We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.
In this paper, we study both multi-armed and contextual bandit problems in censored environments. Our goal is to estimate the performance loss due to censorship in the context of classical algorithms designed for uncensored environments. Our main contributions include the introduction of a broad class of censorship models and their analysis in terms of the effective dimension of the problem -- a natural measure of its underlying statistical complexity and main driver of the regret bound. In particular, the effective dimension allows us to maintain the structure of the original problem at first order, while embedding it in a bigger space, and thus naturally leads to results analogous to uncensored settings. Our analysis involves a continuous generalization of the Elliptical Potential Inequality, which we believe is of independent interest. We also discover an interesting property of decision-making under censorship: a transient phase during which initial misspecification of censorship is self-corrected at an extra cost, followed by a stationary phase that reflects the inherent slowdown of learning governed by the effective dimension. Our results are useful for applications of sequential decision-making models where the feedback received depends on strategic uncertainty (e.g., agents' willingness to follow a recommendation) and/or random uncertainty (e.g., loss or delay in arrival of information).
A new era of space exploration and exploitation is fast approaching. A multitude of spacecraft will flow in the future decades under the propulsive momentum of the new space economy. Yet, the flourishing proliferation of deep-space assets will make it unsustainable to pilot them from ground with standard radiometric tracking. The adoption of autonomous navigation alternatives is crucial to overcoming these limitations. Among these, optical navigation is an affordable and fully ground-independent approach. Probes can triangulate their position by observing visible beacons, e.g., planets or asteroids, by acquiring their line-of-sight in deep space. To do so, developing efficient and robust image processing algorithms providing information to navigation filters is a necessary action. This paper proposes an innovative pipeline for unresolved beacon recognition and line-of-sight extraction from images for autonomous interplanetary navigation. The developed algorithm exploits the k-vector method for the non-stellar object identification and statistical likelihood to detect whether any beacon projection is visible in the image. Statistical results show that the accuracy in detecting the planet position projection is independent of the spacecraft position uncertainty. Whereas, the planet detection success rate is higher than 95% when the spacecraft position is known with a 3sigma accuracy up to 10^5 km.
The reinforcement learning (RL) research area is very active, with an important number of new contributions; especially considering the emergent field of deep RL (DRL). However a number of scientific and technical challenges still need to be resolved, amongst which we can mention the ability to abstract actions or the difficulty to explore the environment in sparse-reward settings which can be addressed by intrinsic motivation (IM). We propose to survey these research works through a new taxonomy based on information theory: we computationally revisit the notions of surprise, novelty and skill learning. This allows us to identify advantages and disadvantages of methods and exhibit current outlooks of research. Our analysis suggests that novelty and surprise can assist the building of a hierarchy of transferable skills that further abstracts the environment and makes the exploration process more robust.
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. Existing works generally disentangle timbre, while information about pitch, rhythm and content is still mixed together. To perform one-shot VC effectively with further disentangling these speech components, we employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information and gradient reversal layer based adversarial mutual information learning to ensure the different parts of the latent space containing only the desired disentangled representation during training. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility. In addition, we can transfer characteristics of one-shot VC on timbre, pitch and rhythm separately by speech representation disentanglement. Our code, pre-trained models and demo are available at https://im1eon.github.io/IS2022-SRDVC/.
Spiking neural networks (SNNs) have attracted much attention due to their ability to process temporal information, low power consumption, and higher biological plausibility. However, it is still challenging to develop efficient and high-performing learning algorithms for SNNs. Methods like artificial neural network (ANN)-to-SNN conversion can transform ANNs to SNNs with slight performance loss, but it needs a long simulation to approximate the rate coding. Directly training SNN by spike-based backpropagation (BP) such as surrogate gradient approximation is more flexible. Yet now, the performance of SNNs is not competitive compared with ANNs. In this paper, we propose a novel k-based leaky Integrate-and-Fire (KLIF) neuron model to improve the learning ability of SNNs. Compared with the popular leaky integrate-and-fire (LIF) model, KLIF adds a learnable scaling factor to dynamically update the slope and width of the surrogate gradient curve during training and incorporates a ReLU activation function that selectively delivers membrane potential to spike firing and resetting. The proposed spiking unit is evaluated on both static MNIST, Fashion-MNIST, CIFAR-10 datasets, as well as neuromorphic N-MNIST, CIFAR10-DVS, and DVS128-Gesture datasets. Experiments indicate that KLIF performs much better than LIF without introducing additional computational cost and achieves state-of-the-art performance on these datasets with few time steps. Also, KLIF is believed to be more biological plausible than LIF. The good performance of KLIF can make it completely replace the role of LIF in SNN for various tasks.
Monocular depth estimation is an ambiguous problem, thus global structural cues play an important role in current data-driven single-view depth estimation methods. Panorama images capture the complete spatial information of their surroundings utilizing the equirectangular projection which introduces large distortion. This requires the depth estimation method to be able to handle the distortion and extract global context information from the image. In this paper, we propose an end-to-end deep network for monocular panorama depth estimation on a unit spherical surface. Specifically, we project the feature maps extracted from equirectangular images onto unit spherical surface sampled by uniformly distributed grids, where the decoder network can aggregate the information from the distortion-reduced feature maps. Meanwhile, we propose a global cross-attention-based fusion module to fuse the feature maps from skip connection and enhance the ability to obtain global context. Experiments are conducted on five panorama depth estimation datasets, and the results demonstrate that the proposed method substantially outperforms previous state-of-the-art methods. All related codes will be open-sourced in the upcoming days.
Large-scale data missing is a challenging problem in Intelligent Transportation Systems (ITS). Many studies have been carried out to impute large-scale traffic data by considering their spatiotemporal correlations at a network level. In existing traffic data imputations, however, rich semantic information of a road network has been largely ignored when capturing network-wide spatiotemporal correlations. This study proposes a Graph Transformer for Traffic Data Imputation (GT-TDI) model to impute large-scale traffic data with spatiotemporal semantic understanding of a road network. Specifically, the proposed model introduces semantic descriptions consisting of network-wide spatial and temporal information of traffic data to help the GT-TDI model capture spatiotemporal correlations at a network level. The proposed model takes incomplete data, the social connectivity of sensors, and semantic descriptions as input to perform imputation tasks with the help of Graph Neural Networks (GNN) and Transformer. On the PeMS freeway dataset, extensive experiments are conducted to compare the proposed GT-TDI model with conventional methods, tensor factorization methods, and deep learning-based methods. The results show that the proposed GT-TDI outperforms existing methods in complex missing patterns and diverse missing rates. The code of the GT-TDI model will be available at https://github.com/KP-Zhang/GT-TDI.
Multiview depth imagery will play a critical role in free-viewpoint television. This technology requires high quality virtual view synthesis to enable viewers to move freely in a dynamic real world scene. Depth imagery at different viewpoints is used to synthesize an arbitrary number of novel views. Usually, depth images at multiple viewpoints are estimated individually by stereo-matching algorithms, and hence, show lack of interview consistency. This inconsistency affects the quality of view synthesis negatively. This paper proposes a method for depth consistency testing in depth difference subspace to enhance the depth representation of a scene across multiple viewpoints. Furthermore, we propose a view synthesis algorithm that uses the obtained consistency information to improve the visual quality of virtual views at arbitrary viewpoints. Our method helps us to find a linear subspace for our depth difference measurements in which we can test the inter-view consistency efficiently. With this, our approach is able to enhance the depth information for real world scenes. In combination with our consistency-adaptive view synthesis, we improve the visual experience of the free-viewpoint user. The experiments show that our approach enhances the objective quality of virtual views by up to 1.4 dB. The advantage for the subjective quality is also demonstrated.
A good feature representation is the key to image classification. In practice, image classifiers may be applied in scenarios different from what they have been trained on. This so-called domain shift leads to a significant performance drop in image classification. Unsupervised domain adaptation (UDA) reduces the domain shift by transferring the knowledge learned from a labeled source domain to an unlabeled target domain. We perform feature disentanglement for UDA by distilling category-relevant features and excluding category-irrelevant features from the global feature maps. This disentanglement prevents the network from overfitting to category-irrelevant information and makes it focus on information useful for classification. This reduces the difficulty of domain alignment and improves the classification accuracy on the target domain. We propose a coarse-to-fine domain adaptation method called Domain Adaptation via Feature Disentanglement~(DAFD), which has two components: (1)the Category-Relevant Feature Selection (CRFS) module, which disentangles the category-relevant features from the category-irrelevant features, and (2)the Dynamic Local Maximum Mean Discrepancy (DLMMD) module, which achieves fine-grained alignment by reducing the discrepancy within the category-relevant features from different domains. Combined with the CRFS, the DLMMD module can align the category-relevant features properly. We conduct comprehensive experiment on four standard datasets. Our results clearly demonstrate the robustness and effectiveness of our approach in domain adaptive image classification tasks and its competitiveness to the state of the art.