We develop ShiftMatch, a new training-data-dependent likelihood for out of distribution (OOD) robustness in Bayesian neural networks (BNNs). ShiftMatch is inspired by the training-data-dependent "EmpCov" priors from Izmailov et al. (2021a) and efficiently matches test-time spatial correlations to those at training time. Critically, ShiftMatch is designed to leave neural network training unchanged, allowing it to use publically available samples from pretrained BNNs. Using pre-trained HMC samples, ShiftMatch gives strong performance improvements on CIFAR-10-C, outperforms EmpCov priors, and is perhaps the first Bayesian method capable of convincingly outperforming plain deep ensembles. ShiftMatch can be integrated with non-Bayesian methods like deep ensembles, where it offers smaller, but still considerable, performance improvements. Overall, Bayesian ShiftMatch gave slightly better accuracy than ensembles with ShiftMatch, though they both had very similar log-likelihoods.
Robotic touch, particularly when using soft optical tactile sensors, suffers from distortion caused by motion-dependent shear. The manner in which the sensor contacts a stimulus is entangled with the tactile information about the geometry of the stimulus. In this work, we propose a supervised convolutional deep neural network model that learns to disentangle, in the latent space, the components of sensor deformations caused by contact geometry from those due to sliding-induced shear. The approach is validated by reconstructing unsheared tactile images from sheared images and showing they match unsheared tactile images collected with no sliding motion. In addition, the unsheared tactile images give a faithful reconstruction of the contact geometry that is not possible from the sheared data, and robust estimation of the contact pose that can be used for servo control sliding around various 2D shapes. Finally, the contact geometry reconstruction in conjunction with servo control sliding were used for faithful full object reconstruction of various 2D shapes. The methods have broad applicability to deep learning models for robots with a shear-sensitive sense of touch.
Deep kernel processes (DKPs) generalise Bayesian neural networks, but do not require us to represent either features or weights. Instead, at each hidden layer they represent and optimize a flexible kernel. Here, we develop a Newton-like method for DKPs that converges in around 10 steps, exploiting matrix solvers initially developed in the control theory literature. These are many times faster the usual gradient descent approach. We generalise to arbitrary DKP architectures, by developing "kernel backprop", and algorithms for "kernel autodiff". While these methods currently are not Bayesian as they give point estimates and scale poorly as they are cubic in the number of datapoints, we hope they will form the basis of a new class of much more efficient approaches to optimizing deep nonlinear function approximators.
Recent work introduced deep kernel processes as an entirely kernel-based alternative to NNs (Aitchison et al. 2020). Deep kernel processes flexibly learn good top-layer representations by alternately sampling the kernel from a distribution over positive semi-definite matrices and performing nonlinear transformations. A particular deep kernel process, the deep Wishart process (DWP), is of particular interest because its prior is equivalent to deep Gaussian process (DGP) priors. However, inference in DWPs has not yet been possible due to the lack of sufficiently flexible distributions over positive semi-definite matrices. Here, we give a novel approach to obtaining flexible distributions over positive semi-definite matrices by generalising the Bartlett decomposition of the Wishart probability density. We use this new distribution to develop an approximate posterior for the DWP that includes dependency across layers. We develop a doubly-stochastic inducing-point inference scheme for the DWP and show experimentally that inference in the DWP gives improved performance over doing inference in a DGP with the equivalent prior.
We show that a popular self-supervised learning method, InfoNCE, is a special case of a new family of unsupervised learning methods, the self-supervised variational autoencoder (SSVAE). SSVAEs circumvent the usual VAE requirement to reconstruct the data by using a carefully chosen implicit decoder. The InfoNCE objective was motivated as a simplified parametric mutual information estimator. Under one choice of prior, the SSVAE objective (i.e. the ELBO) is exactly equal to the mutual information (up to constants). Under an alternative choice of prior, the SSVAE objective is exactly equal to the simplified parametric mutual information estimator used in InfoNCE (up to constants). Importantly, the use of simplified parametric mutual information estimators is believed to be critical to obtain good high-level representations, and the SSVAE framework naturally provides a principled justification for using prior information to choose these estimators.
Data augmentation is a highly effective approach for improving performance in deep neural networks. The standard view is that it creates an enlarged dataset by adding synthetic data, which raises a problem when combining it with Bayesian inference: how much data are we really conditioning on? This question is particularly relevant to recent observations linking data augmentation to the cold posterior effect. We investigate various principled ways of finding a log-likelihood for augmented datasets. Our approach prescribes augmenting the same underlying image multiple times, both at test and train-time, and averaging either the logits or the predictive probabilities. Empirically, we observe the best performance with averaging probabilities. While there are interactions with the cold posterior effect, neither averaging logits or averaging probabilities eliminates it.
Bayesian neural networks have shown great promise in many applications where calibrated uncertainty estimates are crucial and can often also lead to a higher predictive performance. However, it remains challenging to choose a good prior distribution over their weights. While isotropic Gaussian priors are often chosen in practice due to their simplicity, they do not reflect our true prior beliefs well and can lead to suboptimal performance. Our new library, BNNpriors, enables state-of-the-art Markov Chain Monte Carlo inference on Bayesian neural networks with a wide range of predefined priors, including heavy-tailed ones, hierarchical ones, and mixture priors. Moreover, it follows a modular approach that eases the design and implementation of new custom priors. It has facilitated foundational discoveries on the nature of the cold posterior effect in Bayesian neural networks and will hopefully catalyze future research as well as practical applications in this area.
We develop variational Laplace for Bayesian neural networks (BNNs) which exploits a local approximation of the curvature of the likelihood to estimate the ELBO without the need for stochastic sampling of the neural-network weights. Variational Laplace performs better on image classification tasks than MAP inference and far better than standard variational inference with stochastic sampling despite using the same mean-field Gaussian approximate posterior. The Variational Laplace objective is simple to evaluate, as it is (in essence) the log-likelihood, plus weight-decay, plus a squared-gradient regularizer. Finally, we emphasise care needed in benchmarking standard VI as there is a risk of stopping before the variance parameters have converged. We show that early-stopping can be avoided by increasing the learning rate for the variance parameters.
We introduce a principled approach to detecting out-of-distribution (OOD) data by exploiting a connection to data curation. In data curation, we exclude ambiguous or difficult-to-classify input points from the dataset, and these excluded points are by definition OOD. We can therefore obtain the likelihood for OOD points by using a principled generative model of data-curation initially developed to explain the cold-posterior effect in Bayesian neural networks (Aitchison 2020). This model gives higher OOD probabilities when predictive uncertainty is higher and can be trained using maximum-likelihood jointly over the in-distribution and OOD points. This approach gives superior performance to past methods that did not provide a probability for OOD points, and therefore could not be trained using maximum-likelihood.
Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, such simplistic priors are unlikely to either accurately reflect our true beliefs about the weight distributions, or to give optimal performance. We study summary statistics of neural network weights in different networks trained using SGD. We find that fully connected networks (FCNNs) display heavy-tailed weight distributions, while convolutional neural network (CNN) weights display strong spatial correlations. Building these observations into the respective priors leads to improved performance on a variety of image classification datasets. Moreover, we find that these priors also mitigate the cold posterior effect in FCNNs, while in CNNs we see strong improvements at all temperatures, and hence no reduction in the cold posterior effect.