Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Stephan Mandt

The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

Feb 07, 2020
Jakub Swiatkowski, Kevin Roth, Bastiaan S. Veeling, Linh Tran, Joshua V. Dillon, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Figure 1 for The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

Figure 2 for The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

Figure 3 for The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

Figure 4 for The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks

Variational Bayesian Inference is a popular methodology for approximating posterior distributions over Bayesian neural network weights. Recent work developing this class of methods has explored ever richer parameterizations of the approximate posterior in the hope of improving performance. In contrast, here we share a curious experimental finding that suggests instead restricting the variational distribution to a more compact parameterization. For a variety of deep Bayesian neural networks trained using Gaussian mean-field variational inference, we find that the posterior standard deviations consistently exhibit strong low-rank structure after convergence. This means that by decomposing these variational parameters into a low-rank factorization, we can make our variational approximation more compact without decreasing the models' performance. Furthermore, we find that such factorized parameterizations improve the signal-to-noise ratio of stochastic gradient estimates of the variational lower bound, resulting in faster convergence.

Via

Access Paper or Ask Questions

How Good is the Bayes Posterior in Deep Neural Networks Really?

Feb 06, 2020
Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin

Figure 1 for How Good is the Bayes Posterior in Deep Neural Networks Really?

Figure 2 for How Good is the Bayes Posterior in Deep Neural Networks Really?

Figure 3 for How Good is the Bayes Posterior in Deep Neural Networks Really?

Figure 4 for How Good is the Bayes Posterior in Deep Neural Networks Really?

During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural networks. However, despite this algorithmic progress and the promise of improved uncertainty quantification and sample efficiency there are---as of early 2020---no publicized deployments of Bayesian neural networks in industrial practice. In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions compared to simpler methods including point estimates obtained from SGD. Furthermore, we demonstrate that predictive performance is improved significantly through the use of a "cold posterior" that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments. Our work questions the goal of accurate posterior approximations in Bayesian deep learning: If the true Bayes posterior is poor, what is the use of more accurate approximations? Instead, we argue that it is timely to focus on understanding the origin of the improved performance of cold posteriors.

Via

Access Paper or Ask Questions

Machine Learning in Thermodynamics: Prediction of Activity Coefficients by Matrix Completion

Jan 29, 2020
Fabian Jirasek, Rodrigo A. S. Alves, Julie Damay, Robert A. Vandermeulen, Robert Bamler, Michael Bortz, Stephan Mandt, Marius Kloft, Hans Hasse

Figure 1 for Machine Learning in Thermodynamics: Prediction of Activity Coefficients by Matrix Completion

Figure 2 for Machine Learning in Thermodynamics: Prediction of Activity Coefficients by Matrix Completion

Figure 3 for Machine Learning in Thermodynamics: Prediction of Activity Coefficients by Matrix Completion

Activity coefficients, which are a measure of the non-ideality of liquid mixtures, are a key property in chemical engineering with relevance to modeling chemical and phase equilibria as well as transport processes. Although experimental data on thousands of binary mixtures are available, prediction methods are needed to calculate the activity coefficients in many relevant mixtures that have not been explored to-date. In this report, we propose a probabilistic matrix factorization model for predicting the activity coefficients in arbitrary binary mixtures. Although no physical descriptors for the considered components were used, our method outperforms the state-of-the-art method that has been refined over three decades while requiring much less training effort. This opens perspectives to novel methods for predicting physico-chemical properties of binary mixtures with the potential to revolutionize modeling and simulation in chemical engineering.

* J. Phys. Chem. Lett. 11 (2020) 981-985
* Published version: J. Phys. Chem. Lett. 11 (2020) 981-985; https://pubs.acs.org/doi/full/10.1021/acs.jpclett.9b03657

Via

Access Paper or Ask Questions

Hydra: Preserving Ensemble Diversity for Model Distillation

Jan 14, 2020
Linh Tran, Bastiaan S. Veeling, Kevin Roth, Jakub Swiatkowski, Joshua V. Dillon, Jasper Snoek, Stephan Mandt, Tim Salimans, Sebastian Nowozin, Rodolphe Jenatton

Figure 1 for Hydra: Preserving Ensemble Diversity for Model Distillation

Figure 2 for Hydra: Preserving Ensemble Diversity for Model Distillation

Figure 3 for Hydra: Preserving Ensemble Diversity for Model Distillation

Figure 4 for Hydra: Preserving Ensemble Diversity for Model Distillation

Ensembles of models have been empirically shown to improve predictive performance and to yield robust measures of uncertainty. However, they are expensive in computation and memory. Therefore, recent research has focused on distilling ensembles into a single compact model, reducing the computational and memory burden of the ensemble while trying to preserve its predictive behavior. Most existing distillation formulations summarize the ensemble by capturing its average predictions. As a result, the diversity of the ensemble predictions, stemming from each individual member, is lost. Thus, the distilled model cannot provide a measure of uncertainty comparable to that of the original ensemble. To retain more faithfully the diversity of the ensemble, we propose a distillation method based on a single multi-headed neural network, which we refer to as Hydra. The shared body network learns a joint feature representation that enables each head to capture the predictive behavior of each ensemble member. We demonstrate that with a slight increase in parameter count, Hydra improves distillation performance on classification and regression settings while capturing the uncertainty behaviour of the original ensemble over both in-domain and out-of-distribution tasks.

Via

Access Paper or Ask Questions

Tightening Bounds for Variational Inference by Revisiting Perturbation Theory

Sep 30, 2019
Robert Bamler, Cheng Zhang, Manfred Opper, Stephan Mandt

Figure 1 for Tightening Bounds for Variational Inference by Revisiting Perturbation Theory

Figure 2 for Tightening Bounds for Variational Inference by Revisiting Perturbation Theory

Figure 3 for Tightening Bounds for Variational Inference by Revisiting Perturbation Theory

Figure 4 for Tightening Bounds for Variational Inference by Revisiting Perturbation Theory

Variational inference has become one of the most widely used methods in latent variable modeling. In its basic form, variational inference employs a fully factorized variational distribution and minimizes its KL divergence to the posterior. As the minimization can only be carried out approximately, this approximation induces a bias. In this paper, we revisit perturbation theory as a powerful way of improving the variational approximation. Perturbation theory relies on a form of Taylor expansion of the log marginal likelihood, vaguely in terms of the log ratio of the true posterior and its variational approximation. While first order terms give the classical variational bound, higher-order terms yield corrections that tighten it. However, traditional perturbation theory does not provide a lower bound, making it inapt for stochastic optimization. In this paper, we present a similar yet alternative way of deriving corrections to the ELBO that resemble perturbation theory, but that result in a valid bound. We show in experiments on Gaussian Processes and Variational Autoencoders that the new bounds are more mass covering, and that the resulting posterior covariances are closer to the true posterior and lead to higher likelihoods on held-out data.

* To appear in Journal of Statistical Mechanics: Theory and Experiment (JSTAT), 2019

Via

Access Paper or Ask Questions

Autoregressive Text Generation Beyond Feedback Loops

Aug 30, 2019
Florian Schmidt, Stephan Mandt, Thomas Hofmann

Figure 1 for Autoregressive Text Generation Beyond Feedback Loops

Figure 2 for Autoregressive Text Generation Beyond Feedback Loops

Autoregressive state transitions, where predictions are conditioned on past predictions, are the predominant choice for both deterministic and stochastic sequential models. However, autoregressive feedback exposes the evolution of the hidden state trajectory to potential biases from well-known train-test discrepancies. In this paper, we combine a latent state space model with a CRF observation model. We argue that such autoregressive observation models form an interesting middle ground that expresses local correlations on the word level but keeps the state evolution non-autoregressive. On unconditional sentence generation we show performance improvements compared to RNN and GAN baselines while avoiding some prototypical failure modes of autoregressive models.

* emnlp camera ready

Via

Access Paper or Ask Questions

Multivariate Time Series Imputation with Variational Autoencoders

Jul 12, 2019
Vincent Fortuin, Gunnar Rätsch, Stephan Mandt

Figure 1 for Multivariate Time Series Imputation with Variational Autoencoders

Figure 2 for Multivariate Time Series Imputation with Variational Autoencoders

Figure 3 for Multivariate Time Series Imputation with Variational Autoencoders

Multivariate time series with missing values are common in many areas, for instance in healthcare and finance. To face this problem, modern data imputation approaches should (a) be tailored to sequential data, (b) deal with high dimensional and complex data distributions, and (c) be based on the probabilistic modeling paradigm for interpretability and confidence assessment. However, many current approaches fall short in at least one of these aspects. Drawing on advances in deep learning and scalable probabilistic modeling, we propose a new deep sequential variational autoencoder approach for dimensionality reduction and data imputation. Temporal dependencies are modeled with a Gaussian process prior and a Cauchy kernel to reflect multi-scale dynamics in the latent space. We furthermore use a structured variational inference distribution that improves the scalability of the approach. We demonstrate that our model exhibits superior imputation performance on benchmark tasks and challenging real-world medical data.

Via

Access Paper or Ask Questions

A Quantum Field Theory of Representation Learning

Jul 04, 2019
Robert Bamler, Stephan Mandt

Figure 1 for A Quantum Field Theory of Representation Learning

Figure 2 for A Quantum Field Theory of Representation Learning

Figure 3 for A Quantum Field Theory of Representation Learning

Continuous symmetries and their breaking play a prominent role in contemporary physics. Effective low-energy field theories around symmetry breaking states explain diverse phenomena such as superconductivity, magnetism, and the mass of nucleons. We show that such field theories can also be a useful tool in machine learning, in particular for loss functions with continuous symmetries that are spontaneously broken by random initializations. In this paper, we illuminate our earlier published work (Bamler & Mandt, 2018) on this topic more from the perspective of theoretical physics. We show that the analogies between superconductivity and symmetry breaking in temporal representation learning are rather deep, allowing us to formulate a gauge theory of `charged' embedding vectors in time series models. We show that making the loss function gauge invariant speeds up convergence in such models.

* Presented at the ICML 2019 Workshop on Theoretical Physics for Deep Learning

Via

Access Paper or Ask Questions

Augmenting and Tuning Knowledge Graph Embeddings

Jul 01, 2019
Robert Bamler, Farnood Salehi, Stephan Mandt

Figure 1 for Augmenting and Tuning Knowledge Graph Embeddings

Figure 2 for Augmenting and Tuning Knowledge Graph Embeddings

Figure 3 for Augmenting and Tuning Knowledge Graph Embeddings

Figure 4 for Augmenting and Tuning Knowledge Graph Embeddings

Knowledge graph embeddings rank among the most successful methods for link prediction in knowledge graphs, i.e., the task of completing an incomplete collection of relational facts. A downside of these models is their strong sensitivity to model hyperparameters, in particular regularizers, which have to be extensively tuned to reach good performance [Kadlec et al., 2017]. We propose an efficient method for large scale hyperparameter tuning by interpreting these models in a probabilistic framework. After a model augmentation that introduces per-entity hyperparameters, we use a variational expectation-maximization approach to tune thousands of such hyperparameters with minimal additional cost. Our approach is agnostic to details of the model and results in a new state of the art in link prediction on standard benchmark data.

* Published version, Conference on Uncertainty in Artificial Intelligence (UAI 2019)

Via

Access Paper or Ask Questions