Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Balaji Lakshminarayanan

Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server

Sep 07, 2017
Leonard Hasenclever, Stefan Webb, Thibaut Lienart, Sebastian Vollmer, Balaji Lakshminarayanan, Charles Blundell, Yee Whye Teh

Figure 1 for Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server

Figure 2 for Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server

Figure 3 for Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server

Figure 4 for Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server

This paper makes two contributions to Bayesian machine learning algorithms. Firstly, we propose stochastic natural gradient expectation propagation (SNEP), a novel alternative to expectation propagation (EP), a popular variational inference algorithm. SNEP is a black box variational algorithm, in that it does not require any simplifying assumptions on the distribution of interest, beyond the existence of some Monte Carlo sampler for estimating the moments of the EP tilted distributions. Further, as opposed to EP which has no guarantee of convergence, SNEP can be shown to be convergent, even when using Monte Carlo moment estimates. Secondly, we propose a novel architecture for distributed Bayesian learning which we call the posterior server. The posterior server allows scalable and robust Bayesian learning in cases where a data set is stored in a distributed manner across a cluster, with each compute node containing a disjoint subset of data. An independent Monte Carlo sampler is run on each compute node, with direct access only to the local data subset, but which targets an approximation to the global posterior distribution given all data across the whole cluster. This is achieved by using a distributed asynchronous implementation of SNEP to pass messages across the cluster. We demonstrate SNEP and the posterior server on distributed Bayesian learning of logistic regression and neural networks. Keywords: Distributed Learning, Large Scale Learning, Deep Learning, Bayesian Learn- ing, Variational Inference, Expectation Propagation, Stochastic Approximation, Natural Gradient, Markov chain Monte Carlo, Parameter Server, Posterior Server.

* Journal of Machine Learning Research 18 (2017) 1-37
* 37 pages, 7 figures

Via

Access Paper or Ask Questions

The Cramer Distance as a Solution to Biased Wasserstein Gradients

May 30, 2017
Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, Rémi Munos

Figure 1 for The Cramer Distance as a Solution to Biased Wasserstein Gradients

Figure 2 for The Cramer Distance as a Solution to Biased Wasserstein Gradients

Figure 3 for The Cramer Distance as a Solution to Biased Wasserstein Gradients

Figure 4 for The Cramer Distance as a Solution to Biased Wasserstein Gradients

The Wasserstein probability metric has received much attention from the machine learning community. Unlike the Kullback-Leibler divergence, which strictly measures change in probability, the Wasserstein metric reflects the underlying geometry between outcomes. The value of being sensitive to this geometry has been demonstrated, among others, in ordinal regression and generative modelling. In this paper we describe three natural properties of probability divergences that reflect requirements from machine learning: sum invariance, scale sensitivity, and unbiased sample gradients. The Wasserstein metric possesses the first two properties but, unlike the Kullback-Leibler divergence, does not possess the third. We provide empirical evidence suggesting that this is a serious issue in practice. Leveraging insights from probabilistic forecasting we propose an alternative to the Wasserstein metric, the Cram\'er distance. We show that the Cram\'er distance possesses all three desired properties, combining the best of the Wasserstein and Kullback-Leibler divergences. To illustrate the relevance of the Cram\'er distance in practice we design a new algorithm, the Cram\'er Generative Adversarial Network (GAN), and show that it performs significantly better than the related Wasserstein GAN.

Via

Access Paper or Ask Questions

Comparison of Maximum Likelihood and GAN-based training of Real NVPs

May 15, 2017
Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, Peter Dayan

Figure 1 for Comparison of Maximum Likelihood and GAN-based training of Real NVPs

Figure 2 for Comparison of Maximum Likelihood and GAN-based training of Real NVPs

Figure 3 for Comparison of Maximum Likelihood and GAN-based training of Real NVPs

Figure 4 for Comparison of Maximum Likelihood and GAN-based training of Real NVPs

We train a generator by maximum likelihood and we also train the same generator architecture by Wasserstein GAN. We then compare the generated samples, exact log-probability densities and approximate Wasserstein distances. We show that an independent critic trained to approximate Wasserstein distance between the validation set and the generator distribution helps detect overfitting. Finally, we use ideas from the one-shot learning literature to develop a novel fast learning critic.

Via

Access Paper or Ask Questions

Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees

Feb 28, 2017
Daniel Zoran, Balaji Lakshminarayanan, Charles Blundell

Figure 1 for Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees

Figure 2 for Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees

Figure 3 for Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees

Figure 4 for Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees

Nearest neighbor (kNN) methods have been gaining popularity in recent years in light of advances in hardware and efficiency of algorithms. There is a plethora of methods to choose from today, each with their own advantages and disadvantages. One requirement shared between all kNN based methods is the need for a good representation and distance measure between samples. We introduce a new method called differentiable boundary tree which allows for learning deep kNN representations. We build on the recently proposed boundary tree algorithm which allows for efficient nearest neighbor classification, regression and retrieval. By modelling traversals in the tree as stochastic events, we are able to form a differentiable cost function which is associated with the tree's predictions. Using a deep neural network to transform the data and back-propagating through the tree allows us to learn good representations for kNN methods. We demonstrate that our method is able to learn suitable representations allowing for very efficient trees with a clearly interpretable structure.

Via

Access Paper or Ask Questions

Learning in Implicit Generative Models

Feb 27, 2017
Shakir Mohamed, Balaji Lakshminarayanan

Figure 1 for Learning in Implicit Generative Models

Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative models with several appealing properties: they do not require a likelihood function to be specified, only a generating procedure; they provide samples that are sharp and compelling; and they allow us to harness our knowledge of building highly accurate neural network classifiers. Here, we develop our understanding of GANs with the aim of forming a rich view of this growing area of machine learning---to build connections to the diverse set of statistical thinking on this topic, of which much can be gained by a mutual exchange of ideas. We frame GANs within the wider landscape of algorithms for learning in implicit generative models--models that only specify a stochastic procedure with which to generate data--and relate these ideas to modelling problems in related fields, such as econometrics and approximate Bayesian computation. We develop likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which we are able to derive the objective function used by GANs, and many other related objectives. The testing viewpoint directs our focus to the general problem of density ratio estimation. There are four approaches for density ratio estimation, one of which is a solution using classifiers to distinguish real from generated data. Other approaches such as divergence minimisation and moment matching have also been explored in the GAN literature, and we synthesise these views to form an understanding in terms of the relationships between them and the wider literature, highlighting avenues for future exploration and cross-pollination.

Via

Access Paper or Ask Questions

The Mondrian Kernel

Jun 16, 2016
Matej Balog, Balaji Lakshminarayanan, Zoubin Ghahramani, Daniel M. Roy, Yee Whye Teh

We introduce the Mondrian kernel, a fast random feature approximation to the Laplace kernel. It is suitable for both batch and online learning, and admits a fast kernel-width-selection procedure as the random features can be re-used efficiently for all kernel widths. The features are constructed by sampling trees via a Mondrian process [Roy and Teh, 2009], and we highlight the connection to Mondrian forests [Lakshminarayanan et al., 2014], where trees are also sampled via a Mondrian process, but fit independently. This link provides a new insight into the relationship between kernel methods and random forests.

* Accepted for presentation at the 32nd Conference on Uncertainty in Artificial Intelligence (UAI 2016)

Via

Access Paper or Ask Questions

Mondrian Forests for Large-Scale Regression when Uncertainty Matters

May 27, 2016
Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh

Figure 1 for Mondrian Forests for Large-Scale Regression when Uncertainty Matters

Figure 2 for Mondrian Forests for Large-Scale Regression when Uncertainty Matters

Figure 3 for Mondrian Forests for Large-Scale Regression when Uncertainty Matters

Figure 4 for Mondrian Forests for Large-Scale Regression when Uncertainty Matters

Many real-world regression problems demand a measure of the uncertainty associated with each prediction. Standard decision forests deliver efficient state-of-the-art predictive performance, but high-quality uncertainty estimates are lacking. Gaussian processes (GPs) deliver uncertainty estimates, but scaling GPs to large-scale data sets comes at the cost of approximating the uncertainty estimates. We extend Mondrian forests, first proposed by Lakshminarayanan et al. (2014) for classification problems, to the large-scale non-parametric regression setting. Using a novel hierarchical Gaussian prior that dovetails with the Mondrian forest framework, we obtain principled uncertainty estimates, while still retaining the computational advantages of decision forests. Through a combination of illustrative examples, real-world large-scale datasets, and Bayesian optimization benchmarks, we demonstrate that Mondrian forests outperform approximate GPs on large-scale regression tasks and deliver better-calibrated uncertainty assessments than decision-forest-based methods.

* Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: W&CP volume 51

Via

Access Paper or Ask Questions

Approximate Inference with the Variational Holder Bound

Jun 19, 2015
Guillaume Bouchard, Balaji Lakshminarayanan

Figure 1 for Approximate Inference with the Variational Holder Bound

Figure 2 for Approximate Inference with the Variational Holder Bound

We introduce the Variational Holder (VH) bound as an alternative to Variational Bayes (VB) for approximate Bayesian inference. Unlike VB which typically involves maximization of a non-convex lower bound with respect to the variational parameters, the VH bound involves minimization of a convex upper bound to the intractable integral with respect to the variational parameters. Minimization of the VH bound is a convex optimization problem; hence the VH method can be applied using off-the-shelf convex optimization algorithms and the approximation error of the VH bound can also be analyzed using tools from convex optimization literature. We present experiments on the task of integrating a truncated multivariate Gaussian distribution and compare our method to VB, EP and a state-of-the-art numerical integration method for this problem.

Via

Access Paper or Ask Questions

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages

Jun 09, 2015
Wittawat Jitkrittum, Arthur Gretton, Nicolas Heess, S. M. Ali Eslami, Balaji Lakshminarayanan, Dino Sejdinovic, Zoltán Szabó

Figure 1 for Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages

We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernel-based regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel two-layer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.

* accepted to UAI 2015. Correct typos. Add more content to the appendix. Main results unchanged

Via

Access Paper or Ask Questions

Particle Gibbs for Bayesian Additive Regression Trees

Feb 16, 2015
Balaji Lakshminarayanan, Daniel M. Roy, Yee Whye Teh

Figure 1 for Particle Gibbs for Bayesian Additive Regression Trees

Figure 2 for Particle Gibbs for Bayesian Additive Regression Trees

Figure 3 for Particle Gibbs for Bayesian Additive Regression Trees

Figure 4 for Particle Gibbs for Bayesian Additive Regression Trees

Additive regression trees are flexible non-parametric models and popular off-the-shelf tools for real-world non-linear regression. In application domains, such as bioinformatics, where there is also demand for probabilistic predictions with measures of uncertainty, the Bayesian additive regression trees (BART) model, introduced by Chipman et al. (2010), is increasingly popular. As data sets have grown in size, however, the standard Metropolis-Hastings algorithms used to perform inference in BART are proving inadequate. In particular, these Markov chains make local changes to the trees and suffer from slow mixing when the data are high-dimensional or the best fitting trees are more than a few layers deep. We present a novel sampler for BART based on the Particle Gibbs (PG) algorithm (Andrieu et al., 2010) and a top-down particle filtering algorithm for Bayesian decision trees (Lakshminarayanan et al., 2013). Rather than making local changes to individual trees, the PG sampler proposes a complete tree to fit the residual. Experiments show that the PG sampler outperforms existing samplers in many settings.

* Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS) 2015, San Diego, CA, USA. JMLR: W&CP volume 38

Via

Access Paper or Ask Questions