Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Lester Mackey

Kernel Thinning

May 17, 2021

Raaz Dwivedi, Lester Mackey

Abstract:We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error in the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.

* 55 pages, 4 figures

Via

Access Paper or Ask Questions

Initialization and Regularization of Factorized Neural Layers

May 03, 2021

Mikhail Khodak, Neil Tenenholtz, Lester Mackey, Nicolò Fusi

Figure 1 for Initialization and Regularization of Factorized Neural Layers

Figure 2 for Initialization and Regularization of Factorized Neural Layers

Figure 3 for Initialization and Regularization of Factorized Neural Layers

Figure 4 for Initialization and Regularization of Factorized Neural Layers

Abstract:Factorized layers--operations parameterized by products of two or more matrices--occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head self-attention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

* ICLR 2021 Camera-Ready

Via

Access Paper or Ask Questions

Knowledge Distillation as Semiparametric Inference

Apr 20, 2021

Tri Dao, Govinda M Kamath, Vasilis Syrgkanis, Lester Mackey

Figure 1 for Knowledge Distillation as Semiparametric Inference

Figure 2 for Knowledge Distillation as Semiparametric Inference

Figure 3 for Knowledge Distillation as Semiparametric Inference

Figure 4 for Knowledge Distillation as Semiparametric Inference

Abstract:A popular approach to model compression is to train an inexpensive student model to mimic the class probabilities of a highly accurate but cumbersome teacher model. Surprisingly, this two-step knowledge distillation process often leads to higher accuracy than training the student directly on labeled data. To explain and enhance this phenomenon, we cast knowledge distillation as a semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate. By adapting modern semiparametric tools, we derive new guarantees for the prediction error of standard distillation and develop two enhancements -- cross-fitting and loss correction -- to mitigate the impact of teacher overfitting and underfitting on student performance. We validate our findings empirically on both tabular and image data and observe consistent improvements from our knowledge distillation enhancements.

Via

Access Paper or Ask Questions

Model-specific Data Subsampling with Influence Functions

Oct 20, 2020

Anant Raj, Cameron Musco, Lester Mackey, Nicolo Fusi

Figure 1 for Model-specific Data Subsampling with Influence Functions

Abstract:Model selection requires repeatedly evaluating models on a given dataset and measuring their relative performances. In modern applications of machine learning, the models being considered are increasingly more expensive to evaluate and the datasets of interest are increasing in size. As a result, the process of model selection is time-consuming and computationally inefficient. In this work, we develop a model-specific data subsampling strategy that improves over random sampling whenever training points have varying influence. Specifically, we leverage influence functions to guide our selection strategy, proving theoretically, and demonstrating empirically that our approach quickly selects high-quality models.

Via

Access Paper or Ask Questions

Independent finite approximations for Bayesian nonparametric inference: construction, error bounds, and practical implications

Sep 22, 2020

Tin D. Nguyen, Jonathan Huggins, Lorenzo Masoero, Lester Mackey, Tamara Broderick

Figure 1 for Independent finite approximations for Bayesian nonparametric inference: construction, error bounds, and practical implications

Figure 2 for Independent finite approximations for Bayesian nonparametric inference: construction, error bounds, and practical implications

Figure 3 for Independent finite approximations for Bayesian nonparametric inference: construction, error bounds, and practical implications

Figure 4 for Independent finite approximations for Bayesian nonparametric inference: construction, error bounds, and practical implications

Abstract:Bayesian nonparametrics based on completely random measures (CRMs) offers a flexible modeling approach when the number of clusters or latent components in a dataset is unknown. However, managing the infinite dimensionality of CRMs often leads to slow computation. Practical inference typically relies on either integrating out the infinite-dimensional parameter or using a finite approximation: a truncated finite approximation (TFA) or an independent finite approximation (IFA). The atom weights of TFAs are constructed sequentially, while the atoms of IFAs are independent, which (1) make them well-suited for parallel and distributed computation and (2) facilitates more convenient inference schemes. While IFAs have been developed in certain special cases in the past, there has not yet been a general template for construction or a systematic comparison to TFAs. We show how to construct IFAs for approximating distributions in a large family of CRMs, encompassing all those typically used in practice. We quantify the approximation error between IFAs and the target nonparametric prior, and prove that, in the worst-case, TFAs provide more component-efficient approximations than IFAs. However, in experiments on image denoising and topic modeling tasks with real data, we find that the error of Bayesian approximation methods overwhelms any finite approximation error, and IFAs perform very similarly to TFAs.

Via

Access Paper or Ask Questions

Cross-validation Confidence Intervals for Test Error

Jul 24, 2020

Pierre Bayle, Alexandre Bayle, Lucas Janson, Lester Mackey

Figure 1 for Cross-validation Confidence Intervals for Test Error

Figure 2 for Cross-validation Confidence Intervals for Test Error

Figure 3 for Cross-validation Confidence Intervals for Test Error

Figure 4 for Cross-validation Confidence Intervals for Test Error

Abstract:This work develops central limit theorems for cross-validation and consistent estimators of its asymptotic variance under weak stability conditions on the learning algorithm. Together, these results provide practical, asymptotically-exact confidence intervals for $k$-fold test error and valid, powerful hypothesis tests of whether one learning algorithm has smaller $k$-fold test error than another. These results are also the first of their kind for the popular choice of leave-one-out cross-validation. In our real-data experiments with diverse learning algorithms, the resulting intervals and tests outperform the most popular alternative methods from the literature.

* 40 pages, 14 figures

Via

Access Paper or Ask Questions

Stochastic Stein Discrepancies

Jul 06, 2020

Jackson Gorham, Anant Raj, Lester Mackey

Figure 1 for Stochastic Stein Discrepancies

Figure 2 for Stochastic Stein Discrepancies

Figure 3 for Stochastic Stein Discrepancies

Abstract:Stein discrepancies (SDs) monitor convergence and non-convergence in approximate inference when exact integration and sampling are intractable. However, the computation of a Stein discrepancy can be prohibitive if the Stein operator - often a sum over likelihood terms or potentials - is expensive to evaluate. To address this deficiency, we show that stochastic Stein discrepancies (SSDs) based on subsampled approximations of the Stein operator inherit the convergence control properties of standard SDs with probability 1. In our experiments with biased Markov chain Monte Carlo (MCMC) hyperparameter tuning, approximate MCMC sampler selection, and stochastic Stein variational gradient descent, SSDs deliver comparable inferences to standard SDs with orders of magnitude fewer likelihood evaluations.

Via

Access Paper or Ask Questions

Metrizing Weak Convergence with Maximum Mean Discrepancies

Jun 16, 2020

Carl-Johann Simon-Gabriel, Alessandro Barp, Lester Mackey

Abstract:Theorem 12 of Simon-Gabriel & Sch\"olkopf (JMLR, 2018) seemed to close a 40-year-old quest to characterize maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures. We prove, however, that the theorem is incorrect and provide a correction. We show that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel k, whose RKHS-functions vanish at infinity, metrizes the weak convergence of probability measures if and only if k is continuous and integrally strictly positive definite (ISPD) over all signed, finite, regular Borel measures. We also show that, contrary to the claim of the aforementioned Theorem 12, there exist both bounded continuous ISPD kernels that do not metrize weak convergence and bounded continuous non-ISPD kernels that do metrize it.

* 10 pages. Correction of Thm.12 in Simon-Gabriel and Sch\"olkopf, JMLR, 19(44):1-29, 2018. See http://jmlr.org/papers/v19/16-291.html

Via

Access Paper or Ask Questions

Minimax Estimation of Conditional Moment Models

Jun 12, 2020

Nishanth Dikkala, Greg Lewis, Lester Mackey, Vasilis Syrgkanis

Figure 1 for Minimax Estimation of Conditional Moment Models

Figure 2 for Minimax Estimation of Conditional Moment Models

Figure 3 for Minimax Estimation of Conditional Moment Models

Figure 4 for Minimax Estimation of Conditional Moment Models

Abstract:We develop an approach for estimating models described via conditional moment restrictions, with a prototypical application being non-parametric instrumental variable regression. We introduce a min-max criterion function, under which the estimation problem can be thought of as solving a zero-sum game between a modeler who is optimizing over the hypothesis space of the target model and an adversary who identifies violating moments over a test function space. We analyze the statistical estimation rate of the resulting estimator for arbitrary hypothesis spaces, with respect to an appropriate analogue of the mean squared error metric, for ill-posed inverse problems. We show that when the minimax criterion is regularized with a second moment penalty on the test function and the test function space is sufficiently rich, then the estimation rate scales with the critical radius of the hypothesis and test function spaces, a quantity which typically gives tight fast rates. Our main result follows from a novel localized Rademacher analysis of statistical learning problems defined via minimax objectives. We provide applications of our main results for several hypothesis spaces used in practice such as: reproducing kernel Hilbert spaces, high dimensional sparse linear functions, spaces defined via shape constraints, ensemble estimators such as random forests, and neural networks. For each of these applications we provide computationally efficient optimization methods for solving the corresponding minimax problem (e.g. stochastic first-order heuristics for neural networks). In several applications, we show how our modified mean squared error rate, combined with conditions that bound the ill-posedness of the inverse problem, lead to mean squared error rates. We conclude with an extensive experimental analysis of the proposed methods.

Via

Access Paper or Ask Questions

Optimal Thinning of MCMC Output

May 08, 2020

Marina Riabiz, Wilson Chen, Jon Cockayne, Pawel Swietach, Steven A. Niederer, Lester Mackey, Chris. J. Oates

Figure 1 for Optimal Thinning of MCMC Output

Figure 2 for Optimal Thinning of MCMC Output

Figure 3 for Optimal Thinning of MCMC Output

Figure 4 for Optimal Thinning of MCMC Output

Abstract:The use of heuristics to assess the convergence and compress the output of Markov chain Monte Carlo can be sub-optimal in terms of the empirical approximations that are produced. Typically a number of the initial states are attributed to "burn in" and removed, whilst the chain can be "thinned" if compression is also required. In this paper we consider the problem of selecting a subset of states, of fixed cardinality, such that the approximation provided by their empirical distribution is close to optimal. A novel method is proposed, based on greedy minimisation of a kernel Stein discrepancy, that is suitable for problems where heavy compression is required. Theoretical results guarantee consistency of the method and its effectiveness is demonstrated in the challenging context of parameter inference for ordinary differential equations. Software is available in the "Stein Thinning" package in both Python and MATLAB, and example code is included.

Via

Access Paper or Ask Questions