Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rodolphe Jenatton

CMAP

Sparse MoEs meet Efficient Ensembles

Oct 07, 2021

James Urquhart Allingham, Florian Wenzel, Zelda E Mariet, Basil Mustafa, Joan Puigcerver, Neil Houlsby, Ghassen Jerfel, Vincent Fortuin, Balaji Lakshminarayanan, Jasper Snoek(+3 more)

Figure 1 for Sparse MoEs meet Efficient Ensembles

Figure 2 for Sparse MoEs meet Efficient Ensembles

Figure 3 for Sparse MoEs meet Efficient Ensembles

Figure 4 for Sparse MoEs meet Efficient Ensembles

Abstract:Machine learning models based on the aggregated outputs of submodels, either at the activation or prediction levels, lead to strong performance. We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs). First, we show that these two approaches have complementary features whose combination is beneficial. Then, we present partitioned batch ensembles, an efficient ensemble of sparse MoEs that takes the best of both classes of models. Extensive experiments on fine-tuned vision transformers demonstrate the accuracy, log-likelihood, few-shot learning, robustness, and uncertainty calibration improvements of our approach over several challenging baselines. Partitioned batch ensembles not only scale to models with up to 2.7B parameters, but also provide larger performance gains for larger models.

* 44 pages, 19 figures, 24 tables

Via

Access Paper or Ask Questions

Deep Classifiers with Label Noise Modeling and Distance Awareness

Oct 06, 2021

Vincent Fortuin, Mark Collier, Florian Wenzel, James Allingham, Jeremiah Liu, Dustin Tran, Balaji Lakshminarayanan, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

Figure 1 for Deep Classifiers with Label Noise Modeling and Distance Awareness

Figure 2 for Deep Classifiers with Label Noise Modeling and Distance Awareness

Figure 3 for Deep Classifiers with Label Noise Modeling and Distance Awareness

Figure 4 for Deep Classifiers with Label Noise Modeling and Distance Awareness

Abstract:Uncertainty estimation in deep learning has recently emerged as a crucial area of interest to advance reliability and robustness in safety-critical applications. While there have been many proposed methods that either focus on distance-aware model uncertainties for out-of-distribution detection or on input-dependent label uncertainties for in-distribution calibration, both of these types of uncertainty are often necessary. In this work, we propose the HetSNGP method for jointly modeling the model and data uncertainty. We show that our proposed model affords a favorable combination between these two complementary types of uncertainty and thus outperforms the baseline methods on some challenging out-of-distribution datasets, including CIFAR-100C, Imagenet-C, and Imagenet-A. Moreover, we propose HetSNGP Ensemble, an ensembled version of our method which adds an additional type of uncertainty and also outperforms other ensemble baselines.

Via

Access Paper or Ask Questions

Scaling Vision with Sparse Mixture of Experts

Jun 10, 2021

Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, Neil Houlsby

Figure 1 for Scaling Vision with Sparse Mixture of Experts

Figure 2 for Scaling Vision with Sparse Mixture of Experts

Figure 3 for Scaling Vision with Sparse Mixture of Experts

Figure 4 for Scaling Vision with Sparse Mixture of Experts

Abstract:Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.

* 44 pages, 38 figures

Via

Access Paper or Ask Questions

Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Jun 07, 2021

Zachary Nado, Neil Band, Mark Collier, Josip Djolonga, Michael W. Dusenberry, Sebastian Farquhar, Angelos Filos, Marton Havasi, Rodolphe Jenatton, Ghassen Jerfel(+14 more)

Figure 1 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Figure 2 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Figure 3 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Figure 4 for Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning

Abstract:High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. Code available at https://github.com/google/uncertainty-baselines.

Via

Access Paper or Ask Questions

Correlated Input-Dependent Label Noise in Large-Scale Image Classification

May 19, 2021

Mark Collier, Basil Mustafa, Efi Kokiopoulou, Rodolphe Jenatton, Jesse Berent

Figure 1 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Figure 2 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Figure 3 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Figure 4 for Correlated Input-Dependent Label Noise in Large-Scale Image Classification

Abstract:Large scale image classification datasets often contain noisy labels. We take a principled probabilistic approach to modelling input-dependent, also known as heteroscedastic, label noise in these datasets. We place a multivariate Normal distributed latent variable on the final hidden layer of a neural network classifier. The covariance matrix of this latent variable, models the aleatoric uncertainty due to label noise. We demonstrate that the learned covariance structure captures known sources of label noise between semantically similar and co-occurring classes. Compared to standard neural network training and other baselines, we show significantly improved accuracy on Imagenet ILSVRC 2012 79.3% (+2.6%), Imagenet-21k 47.0% (+1.1%) and JFT 64.7% (+1.6%). We set a new state-of-the-art result on WebVision 1.0 with 76.6% top-1 accuracy. These datasets range from over 1M to over 300M training examples and from 1k classes to more than 21k classes. Our method is simple to use, and we provide an implementation that is a drop-in replacement for the final fully-connected layer in a deep classifier.

* Accepted as Oral at CVPR 2021

Via

Access Paper or Ask Questions

Amazon SageMaker Autopilot: a white box AutoML solution at scale

Dec 16, 2020

Piali Das, Valerio Perrone, Nikita Ivkin, Tanya Bansal, Zohar Karnin, Huibin Shen, Iaroslav Shcherbatyi, Yotam Elor, Wilton Wu, Aida Zolic(+15 more)

Figure 1 for Amazon SageMaker Autopilot: a white box AutoML solution at scale

Figure 2 for Amazon SageMaker Autopilot: a white box AutoML solution at scale

Figure 3 for Amazon SageMaker Autopilot: a white box AutoML solution at scale

Figure 4 for Amazon SageMaker Autopilot: a white box AutoML solution at scale

Abstract:AutoML systems provide a black-box solution to machine learning problems by selecting the right way of processing features, choosing an algorithm and tuning the hyperparameters of the entire pipeline. Although these systems perform well on many datasets, there is still a non-negligible number of datasets for which the one-shot solution produced by each particular system would provide sub-par performance. In this paper, we present Amazon SageMaker Autopilot: a fully managed system providing an automated ML solution that can be modified when needed. Given a tabular dataset and the target column name, Autopilot identifies the problem type, analyzes the data and produces a diverse set of complete ML pipelines including feature preprocessing and ML algorithms, which are tuned to generate a leaderboard of candidate models. In the scenario where the performance is not satisfactory, a data scientist is able to view and edit the proposed ML pipelines in order to infuse their expertise and business knowledge without having to revert to a fully manual solution. This paper describes the different components of Autopilot, emphasizing the infrastructure choices that allow scalability, high quality models, editable ML pipelines, consumption of artifacts of offline meta-learning, and a convenient integration with the entire SageMaker suite allowing these trained models to be used in a production setting.

Via

Access Paper or Ask Questions

Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization

Dec 15, 2020

Valerio Perrone, Huibin Shen, Aida Zolic, Iaroslav Shcherbatyi, Amr Ahmed, Tanya Bansal, Michele Donini, Fela Winkelmolen, Rodolphe Jenatton, Jean Baptiste Faddoul(+5 more)

Figure 1 for Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization

Figure 2 for Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization

Figure 3 for Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization

Figure 4 for Amazon SageMaker Automatic Model Tuning: Scalable Black-box Optimization

Abstract:Tuning complex machine learning systems is challenging. Machine learning models typically expose a set of hyperparameters, be it regularization, architecture, or optimization parameters, whose careful tuning is critical to achieve good performance. To democratize access to such systems, it is essential to automate this tuning process. This paper presents Amazon SageMaker Automatic Model Tuning (AMT), a fully managed system for black-box optimization at scale. AMT finds the best version of a machine learning model by repeatedly training it with different hyperparameter configurations. It leverages either random search or Bayesian optimization to choose the hyperparameter values resulting in the best-performing model, as measured by the metric chosen by the user. AMT can be used with built-in algorithms, custom algorithms, and Amazon SageMaker pre-built containers for machine learning frameworks. We discuss the core functionality, system architecture and our design principles. We also describe some more advanced features provided by AMT, such as automated early stopping and warm-starting, demonstrating their benefits in experiments.

Via

Access Paper or Ask Questions

Training independent subnetworks for robust prediction

Oct 13, 2020

Marton Havasi, Rodolphe Jenatton, Stanislav Fort, Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan, Andrew M. Dai, Dustin Tran

Figure 1 for Training independent subnetworks for robust prediction

Figure 2 for Training independent subnetworks for robust prediction

Figure 3 for Training independent subnetworks for robust prediction

Figure 4 for Training independent subnetworks for robust prediction

Abstract:Recent approaches to efficiently ensemble neural networks have shown that strong robustness and uncertainty performance can be achieved with a negligible gain in parameters over the original network. However, these methods still require multiple forward passes for prediction, leading to a significant computational cost. In this work, we show a surprising result: the benefits of using multiple predictions can be achieved `for free' under a single model's forward pass. In particular, we show that, using a multi-input multi-output (MIMO) configuration, one can utilize a single model's capacity to train multiple subnetworks that independently learn the task at hand. By ensembling the predictions made by the subnetworks, we improve model robustness without increasing compute. We observe a significant improvement in negative log-likelihood, accuracy, and calibration error on CIFAR10, CIFAR100, ImageNet, and their out-of-distribution variants compared to previous methods.

Via

Access Paper or Ask Questions

Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Jun 24, 2020

Florian Wenzel, Jasper Snoek, Dustin Tran, Rodolphe Jenatton

Figure 1 for Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Figure 2 for Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Figure 3 for Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Figure 4 for Hyperparameter Ensembles for Robustness and Uncertainty Quantification

Abstract:Ensembles over neural network weights trained from different random initialization, known as deep ensembles, achieve state-of-the-art accuracy and calibration. The recently introduced batch ensembles provide a drop-in replacement that is more parameter efficient. In this paper, we design ensembles not only over weights, but over hyperparameters to improve the state of the art in both settings. For best performance independent of budget, we propose hyper-deep ensembles, a simple procedure that involves a random search over different hyperparameters, themselves stratified across multiple random initializations. Its strong performance highlights the benefit of combining models with both weight and hyperparameter diversity. We further propose a parameter efficient version, hyper-batch ensembles, which builds on the layer structure of batch ensembles and self-tuning networks. The computational and memory costs of our method are notably lower than typical ensembles. On image classification tasks, with MLP, LeNet, and Wide ResNet 28-10 architectures, our methodology improves upon both deep and batch ensembles.

Via

Access Paper or Ask Questions

On Mixup Regularization

Jun 10, 2020

Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, Jean-Philippe Vert

Abstract:Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We further show that these transformations and perturbations induce multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator, and that these schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We illustrate our theoretical analysis by experiments that empirically support our conclusions.

Via

Access Paper or Ask Questions