Alert button
Picture for Yann Dauphin

Yann Dauphin

Alert button

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation

May 22, 2023
Emirhan Kurtulus, Zichao Li, Yann Dauphin, Ekin Dogus Cubuk

Figure 1 for Tied-Augment: Controlling Representation Similarity Improves Data Augmentation
Figure 2 for Tied-Augment: Controlling Representation Similarity Improves Data Augmentation
Figure 3 for Tied-Augment: Controlling Representation Similarity Improves Data Augmentation
Figure 4 for Tied-Augment: Controlling Representation Similarity Improves Data Augmentation

Data augmentation methods have played an important role in the recent advance of deep learning models, and have become an indispensable component of state-of-the-art models in semi-supervised, self-supervised, and supervised training for vision. Despite incurring no additional latency at test time, data augmentation often requires more epochs of training to be effective. For example, even the simple flips-and-crops augmentation requires training for more than 5 epochs to improve performance, whereas RandAugment requires more than 90 epochs. We propose a general framework called Tied-Augment, which improves the efficacy of data augmentation in a wide range of applications by adding a simple term to the loss that can control the similarity of representations under distortions. Tied-Augment can improve state-of-the-art methods from data augmentation (e.g. RandAugment, mixup), optimization (e.g. SAM), and semi-supervised learning (e.g. FixMatch). For example, Tied-RandAugment can outperform RandAugment by 2.0% on ImageNet. Notably, using Tied-Augment, data augmentation can be made to improve generalization even when training for a few epochs and when fine-tuning. We open source our code at https://github.com/ekurtulus/tied-augment/tree/main.

* 14 pages, 2 figures, ICML 2023 
Viaarxiv icon

JaxPruner: A concise library for sparsity research

May 02, 2023
Joo Hyung Lee, Wonpyo Park, Nicole Mitchell, Jonathan Pilault, Johan Obando-Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart Bik, Woohyun Han, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci

Figure 1 for JaxPruner: A concise library for sparsity research
Figure 2 for JaxPruner: A concise library for sparsity research

This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.

* Jaxpruner is hosted at http://github.com/google-research/jaxpruner 
Viaarxiv icon

Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets

Apr 06, 2023
Jonas Ngnawe, Marianne ABEMGNIGNI NJIFON, Jonathan Heek, Yann Dauphin

Figure 1 for Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
Figure 2 for Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
Figure 3 for Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets
Figure 4 for Robustmix: Improving Robustness by Regularizing the Frequency Bias of Deep Nets

Deep networks have achieved impressive results on a range of well-curated benchmark datasets. Surprisingly, their performance remains sensitive to perturbations that have little effect on human performance. In this work, we propose a novel extension of Mixup called Robustmix that regularizes networks to classify based on lower-frequency spatial features. We show that this type of regularization improves robustness on a range of benchmarks such as Imagenet-C and Stylized Imagenet. It adds little computational overhead and, furthermore, does not require a priori knowledge of a large set of image transformations. We find that this approach further complements recent advances in model architecture and data augmentation, attaining a state-of-the-art mCE of 44.8 with an EfficientNet-B8 model and RandAugment, which is a reduction of 16 mCE compared to the baseline.

* Accepted at: Workshop on Distribution Shifts, 36th Conference on Neural Information Processing Systems (NeurIPS 2022). https://openreview.net/forum?id=Na64z0YpOx 
Viaarxiv icon

No One Representation to Rule Them All: Overlapping Features of Training Methods

Oct 26, 2021
Raphael Gontijo-Lopes, Yann Dauphin, Ekin D. Cubuk

Figure 1 for No One Representation to Rule Them All: Overlapping Features of Training Methods
Figure 2 for No One Representation to Rule Them All: Overlapping Features of Training Methods
Figure 3 for No One Representation to Rule Them All: Overlapping Features of Training Methods
Figure 4 for No One Representation to Rule Them All: Overlapping Features of Training Methods

Despite being able to capture a range of features of the data, high accuracy models trained with supervision tend to make similar predictions. This seemingly implies that high-performing models share similar biases regardless of training methodology, which would limit ensembling benefits and render low-accuracy models as having little practical use. Against this backdrop, recent work has made very different training techniques, such as large-scale contrastive learning, yield competitively-high accuracy on generalization and robustness benchmarks. This motivates us to revisit the assumption that models necessarily learn similar functions. We conduct a large-scale empirical study of models across hyper-parameters, architectures, frameworks, and datasets. We find that model pairs that diverge more in training methodology display categorically different generalization behavior, producing increasingly uncorrelated errors. We show these models specialize in subdomains of the data, leading to higher ensemble performance: with just 2 models (each with ImageNet accuracy ~76.5%), we can create ensembles with 83.4% (+7% boost). Surprisingly, we find that even significantly low-accuracy models can be used to improve high-accuracy models. Finally, we show diverging training methodology yield representations that capture overlapping (but not supersetting) feature sets which, when combined, lead to increased downstream performance.

Viaarxiv icon

Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral

Aug 25, 2021
Lucio M. Dery, Yann Dauphin, David Grangier

Figure 1 for Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral
Figure 2 for Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral
Figure 3 for Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral
Figure 4 for Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral

While deep learning has been very beneficial in data-rich settings, tasks with smaller training set often resort to pre-training or multitask learning to leverage data from other tasks. In this case, careful consideration is needed to select tasks and model parameterizations such that updates from the auxiliary tasks actually help the primary task. We seek to alleviate this burden by formulating a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged. This allows weighting the update directions differently depending on their impact on the problem of interest. We present a novel and efficient algorithm for that purpose and show its advantage in practice. Our method leverages efficient automatic differentiation procedures and randomized singular value decomposition for scalability. We show that our framework is generic and encompasses some prior work as particular cases. Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks.

* 15 pages, 3 figures, Accepted to International Conference on Learning Representations (ICLR) 2021 See https://github.com/ldery/ATTITTUD}{https://github.com/ldery/ATTITTUD for associated code 
Viaarxiv icon

Continental-Scale Building Detection from High Resolution Satellite Imagery

Jul 29, 2021
Wojciech Sirko, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Eddine Bouchareb, Yann Dauphin, Daniel Keysers, Maxim Neumann, Moustapha Cisse, John Quinn

Figure 1 for Continental-Scale Building Detection from High Resolution Satellite Imagery
Figure 2 for Continental-Scale Building Detection from High Resolution Satellite Imagery
Figure 3 for Continental-Scale Building Detection from High Resolution Satellite Imagery
Figure 4 for Continental-Scale Building Detection from High Resolution Satellite Imagery

Identifying the locations and footprints of buildings is vital for many practical and scientific purposes. Such information can be particularly useful in developing regions where alternative data sources may be scarce. In this work, we describe a model training pipeline for detecting buildings across the entire continent of Africa, using 50 cm satellite imagery. Starting with the U-Net model, widely used in satellite image analysis, we study variations in architecture, loss functions, regularization, pre-training, self-training and post-processing that increase instance segmentation performance. Experiments were carried out using a dataset of 100k satellite images across Africa containing 1.75M manually labelled building instances, and further datasets for pre-training and self-training. We report novel methods for improving performance of building detection with this type of model, including the use of mixup (mAP +0.12) and self-training with soft KL loss (mAP +0.06). The resulting pipeline obtains good results even on a wide variety of challenging rural and urban contexts, and was used to create the Open Buildings dataset of 516M Africa-wide detected footprints.

Viaarxiv icon

Temperature check: theory and practice for training models with softmax-cross-entropy losses

Oct 14, 2020
Atish Agarwala, Jeffrey Pennington, Yann Dauphin, Sam Schoenholz

Figure 1 for Temperature check: theory and practice for training models with softmax-cross-entropy losses
Figure 2 for Temperature check: theory and practice for training models with softmax-cross-entropy losses
Figure 3 for Temperature check: theory and practice for training models with softmax-cross-entropy losses
Figure 4 for Temperature check: theory and practice for training models with softmax-cross-entropy losses

The softmax function combined with a cross-entropy loss is a principled approach to modeling probability distributions that has become ubiquitous in deep learning. The softmax function is defined by a lone hyperparameter, the temperature, that is commonly set to one or regarded as a way to tune model confidence after training; however, less is known about how the temperature impacts training dynamics or generalization performance. In this work we develop a theory of early learning for models trained with softmax-cross-entropy loss and show that the learning dynamics depend crucially on the inverse-temperature $\beta$ as well as the magnitude of the logits at initialization, $||\beta{\bf z}||_{2}$. We follow up these analytic results with a large-scale empirical study of a variety of model architectures trained on CIFAR10, ImageNet, and IMDB sentiment analysis. We find that generalization performance depends strongly on the temperature, but only weakly on the initial logit magnitude. We provide evidence that the dependence of generalization on $\beta$ is not due to changes in model confidence, but is a dynamical phenomenon. It follows that the addition of $\beta$ as a tunable hyperparameter is key to maximizing model performance. Although we find the optimal $\beta$ to be sensitive to the architecture, our results suggest that tuning $\beta$ over the range $10^{-2}$ to $10^1$ improves performance over all architectures studied. We find that smaller $\beta$ may lead to better peak performance at the cost of learning stability.

Viaarxiv icon

Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

Oct 07, 2020
Utku Evci, Yani A. Ioannou, Cem Keskin, Yann Dauphin

Figure 1 for Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
Figure 2 for Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
Figure 3 for Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win
Figure 4 for Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win

Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exception of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). In this work, we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes LTs and DST the exceptions? We show that sparse NNs have poor gradient flow at initialization and propose a modified initialization for unstructured connectivity. Furthermore, we find that DST methods significantly improve gradient flow during training over traditional sparse training methods. Finally, we show that LTs do not improve gradient flow, rather their success lies in re-learning the pruning solution they are derived from - however, this comes at the cost of learning novel solutions.

* sparse training, sparsity, pruning, lottery ticket hypothesis, lottery tickets, sparse initialization, initialization, deep learning, gradient flow 
Viaarxiv icon