Alert button
Picture for Utku Evci

Utku Evci

Alert button

Scaling Laws for Sparsely-Connected Foundation Models

Sep 15, 2023
Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

Figure 1 for Scaling Laws for Sparsely-Connected Foundation Models
Figure 2 for Scaling Laws for Sparsely-Connected Foundation Models
Figure 3 for Scaling Laws for Sparsely-Connected Foundation Models
Figure 4 for Scaling Laws for Sparsely-Connected Foundation Models

We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales; on ViT/JFT-4B and T5/C4. These results allow us to characterize the "optimal sparsity", the sparsity level which yields the best performance for a given effective model size and training budget. For a fixed number of non-zero parameters, we identify that the optimal sparsity increases with the amount of data used for training. We also extend our study to different sparsity structures (such as the hardware-friendly n:m pattern) and strategies (such as starting from a pretrained dense model). Our findings shed light on the power and limitations of weight sparsity across various parameter and computational settings, offering both theoretical understanding and practical implications for leveraging sparsity towards computational efficiency improvements.

Viaarxiv icon

Dynamic Sparse Training with Structured Sparsity

May 03, 2023
Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, Yani Ioannou

Figure 1 for Dynamic Sparse Training with Structured Sparsity
Figure 2 for Dynamic Sparse Training with Structured Sparsity
Figure 3 for Dynamic Sparse Training with Structured Sparsity
Figure 4 for Dynamic Sparse Training with Structured Sparsity

DST methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically cheaper to train, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work we propose a DST method to learn a variant of structured N:M sparsity, the acceleration of which in general is commonly supported in commodity hardware. Furthermore, we motivate with both a theoretical analysis and empirical results, the generalization performance of our specific N:M sparsity (constant fan-in), present a condensed representation with a reduced parameter and memory footprint, and demonstrate reduced inference time compared to dense models with a naive PyTorch CPU implementation of the condensed representation Our source code is available at https://github.com/calgaryml/condensed-sparsity

* 16 pages, 11 figures 
Viaarxiv icon

JaxPruner: A concise library for sparsity research

May 02, 2023
Joo Hyung Lee, Wonpyo Park, Nicole Mitchell, Jonathan Pilault, Johan Obando-Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart Bik, Woohyun Han, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci

Figure 1 for JaxPruner: A concise library for sparsity research
Figure 2 for JaxPruner: A concise library for sparsity research

This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.

* Jaxpruner is hosted at http://github.com/google-research/jaxpruner 
Viaarxiv icon

The Dormant Neuron Phenomenon in Deep Reinforcement Learning

Feb 24, 2023
Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, Utku Evci

Figure 1 for The Dormant Neuron Phenomenon in Deep Reinforcement Learning
Figure 2 for The Dormant Neuron Phenomenon in Deep Reinforcement Learning
Figure 3 for The Dormant Neuron Phenomenon in Deep Reinforcement Learning
Figure 4 for The Dormant Neuron Phenomenon in Deep Reinforcement Learning

In this work we identify the dormant neuron phenomenon in deep reinforcement learning, where an agent's network suffers from an increasing number of inactive neurons, thereby affecting network expressivity. We demonstrate the presence of this phenomenon across a variety of algorithms and environments, and highlight its effect on learning. To address this issue, we propose a simple and effective method (ReDo) that Recycles Dormant neurons throughout training. Our experiments demonstrate that ReDo maintains the expressive power of networks by reducing the number of dormant neurons and results in improved performance.

Viaarxiv icon

Scaling Vision Transformers to 22 Billion Parameters

Feb 10, 2023
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

Figure 1 for Scaling Vision Transformers to 22 Billion Parameters
Figure 2 for Scaling Vision Transformers to 22 Billion Parameters
Figure 3 for Scaling Vision Transformers to 22 Billion Parameters
Figure 4 for Scaling Vision Transformers to 22 Billion Parameters

The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.

Viaarxiv icon

Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

Sep 15, 2022
Sheng-Chun Kao, Amir Yazdanbakhsh, Suvinay Subramanian, Shivani Agrawal, Utku Evci, Tushar Krishna

Figure 1 for Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
Figure 2 for Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
Figure 3 for Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask
Figure 4 for Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage certain forms of N:M structured sparsity to yield higher compute-efficiency. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely "pruning mask decay" and "sparse structure decay". Our evaluations indicate that these proposed methods consistently deliver state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on a Transformer-based model for a translation task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs).

* 11 pages, 2 figures, and 9 tables. Published at the ICML Workshop on Sparsity in Neural Networks Advancing Understanding and Practice, 2022. First two authors contributed equally 
Viaarxiv icon

The State of Sparse Training in Deep Reinforcement Learning

Jun 17, 2022
Laura Graesser, Utku Evci, Erich Elsen, Pablo Samuel Castro

Figure 1 for The State of Sparse Training in Deep Reinforcement Learning
Figure 2 for The State of Sparse Training in Deep Reinforcement Learning
Figure 3 for The State of Sparse Training in Deep Reinforcement Learning
Figure 4 for The State of Sparse Training in Deep Reinforcement Learning

The use of sparse neural networks has seen rapid growth in recent years, particularly in computer vision. Their appeal stems largely from the reduced number of parameters required to train and store, as well as in an increase in learning efficiency. Somewhat surprisingly, there have been very few efforts exploring their use in Deep Reinforcement Learning (DRL). In this work we perform a systematic investigation into applying a number of existing sparse training techniques on a variety of DRL agents and environments. Our results corroborate the findings from sparse training in the computer vision domain - sparse networks perform better than dense networks for the same parameter count - in the DRL domain. We provide detailed analyses on how the various components in DRL are affected by the use of sparse networks and conclude by suggesting promising avenues for improving the effectiveness of sparse training methods, as well as for advancing their use in DRL.

* Proceedings of the 39th International Conference on Machine Learning (ICML'22) 
Viaarxiv icon

GradMax: Growing Neural Networks using Gradient Information

Jan 13, 2022
Utku Evci, Max Vladymyrov, Thomas Unterthiner, Bart van Merriënboer, Fabian Pedregosa

Figure 1 for GradMax: Growing Neural Networks using Gradient Information
Figure 2 for GradMax: Growing Neural Networks using Gradient Information
Figure 3 for GradMax: Growing Neural Networks using Gradient Information
Figure 4 for GradMax: Growing Neural Networks using Gradient Information

The architecture and the parameters of neural networks are often optimized independently, which requires costly retraining of the parameters whenever the architecture is modified. In this work we instead focus on growing the architecture without requiring costly retraining. We present a method that adds new neurons during training without impacting what is already learned, while improving the training dynamics. We achieve the latter by maximizing the gradients of the new weights and find the optimal initialization efficiently by means of the singular value decomposition (SVD). We call this technique Gradient Maximizing Growth (GradMax) and demonstrate its effectiveness in variety of vision tasks and architectures.

Viaarxiv icon

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Jan 10, 2022
Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael C. Mozer

Figure 1 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Figure 2 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Figure 3 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Figure 4 for Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning

Transfer-learning methods aim to improve performance in a data-scarce target domain using a model pretrained on a data-rich source domain. A cost-efficient strategy, linear probing, involves freezing the source model and training a new classification head for the target domain. This strategy is outperformed by a more costly but state-of-the-art method -- fine-tuning all parameters of the source model to the target domain -- possibly because fine-tuning allows the model to leverage useful information from intermediate layers which is otherwise discarded by the later pretrained layers. We explore the hypothesis that these intermediate layers might be directly exploited. We propose a method, Head-to-Toe probing (Head2Toe), that selects features from all layers of the source model to train a classification head for the target-domain. In evaluations on the VTAB-1k, Head2Toe matches performance obtained with fine-tuning on average while reducing training and storage cost hundred folds or more, but critically, for out-of-distribution transfer, Head2Toe outperforms fine-tuning.

Viaarxiv icon