Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Martin Jaggi

EPFL

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

May 29, 2024

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi

Figure 1 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Figure 2 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Figure 3 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Figure 4 for Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Abstract:Scale has become a main ingredient in obtaining strong machine learning models. As a result, understanding a model's scaling properties is key to effectively designing both the right training setup as well as future generations of architectures. In this work, we argue that scale and training research has been needlessly complex due to reliance on the cosine schedule, which prevents training across different lengths for the same model size. We investigate the training behavior of a direct alternative - constant learning rate and cooldowns - and find that it scales predictably and reliably similar to cosine. Additionally, we show that stochastic weight averaging yields improved performance along the training trajectory, without additional training costs, across different scales. Importantly, with these findings we demonstrate that scaling experiments can be performed with significantly reduced compute and GPU hours by utilizing fewer but reusable training runs. Our code is available at https://github.com/epfml/schedules-and-scaling.

Via

Access Paper or Ask Questions

Deep Grokking: Would Deep Neural Networks Generalize Better?

May 29, 2024

Simin Fan, Razvan Pascanu, Martin Jaggi

Figure 1 for Deep Grokking: Would Deep Neural Networks Generalize Better?

Figure 2 for Deep Grokking: Would Deep Neural Networks Generalize Better?

Figure 3 for Deep Grokking: Would Deep Neural Networks Generalize Better?

Figure 4 for Deep Grokking: Would Deep Neural Networks Generalize Better?

Abstract:Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization behaviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.

Via

Access Paper or Ask Questions

The Privacy Power of Correlated Noise in Decentralized Learning

May 02, 2024

Youssef Allouah, Anastasia Koloskova, Aymane El Firdoussi, Martin Jaggi, Rachid Guerraoui

Figure 1 for The Privacy Power of Correlated Noise in Decentralized Learning

Figure 2 for The Privacy Power of Correlated Noise in Decentralized Learning

Figure 3 for The Privacy Power of Correlated Noise in Decentralized Learning

Abstract:Decentralized learning is appealing as it enables the scalable usage of large amounts of distributed data and resources (without resorting to any central entity), while promoting privacy since every user minimizes the direct exposure of their data. Yet, without additional precautions, curious users can still leverage models obtained from their peers to violate privacy. In this paper, we propose Decor, a variant of decentralized SGD with differential privacy (DP) guarantees. Essentially, in Decor, users securely exchange randomness seeds in one communication round to generate pairwise-canceling correlated Gaussian noises, which are injected to protect local models at every communication round. We theoretically and empirically show that, for arbitrary connected graphs, Decor matches the central DP optimal privacy-utility trade-off. We do so under SecLDP, our new relaxation of local DP, which protects all user communications against an external eavesdropper and curious users, assuming that every pair of connected users shares a secret, i.e., an information hidden to all others. The main theoretical challenge is to control the accumulation of non-canceling correlated noise due to network sparsity. We also propose a companion SecLDP privacy accountant for public use.

* Accepted as conference paper at ICML 2024

Via

Access Paper or Ask Questions

Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Apr 15, 2024

Nicolas Wagner, Dongyang Fan, Martin Jaggi

Figure 1 for Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Figure 2 for Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Figure 3 for Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Figure 4 for Personalized Collaborative Fine-Tuning for On-Device Large Language Models

Abstract:We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low-Rank Adaptation (LoRA) and only exchange LoRA weight updates. Our protocols, driven by prediction and performance metrics, surpass both FedAvg and local fine-tuning methods, which is particularly evident in realistic scenarios with more diverse local data distributions. The results underscore the effectiveness of our approach in addressing heterogeneity and scarcity within local datasets.

Via

Access Paper or Ask Questions

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Mar 30, 2024

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

Figure 1 for QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Figure 2 for QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Figure 3 for QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Figure 4 for QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Abstract:We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: https://github.com/spcl/QuaRot.

* 19 pages, 6 figures

Via

Access Paper or Ask Questions

Towards an empirical understanding of MoE design choices

Feb 20, 2024

Dongyang Fan, Bettina Messmer, Martin Jaggi

Abstract:In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.

Via

Access Paper or Ask Questions

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Feb 06, 2024

Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, Michael Gastpar

Figure 1 for Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Figure 2 for Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Figure 3 for Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Figure 4 for Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Abstract:In recent years, attention-based transformers have achieved tremendous success across a variety of disciplines including natural languages. A key ingredient behind their success is the generative pretraining procedure, during which these models are trained on a large text corpus in an auto-regressive manner. To shed light on this phenomenon, we propose a new framework that allows both theory and systematic experiments to study the sequential modeling capabilities of transformers through the lens of Markov chains. Inspired by the Markovianity of natural languages, we model the data as a Markovian source and utilize this framework to systematically study the interplay between the data-distributional properties, the transformer architecture, the learnt distribution, and the final model performance. In particular, we theoretically characterize the loss landscape of single-layer transformers and show the existence of global minima and bad local minima contingent upon the specific data characteristics and the transformer architecture. Backed by experiments, we demonstrate that our theoretical findings are in congruence with the empirical results. We further investigate these findings in the broader context of higher order Markov chains and deeper architectures, and outline open problems in this arena. Code is available at \url{https://github.com/Bond1995/Markov}.

Via

Access Paper or Ask Questions

InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

Feb 05, 2024

Vinitra Swamy, Julian Blackwell, Jibril Frej, Martin Jaggi, Tanja Käser

Figure 1 for InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

Figure 2 for InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

Figure 3 for InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

Figure 4 for InterpretCC: Conditional Computation for Inherently Interpretable Neural Networks

Abstract:Real-world interpretability for neural networks is a tradeoff between three concerns: 1) it requires humans to trust the explanation approximation (e.g. post-hoc approaches), 2) it compromises the understandability of the explanation (e.g. automatically identified feature masks), and 3) it compromises the model performance (e.g. decision trees). These shortcomings are unacceptable for human-facing domains, like education, healthcare, or natural language, which require trustworthy explanations, actionable interpretations, and accurate predictions. In this work, we present InterpretCC (interpretable conditional computation), a family of interpretable-by-design neural networks that guarantee human-centric interpretability while maintaining comparable performance to state-of-the-art models by adaptively and sparsely activating features before prediction. We extend this idea into an interpretable mixture-of-experts model, that allows humans to specify topics of interest, discretely separates the feature space for each data point into topical subnetworks, and adaptively and sparsely activates these topical subnetworks. We demonstrate variations of the InterpretCC architecture for text and tabular data across several real-world benchmarks: six online education courses, news classification, breast cancer diagnosis, and review sentiment.

Via

Access Paper or Ask Questions

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Feb 04, 2024

Matteo Pagliardini, Amirkeivan Mohtashami, Francois Fleuret, Martin Jaggi

Figure 1 for DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Figure 2 for DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Figure 3 for DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Figure 4 for DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Abstract:The transformer architecture from Vaswani et al. (2017) is now ubiquitous across application domains, from natural language processing to speech processing and image understanding. We propose DenseFormer, a simple modification to the standard architecture that improves the perplexity of the model without increasing its size -- adding a few thousand parameters for large-scale models in the 100B parameters range. Our approach relies on an additional averaging step after each transformer block, which computes a weighted average of current and past representations -- we refer to this operation as Depth-Weighted-Average (DWA). The learned DWA weights exhibit coherent patterns of information flow, revealing the strong and structured reuse of activations from distant layers. Experiments demonstrate that DenseFormer is more data efficient, reaching the same perplexity of much deeper transformer models, and that for the same perplexity, these new models outperform transformer baselines in terms of memory efficiency and inference time.

Via

Access Paper or Ask Questions

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Nov 27, 2023

Zeming Chen, Alejandro Hernández Cano, Angelika Romanou, Antoine Bonnet, Kyle Matoba, Francesco Salvi, Matteo Pagliardini, Simin Fan, Andreas Köpf, Amirkeivan Mohtashami(+10 more)

Figure 1 for MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Figure 2 for MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Figure 3 for MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Figure 4 for MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Abstract:Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs' medical knowledge and reasoning capacities, the resulting models are either closed-source (e.g., PaLM, GPT-4) or limited in scale (<= 13B parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia's Megatron-LM distributed trainer), and extends pretraining on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a 6% absolute performance gain over the best public baseline in its parameter class and 3% over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within 5% of GPT-4 and 10% of Med-PaLM-2. We release our code for curating the medical pretraining corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

Via

Access Paper or Ask Questions