Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ekaterina Lobacheva

HSE University, Russia

Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Jun 05, 2025

Andrei Mircea, Supriyo Chakraborty, Nima Chitsazan, Irina Rish, Ekaterina Lobacheva

Abstract:This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

* Published as a conference paper at ACL 2025

Via

Access Paper or Ask Questions

SGD as Free Energy Minimization: A Thermodynamic View on Neural Network Training

May 29, 2025

Ildus Sadrtdinov, Ivan Klimov, Ekaterina Lobacheva, Dmitry Vetrov

Abstract:We present a thermodynamic interpretation of the stationary behavior of stochastic gradient descent (SGD) under fixed learning rates (LRs) in neural network training. We show that SGD implicitly minimizes a free energy function $F=U-TS$, balancing training loss $U$ and the entropy of the weights distribution $S$, with temperature $T$ determined by the LR. This perspective offers a new lens on why high LRs prevent training from converging to the loss minima and how different LRs lead to stabilization at different loss levels. We empirically validate the free energy framework on both underparameterized (UP) and overparameterized (OP) models. UP models consistently follow free energy minimization, with temperature increasing monotonically with LR, while for OP models, the temperature effectively drops to zero at low LRs, causing SGD to minimize the loss directly and converge to an optimum. We attribute this mismatch to differences in the signal-to-noise ratio of stochastic gradients near optima, supported by both a toy example and neural network experiments.

* First two authors contributed equally

Via

Access Paper or Ask Questions

Where Do Large Learning Rates Lead Us?

Oct 29, 2024

Ildus Sadrtdinov, Maxim Kodryan, Eduard Pokonechny, Ekaterina Lobacheva, Dmitry Vetrov

Abstract:It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.

* Published in NeurIPS 2024. First three authors contributed equally, last two authors share senior authorship

Via

Access Paper or Ask Questions

Large Learning Rates Improve Generalization: But How Large Are We Talking About?

Nov 19, 2023

Ekaterina Lobacheva, Eduard Pockonechnyy, Maxim Kodryan, Dmitry Vetrov

Abstract:Inspired by recent research that recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization, we explore this hypothesis in detail. Our study clarifies the initial LR ranges that provide optimal results for subsequent training with a small LR or weight averaging. We find that these ranges are in fact significantly narrower than generally assumed. We conduct our main experiments in a simplified setup that allows precise control of the learning rate hyperparameter and validate our key findings in a more practical setting.

* Published in Mathematics of Modern Machine Learning Workshop at NeurIPS 2023. First two authors contributed equally

Via

Access Paper or Ask Questions

To Stay or Not to Stay in the Pre-train Basin: Insights on Ensembling in Transfer Learning

Mar 06, 2023

Ildus Sadrtdinov, Dmitrii Pozdeev, Dmitry Vetrov, Ekaterina Lobacheva

Abstract:Transfer learning and ensembling are two popular techniques for improving the performance and robustness of neural networks. Due to the high cost of pre-training, ensembles of models fine-tuned from a single pre-trained checkpoint are often used in practice. Such models end up in the same basin of the loss landscape and thus have limited diversity. In this work, we study if it is possible to improve ensembles trained from a single pre-trained checkpoint by better exploring the pre-train basin or a close vicinity outside of it. We show that while exploration of the pre-train basin may be beneficial for the ensemble, leaving the basin results in losing the benefits of transfer learning and degradation of the ensemble quality.

* First two authors contributed equally

Via

Access Paper or Ask Questions

Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Sep 08, 2022

Maxim Kodryan, Ekaterina Lobacheva, Maksim Nakhodnov, Dmitry Vetrov

Figure 1 for Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Figure 2 for Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Figure 3 for Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Figure 4 for Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Abstract:A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.

* First three authors contributed equally

Via

Access Paper or Ask Questions

Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Dec 29, 2021

Evgeny Bobrov, Sergey Troshin, Nadezhda Chirkova, Ekaterina Lobacheva, Sviatoslav Panchenko, Dmitry Vetrov, Dmitry Kropotov

Figure 1 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Figure 2 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Figure 3 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Figure 4 for Machine Learning Methods for Spectral Efficiency Prediction in Massive MIMO Systems

Abstract:Channel decoding, channel detection, channel assessment, and resource management for wireless multiple-input multiple-output (MIMO) systems are all examples of problems where machine learning (ML) can be successfully applied. In this paper, we study several ML approaches to solve the problem of estimating the spectral efficiency (SE) value for a certain precoding scheme, preferably in the shortest possible time. The best results in terms of mean average percentage error (MAPE) are obtained with gradient boosting over sorted features, while linear models demonstrate worse prediction quality. Neural networks perform similarly to gradient boosting, but they are more resource- and time-consuming because of hyperparameter tuning and frequent retraining. We investigate the practical applicability of the proposed algorithms in a wide range of scenarios generated by the Quadriga simulator. In almost all scenarios, the MAPE achieved using gradient boosting and neural networks is less than 10\%.

* To appear in Optimization Methods & Software, 22 pages, 10 figures, 2 tables

Via

Access Paper or Ask Questions

On the Memorization Properties of Contrastive Learning

Jul 21, 2021

Ildus Sadrtdinov, Nadezhda Chirkova, Ekaterina Lobacheva

Figure 1 for On the Memorization Properties of Contrastive Learning

Figure 2 for On the Memorization Properties of Contrastive Learning

Figure 3 for On the Memorization Properties of Contrastive Learning

Figure 4 for On the Memorization Properties of Contrastive Learning

Abstract:Memorization studies of deep neural networks (DNNs) help to understand what patterns and how do DNNs learn, and motivate improvements to DNN training approaches. In this work, we investigate the memorization properties of SimCLR, a widely used contrastive self-supervised learning approach, and compare them to the memorization of supervised learning and random labels training. We find that both training objects and augmentations may have different complexity in the sense of how SimCLR learns them. Moreover, we show that SimCLR is similar to random labels training in terms of the distribution of training objects complexity.

* Published in Workshop on Overparameterization: Pitfalls & Opportunities at ICML 2021

Via

Access Paper or Ask Questions

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Jun 29, 2021

Ekaterina Lobacheva, Maxim Kodryan, Nadezhda Chirkova, Andrey Malinin, Dmitry Vetrov

Figure 1 for On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Figure 2 for On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Figure 3 for On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Figure 4 for On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Abstract:Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent works show their joint usage may cause instabilities at the late stages of training. Other works, in contrast, show convergence to the equilibrium, i.e., the stabilization of training metrics. In this paper, we study this contradiction and show that instead of converging to a stable equilibrium, the training dynamics converge to consistent periodic behavior. That is, the training process regularly exhibits instabilities which, however, do not lead to complete training failure, but cause a new period of training. We rigorously investigate the mechanism underlying this discovered periodic behavior both from an empirical and theoretical point of view and show that this periodic behavior is indeed caused by the interaction between batch normalization and weight decay.

* First two authors contributed equally

Via

Access Paper or Ask Questions

On Power Laws in Deep Ensembles

Jul 16, 2020

Ekaterina Lobacheva, Nadezhda Chirkova, Maxim Kodryan, Dmitry Vetrov

Figure 1 for On Power Laws in Deep Ensembles

Figure 2 for On Power Laws in Deep Ensembles

Figure 3 for On Power Laws in Deep Ensembles

Figure 4 for On Power Laws in Deep Ensembles

Abstract:Ensembles of deep neural networks are known to achieve state-of-the-art performance in uncertainty estimation and lead to accuracy improvement. In this work, we focus on a classification problem and investigate the behavior of both non-calibrated and calibrated negative log-likelihood (CNLL) of a deep ensemble as a function of the ensemble size and the member network size. We indicate the conditions under which CNLL follows a power law w.r.t. ensemble size or member network size, and analyze the dynamics of the parameters of the discovered power laws. Our important practical finding is that one large network may perform worse than an ensemble of several medium-size networks with the same total number of parameters (we call this ensemble a memory split). Using the detected power law-like dependencies, we can predict (1) the possible gain from the ensembling of networks with given structure, (2) the optimal memory split given a memory budget, based on a relatively small number of trained networks. We describe the memory split advantage effect in more details in arXiv:2005.07292

* Published in Workshop on Uncertainty and Robustness in Deep Learning, ICML 2020

Via

Access Paper or Ask Questions