Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bernd Rosenow

A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification

May 21, 2026

Marcel Kühn, Yoon Thelge, Bernd Rosenow

Abstract:Hard-label classification is usually trained with smooth surrogate losses, most prominently softmax cross-entropy. We isolate an asymptotic mechanism by which this mismatch between smooth surrogate and discrete labels produces power-law learning curves in an online teacher-student model. After subtracting the mean logit, the thermodynamic-limit dynamics close in centered variables: a growing centered student-teacher alignment $D$ and the residual student variance $Δ$. At late times, examples away from teacher decision boundaries are already classified confidently and contribute exponentially little. Only boundary layers of width $O(D^{-1})$ remain active, while the noise of fixed-learning-rate online gradient descent maintains a nonzero $Δ$. As a function of the training time $α$ the late-time solution yields a $α^{-1/3}$ power law not only for the test loss but also for the generalization error $ε_g$, i.e., one minus test accuracy. This is much slower than the $α^{-1}$ Bayes-optimal reference for the same model. We further show that learning-rate schedules can improve the generalization error towards a $ε_g \sim α^{-1/2}$ power law. Simulations support the predicted order parameter dynamics and learning curves. Controlled experiments with correlated Gaussian inputs and whitened pretrained features show that data structure can dominate transients. Therefore, our result is an asymptotic, complementary mechanism rather than an alternative to spectral explanations of neural scaling laws.

* 20 pages, 7 figures

Via

Access Paper or Ask Questions

Locating Information in Large Language Models via Random Matrix Theory

Oct 23, 2024

Max Staats, Matthias Thamm, Bernd Rosenow

Figure 1 for Locating Information in Large Language Models via Random Matrix Theory

Figure 2 for Locating Information in Large Language Models via Random Matrix Theory

Figure 3 for Locating Information in Large Language Models via Random Matrix Theory

Figure 4 for Locating Information in Large Language Models via Random Matrix Theory

Abstract:As large language models (LLMs) become central to AI applications, gaining a deeper understanding of their inner workings is increasingly important. In this work, we analyze the weight matrices of pretrained transformer models -- specifically BERT and Llama -- using random matrix theory (RMT) as a zero-information hypothesis. While randomly initialized weights perfectly agree with RMT predictions, deviations emerge after training, allowing us to locate learned structures within the models. We identify layer-type specific behaviors that are consistent across all blocks and architectures considered. By pinpointing regions that deviate from RMT predictions, we highlight areas of feature learning and confirm this through comparisons with the activation covariance matrices of the corresponding layers. Our method provides a diagnostic tool for identifying relevant regions in transformer weights using only the trained matrices. Additionally, we address the ongoing debate regarding the significance of small singular values in the context of fine-tuning and alignment in LLMs. Our findings reveal that, after fine-tuning, small singular values play a crucial role in the models' capabilities, suggesting that removing them in an already aligned transformer can be detrimental, as it may compromise model alignment.

* 17 pages, 14 figures

Via

Access Paper or Ask Questions

Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Oct 11, 2024

Roman Worschech, Bernd Rosenow

Figure 1 for Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Figure 2 for Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Figure 3 for Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Figure 4 for Analyzing Neural Scaling Laws in Two-Layer Networks with Power-Law Data Spectra

Abstract:Neural scaling laws describe how the performance of deep neural networks scales with key factors such as training data size, model complexity, and training time, often following power-law behaviors over multiple orders of magnitude. Despite their empirical observation, the theoretical understanding of these scaling laws remains limited. In this work, we employ techniques from statistical mechanics to analyze one-pass stochastic gradient descent within a student-teacher framework, where both the student and teacher are two-layer neural networks. Our study primarily focuses on the generalization error and its behavior in response to data covariance matrices that exhibit power-law spectra. For linear activation functions, we derive analytical expressions for the generalization error, exploring different learning regimes and identifying conditions under which power-law scaling emerges. Additionally, we extend our analysis to non-linear activation functions in the feature learning regime, investigating how power-law spectra in the data covariance matrix impact learning dynamics. Importantly, we find that the length of the symmetric plateau depends on the number of distinct eigenvalues of the data covariance matrix and the number of hidden units, demonstrating how these plateaus behave under various configurations. In addition, our results reveal a transition from exponential to power-law convergence in the specialized phase when the data covariance matrix possesses a power-law spectrum. This work contributes to the theoretical understanding of neural scaling laws and provides insights into optimizing learning performance in practical scenarios involving complex data structures.

Via

Access Paper or Ask Questions

Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Jun 08, 2023

Marcel Kühn, Bernd Rosenow

Figure 1 for Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Figure 2 for Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Figure 3 for Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Figure 4 for Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Abstract:Stochastic gradient descent (SGD) has become a cornerstone of neural network optimization, yet the noise introduced by SGD is often assumed to be uncorrelated over time, despite the ubiquity of epoch-based training. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum, limited to a quadratic loss. Our main contributions are twofold: first, we calculate the exact autocorrelation of the noise for training in epochs under the assumption that the noise is independent of small fluctuations in the weight vector; second, we explore the influence of correlations introduced by the epoch-based learning scheme on SGD dynamics. We find that for directions with a curvature greater than a hyperparameter-dependent crossover value, the results for uncorrelated noise are recovered. However, for relatively flat directions, the weight variance is significantly reduced. We provide an intuitive explanation for these results based on a crossover between correlation times, contributing to a deeper understanding of the dynamics of SGD in the presence of epoch-based noise correlations.

* 25 pages, 7 figures

Via

Access Paper or Ask Questions

Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Jun 08, 2023

Max Staats, Matthias Thamm, Bernd Rosenow

Figure 1 for Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Figure 2 for Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Figure 3 for Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Figure 4 for Reevaluating Loss Functions: Enhancing Robustness to Label Noise in Deep Learning Models

Abstract:Large annotated datasets inevitably contain incorrect labels, which poses a major challenge for the training of deep neural networks as they easily fit the labels. Only when training with a robust model that is not easily distracted by the noise, a good generalization performance can be achieved. A simple yet effective way to create a noise robust model is to use a noise robust loss function. However, the number of proposed loss functions is large, they often come with hyperparameters, and may learn slower than the widely used but noise sensitive Cross Entropy loss. By heuristic considerations and extensive numerical experiments, we study in which situations the proposed loss functions are applicable and give suggestions on how to choose an appropriate loss. Additionally, we propose a novel technique to enhance learning with bounded loss functions: the inclusion of an output bias, i.e. a slight increase in the neuron pre-activation corresponding to the correct label. Surprisingly, we find that this not only significantly improves the learning of bounded losses, but also leads to the Mean Absolute Error loss outperforming the Cross Entropy loss on the Cifar-100 dataset - even in the absence of additional label noise. This suggests that training with a bounded loss function can be advantageous even in the presence of minimal label noise. To further strengthen our analysis of the learning behavior of different loss functions, we additionally design and test a novel loss function denoted as Bounded Cross Entropy.

* 13 pages, 4 figures

Via

Access Paper or Ask Questions

Topological gap protocol based machine learning optimization of Majorana hybrid wires

May 25, 2023

Matthias Thamm, Bernd Rosenow

Figure 1 for Topological gap protocol based machine learning optimization of Majorana hybrid wires

Figure 2 for Topological gap protocol based machine learning optimization of Majorana hybrid wires

Figure 3 for Topological gap protocol based machine learning optimization of Majorana hybrid wires

Figure 4 for Topological gap protocol based machine learning optimization of Majorana hybrid wires

Abstract:Majorana zero modes in superconductor-nanowire hybrid structures are a promising candidate for topologically protected qubits with the potential to be used in scalable structures. Currently, disorder in such Majorana wires is a major challenge, as it can destroy the topological phase and thus reduce the yield in the fabrication of Majorana devices. We study machine learning optimization of a gate array in proximity to a grounded Majorana wire, which allows us to reliably compensate even strong disorder. We propose a metric for optimization that is inspired by the topological gap protocol, and which can be implemented based on measurements of the non-local conductance through the wire.

* 13 pages, 11 figures

Via

Access Paper or Ask Questions

Online Learning for the Random Feature Model in the Student-Teacher Framework

Mar 24, 2023

Roman Worschech, Bernd Rosenow

Figure 1 for Online Learning for the Random Feature Model in the Student-Teacher Framework

Figure 2 for Online Learning for the Random Feature Model in the Student-Teacher Framework

Figure 3 for Online Learning for the Random Feature Model in the Student-Teacher Framework

Figure 4 for Online Learning for the Random Feature Model in the Student-Teacher Framework

Abstract:Deep neural networks are widely used prediction algorithms whose performance often improves as the number of weights increases, leading to over-parametrization. We consider a two-layered neural network whose first layer is frozen while the last layer is trainable, known as the random feature model. We study over-parametrization in the context of a student-teacher framework by deriving a set of differential equations for the learning dynamics. For any finite ratio of hidden layer size and input dimension, the student cannot generalize perfectly, and we compute the non-zero asymptotic generalization error. Only when the student's hidden layer size is exponentially larger than the input dimension, an approach to perfect generalization is possible.

Via

Access Paper or Ask Questions

Machine learning optimization of Majorana hybrid nanowires

Aug 09, 2022

Matthias Thamm, Bernd Rosenow

Figure 1 for Machine learning optimization of Majorana hybrid nanowires

Figure 2 for Machine learning optimization of Majorana hybrid nanowires

Figure 3 for Machine learning optimization of Majorana hybrid nanowires

Figure 4 for Machine learning optimization of Majorana hybrid nanowires

Abstract:As the complexity of quantum systems such as quantum bit arrays increases, efforts to automate expensive tuning are increasingly worthwhile. We investigate machine learning based tuning of gate arrays using the CMA-ES algorithm for the case study of Majorana wires with strong disorder. We find that the algorithm is able to efficiently improve the topological signatures, learn intrinsic disorder profiles, and completely eliminate disorder effects. For example, with only 20 gates, it is possible to fully recover Majorana zero modes destroyed by disorder by optimizing gate voltages.

* 13 pages, 13 figures; added references

Via

Access Paper or Ask Questions

Boundary between noise and information applied to filtering neural network weight matrices

Jun 08, 2022

Max Staats, Matthias Thamm, Bernd Rosenow

Figure 1 for Boundary between noise and information applied to filtering neural network weight matrices

Figure 2 for Boundary between noise and information applied to filtering neural network weight matrices

Figure 3 for Boundary between noise and information applied to filtering neural network weight matrices

Figure 4 for Boundary between noise and information applied to filtering neural network weight matrices

Abstract:Deep neural networks have been successfully applied to a broad range of problems where overparametrization yields weight matrices which are partially random. A comparison of weight matrix singular vectors to the Porter-Thomas distribution suggests that there is a boundary between randomness and learned information in the singular value spectrum. Inspired by this finding, we introduce an algorithm for noise filtering, which both removes small singular values and reduces the magnitude of large singular values to counteract the effect of level repulsion between the noise and the information part of the spectrum. For networks trained in the presence of label noise, we indeed find that the generalization performance improves significantly due to noise filtering.

* 6 pages, 5 figures

Via

Access Paper or Ask Questions

Random matrix analysis of deep neural network weight matrices

Mar 28, 2022

Matthias Thamm, Max Staats, Bernd Rosenow

Figure 1 for Random matrix analysis of deep neural network weight matrices

Figure 2 for Random matrix analysis of deep neural network weight matrices

Figure 3 for Random matrix analysis of deep neural network weight matrices

Figure 4 for Random matrix analysis of deep neural network weight matrices

Abstract:Neural networks have been used successfully in a variety of fields, which has led to a great deal of interest in developing a theoretical understanding of how they store the information needed to perform a particular task. We study the weight matrices of trained deep neural networks using methods from random matrix theory (RMT) and show that the statistics of most of the singular values follow universal RMT predictions. This suggests that they are random and do not contain system specific information, which we investigate further by comparing the statistics of eigenvector entries to the universal Porter-Thomas distribution. We find that for most eigenvectors the hypothesis of randomness cannot be rejected, and that only eigenvectors belonging to the largest singular values deviate from the RMT prediction, indicating that they may encode learned information. We analyze the spectral distribution of such large singular values using the Hill estimator and find that the distribution cannot be characterized by a tail index, i.e. is not of power law type.

* 11 pages, 9 figures

Via

Access Paper or Ask Questions