Alert button
Picture for Bruno Loureiro

Bruno Loureiro

Alert button

Learning Two-Layer Neural Networks, One (Giant) Step at a Time

May 29, 2023
Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan

Figure 1 for Learning Two-Layer Neural Networks, One (Giant) Step at a Time
Figure 2 for Learning Two-Layer Neural Networks, One (Giant) Step at a Time
Figure 3 for Learning Two-Layer Neural Networks, One (Giant) Step at a Time
Figure 4 for Learning Two-Layer Neural Networks, One (Giant) Step at a Time

We study the training dynamics of shallow neural networks, investigating the conditions under which a limited number of large batch gradient descent steps can facilitate feature learning beyond the kernel regime. We compare the influence of batch size and that of multiple (but finitely many) steps. Our analysis of a single-step process reveals that while a batch size of $n = O(d)$ enables feature learning, it is only adequate for learning a single direction, or a single-index model. In contrast, $n = O(d^2)$ is essential for learning multiple directions and specialization. Moreover, we demonstrate that ``hard'' directions, which lack the first $\ell$ Hermite coefficients, remain unobserved and require a batch size of $n = O(d^\ell)$ for being captured by gradient descent. Upon iterating a few steps, the scenario changes: a batch-size of $n = O(d)$ is enough to learn new target directions spanning the subspace linearly connected in the Hermite basis to the previously learned directions, thereby a staircase property. Our analysis utilizes a blend of techniques related to concentration, projection-based conditioning, and Gaussian equivalence that are of independent interest. By determining the conditions necessary for learning and specialization, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on feature learning in neural networks.

Viaarxiv icon

Escaping mediocrity: how two-layer networks learn hard single-index models with SGD

May 29, 2023
Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, Ludovic Stephan

Figure 1 for Escaping mediocrity: how two-layer networks learn hard single-index models with SGD
Figure 2 for Escaping mediocrity: how two-layer networks learn hard single-index models with SGD
Figure 3 for Escaping mediocrity: how two-layer networks learn hard single-index models with SGD
Figure 4 for Escaping mediocrity: how two-layer networks learn hard single-index models with SGD

This study explores the sample complexity for two-layer neural networks to learn a single-index target function under Stochastic Gradient Descent (SGD), focusing on the challenging regime where many flat directions are present at initialization. It is well-established that in this scenario $n=O(d\log{d})$ samples are typically needed. However, we provide precise results concerning the pre-factors in high-dimensional contexts and for varying widths. Notably, our findings suggest that overparameterization can only enhance convergence by a constant factor within this problem class. These insights are grounded in the reduction of SGD dynamics to a stochastic process in lower dimensions, where escaping mediocrity equates to calculating an exit time. Yet, we demonstrate that a deterministic approximation of this process adequately represents the escape time, implying that the role of stochasticity may be minimal in this scenario.

Viaarxiv icon

Expectation consistency for calibration of neural networks

Mar 05, 2023
Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Figure 1 for Expectation consistency for calibration of neural networks
Figure 2 for Expectation consistency for calibration of neural networks
Figure 3 for Expectation consistency for calibration of neural networks
Figure 4 for Expectation consistency for calibration of neural networks

Despite their incredible performance, it is well reported that deep neural networks tend to be overoptimistic about their prediction confidence. Finding effective and efficient calibration methods for neural networks is therefore an important endeavour towards better uncertainty quantification in deep learning. In this manuscript, we introduce a novel calibration technique named expectation consistency (EC), consisting of a post-training rescaling of the last layer weights by enforcing that the average validation confidence coincides with the average proportion of correct labels. First, we show that the EC method achieves similar calibration performance to temperature scaling (TS) across different neural network architectures and data sets, all while requiring similar validation samples and computational resources. However, we argue that EC provides a principled method grounded on a Bayesian optimality principle known as the Nishimori identity. Next, we provide an asymptotic characterization of both TS and EC in a synthetic setting and show that their performance crucially depends on the target function. In particular, we discuss examples where EC significantly outperforms TS.

Viaarxiv icon

Universality laws for Gaussian mixtures in generalized linear models

Feb 17, 2023
Yatin Dandi, Ludovic Stephan, Florent Krzakala, Bruno Loureiro, Lenka Zdeborová

Figure 1 for Universality laws for Gaussian mixtures in generalized linear models
Figure 2 for Universality laws for Gaussian mixtures in generalized linear models
Figure 3 for Universality laws for Gaussian mixtures in generalized linear models
Figure 4 for Universality laws for Gaussian mixtures in generalized linear models

Let $(x_{i}, y_{i})_{i=1,\dots,n}$ denote independent samples from a general mixture distribution $\sum_{c\in\mathcal{C}}\rho_{c}P_{c}^{x}$, and consider the hypothesis class of generalized linear models $\hat{y} = F(\Theta^{\top}x)$. In this work, we investigate the asymptotic joint statistics of the family of generalized linear estimators $(\Theta_{1}, \dots, \Theta_{M})$ obtained either from (a) minimizing an empirical risk $\hat{R}_{n}(\Theta;X,y)$ or (b) sampling from the associated Gibbs measure $\exp(-\beta n \hat{R}_{n}(\Theta;X,y))$. Our main contribution is to characterize under which conditions the asymptotic joint statistics of this family depends (on a weak sense) only on the means and covariances of the class conditional features distribution $P_{c}^{x}$. In particular, this allow us to prove the universality of different quantities of interest, such as the training and generalization errors, redeeming a recent line of work in high-dimensional statistics working under the Gaussian mixture hypothesis. Finally, we discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty

Viaarxiv icon

Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation

Feb 17, 2023
Luca Pesce, Florent Krzakala, Bruno Loureiro, Ludovic Stephan

Figure 1 for Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation
Figure 2 for Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation
Figure 3 for Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation
Figure 4 for Are Gaussian data all you need? Extents and limits of universality in high-dimensional generalized linear estimation

In this manuscript we consider the problem of generalized linear estimation on Gaussian mixture data with labels given by a single-index model. Our first result is a sharp asymptotic expression for the test and training errors in the high-dimensional regime. Motivated by the recent stream of results on the Gaussian universality of the test and training errors in generalized linear estimation, we ask ourselves the question: "when is a single Gaussian enough to characterize the error?". Our formula allow us to give sharp answers to this question, both in the positive and negative directions. More precisely, we show that the sufficient conditions for Gaussian universality (or lack of thereof) crucially depend on the alignment between the target weights and the means and covariances of the mixture clusters, which we precisely quantify. In the particular case of least-squares interpolation, we prove a strong universality property of the training error, and show it follows a simple, closed-form expression. Finally, we apply our results to real datasets, clarifying some recent discussion in the literature about Gaussian universality of the errors in this context.

Viaarxiv icon

From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks

Feb 12, 2023
Luca Arnaboldi, Ludovic Stephan, Florent Krzakala, Bruno Loureiro

Figure 1 for From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks
Figure 2 for From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks
Figure 3 for From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks
Figure 4 for From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks

This manuscript investigates the one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network trained on Gaussian data and labels generated by a similar, though not necessarily identical, target function. We rigorously analyse the limiting dynamics via a deterministic and low-dimensional description in terms of the sufficient statistics for the population risk. Our unifying analysis bridges different regimes of interest, such as the classical gradient-flow regime of vanishing learning rate, the high-dimensional regime of large input dimension, and the overparameterised "mean-field" regime of large network width, covering as well the intermediate regimes where the limiting dynamics is determined by the interplay between these behaviours. In particular, in the high-dimensional limit, the infinite-width dynamics is found to remain close to a low-dimensional subspace spanned by the target principal directions. Our results therefore provide a unifying picture of the limiting SGD dynamics with synthetic data.

Viaarxiv icon

Deterministic equivalent and error universality of deep random features learning

Feb 01, 2023
Dominik Schröder, Hugo Cui, Daniil Dmitriev, Bruno Loureiro

Figure 1 for Deterministic equivalent and error universality of deep random features learning
Figure 2 for Deterministic equivalent and error universality of deep random features learning
Figure 3 for Deterministic equivalent and error universality of deep random features learning
Figure 4 for Deterministic equivalent and error universality of deep random features learning

This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.

Viaarxiv icon

A study of uncertainty quantification in overparametrized high-dimensional models

Oct 23, 2022
Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Figure 1 for A study of uncertainty quantification in overparametrized high-dimensional models
Figure 2 for A study of uncertainty quantification in overparametrized high-dimensional models
Figure 3 for A study of uncertainty quantification in overparametrized high-dimensional models
Figure 4 for A study of uncertainty quantification in overparametrized high-dimensional models

Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization.

Viaarxiv icon

Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap

May 26, 2022
Luca Pesce, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Figure 1 for Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap
Figure 2 for Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap
Figure 3 for Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap
Figure 4 for Subspace clustering in high-dimensions: Phase transitions \& Statistical-to-Computational gap

A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $\rho$, as well as the ratio $\alpha$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $\lambda_{\text{alg}} \ge k / \sqrt{\alpha} $ to perform better than random, and the information theoretic threshold at $\lambda_{\text{it}} \approx \sqrt{-k \rho \log{\rho}} / \sqrt{\alpha}$. Finally, we discuss the case of sub-extensive sparsity $\rho$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding.

Viaarxiv icon

Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension

May 26, 2022
Federica Gerace, Florent Krzakala, Bruno Loureiro, Ludovic Stephan, Lenka Zdeborová

Figure 1 for Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension
Figure 2 for Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension
Figure 3 for Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension
Figure 4 for Gaussian Universality of Linear Classifiers with Random Labels in High-Dimension

While classical in many theoretical settings, the assumption of Gaussian i.i.d. inputs is often perceived as a strong limitation in the analysis of high-dimensional learning. In this study, we redeem this line of work in the case of generalized linear classification with random labels. Our main contribution is a rigorous proof that data coming from a range of generative models in high-dimensions have the same minimum training loss as Gaussian data with corresponding data covariance. In particular, our theorem covers data created by an arbitrary mixture of homogeneous Gaussian clouds, as well as multi-modal generative neural networks. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. Finally, we show that this universality property is observed in practice with real datasets and random labels.

Viaarxiv icon