Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guy Gur-Ari

Are wider nets better given the same number of parameters?

Oct 27, 2020
Anna Golubeva, Behnam Neyshabur, Guy Gur-Ari

Figure 1 for Are wider nets better given the same number of parameters?

Figure 2 for Are wider nets better given the same number of parameters?

Figure 3 for Are wider nets better given the same number of parameters?

Figure 4 for Are wider nets better given the same number of parameters?

Empirical studies demonstrate that the performance of neural networks improves with increasing number of parameters. In most of these studies, the number of parameters is increased by increasing the network width. This begs the question: Is the observed improvement due to the larger number of parameters, or is it due to the larger width itself? We compare different ways of increasing model width while keeping the number of parameters constant. We show that for models initialized with a random, static sparsity pattern in the weight tensors, network width is the determining factor for good performance, while the number of weights is secondary, as long as trainability is ensured. As a step towards understanding this effect, we analyze these models in the framework of Gaussian Process kernels. We find that the distance between the sparse finite-width model kernel and the infinite-width kernel at initialization is indicative of model performance.

* 9 pages

Via

Access Paper or Ask Questions

On the training dynamics of deep networks with $L_2$ regularization

Jun 15, 2020
Aitor Lewkowycz, Guy Gur-Ari

Figure 1 for On the training dynamics of deep networks with $L_2$ regularization

Figure 2 for On the training dynamics of deep networks with $L_2$ regularization

Figure 3 for On the training dynamics of deep networks with $L_2$ regularization

Figure 4 for On the training dynamics of deep networks with $L_2$ regularization

We study the role of $L_2$ regularization in deep learning, and uncover simple relations between the performance of the model, the $L_2$ coefficient, the learning rate, and the number of training steps. These empirical relations hold when the network is overparameterized. They can be used to predict the optimal regularization parameter of a given model. In addition, based on these observations we propose a dynamical schedule for the regularization parameter that improves performance and speeds up training. We test these proposals in modern image classification settings. Finally, we show that these empirical relations can be understood theoretically in the context of infinitely wide networks. We derive the gradient flow dynamics of such networks, and compare the role of $L_2$ regularization in this context with that of linear models.

* 10+12 pages, 5+10 figures

Via

Access Paper or Ask Questions

On the asymptotics of wide networks with polynomial activations

Jun 11, 2020
Kyle Aitken, Guy Gur-Ari

Figure 1 for On the asymptotics of wide networks with polynomial activations

Figure 2 for On the asymptotics of wide networks with polynomial activations

Figure 3 for On the asymptotics of wide networks with polynomial activations

Figure 4 for On the asymptotics of wide networks with polynomial activations

We consider an existing conjecture addressing the asymptotic behavior of neural networks in the large width limit. The results that follow from this conjecture include tight bounds on the behavior of wide networks during stochastic gradient descent, and a derivation of their finite-width dynamics. We prove the conjecture for deep networks with polynomial activation functions, greatly extending the validity of these results. Finally, we point out a difference in the asymptotic behavior of networks with analytic (and non-linear) activation functions and those with piecewise-linear activations such as ReLU.

* 8+12 pages, 6 figures, 2 tables

Via

Access Paper or Ask Questions

The large learning rate phase of deep learning: the catapult mechanism

Mar 04, 2020
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

Figure 1 for The large learning rate phase of deep learning: the catapult mechanism

Figure 2 for The large learning rate phase of deep learning: the catapult mechanism

Figure 3 for The large learning rate phase of deep learning: the catapult mechanism

Figure 4 for The large learning rate phase of deep learning: the catapult mechanism

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.

* 25 pages, 19 figures

Via

Access Paper or Ask Questions

Wider Networks Learn Better Features

Sep 25, 2019
Dar Gilboa, Guy Gur-Ari

Figure 1 for Wider Networks Learn Better Features

Figure 2 for Wider Networks Learn Better Features

Figure 3 for Wider Networks Learn Better Features

Figure 4 for Wider Networks Learn Better Features

Transferability of learned features between tasks can massively reduce the cost of training a neural network on a novel task. We investigate the effect of network width on learned features using activation atlases --- a visualization technique that captures features the entire hidden state responds to, as opposed to individual neurons alone. We find that, while individual neurons do not learn interpretable features in wide networks, groups of neurons do. In addition, the hidden state of a wide network contains more information about the inputs than that of a narrow network trained to the same test accuracy. Inspired by this observation, we show that when fine-tuning the last layer of a network on a new task, performance improves significantly as the width of the network is increased, even though test accuracy on the original task is independent of width.

Via

Access Paper or Ask Questions

Asymptotics of Wide Networks from Feynman Diagrams

Sep 25, 2019
Ethan Dyer, Guy Gur-Ari

Figure 1 for Asymptotics of Wide Networks from Feynman Diagrams

Figure 2 for Asymptotics of Wide Networks from Feynman Diagrams

Figure 3 for Asymptotics of Wide Networks from Feynman Diagrams

Figure 4 for Asymptotics of Wide Networks from Feynman Diagrams

Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically.

* 10 pages, 3 figures, 1 Table + Appendices

Via

Access Paper or Ask Questions

Gradient Descent Happens in a Tiny Subspace

Dec 12, 2018
Guy Gur-Ari, Daniel A. Roberts, Ethan Dyer

Figure 1 for Gradient Descent Happens in a Tiny Subspace

Figure 2 for Gradient Descent Happens in a Tiny Subspace

Figure 3 for Gradient Descent Happens in a Tiny Subspace

Figure 4 for Gradient Descent Happens in a Tiny Subspace

We show that in a variety of large-scale deep learning scenarios the gradient dynamically converges to a very small subspace after a short period of training. The subspace is spanned by a few top eigenvectors of the Hessian (equal to the number of classes in the dataset), and is mostly preserved over long periods of training. A simple argument then suggests that gradient descent may happen mostly in this subspace. We give an example of this effect in a solvable model of classification, and we comment on possible implications for optimization and learning.

* 9 pages + appendices, 12 figures

Via

Access Paper or Ask Questions