Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jeffrey Pennington

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Aug 15, 2020

Ben Adlam, Jeffrey Pennington

Figure 1 for The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Figure 2 for The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Figure 3 for The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Figure 4 for The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Abstract:Modern deep learning models employ considerably more parameters than required to fit the training data. Whereas conventional statistical wisdom suggests such models should drastically overfit, in practice these models generalize remarkably well. An emerging paradigm for describing this unexpected behavior is in terms of a \emph{double descent} curve, in which increasing a model's capacity causes its test error to first decrease, then increase to a maximum near the interpolation threshold, and then decrease again in the overparameterized regime. Recent efforts to explain this phenomenon theoretically have focused on simple settings, such as linear regression or kernel regression with unstructured random features, which we argue are too coarse to reveal important nuances of actual neural networks. We provide a precise high-dimensional asymptotic analysis of generalization under kernel regression with the Neural Tangent Kernel, which characterizes the behavior of wide neural networks optimized with gradient descent. Our results reveal that the test error has non-monotonic behavior deep in the overparameterized regime and can even exhibit additional peaks and descents when the number of parameters scales quadratically with the dataset size.

* Published as a conference paper in the Proceedings of the 37th International Conference on Machine Learning; 31 pages; 4 figures

Via

Access Paper or Ask Questions

The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Jun 25, 2020

Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington

Figure 1 for The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Figure 2 for The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Figure 3 for The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Figure 4 for The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks

Abstract:Modern neural networks are often regarded as complex black-box functions whose behavior is difficult to understand owing to their nonlinear dependence on the data and the nonconvexity in their loss landscapes. In this work, we show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. We additionally argue that this surprising simplicity can persist in networks with more layers and with convolutional architecture, which we verify empirically. Key to our analysis is to bound the spectral norm of the difference between the Neural Tangent Kernel (NTK) at initialization and an affine transform of the data kernel; however, unlike many previous results utilizing the NTK, we do not require the network to have disproportionately large width, and the network is allowed to escape the kernel regime later in training.

Via

Access Paper or Ask Questions

Exact posterior distributions of wide Bayesian neural networks

Jun 18, 2020

Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

Figure 1 for Exact posterior distributions of wide Bayesian neural networks

Figure 2 for Exact posterior distributions of wide Bayesian neural networks

Abstract:Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling.

Via

Access Paper or Ask Questions

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Jan 16, 2020

Wei Hu, Lechao Xiao, Jeffrey Pennington

Figure 1 for Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Figure 2 for Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Abstract:The selection of initial parameter values for gradient-based optimization of deep neural networks is one of the most impactful hyperparameter choices in deep learning systems, affecting both convergence times and model performance. Yet despite significant empirical and theoretical analysis, relatively little has been proved about the concrete effects of different initialization schemes. In this work, we analyze the effect of initialization in deep linear networks, and provide for the first time a rigorous proof that drawing the initial weights from the orthogonal group speeds up convergence relative to the standard Gaussian initialization with iid weights. We show that for deep networks, the width needed for efficient convergence to a global minimum with orthogonal initializations is independent of the depth, whereas the width needed for efficient convergence with Gaussian initializations scales linearly in the depth. Our results demonstrate how the benefits of a good initialization can persist throughout learning, suggesting an explanation for the recent empirical successes found by initializing very deep non-linear networks according to the principle of dynamical isometry.

* International Conference on Learning Representations (ICLR) 2020

Via

Access Paper or Ask Questions

Disentangling trainability and generalization in deep learning

Dec 30, 2019

Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

Figure 1 for Disentangling trainability and generalization in deep learning

Figure 2 for Disentangling trainability and generalization in deep learning

Figure 3 for Disentangling trainability and generalization in deep learning

Figure 4 for Disentangling trainability and generalization in deep learning

Abstract:A fundamental goal in deep learning is the characterization of trainability and generalization of neural networks as a function of their architecture and hyperparameters. In this paper, we discuss these challenging issues in the context of wide neural networks at large depths where we will see that the situation simplifies considerably. To do this, we leverage recent advances that have separately shown: (1) that in the wide network limit, random networks before training are Gaussian Processes governed by a kernel known as the Neural Network Gaussian Process (NNGP) kernel, (2) that at large depths the spectrum of the NNGP kernel simplifies considerably and becomes "weakly data-dependent" and (3) that gradient descent training of wide neural networks is described by a kernel called the Neural Tangent Kernel (NTK) that is related to the NNGP. Here we show that in the large depth limit the spectrum of the NTK simplifies in much the same way as that of the NNGP kernel. By analyzing this spectrum, we arrive at a precise characterization of trainability and a necessary condition for generalization across a range of architectures including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). In particular, we find that there are large regions of hyperparameter space where networks can only memorize the training set in the sense they reach perfect training accuracy but completely fail to generalize outside the training set, in contrast with several recent results. By comparing CNNs with- and without-global average pooling, we show that CNNs without average pooling have very nearly identical learning dynamics to FCNs while CNNs with pooling contain a correction that alters its generalization performance. We perform a thorough empirical investigation of these theoretical results and finding excellent agreement on real datasets.

* 22 pages, 3 figures

Via

Access Paper or Ask Questions

A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

Dec 02, 2019

Ben Adlam, Jake Levinson, Jeffrey Pennington

Figure 1 for A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

Figure 2 for A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

Figure 3 for A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

Figure 4 for A Random Matrix Perspective on Mixtures of Nonlinearities for Deep Learning

Abstract:One of the distinguishing characteristics of modern deep learning systems is that they typically employ neural network architectures that utilize enormous numbers of parameters, often in the millions and sometimes even in the billions. While this paradigm has inspired significant research on the properties of large networks, relatively little work has been devoted to the fact that these networks are often used to model large complex datasets, which may themselves contain millions or even billions of constraints. In this work, we focus on this high-dimensional regime in which both the dataset size and the number of features tend to infinity. We analyze the performance of a simple regression model trained on the random features $F=f(WX+B)$ for a random weight matrix $W$ and random bias vector $B$, obtaining an exact formula for the asymptotic training error on a noisy autoencoding task. The role of the bias can be understood as parameterizing a distribution over activation functions, and our analysis directly generalizes to such distributions, even those not expressible with a traditional additive bias. Intriguingly, we find that a mixture of nonlinearities can outperform the best single nonlinearity on the noisy autoecndoing task, suggesting that mixtures of nonlinearities might be useful for approximate kernel methods or neural network architecture design.

Via

Access Paper or Ask Questions

A Mean Field Theory of Batch Normalization

Mar 05, 2019

Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Figure 1 for A Mean Field Theory of Batch Normalization

Figure 2 for A Mean Field Theory of Batch Normalization

Figure 3 for A Mean Field Theory of Batch Normalization

Figure 4 for A Mean Field Theory of Batch Normalization

Abstract:We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range. Our theory leverages Laplace, Fourier, and Gegenbauer transforms and we derive new identities that may be of independent interest.

* To appear in ICLR 2019

Via

Access Paper or Ask Questions

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Feb 18, 2019

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, Jeffrey Pennington

Figure 1 for Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Figure 2 for Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Figure 3 for Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Figure 4 for Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Abstract:A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.

* 10+8 pages, 13 figures

Via

Access Paper or Ask Questions

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Jan 25, 2019

Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

Figure 1 for Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Figure 2 for Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Figure 3 for Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Figure 4 for Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Abstract:Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and the GRU, do exhibit modest improvements over vanilla RNN cells, but they still suffer from instabilities when trained on very long sequences. In this work, we develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. By optimizing these quantities in terms of the initialization hyperparameters, we derive a novel initialization scheme that eliminates or reduces training instabilities. We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower. We also observe a beneficial effect on generalization performance using this new initialization.

Via

Access Paper or Ask Questions

Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

Oct 11, 2018

Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

Figure 1 for Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

Figure 2 for Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

Figure 3 for Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

Figure 4 for Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes

Abstract:There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance in finite-channel CNNs trained with stochastic gradient descent (SGD) has no corresponding property in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation.

* 26 pages, 7 figures

Via

Access Paper or Ask Questions