Recent works show an intriguing phenomenon of Frequency Principle (F-Principle) that deep neural networks (DNNs) fit the target function from low to high frequency during the training, which provides insight into the training and generalization behavior of DNNs in complex tasks. In this paper, through analysis of an infinite-width two-layer NN in the neural tangent kernel (NTK) regime, we derive the exact differential equation, namely Linear Frequency-Principle (LFP) model, governing the evolution of NN output function in the frequency domain during the training. Our exact computation applies for general activation functions with no assumption on size and distribution of training data. This LFP model unravels that higher frequencies evolve polynomially or exponentially slower than lower frequencies depending on the smoothness/regularity of the activation function. We further bridge the gap between training dynamics and generalization by proving that LFP model implicitly minimizes a Frequency-Principle norm (FP-norm) of the learned function, by which higher frequencies are more severely penalized depending on the inverse of their evolution rate. Finally, we derive an \textit{a priori} generalization error bound controlled by the FP-norm of the target function, which provides a theoretical justification for the empirical results that DNNs often generalize well for low frequency functions.
How neural network behaves during the training over different choices of hyperparameters is an important question in the study of neural networks. However, except for specific examples with particular choices of hyperparameters, e.g., neural tangent kernel (NTK), mean-field model, this question is largely unanswered. In this work, inspired by the phase diagram in statistical mechanics, we draw the phase diagram for the two-layer ReLU neural network at the infinite-width limit for a complete characterization of its dynamical regimes and their dependence on hyperparameters. Through both experimental and theoretical approaches, we identify three regimes in the phase diagram, i.e., linear regime, critical regime and condensed regime, based on the relative change of input weights as the width approaches infinity, which tends to $0$, $O(1)$ and $+\infty$, respectively. In the linear regime, NN training dynamics is approximately linear similar to a random feature model with an exponential loss decay. In the condensed regime, we demonstrate through experiments that active neurons are condensed at several discrete orientations. The critical regime serves as the boundary between above two regimes, which exhibits an intermediate nonlinear behavior with the mean-field model as a typical example. Overall, our phase diagram for the two-layer ReLU NN serves as a map for the future studies and is a first step towards a more systematical investigation of the training behavior and the implicit regularization of NNs of different structures.
We focus on estimating \emph{a priori} generalization error of two-layer ReLU neural networks (NNs) trained by mean squared error, which only depends on initial parameters and the target function, through the following research line. We first estimate \emph{a priori} generalization error of finite-width two-layer ReLU NN with constraint of minimal norm solution, which is proved by \cite{zhang2019type} to be an equivalent solution of a linearized (w.r.t. parameter) finite-width two-layer NN. As the width goes to infinity, the linearized NN converges to the NN in Neural Tangent Kernel (NTK) regime \citep{jacot2018neural}. Thus, we can derive the \emph{a priori} generalization error of two-layer ReLU NN in NTK regime. The distance between NN in a NTK regime and a finite-width NN with gradient training is estimated by \cite{arora2019exact}. Based on the results in \cite{arora2019exact}, our work proves an \emph{a priori} generalization error bound of two-layer ReLU NNs. This estimate uses the intrinsic implicit bias of the minimum norm solution without requiring extra regularity in the loss function. This \emph{a priori} estimate also implies that NN does not suffer from curse of dimensionality, and a small generalization error can be achieved without requiring exponentially large number of neurons. In addition the research line proposed in this paper can also be used to study other properties of the finite-width network, such as the posterior generalization error.
Along with fruitful applications of Deep Neural Networks (DNNs) to realistic problems, recently, some empirical studies of DNNs reported a universal phenomenon of Frequency Principle (F-Principle): a DNN tends to learn a target function from low to high frequencies during the training. The F-Principle has been very useful in providing both qualitative and quantitative understandings of DNNs. In this paper, we rigorously investigate the F-Principle for the training dynamics of a general DNN at three stages: initial stage, intermediate stage, and final stage. For each stage, a theorem is provided in terms of proper quantities characterizing the F-Principle. Our results are general in the sense that they work for multilayer networks with general activation functions, population densities of data, and a large class of loss functions. Our work lays a theoretical foundation of the F-Principle for a better understanding of the training process of DNNs.
It remains a puzzle that why deep neural networks (DNNs), with more parameters than samples, often generalize well. An attempt of understanding this puzzle is to discover implicit biases underlying the training process of DNNs, such as the Frequency Principle (F-Principle), i.e., DNNs often fit target functions from low to high frequencies. Inspired by the F-Principle, we propose an effective model of linear F-Principle (LFP) dynamics which accurately predicts the learning results of two-layer ReLU neural networks (NNs) of large widths. This LFP dynamics is rationalized by a linearized mean field residual dynamics of NNs. Importantly, the long-time limit solution of this LFP dynamics is equivalent to the solution of a constrained optimization problem explicitly minimizing an FP-norm, in which higher frequencies of feasible solutions are more heavily penalized. Using this optimization formulation, an a priori estimate of the generalization error bound is provided, revealing that a higher FP-norm of the target function increases the generalization error. Overall, by explicitizing the implicit bias of the F-Principle as an explicit penalty for two-layer NNs, our work makes a step towards a quantitative understanding of the learning and generalization of general DNNs.
How different initializations and loss functions affect the learning of a deep neural network (DNN), specifically its generalization error, is an important problem in practice. In this work, focusing on regression problems, we develop a kernel-norm minimization framework for the analysis of DNNs in the kernel regime in which the number of neurons in each hidden layer is sufficiently large (Jacot et al. 2018, Lee et al. 2019). We find that, in the kernel regime, for any loss in a general class of functions, e.g., any Lp loss for $1 < p < \infty$, the DNN finds the same global minima-the one that is nearest to the initial value in the parameter space, or equivalently, the one that is closest to the initial DNN output in the corresponding reproducing kernel Hilbert space. With this framework, we prove that a non-zero initial output increases the generalization error of DNN. We further propose an antisymmetrical initialization (ASI) trick that eliminates this type of error and accelerates the training. We also demonstrate experimentally that even for DNNs in the non-kernel regime, our theoretical analysis and the ASI trick remain effective. Overall, our work provides insight into how initialization and loss function quantitatively affect the generalization of DNNs, and also provides guidance for the training of DNNs.
We study the training process of Deep Neural Networks (DNNs) from the Fourier analysis perspective. Our starting point is a Frequency Principle (F-Principle) --- DNNs initialized with small parameters often fit target functions from low to high frequencies --- which was first proposed by Xu et al. (2018) and Rahaman et al. (2018) on synthetic datasets. In this work, we first show the universality of the F-Principle by demonstrating this phenomenon on high-dimensional benchmark datasets, such as MNIST and CIFAR10. Then, based on experiments, we show that the F-Principle provides insight into both the success and failure of DNNs in different types of problems. Based on the F-Principle, we further propose that DNN can be adopted to accelerate the convergence of low frequencies for scientific computing problems, in which most of the conventional methods (e.g., Jacobi method) exhibit the opposite convergence behavior --- faster convergence for higher frequencies. Finally, we prove a theorem for DNNs of one hidden layer as a first step towards a mathematical explanation of the F-Principle. Our work indicates that the F-Principle with Fourier analysis is a promising approach to the study of DNNs because it seems ubiquitous, applicable, and explainable.
Why deep neural networks (DNNs) capable of overfitting often generalize well in practice is a mystery in deep learning. Existing works indicate that this observation holds for both complicated real datasets and simple datasets of one-dimensional (1-d) functions. In this work, for fitting low-frequency dominant 1-d functions, memorizing natural images and classification problems, we empirically found that a DNN, i.e., full-connected DNN or convolutional neural networks with common settings first quickly captures the dominant low-frequency components, and then relatively slowly captures high-frequency ones. We call this phenomenon Frequency Principle (F-Principle). F-Principle can be observed over various DNN setups of different activation functions, layer structures and training algorithms in our experiments. F-Principle can be used to understand (i) the behavior of DNN training in the information plane and (ii) why DNNs often generalize well albeit its ability of overfitting. This F-Principle potentially can provide insights into understanding the general principle underlying DNN optimization and generalization for real datasets.