Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Feb 04, 2019
Yuan Cao, Quanquan Gu

Empirical studies show that gradient based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that gradient-based methods with proper random initialization can find the global minima of the training loss in over-parameterized DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to a non-vacuous algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that explains the good generalization performance of over-parameterized deep neural networks learned by gradient descent.

* 52 pages 

Share this with someone who'll enjoy it:

   Access Paper Source

Share this with someone who'll enjoy it: