Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Nov 13, 2021

Nikhil Ghosh, Song Mei, Bin Yu

Figure 1 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Figure 2 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Figure 3 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Figure 4 for The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Share this with someone who'll enjoy it:

Abstract:To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise high-dimensional asymptotics, we characterize the dynamics of the fitted model in two "worlds": in the Oracle World the model is trained on the population distribution and in the Empirical World the model is trained on a sampled dataset. We show that under mild conditions on the kernel and $L^2$ target regression function the training dynamics undergo three stages characterized by the behaviors of the models in the two worlds. Our theoretical results also mathematically formalize some interesting deep learning phenomena. Specifically, in our setting we show that SGD progressively learns more complex functions and that there is a "deep bootstrap" phenomenon: during the second stage, the test error of both worlds remain close despite the empirical training error being much smaller. Finally, we give a concrete example comparing the dynamics of two different kernels which shows that faster training is not necessary for better generalization.

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

Paper and Code