Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

Aug 02, 2019
Qianqian Tong, Guannan Liang, Jinbo Bi

Share this with someone who'll enjoy it:

Although adaptive gradient methods (AGMs) have fast speed in training deep neural networks, it is known to generalize worse than the stochastic gradient descent (SGD) or SGD with momentum (S-Momentum). Many works have attempted to modify AGMs so to close the gap in generalization performance between AGMs and S-Momentum, but they do not answer why there is such a gap. We identify that the anisotropic scale of the adaptive learning rate (A-LR) used by AGMs contributes to the generalization performance gap, and all existing modified AGMs actually represent efforts in revising the A-LR. Because the A-LR varies significantly across the dimensions of the problem over the optimization epochs (i.e., anisotropic scale), we propose a new AGM by calibrating the A-LR with a {\em softplus} function, resulting in the \textsc{Sadam} and \textsc{SAMSGrad} methods\footnote{Code is available at}. These methods have better chance to not trap at sharp local minimizers, which helps them resume the dips in the generalization error curve observed with SGD and S-Momentum. We further provide a new way to analyze the convergence of AGMs (e.g., \textsc{Adam}, \textsc{Sadam}, and \textsc{SAMSGrad}) under the nonconvex, non-strongly convex, and Polyak-{\L}ojasiewicz conditions. We prove that the convergence rate of ADAM also depends on its hyper-parameter epsilon, which has been overlooked in prior convergence analysis. Empirical studies support our observation of the anisotropic A-LR and show that the proposed methods outperform existing AGMs and generalize even better than S-Momentum in multiple deep learning tasks.

   Access Paper Source

Share this with someone who'll enjoy it: