Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Don't Decay the Learning Rate, Increase the Batch Size

Feb 24, 2018

Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le

Figure 1 for Don't Decay the Learning Rate, Increase the Batch Size

Figure 2 for Don't Decay the Learning Rate, Increase the Batch Size

Figure 3 for Don't Decay the Learning Rate, Increase the Batch Size

Figure 4 for Don't Decay the Learning Rate, Increase the Batch Size

Share this with someone who'll enjoy it:

Abstract:It is common practice to decay the learning rate. Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. We can further reduce the number of parameter updates by increasing the learning rate $\epsilon$ and scaling the batch size $B \propto \epsilon$. Finally, one can increase the momentum coefficient $m$ and scale $B \propto 1/(1-m)$, although this tends to slightly reduce the test accuracy. Crucially, our techniques allow us to repurpose existing training schedules for large batch training with no hyper-parameter tuning. We train ResNet-50 on ImageNet to $76.1\%$ validation accuracy in under 30 minutes.

* 11 pages, 8 figures. Published as a conference paper at ICLR 2018

View paper on

OpenReview

Share this with someone who'll enjoy it:

Title:Don't Decay the Learning Rate, Increase the Batch Size

Paper and Code