Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Aitor Lewkowycz

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Nov 30, 2021
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, Augustus Odena

Figure 1 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Figure 2 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Figure 3 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Figure 4 for Show Your Work: Scratchpads for Intermediate Computation with Language Models

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

Via

Access Paper or Ask Questions

How to decay your learning rate

Mar 23, 2021
Aitor Lewkowycz

Figure 1 for How to decay your learning rate

Figure 2 for How to decay your learning rate

Figure 3 for How to decay your learning rate

Figure 4 for How to decay your learning rate

Complex learning rate schedules have become an integral part of deep learning. We find empirically that common fine-tuned schedules decay the learning rate after the weight norm bounces. This leads to the proposal of ABEL: an automatic scheduler which decays the learning rate by keeping track of the weight norm. ABEL's performance matches that of tuned schedules and is more robust with respect to its parameters. Through extensive experiments in vision, NLP, and RL, we show that if the weight norm does not bounce, we can simplify schedules even further with no loss in performance. In such cases, a complex schedule has similar performance to a constant learning rate with a decay at the end of training.

* 9 + 14 pages, 5 + 11 figures

Via

Access Paper or Ask Questions

On the training dynamics of deep networks with $L_2$ regularization

Jun 15, 2020
Aitor Lewkowycz, Guy Gur-Ari

Figure 1 for On the training dynamics of deep networks with $L_2$ regularization

Figure 2 for On the training dynamics of deep networks with $L_2$ regularization

Figure 3 for On the training dynamics of deep networks with $L_2$ regularization

Figure 4 for On the training dynamics of deep networks with $L_2$ regularization

We study the role of $L_2$ regularization in deep learning, and uncover simple relations between the performance of the model, the $L_2$ coefficient, the learning rate, and the number of training steps. These empirical relations hold when the network is overparameterized. They can be used to predict the optimal regularization parameter of a given model. In addition, based on these observations we propose a dynamical schedule for the regularization parameter that improves performance and speeds up training. We test these proposals in modern image classification settings. Finally, we show that these empirical relations can be understood theoretically in the context of infinitely wide networks. We derive the gradient flow dynamics of such networks, and compare the role of $L_2$ regularization in this context with that of linear models.

* 10+12 pages, 5+10 figures

Via

Access Paper or Ask Questions

The large learning rate phase of deep learning: the catapult mechanism

Mar 04, 2020
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

Figure 1 for The large learning rate phase of deep learning: the catapult mechanism

Figure 2 for The large learning rate phase of deep learning: the catapult mechanism

Figure 3 for The large learning rate phase of deep learning: the catapult mechanism

Figure 4 for The large learning rate phase of deep learning: the catapult mechanism

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.

* 25 pages, 19 figures

Via

Access Paper or Ask Questions