Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Parameter Norm Growth During Training of Transformers

Nov 11, 2020

William Merrill, Vivek Ramanujan, Yoav Goldberg, Roy Schwartz, Noah Smith

Figure 1 for Parameter Norm Growth During Training of Transformers

Figure 2 for Parameter Norm Growth During Training of Transformers

Figure 3 for Parameter Norm Growth During Training of Transformers

Figure 4 for Parameter Norm Growth During Training of Transformers

Share this with someone who'll enjoy it:

Abstract:The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically some variant of gradient descent (GD). To better understand this bias, we study the tendency of transformer parameters to grow in magnitude during training. We find, both theoretically and empirically, that, in certain contexts, GD increases the parameter $L_2$ norm up to a threshold that itself increases with training-set accuracy. This means increasing training accuracy over time enables the norm to increase. Empirically, we show that the norm grows continuously over pretraining for T5 (Raffel et al., 2019). We show that pretrained T5 approximates a semi-discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the original network family that can be described in automata-theoretic terms. This suggests saturation is a new characterization of an inductive bias implicit in GD that is of particular interest for NLP. While our experiments focus on transformers, our theoretical analysis extends to other architectures with similar formal properties, such as feedforward ReLU networks.

* Preprint. 9 body pages with appendix

View paper on

Share this with someone who'll enjoy it:

Title:Parameter Norm Growth During Training of Transformers

Paper and Code