Alert button

Differentiable Subset Pruning of Transformer Heads

Aug 10, 2021
Jiaoda Li, Ryan Cotterell, Mrinmaya Sachan

Figure 1 for Differentiable Subset Pruning of Transformer Heads
Figure 2 for Differentiable Subset Pruning of Transformer Heads
Figure 3 for Differentiable Subset Pruning of Transformer Heads
Figure 4 for Differentiable Subset Pruning of Transformer Heads

Share this with someone who'll enjoy it:

Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer (Vaswaniet al., 2017). Recent work has shown, however, that a large proportion of the heads in a Transformer's multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. Intuitively, our method learns per-head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. The importance variables are learned via stochastic gradient descent. We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than Voita et al. (2019) while offering the same exact control over the number of heads as Michel et al. (2019).

View paper onarxiv icon

Share this with someone who'll enjoy it: