Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

May 06, 2024

Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis

Figure 1 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Figure 2 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Figure 3 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Figure 4 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Share this with someone who'll enjoy it:

Abstract:Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

* 21 pages, 12 figures

View paper on

Share this with someone who'll enjoy it:

Title:Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Paper and Code