Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mike Lewis

8-bit Optimizers via Block-wise Quantization

Oct 06, 2021
Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer

Figure 1 for 8-bit Optimizers via Block-wise Quantization

Figure 2 for 8-bit Optimizers via Block-wise Quantization

Figure 3 for 8-bit Optimizers via Block-wise Quantization

Figure 4 for 8-bit Optimizers via Block-wise Quantization

Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT'14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.

* ICLR2022 submission with appendix

Via

Access Paper or Ask Questions

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Aug 27, 2021
Ofir Press, Noah A. Smith, Mike Lewis

Figure 1 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Figure 2 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Figure 3 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Figure 4 for Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11% faster and using 11% less memory. ALiBi's inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

Via

Access Paper or Ask Questions

DEMix Layers: Disentangling Domains for Modular Language Modeling

Aug 20, 2021
Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A. Smith, Luke Zettlemoyer

Figure 1 for DEMix Layers: Disentangling Domains for Modular Language Modeling

Figure 2 for DEMix Layers: Disentangling Domains for Modular Language Modeling

Figure 3 for DEMix Layers: Disentangling Domains for Modular Language Modeling

Figure 4 for DEMix Layers: Disentangling Domains for Modular Language Modeling

We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMix layer is a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMix layers reduce test-time perplexity, increase training efficiency, and enable rapid adaptation with little overhead. We show that mixing experts during inference, using a parameter-free weighted ensemble, allows the model to better generalize to heterogeneous or unseen domains. We also show that experts can be added to iteratively incorporate new domains without forgetting older ones, and that experts can be removed to restrict access to unwanted domains, without additional training. Overall, these results demonstrate benefits of explicitly conditioning on textual domains during language modeling.

* edits: updated reference links, added related work, typo fixes

Via

Access Paper or Ask Questions

Noisy Channel Language Model Prompting for Few-Shot Text Classification

Aug 15, 2021
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer

Figure 1 for Noisy Channel Language Model Prompting for Few-Shot Text Classification

Figure 2 for Noisy Channel Language Model Prompting for Few-Shot Text Classification

Figure 3 for Noisy Channel Language Model Prompting for Few-Shot Text Classification

Figure 4 for Noisy Channel Language Model Prompting for Few-Shot Text Classification

We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input (referred as direct models), channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters, via either in-context demonstration or prompt tuning. Our experiments show that, for both methods, channel models significantly outperform their direct counterparts, which we attribute to their stability, i.e., lower variance and higher worst-case accuracy. We also present extensive ablations that provide recommendations for when to use channel prompt tuning instead of other competitive models (e.g., direct head tuning): channel prompt tuning is preferred when the number of training examples is small, labels in the training data are imbalanced, or generalization to unseen labels is required.

* 15 pages, 6 figures

Via

Access Paper or Ask Questions

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Jul 14, 2021
Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer

Figure 1 for HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Figure 2 for HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Figure 3 for HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Figure 4 for HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale web crawl. Modeling hyper-text has a number of advantages: (1) it is easily gathered at scale, (2) it provides rich document-level and end-task-adjacent supervision (e.g. class and id attributes often encode document category information), and (3) it allows for new structured prompting that follows the established semantics of HTML (e.g. to do zero-shot summarization by infilling title tags for a webpage that contains the input text). We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data. We will release all code and models to support future HTLM research.

Via

Access Paper or Ask Questions

Question Answering Infused Pre-training of General-Purpose Contextualized Representations

Jun 15, 2021
Robin Jia, Mike Lewis, Luke Zettlemoyer

Figure 1 for Question Answering Infused Pre-training of General-Purpose Contextualized Representations

Figure 2 for Question Answering Infused Pre-training of General-Purpose Contextualized Representations

Figure 3 for Question Answering Infused Pre-training of General-Purpose Contextualized Representations

Figure 4 for Question Answering Infused Pre-training of General-Purpose Contextualized Representations

This paper proposes a pre-training objective based on question answering (QA) for learning general-purpose contextual representations, motivated by the intuition that the representation of a phrase in a passage should encode all questions that the phrase can answer in context. We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model on 80 million synthesized QA pairs. By encoding QA-relevant information, the bi-encoder's token-level representations are useful for non-QA downstream tasks without extensive (or in some cases, any) fine-tuning. We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection on four datasets, few-shot named entity recognition on two datasets, and zero-shot sentiment analysis on three datasets.

Via

Access Paper or Ask Questions

Multitasking Inhibits Semantic Drift

Apr 15, 2021
Athul Paul Jacob, Mike Lewis, Jacob Andreas

Figure 1 for Multitasking Inhibits Semantic Drift

Figure 2 for Multitasking Inhibits Semantic Drift

Figure 3 for Multitasking Inhibits Semantic Drift

Figure 4 for Multitasking Inhibits Semantic Drift

When intelligent agents communicate to accomplish shared goals, how do these goals shape the agents' language? We study the dynamics of learning in latent language policies (LLPs), in which instructor agents generate natural-language subgoal descriptions and executor agents map these descriptions to low-level actions. LLPs can solve challenging long-horizon reinforcement learning problems and provide a rich model for studying task-oriented language use. But previous work has found that LLP training is prone to semantic drift (use of messages in ways inconsistent with their original natural language meanings). Here, we demonstrate theoretically and empirically that multitask training is an effective counter to this problem: we prove that multitask training eliminates semantic drift in a well-studied family of signaling games, and show that multitask training of neural LLPs in a complex strategy game reduces drift and while improving sample efficiency.

* NAACL 2021

Via

Access Paper or Ask Questions

BASE Layers: Simplifying Training of Large, Sparse Models

Mar 30, 2021
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, Luke Zettlemoyer

Figure 1 for BASE Layers: Simplifying Training of Large, Sparse Models

Figure 2 for BASE Layers: Simplifying Training of Large, Sparse Models

Figure 3 for BASE Layers: Simplifying Training of Large, Sparse Models

Figure 4 for BASE Layers: Simplifying Training of Large, Sparse Models

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules that contain only a small fraction of the model parameters. However, it can be difficult to learn balanced routing functions that make full use of the available experts; existing approaches typically use routing heuristics or auxiliary expert-balancing loss functions. In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyperparameters or auxiliary losses. Code is publicly released at https://github.com/pytorch/fairseq/

Via

Access Paper or Ask Questions