Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mike Lewis

Jack

Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

May 06, 2024

Zexuan Zhong, Mengzhou Xia, Danqi Chen, Mike Lewis

Figure 1 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Figure 2 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Figure 3 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Figure 4 for Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training

Abstract:Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive language model pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of language models; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for language model pre-training and advocates future research in this area.

* 21 pages, 12 figures

Via

Access Paper or Ask Questions

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Oct 20, 2023

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis

Figure 1 for In-Context Pretraining: Language Modeling Beyond Document Boundaries

Figure 2 for In-Context Pretraining: Language Modeling Beyond Document Boundaries

Figure 3 for In-Context Pretraining: Language Modeling Beyond Document Boundaries

Figure 4 for In-Context Pretraining: Language Modeling Beyond Document Boundaries

Abstract:Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).

Via

Access Paper or Ask Questions

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Oct 08, 2023

Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis(+2 more)

Figure 1 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Figure 2 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Figure 3 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Figure 4 for RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Abstract:Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average.

* 24 pages

Via

Access Paper or Ask Questions

Contrastive Decoding Improves Reasoning in Large Language Models

Sep 29, 2023

Sean O'Brien, Mike Lewis

Figure 1 for Contrastive Decoding Improves Reasoning in Large Language Models

Figure 2 for Contrastive Decoding Improves Reasoning in Large Language Models

Figure 3 for Contrastive Decoding Improves Reasoning in Large Language Models

Figure 4 for Contrastive Decoding Improves Reasoning in Large Language Models

Abstract:We demonstrate that Contrastive Decoding -- a simple, computationally light, and training-free text generation method proposed by Li et al 2022 -- achieves large out-of-the-box improvements over greedy decoding on a variety of reasoning tasks. Originally shown to improve the perceived quality of long-form text generation, Contrastive Decoding searches for strings that maximize a weighted difference in likelihood between strong and weak models. We show that Contrastive Decoding leads LLaMA-65B to outperform LLaMA 2, GPT-3.5 and PaLM 2-L on the HellaSwag commonsense reasoning benchmark, and to outperform LLaMA 2, GPT-3.5 and PaLM-540B on the GSM8K math word reasoning benchmark, in addition to improvements on a collection of other tasks. Analysis suggests that Contrastive Decoding improves over existing methods by preventing some abstract reasoning errors, as well as by avoiding simpler modes such as copying sections of the input during chain-of-thought. Overall, Contrastive Decoding outperforms nucleus sampling for long-form generation and greedy decoding for reasoning tasks, making it a powerful general purpose method for generating text from language models.

* 9 figures, 11 tables

Via

Access Paper or Ask Questions

Efficient Streaming Language Models with Attention Sinks

Sep 29, 2023

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

Figure 1 for Efficient Streaming Language Models with Attention Sinks

Figure 2 for Efficient Streaming Language Models with Attention Sinks

Figure 3 for Efficient Streaming Language Models with Attention Sinks

Figure 4 for Efficient Streaming Language Models with Attention Sinks

Abstract:Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach -- but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a ``sink'' even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.

Via

Access Paper or Ask Questions

Effective Long-Context Scaling of Foundation Models

Sep 27, 2023

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz(+11 more)

Figure 1 for Effective Long-Context Scaling of Foundation Models

Figure 2 for Effective Long-Context Scaling of Foundation Models

Figure 3 for Effective Long-Context Scaling of Foundation Models

Figure 4 for Effective Long-Context Scaling of Foundation Models

Abstract:We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our model series are built through continual pretraining from Llama 2 with longer training sequences and on a dataset where long texts are upsampled. We perform extensive evaluation on language modeling, synthetic context probing tasks, and a wide range of research benchmarks. On research benchmarks, our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2. Notably, with a cost-effective instruction tuning procedure that does not require human-annotated long instruction data, the 70B variant can already surpass gpt-3.5-turbo-16k's overall performance on a suite of long-context tasks. Alongside these results, we provide an in-depth analysis on the individual components of our method. We delve into Llama's position encodings and discuss its limitation in modeling long dependencies. We also examine the impact of various design choices in the pretraining process, including the data mix and the training curriculum of sequence lengths -- our ablation experiments suggest that having abundant long texts in the pretrain dataset is not the key to achieving strong performance, and we empirically verify that long context continual pretraining is more efficient and similarly effective compared to pretraining from scratch with long sequences.

Via

Access Paper or Ask Questions

Self-Alignment with Instruction Backtranslation

Aug 14, 2023

Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis

Figure 1 for Self-Alignment with Instruction Backtranslation

Figure 2 for Self-Alignment with Instruction Backtranslation

Figure 3 for Self-Alignment with Instruction Backtranslation

Figure 4 for Self-Alignment with Instruction Backtranslation

Abstract:We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.

Via

Access Paper or Ask Questions

Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

May 24, 2023

Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, Scott Wen-tau Yih

Figure 1 for Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Figure 2 for Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Figure 3 for Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Figure 4 for Trusting Your Evidence: Hallucinate Less with Context-aware Decoding

Abstract:Language models (LMs) often struggle to pay enough attention to the input context, and generate texts that are unfaithful or contain hallucinations. To mitigate this issue, we present context-aware decoding (CAD), which follows a contrastive output distribution that amplifies the difference between the output probabilities when a model is used with and without context. Our experiments show that CAD, without additional training, significantly improves the faithfulness of different LM families, including OPT, GPT, LLaMA and FLAN-T5 for summarization tasks (e.g., 14.3% gain for LLaMA in factuality metrics). Furthermore, CAD is particularly effective in overriding a model's prior knowledge when it contradicts the provided context, leading to substantial improvements in tasks where resolving the knowledge conflict is essential.

Via

Access Paper or Ask Questions

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

May 23, 2023

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi

Figure 1 for FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Figure 2 for FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Figure 3 for FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Figure 4 for FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Abstract:Evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FActScore (Factual precision in Atomicity Score), a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FActScores of people biographies generated by several state-of-the-art commercial LMs -- InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI -- and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FActScore, using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models.

* 23 pages, 7 figures

Via

Access Paper or Ask Questions

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

May 19, 2023

Lili Yu, Dániel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, Mike Lewis

Figure 1 for MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Figure 2 for MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Figure 3 for MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Figure 4 for MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

Abstract:Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding -- unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

Via

Access Paper or Ask Questions