Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Naman Goyal

Jack

On the Role of Bidirectionality in Language Model Pre-Training

May 24, 2022

Mikel Artetxe, Jingfei Du, Naman Goyal, Luke Zettlemoyer, Ves Stoyanov

Figure 1 for On the Role of Bidirectionality in Language Model Pre-Training

Figure 2 for On the Role of Bidirectionality in Language Model Pre-Training

Figure 3 for On the Role of Bidirectionality in Language Model Pre-Training

Figure 4 for On the Role of Bidirectionality in Language Model Pre-Training

Abstract:Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.

Via

Access Paper or Ask Questions

Lifting the Curse of Multilinguality by Pre-training Modular Transformers

May 12, 2022

Jonas Pfeiffer, Naman Goyal, Xi Victoria Lin, Xian Li, James Cross, Sebastian Riedel, Mikel Artetxe

Figure 1 for Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Figure 2 for Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Figure 3 for Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Figure 4 for Lifting the Curse of Multilinguality by Pre-training Modular Transformers

Abstract:Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant. In contrast with prior work that learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.

* NAACL 2022

Via

Access Paper or Ask Questions

OPT: Open Pre-trained Transformer Language Models

May 05, 2022

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin(+9 more)

Figure 1 for OPT: Open Pre-trained Transformer Language Models

Figure 2 for OPT: Open Pre-trained Transformer Language Models

Figure 3 for OPT: Open Pre-trained Transformer Language Models

Figure 4 for OPT: Open Pre-trained Transformer Language Models

Abstract:Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

Via

Access Paper or Ask Questions

How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Apr 29, 2022

Shiyue Zhang, Vishrav Chaudhary, Naman Goyal, James Cross, Guillaume Wenzek, Mohit Bansal, Francisco Guzman

Figure 1 for How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Figure 2 for How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Figure 3 for How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Figure 4 for How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?

Abstract:A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling for tokenizer training from sampling for model training and show that the model is more sensitive to the latter.

Via

Access Paper or Ask Questions

Graph Neural Networks for Image Classification and Reinforcement Learning using Graph representations

Mar 08, 2022

Naman Goyal, David Steiner

Figure 1 for Graph Neural Networks for Image Classification and Reinforcement Learning using Graph representations

Figure 2 for Graph Neural Networks for Image Classification and Reinforcement Learning using Graph representations

Figure 3 for Graph Neural Networks for Image Classification and Reinforcement Learning using Graph representations

Figure 4 for Graph Neural Networks for Image Classification and Reinforcement Learning using Graph representations

Abstract:In this paper, we will evaluate the performance of graph neural networks in two distinct domains: computer vision and reinforcement learning. In the computer vision section, we seek to learn whether a novel non-redundant representation for images as graphs can improve performance over trivial pixel to node mapping on a graph-level prediction graph, specifically image classification. For the reinforcement learning section, we seek to learn if explicitly modeling solving a Rubik's cube as a graph problem can improve performance over a standard model-free technique with no inductive bias.

* The work was done as a project for Neural Networks and Deep Learning course, Fall 2021 offering by Prof. Richard Zemel at Columbia University

Via

Access Paper or Ask Questions

CM3: A Causal Masked Multimodal Model of the Internet

Jan 19, 2022

Armen Aghajanyan, Bernie Huang, Candace Ross, Vladimir Karpukhin, Hu Xu, Naman Goyal, Dmytro Okhonko, Mandar Joshi, Gargi Ghosh, Mike Lewis(+1 more)

Figure 1 for CM3: A Causal Masked Multimodal Model of the Internet

Figure 2 for CM3: A Causal Masked Multimodal Model of the Internet

Figure 3 for CM3: A Causal Masked Multimodal Model of the Internet

Figure 4 for CM3: A Causal Masked Multimodal Model of the Internet

Abstract:We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked language-image models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM. We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

Via

Access Paper or Ask Questions

Efficient Large Scale Language Modeling with Mixtures of Experts

Dec 20, 2021

Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru(+14 more)

Figure 1 for Efficient Large Scale Language Modeling with Mixtures of Experts

Figure 2 for Efficient Large Scale Language Modeling with Mixtures of Experts

Figure 3 for Efficient Large Scale Language Modeling with Mixtures of Experts

Figure 4 for Efficient Large Scale Language Modeling with Mixtures of Experts

Abstract:Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using $\sim$4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.

Via

Access Paper or Ask Questions

Few-shot Learning with Multilingual Language Models

Dec 20, 2021

Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Wang, Shuohui Chen, Daniel Simig, Myle Ott, Naman Goyal, Shruti Bhosale, Jingfei Du(+11 more)

Figure 1 for Few-shot Learning with Multilingual Language Models

Figure 2 for Few-shot Learning with Multilingual Language Models

Figure 3 for Few-shot Learning with Multilingual Language Models

Figure 4 for Few-shot Learning with Multilingual Language Models

Abstract:Large-scale autoregressive language models such as GPT-3 are few-shot learners that can perform a wide range of language tasks without fine-tuning. While these models are known to be able to jointly represent many different languages, their training data is dominated by English, potentially limiting their cross-lingual generalization. In this work, we train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages, and study their few- and zero-shot learning capabilities in a wide range of tasks. Our largest model with 7.5 billion parameters sets new state of the art in few-shot learning in more than 20 representative languages, outperforming GPT-3 of comparable size in multilingual commonsense reasoning (with +7.4% absolute accuracy improvement in 0-shot settings and +9.4% in 4-shot settings) and natural language inference (+5.4% in each of 0-shot and 4-shot settings). On the FLORES-101 machine translation benchmark, our model outperforms GPT-3 on 171 out of 182 translation directions with 32 training examples, while surpassing the official supervised baseline in 45 directions. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning on some tasks, while there is still room for improvement on surface form robustness and adaptation to tasks that do not have a natural cloze form. Finally, we evaluate our models in social value tasks such as hate speech detection in five languages and find it has limitations similar to comparable sized GPT-3 models.

* 36 pages

Via

Access Paper or Ask Questions

XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Nov 19, 2021

Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino(+3 more)

Figure 1 for XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Figure 2 for XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Figure 3 for XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Figure 4 for XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

Abstract:This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.

Via

Access Paper or Ask Questions

FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

Aug 14, 2021

Yun Tang, Hongyu Gong, Xian Li, Changhan Wang, Juan Pino, Holger Schwenk, Naman Goyal

Figure 1 for FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

Figure 2 for FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

Figure 3 for FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

Figure 4 for FST: the FAIR Speech Translation System for the IWSLT21 Multilingual Shared Task

Abstract:In this paper, we describe our end-to-end multilingual speech translation system submitted to the IWSLT 2021 evaluation campaign on the Multilingual Speech Translation shared task. Our system is built by leveraging transfer learning across modalities, tasks and languages. First, we leverage general-purpose multilingual modules pretrained with large amounts of unlabelled and labelled data. We further enable knowledge transfer from the text task to the speech task by training two tasks jointly. Finally, our multilingual model is finetuned on speech translation task-specific data to achieve the best translation results. Experimental results show our system outperforms the reported systems, including both end-to-end and cascaded based approaches, by a large margin. In some translation directions, our speech translation results evaluated on the public Multilingual TEDx test set are even comparable with the ones from a strong text-to-text translation system, which uses the oracle speech transcripts as input.

* Accepted by IWSLT 2021 as a system paper

Via

Access Paper or Ask Questions