Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Omer Levy

Aligned Cross Entropy for Non-Autoregressive Machine Translation

Apr 03, 2020
Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy

Figure 1 for Aligned Cross Entropy for Non-Autoregressive Machine Translation

Figure 2 for Aligned Cross Entropy for Non-Autoregressive Machine Translation

Figure 3 for Aligned Cross Entropy for Non-Autoregressive Machine Translation

Figure 4 for Aligned Cross Entropy for Non-Autoregressive Machine Translation

Non-autoregressive machine translation models significantly speed up decoding by allowing for parallel prediction of the entire target sequence. However, modeling word order is more challenging due to the lack of autoregressive factors in the model. This difficultly is compounded during training with cross entropy loss, which can highly penalize small shifts in word order. In this paper, we propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models. AXE uses a differentiable dynamic program to assign loss based on the best possible monotonic alignment between target tokens and model predictions. AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks, while setting a new state of the art for non-autoregressive models.

Via

Access Paper or Ask Questions

Semi-Autoregressive Training Improves Mask-Predict Decoding

Jan 23, 2020
Marjan Ghazvininejad, Omer Levy, Luke Zettlemoyer

Figure 1 for Semi-Autoregressive Training Improves Mask-Predict Decoding

Figure 2 for Semi-Autoregressive Training Improves Mask-Predict Decoding

Figure 3 for Semi-Autoregressive Training Improves Mask-Predict Decoding

Figure 4 for Semi-Autoregressive Training Improves Mask-Predict Decoding

The recently proposed mask-predict decoding algorithm has narrowed the performance gap between semi-autoregressive machine translation models and the traditional left-to-right approach. We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict, producing training examples that contain model predictions as part of their inputs. Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.

Via

Access Paper or Ask Questions

Improving Transformer Models by Reordering their Sublayers

Nov 10, 2019
Ofir Press, Noah A. Smith, Omer Levy

Figure 1 for Improving Transformer Models by Reordering their Sublayers

Figure 2 for Improving Transformer Models by Reordering their Sublayers

Figure 3 for Improving Transformer Models by Reordering their Sublayers

Figure 4 for Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern achieve better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer design pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on the WikiText-103 language modeling benchmark, at no cost in parameters, memory, or training time.

Via

Access Paper or Ask Questions

Blockwise Self-Attention for Long Document Understanding

Nov 07, 2019
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, Jie Tang

Figure 1 for Blockwise Self-Attention for Long Document Understanding

Figure 2 for Blockwise Self-Attention for Long Document Understanding

Figure 3 for Blockwise Self-Attention for Long Document Understanding

Figure 4 for Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is designed to better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training time, which also enables attention heads to capture either short- or long-range contextual information. We conduct experiments on several benchmark question answering datasets with various paragraph lengths. Results show that BlockBERT uses 18.7-36.1% less memory and reduces the training time by 12.0-25.1%, while having comparable and sometimes better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.

Via

Access Paper or Ask Questions

Generalization through Memorization: Nearest Neighbor Language Models

Nov 01, 2019
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis

Figure 1 for Generalization through Memorization: Nearest Neighbor Language Models

Figure 2 for Generalization through Memorization: Nearest Neighbor Language Models

Figure 3 for Generalization through Memorization: Nearest Neighbor Language Models

Figure 4 for Generalization through Memorization: Nearest Neighbor Language Models

We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this augmentation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 - a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

Via

Access Paper or Ask Questions

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Oct 29, 2019
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer

Figure 1 for BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Figure 2 for BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Figure 3 for BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Figure 4 for BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also report ablation experiments that replicate other pretraining schemes within the BART framework, to better measure which factors most influence end-task performance.

Via

Access Paper or Ask Questions

Structural Language Models for Any-Code Generation

Sep 30, 2019
Uri Alon, Roy Sadaka, Omer Levy, Eran Yahav

Figure 1 for Structural Language Models for Any-Code Generation

Figure 2 for Structural Language Models for Any-Code Generation

Figure 3 for Structural Language Models for Any-Code Generation

Figure 4 for Structural Language Models for Any-Code Generation

We address the problem of Any-Code Generation (AnyGen) - generating code without any restriction on the vocabulary or structure. The state-of-the-art in this problem is the sequence-to-sequence (seq2seq) approach, which treats code as a sequence and does not leverage any structural information. We introduce a new approach to AnyGen that leverages the strict syntax of programming languages to model a code snippet as a tree - structural language modeling (SLM). SLM estimates the probability of the program's abstract syntax tree (AST) by decomposing it into a product of conditional probabilities over its nodes. We present a neural model that computes these conditional probabilities by considering all AST paths leading to a target node. Unlike previous structural techniques that have severely restricted the kinds of expressions that can be generated, our approach can generate arbitrary expressions in any programming language. Our model significantly outperforms both seq2seq and a variety of existing structured approaches in generating Java and C# code. We make our code, datasets, and models available online.

Via

Access Paper or Ask Questions

BERT for Coreference Resolution: Baselines and Analysis

Sep 01, 2019
Mandar Joshi, Omer Levy, Daniel S. Weld, Luke Zettlemoyer

Figure 1 for BERT for Coreference Resolution: Baselines and Analysis

Figure 2 for BERT for Coreference Resolution: Baselines and Analysis

Figure 3 for BERT for Coreference Resolution: Baselines and Analysis

Figure 4 for BERT for Coreference Resolution: Baselines and Analysis

We apply BERT to coreference resolution, achieving strong improvements on the OntoNotes (+3.9 F1) and GAP (+11.5 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but distinct entities (e.g., President and CEO). However, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. Our code and models are publicly available.

* EMNLP 2019 camera ready version

Via

Access Paper or Ask Questions

SpanBERT: Improving Pre-training by Representing and Predicting Spans

Jul 31, 2019
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy

Figure 1 for SpanBERT: Improving Pre-training by Representing and Predicting Spans

Figure 2 for SpanBERT: Improving Pre-training by Representing and Predicting Spans

Figure 3 for SpanBERT: Improving Pre-training by Representing and Predicting Spans

Figure 4 for SpanBERT: Improving Pre-training by Representing and Predicting Spans

We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT-large, our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0, respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6\% F1), strong performance on the TACRED relation extraction benchmark, and even show gains on GLUE.

Via

Access Paper or Ask Questions

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Jul 26, 2019
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

Figure 1 for RoBERTa: A Robustly Optimized BERT Pretraining Approach

Figure 2 for RoBERTa: A Robustly Optimized BERT Pretraining Approach

Figure 3 for RoBERTa: A Robustly Optimized BERT Pretraining Approach

Figure 4 for RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Via

Access Paper or Ask Questions