Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Isaac Caswell

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Oct 29, 2020

Isaac Caswell, Theresa Breiner, Daan van Esch, Ankur Bapna

Figure 1 for Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Figure 2 for Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Figure 3 for Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Figure 4 for Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Abstract:Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

* Accepted to COLING 2020. 9 pages with 8 page abstract

Via

Access Paper or Ask Questions

BLEU might be Guilty but References are not Innocent

Apr 13, 2020

Markus Freitag, David Grangier, Isaac Caswell

Figure 1 for BLEU might be Guilty but References are not Innocent

Figure 2 for BLEU might be Guilty but References are not Innocent

Figure 3 for BLEU might be Guilty but References are not Innocent

Figure 4 for BLEU might be Guilty but References are not Innocent

Abstract:The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

Via

Access Paper or Ask Questions

Translationese as a Language in "Multilingual" NMT

Nov 10, 2019

Parker Riley, Isaac Caswell, Markus Freitag, David Grangier

Figure 1 for Translationese as a Language in "Multilingual" NMT

Figure 2 for Translationese as a Language in "Multilingual" NMT

Figure 3 for Translationese as a Language in "Multilingual" NMT

Figure 4 for Translationese as a Language in "Multilingual" NMT

Abstract:Machine translation has an undesirable propensity to produce "translationese" artifacts, which can lead to higher BLEU scores while being liked less by human raters. Motivated by this, we model translationese and original (i.e. natural) text as separate languages in a multilingual model, and pose the question: can we perform zero-shot translation between original source text and original target text? There is no data with original source and original target, so we train sentence-level classifiers to distinguish translationese from original target text, and use this classifier to tag the training data for an NMT model. Using this technique we bias the model to produce more natural outputs at test time, yielding gains in human evaluation scores on both accuracy and fluency. Additionally, we demonstrate that it is possible to bias the model to produce translationese and game the BLEU score, increasing it while decreasing human-rated quality. We analyze these models using metrics to measure the degree of translationese in the output, and present an analysis of the capriciousness of heuristically-based train-data tagging.

Via

Access Paper or Ask Questions

Investigating Multilingual NMT Representations at Scale

Sep 11, 2019

Sneha Reddy Kudugunta, Ankur Bapna, Isaac Caswell, Naveen Arivazhagan, Orhan Firat

Figure 1 for Investigating Multilingual NMT Representations at Scale

Figure 2 for Investigating Multilingual NMT Representations at Scale

Figure 3 for Investigating Multilingual NMT Representations at Scale

Figure 4 for Investigating Multilingual NMT Representations at Scale

Abstract:Multilingual Neural Machine Translation (NMT) models have yielded large empirical success in transfer learning settings. However, these black-box representations are poorly understood, and their mode of transfer remains elusive. In this work, we attempt to understand massively multilingual NMT representations (with 103 languages) using Singular Value Canonical Correlation Analysis (SVCCA), a representation similarity framework that allows us to compare representations across different languages, layers and models. Our analysis validates several empirical results and long-standing intuitions, and unveils new observations regarding how representations evolve in a multilingual translation model. We draw three major conclusions from our analysis, with implications on cross-lingual transfer learning: (i) Encoder representations of different languages cluster based on linguistic similarity, (ii) Representations of a source language learned by the encoder are dependent on the target language, and vice-versa, and (iii) Representations of high resource and/or linguistically similar languages are more robust when fine-tuning on an arbitrary language pair, which is critical to determining how much cross-lingual transfer can be expected in a zero or few-shot setting. We further connect our findings with existing empirical observations in multilingual NMT and transfer learning.

* Paper at EMNLP 2019

Via

Access Paper or Ask Questions

Learning a Multitask Curriculum for Neural Machine Translation

Aug 28, 2019

Wei Wang, Ye Tian, Jiquan Ngiam, Yinfei Yang, Isaac Caswell, Zarana Parekh

Figure 1 for Learning a Multitask Curriculum for Neural Machine Translation

Figure 2 for Learning a Multitask Curriculum for Neural Machine Translation

Figure 3 for Learning a Multitask Curriculum for Neural Machine Translation

Figure 4 for Learning a Multitask Curriculum for Neural Machine Translation

Abstract:Existing curriculum learning research in neural machine translation (NMT) mostly focuses on a single final task such as selecting data for a domain or for denoising, and considers in-task example selection. This paper studies the data selection problem in multitask setting. We present a method to learn a multitask curriculum on a single, diverse, potentially noisy training dataset. It computes multiple data selection scores for each training example, each score measuring how useful the example is to a certain task. It uses Bayesian optimization to learn a linear weighting of these per-instance scores, and then sorts the data to form a curriculum. We experiment with three domain translation tasks: two specific domains and the general domain, and demonstrate that the learned multitask curriculum delivers results close to individually optimized models and brings solid gains over no curriculum training, across all test sets.

* 12 pages

Via

Access Paper or Ask Questions

Tagged Back-Translation

Jun 15, 2019

Isaac Caswell, Ciprian Chelba, David Grangier

Abstract:Recent work in Neural Machine Translation (NMT) has shown significant quality gains from noised-beam decoding during back-translation, a method to generate synthetic parallel data. We show that the main role of such synthetic noise is not to diversify the source side, as previously suggested, but simply to indicate to the model that the given source is synthetic. We propose a simpler alternative to noising techniques, consisting of tagging back-translated source sentences with an extra token. Our results on WMT outperform noised back-translation in English-Romanian and match performance on English-German, re-defining state-of-the-art in the former.

* Accepted as oral presentation in WMT 2019; 9 pages; 9 tables; 1 figure

Via

Access Paper or Ask Questions

Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation

Jun 03, 2019

Wei Wang, Isaac Caswell, Ciprian Chelba

Figure 1 for Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation

Figure 2 for Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation

Figure 3 for Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation

Figure 4 for Dynamically Composing Domain-Data Selection with Clean-Data Selection by "Co-Curricular Learning" for Neural Machine Translation

Abstract:Noise and domain are important aspects of data quality for neural machine translation. Existing research focus separately on domain-data selection, clean-data selection, or their static combination, leaving the dynamic interaction across them not explicitly examined. This paper introduces a "co-curricular learning" method to compose dynamic domain-data selection with dynamic clean-data selection, for transfer learning across both capabilities. We apply an EM-style optimization procedure to further refine the "co-curriculum". Experiment results and analysis with two domains demonstrate the effectiveness of the method and the properties of data scheduled by the co-curriculum.

* The 57th Annual Meeting of the Association for Computational Linguistics (ACL2019)
* 11 pages

Via

Access Paper or Ask Questions

Text Repair Model for Neural Machine Translation

Apr 09, 2019

Markus Freitag, Isaac Caswell, Scott Roy

Figure 1 for Text Repair Model for Neural Machine Translation

Figure 2 for Text Repair Model for Neural Machine Translation

Figure 3 for Text Repair Model for Neural Machine Translation

Figure 4 for Text Repair Model for Neural Machine Translation

Abstract:In this work, we train a text repair model as a post-processor for Neural Machine Translation (NMT). The goal of the repair model is to correct typical errors introduced by the translation process, and convert the "translationese" output into natural text. The repair model is trained on monolingual data that has been round-trip translated through English, to mimic errors that are similar to the ones introduced by NMT. Having a trained repair model, we apply it to the output of existing NMT systems. We run experiments for both the WMT18 English to German and the WMT16 English to Romanian task. Furthermore, we apply the repair model on the output of the top submissions of the most recent WMT evaluation campaigns. We see quality improvements on all tasks of up to 2.5 BLEU points.

Via

Access Paper or Ask Questions

Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Feb 21, 2019

Jonathan Shen, Patrick Nguyen, Yonghui Wu, Zhifeng Chen, Mia X. Chen, Ye Jia, Anjuli Kannan, Tara Sainath, Yuan Cao, Chung-Cheng Chiu(+81 more)

Figure 1 for Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Figure 2 for Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Figure 3 for Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling

Abstract:Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models. Lingvo models are composed of modular building blocks that are flexible and easily extensible, and experiment configurations are centralized and highly customizable. Distributed training and quantized inference are supported directly within the framework, and it contains existing implementations of a large number of utilities, helper functions, and the newest research ideas. Lingvo has been used in collaboration by dozens of researchers in more than 20 papers over the last two years. This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the framework.

Via

Access Paper or Ask Questions