Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rico Sennrich

Distributionally Robust Recurrent Decoders with Random Network Distillation

Oct 25, 2021

Antonio Valerio Miceli-Barone, Alexandra Birch, Rico Sennrich

Figure 1 for Distributionally Robust Recurrent Decoders with Random Network Distillation

Figure 2 for Distributionally Robust Recurrent Decoders with Random Network Distillation

Figure 3 for Distributionally Robust Recurrent Decoders with Random Network Distillation

Figure 4 for Distributionally Robust Recurrent Decoders with Random Network Distillation

Abstract:Neural machine learning models can successfully model language that is similar to their training distribution, but they are highly susceptible to degradation under distribution shift, which occurs in many practical applications when processing out-of-domain (OOD) text. This has been attributed to "shortcut learning": relying on weak correlations over arbitrary large contexts. We propose a method based on OOD detection with Random Network Distillation to allow an autoregressive language model to automatically disregard OOD context during inference, smoothly transitioning towards a less expressive but more robust model as the data becomes more OOD while retaining its full context capability when operating in-distribution. We apply our method to a GRU architecture, demonstrating improvements on multiple language modeling (LM) datasets.

* 8 pages, 1 figure

Via

Access Paper or Ask Questions

On the Limits of Minimal Pairs in Contrastive Evaluation

Sep 15, 2021

Jannis Vamvas, Rico Sennrich

Figure 1 for On the Limits of Minimal Pairs in Contrastive Evaluation

Figure 2 for On the Limits of Minimal Pairs in Contrastive Evaluation

Figure 3 for On the Limits of Minimal Pairs in Contrastive Evaluation

Figure 4 for On the Limits of Minimal Pairs in Contrastive Evaluation

Abstract:Minimal sentence pairs are frequently used to analyze the behavior of language models. It is often assumed that model behavior on contrastive pairs is predictive of model behavior at large. We argue that two conditions are necessary for this assumption to hold: First, a tested hypothesis should be well-motivated, since experiments show that contrastive evaluation can lead to false positives. Secondly, test data should be chosen such as to minimize distributional discrepancy between evaluation time and deployment time. For a good approximation of deployment-time decoding, we recommend that minimal pairs are created based on machine-generated text, as opposed to human-written references. We present a contrastive evaluation suite for English-German MT that implements this recommendation.

* BlackboxNLP 2021

Via

Access Paper or Ask Questions

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Sep 14, 2021

Noëmi Aepli, Rico Sennrich

Figure 1 for Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Figure 2 for Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Figure 3 for Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Figure 4 for Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Abstract:Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity, but current approaches that operate in the embedding space do not take surface similarity into account. In this work, we present a simple yet effective strategy to improve cross-lingual transfer between closely related varieties by augmenting the data of the high-resource parent language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Germanic, Uralic, and Romance language genera. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties.

* dialects, language varieties

Via

Access Paper or Ask Questions

Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Sep 08, 2021

Jiaoda Li, Duygu Ataman, Rico Sennrich

Figure 1 for Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Figure 2 for Vision Matters When It Should: Sanity Checking Multimodal Machine Translation Models

Abstract:Multimodal machine translation (MMT) systems have been shown to outperform their text-only neural machine translation (NMT) counterparts when visual context is available. However, recent studies have also shown that the performance of MMT models is only marginally impacted when the associated image is replaced with an unrelated image or noise, which suggests that the visual context might not be exploited by the model at all. We hypothesize that this might be caused by the nature of the commonly used evaluation benchmark, also known as Multi30K, where the translations of image captions were prepared without actually showing the images to human translators. In this paper, we present a qualitative study that examines the role of datasets in stimulating the leverage of visual modality and we propose methods to highlight the importance of visual signals in the datasets which demonstrate improvements in reliance of models on the source images. Our findings suggest the research on effective MMT architectures is currently impaired by the lack of suitable datasets and careful consideration must be taken in creation of future MMT datasets, for which we also provide useful insights.

* EMNLP 2021

Via

Access Paper or Ask Questions

Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

Sep 03, 2021

Elena Voita, Rico Sennrich, Ivan Titov

Figure 1 for Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

Figure 2 for Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

Figure 3 for Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

Figure 4 for Language Modeling, Lexical Translation, Reordering: The Training Process of NMT through the Lens of Classical SMT

Abstract:Differently from the traditional statistical MT that decomposes the translation task into distinct separately learned components, neural machine translation uses a single neural network to model the entire translation process. Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training, and how this mirrors the different models in traditional SMT. In this work, we look at the competences related to three core SMT components and find that during training, NMT first focuses on learning target-side language modeling, then improves translation quality approaching word-by-word translation, and finally learns more complicated reordering patterns. We show that this behavior holds for several models and language pairs. Additionally, we explain how such an understanding of the training process can be useful in practice and, as an example, show how it can be used to improve vanilla non-autoregressive neural machine translation by guiding teacher model selection.

* EMNLP 2021

Via

Access Paper or Ask Questions

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Sep 02, 2021

Chantal Amrhein, Rico Sennrich

Figure 1 for How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Figure 2 for How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Figure 3 for How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Figure 4 for How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Abstract:Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semi-synthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for non-concatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.

* Findings of EMNLP 2021

Via

Access Paper or Ask Questions

Revisiting Negation in Neural Machine Translation

Jul 26, 2021

Gongbo Tang, Philipp Rönchen, Rico Sennrich, Joakim Nivre

Figure 1 for Revisiting Negation in Neural Machine Translation

Figure 2 for Revisiting Negation in Neural Machine Translation

Figure 3 for Revisiting Negation in Neural Machine Translation

Figure 4 for Revisiting Negation in Neural Machine Translation

Abstract:In this paper, we evaluate the translation of negation both automatically and manually, in English--German (EN--DE) and English--Chinese (EN--ZH). We show that the ability of neural machine translation (NMT) models to translate negation has improved with deeper and more advanced networks, although the performance varies between language pairs and translation directions. The accuracy of manual evaluation in EN-DE, DE-EN, EN-ZH, and ZH-EN is 95.7%, 94.8%, 93.4%, and 91.7%, respectively. In addition, we show that under-translation is the most significant error type in NMT, which contrasts with the more diverse error profile previously observed for statistical machine translation. To better understand the root of the under-translation of negation, we study the model's information flow and training data. While our information flow analysis does not reveal any deficiencies that could be used to detect or fix the under-translation of negation, we find that negation is often rephrased during training, which could make it more difficult for the model to learn a reliable link between source and target negation. We finally conduct intrinsic analysis and extrinsic probing tasks on negation, showing that NMT models can distinguish negation and non-negation tokens very well and encode a lot of information about negation in hidden states but nevertheless leave room for improvement.

* To appear at TACL and to be presented at ACL 2021. Authors' final version

Via

Access Paper or Ask Questions

Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation

May 18, 2021

Mathias Müller, Rico Sennrich

Figure 1 for Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation

Figure 2 for Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation

Figure 3 for Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation

Figure 4 for Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation

Abstract:Neural Machine Translation (NMT) currently exhibits biases such as producing translations that are too short and overgenerating frequent words, and shows poor robustness to copy noise in training data or domain shift. Recent work has tied these shortcomings to beam search -- the de facto standard inference algorithm in NMT -- and Eikema & Aziz (2020) propose to use Minimum Bayes Risk (MBR) decoding on unbiased samples instead. In this paper, we empirically investigate the properties of MBR decoding on a number of previously reported biases and failure cases of beam search. We find that MBR still exhibits a length and token frequency bias, owing to the MT metrics used as utility functions, but that MBR also increases robustness against copy noise in the training data and domain shift.

* V1: ACL 2021 camera-ready

Via

Access Paper or Ask Questions

Sparse Attention with Linear Units

Apr 14, 2021

Biao Zhang, Ivan Titov, Rico Sennrich

Figure 1 for Sparse Attention with Linear Units

Figure 2 for Sparse Attention with Linear Units

Figure 3 for Sparse Attention with Linear Units

Figure 4 for Sparse Attention with Linear Units

Abstract:Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.

Via

Access Paper or Ask Questions

On Biasing Transformer Attention Towards Monotonicity

Apr 08, 2021

Annette Rios, Chantal Amrhein, Noëmi Aepli, Rico Sennrich

Figure 1 for On Biasing Transformer Attention Towards Monotonicity

Figure 2 for On Biasing Transformer Attention Towards Monotonicity

Figure 3 for On Biasing Transformer Attention Towards Monotonicity

Figure 4 for On Biasing Transformer Attention Towards Monotonicity

Abstract:Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.

* To be published in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021)

Via

Access Paper or Ask Questions