Alert button
Picture for Atsushi Fujita

Atsushi Fujita

Alert button

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

Jun 29, 2021
Benjamin Marie, Atsushi Fujita, Raphael Rubino

Figure 1 for Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Figure 2 for Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Figure 3 for Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
Figure 4 for Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

* Camera-ready for ACL2021 
Viaarxiv icon

Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

Jun 18, 2021
Raj Dabre, Atsushi Fujita

Figure 1 for Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation
Figure 2 for Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation
Figure 3 for Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation
Figure 4 for Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation

In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the network's prediction. Conventionally, each layer in the stack has its own parameters which leads to a significant increase in the number of model parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We report on an extensive case study on neural machine translation (NMT), where we apply our proposed method to an encoder-decoder based neural network model, i.e., the Transformer model, and experiment with three Japanese--English translation datasets. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters. We also explore the limits of recurrent stacking where we train extremely deep NMT models. This paper also examines the utility of our recurrently stacked model as a student model through transfer learning via leveraging pre-trained parameters and knowledge distillation, and shows that it compensates for the performance drops in translation quality that the direct training of recurrently stacked model brings. We also show how transfer learning helps in faster decoding on top of the already reduced number of parameters due to recurrent stacking. Finally, we analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not.

* 22 pages. Under review. Work in progress. Extended version of https://ojs.aaai.org//index.php/AAAI/article/view/4590 which is an extension of arXiv:1807.05353 . The focus is on analyzing the limitations of recurrently stacked layers and methods to overcome said limitations 
Viaarxiv icon

Understanding Pre-Editing for Black-Box Neural Machine Translation

Feb 05, 2021
Rei Miyata, Atsushi Fujita

Figure 1 for Understanding Pre-Editing for Black-Box Neural Machine Translation
Figure 2 for Understanding Pre-Editing for Black-Box Neural Machine Translation
Figure 3 for Understanding Pre-Editing for Black-Box Neural Machine Translation
Figure 4 for Understanding Pre-Editing for Black-Box Neural Machine Translation

Pre-editing is the process of modifying the source text (ST) so that it can be translated by machine translation (MT) in a better quality. Despite the unpredictability of black-box neural MT (NMT), pre-editing has been deployed in various practical MT use cases. Although many studies have demonstrated the effectiveness of pre-editing methods for particular settings, thus far, a deep understanding of what pre-editing is and how it works for black-box NMT is lacking. To elicit such understanding, we extensively investigated human pre-editing practices. We first implemented a protocol to incrementally record the minimum edits for each ST and collected 6,652 instances of pre-editing across three translation directions, two MT systems, and four text domains. We then analysed the instances from three perspectives: the characteristics of the pre-edited ST, the diversity of pre-editing operations, and the impact of the pre-editing operations on NMT outputs. Our findings include the following: (1) enhancing the explicitness of the meaning of an ST and its syntactic structure is more important for obtaining better translations than making the ST shorter and simpler, and (2) although the impact of pre-editing on NMT is generally unpredictable, there are some tendencies of changes in the NMT outputs depending on the editing operation types.

* Accepted at EACL 2021 
Viaarxiv icon

Synthesizing Monolingual Data for Neural Machine Translation

Jan 29, 2021
Benjamin Marie, Atsushi Fujita

Figure 1 for Synthesizing Monolingual Data for Neural Machine Translation
Figure 2 for Synthesizing Monolingual Data for Neural Machine Translation
Figure 3 for Synthesizing Monolingual Data for Neural Machine Translation
Figure 4 for Synthesizing Monolingual Data for Neural Machine Translation

In neural machine translation (NMT), monolingual data in the target language are usually exploited through a method so-called "back-translation" to synthesize additional training parallel data. The synthetic data have been shown helpful to train better NMT, especially for low-resource language pairs and domains. Nonetheless, large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data. In this work, we propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain. We fine-tune a pre-trained GPT-2 model on such small in-domain monolingual data and use the resulting model to generate a large amount of synthetic in-domain monolingual data. Then, we perform back-translation, or forward translation, to generate synthetic in-domain parallel data. Our preliminary experiments on three language pairs and five domains show the effectiveness of our method to generate fully synthetic but useful in-domain parallel data for improving NMT in all configurations. We also show promising results in extreme adaptation for personalized NMT.

* Preliminary work 
Viaarxiv icon

Softmax Tempering for Training Neural Machine Translation Models

Sep 20, 2020
Raj Dabre, Atsushi Fujita

Figure 1 for Softmax Tempering for Training Neural Machine Translation Models
Figure 2 for Softmax Tempering for Training Neural Machine Translation Models
Figure 3 for Softmax Tempering for Training Neural Machine Translation Models
Figure 4 for Softmax Tempering for Training Neural Machine Translation Models

Neural machine translation (NMT) models are typically trained using a softmax cross-entropy loss where the softmax distribution is compared against smoothed gold labels. In low-resource scenarios, NMT models tend to over-fit because the softmax distribution quickly approaches the gold label distribution. To address this issue, we propose to divide the logits by a temperature coefficient, prior to applying softmax, during training. In our experiments on 11 language pairs in the Asian Language Treebank dataset and the WMT 2019 English-to-German translation task, we observed significant improvements in translation quality by up to 3.9 BLEU points. Furthermore, softmax tempering makes the greedy search to be as good as beam search decoding in terms of translation quality, enabling 1.5 to 3.5 times speed-up. We also study the impact of softmax tempering on multilingual NMT and recurrently stacked NMT, both of which aim to reduce the NMT model size by parameter sharing thereby verifying the utility of temperature in developing compact NMT models. Finally, an analysis of softmax entropies and gradients reveal the impact of our method on the internal behavior of NMT models.

* The paper is about prediction smoothing for improving sequence to sequence performance. Related to but not the same as label smoothing. Work in progress. Updates with deeper analyses and comparisons to related methods to follow. Rejected from EMNLP 2020 
Viaarxiv icon

Balancing Cost and Benefit with Tied-Multi Transformers

Feb 20, 2020
Raj Dabre, Raphael Rubino, Atsushi Fujita

Figure 1 for Balancing Cost and Benefit with Tied-Multi Transformers
Figure 2 for Balancing Cost and Benefit with Tied-Multi Transformers
Figure 3 for Balancing Cost and Benefit with Tied-Multi Transformers
Figure 4 for Balancing Cost and Benefit with Tied-Multi Transformers

We propose and evaluate a novel procedure for training multiple Transformers with tied parameters which compresses multiple models into one enabling the dynamic choice of the number of encoder and decoder layers during decoding. In sequence-to-sequence modeling, typically, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute loss. Instead, our method computes a single loss consisting of NxM losses, where each loss is computed from the output of one of the M decoder layers connected to one of the N encoder layers. Such a model subsumes NxM models with different number of encoder and decoder layers, and can be used for decoding with fewer than the maximum number of encoder and decoder layers. We then propose a mechanism to choose a priori the number of encoder and decoder layers for faster decoding, and also explore recurrent stacking of layers and knowledge distillation for model compression. We present a cost-benefit analysis of applying the proposed approaches for neural machine translation and show that they reduce decoding costs while preserving translation quality.

* Extended version of our previous manuscript available at arXiv:1908.10118 
Viaarxiv icon

Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Jan 14, 2020
Haiyue Song, Raj Dabre, Atsushi Fujita, Sadao Kurohashi

Figure 1 for Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Figure 2 for Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Figure 3 for Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Figure 4 for Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a language independent framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese--English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we will release our code for parallel data creation.

* 10 pages, 1 figure, 9 tables, under review by LREC2020 
Viaarxiv icon

Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

Aug 28, 2019
Raj Dabre, Atsushi Fujita

Figure 1 for Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers
Figure 2 for Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers
Figure 3 for Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers
Figure 4 for Multi-Layer Softmaxing during Training Neural Machine Translation for Flexible Decoding with Fewer Layers

This paper proposes a novel procedure for training an encoder-decoder based deep neural network which compresses NxM models into a single model enabling us to dynamically choose the number of encoder and decoder layers for decoding. Usually, the output of the last layer of the N-layer encoder is fed to the M-layer decoder, and the output of the last decoder layer is used to compute softmax loss. Instead, our method computes a single loss consisting of NxM losses: the softmax loss for the output of each of the M decoder layers derived using the output of each of the N encoder layers. A single model trained by our method can be used for decoding with an arbitrary fewer number of encoder and decoder layers. In practical scenarios, this (a) enables faster decoding with insignificant losses in translation quality and (b) alleviates the need to train NxM models, thereby saving space. We take a case study of neural machine translation and show the advantage and give a cost-benefit analysis of our approach.

* Fixed numeric typos and corresponding explanations in the running text in the paper 
Viaarxiv icon

Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation

Jul 06, 2019
Aizhan Imankulova, Raj Dabre, Atsushi Fujita, Kenji Imamura

Figure 1 for Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation
Figure 2 for Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation
Figure 3 for Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation
Figure 4 for Exploiting Out-of-Domain Parallel Data through Multilingual Transfer Learning for Low-Resource Neural Machine Translation

This paper proposes a novel multilingual multistage fine-tuning approach for low-resource neural machine translation (NMT), taking a challenging Japanese--Russian pair for benchmarking. Although there are many solutions for low-resource scenarios, such as multilingual NMT and back-translation, we have empirically confirmed their limited success when restricted to in-domain data. We therefore propose to exploit out-of-domain data through transfer learning, by using it to first train a multilingual NMT model followed by multistage fine-tuning on in-domain parallel and back-translated pseudo-parallel data. Our approach, which combines domain adaptation, multilingualism, and back-translation, helps improve the translation quality by more than 3.7 BLEU points, over a strong baseline, for this extremely low-resource scenario.

* Accepted at the 17th Machine Translation Summit 
Viaarxiv icon