Alert button
Picture for George Foster

George Foster

Alert button

Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation

May 23, 2023
Daniel Deutsch, George Foster, Markus Freitag

Figure 1 for Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation
Figure 2 for Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation
Figure 3 for Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation
Figure 4 for Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation

Kendall's tau is frequently used to meta-evaluate how well machine translation (MT) evaluation metrics score individual translations. Its focus on pairwise score comparisons is intuitive but raises the question of how ties should be handled, a gray area that has motivated different variants in the literature. We demonstrate that, in settings like modern MT meta-evaluation, existing variants have weaknesses arising from their handling of ties, and in some situations can even be gamed. We propose a novel variant that gives metrics credit for correctly predicting ties, as well as an optimization procedure that automatically introduces ties into metric scores, enabling fair comparison between metrics that do and do not predict ties. We argue and provide experimental evidence that these modifications lead to fairer Kendall-based assessments of metric performance.

Viaarxiv icon

Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability

May 17, 2023
Eleftheria Briakou, Colin Cherry, George Foster

Figure 1 for Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
Figure 2 for Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
Figure 3 for Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability
Figure 4 for Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability

Large, multilingual language models exhibit surprisingly good zero- or few-shot machine translation capabilities, despite having never seen the intentionally-included translation examples provided to typical neural translation systems. We investigate the role of incidental bilingualism -- the unintentional consumption of bilingual signals, including translation examples -- in explaining the translation capabilities of large language models, taking the Pathways Language Model (PaLM) as a case study. We introduce a mixed-method approach to measure and understand incidental bilingualism at scale. We show that PaLM is exposed to over 30 million translation pairs across at least 44 languages. Furthermore, the amount of incidental bilingual content is highly correlated with the amount of monolingual in-language content for non-English languages. We relate incidental bilingual content to zero-shot prompts and show that it can be used to mine new prompts to improve PaLM's out-of-English zero-shot translation quality. Finally, in a series of small-scale ablations, we show that its presence has a substantial impact on translation capabilities, although this impact diminishes with model scale.

* Accepted at ACL 2023 
Viaarxiv icon

Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation

Feb 16, 2023
Minghao Wu, George Foster, Lizhen Qu, Gholamreza Haffari

Figure 1 for Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation
Figure 2 for Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation
Figure 3 for Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation
Figure 4 for Document Flattening: Beyond Concatenating Context for Document-Level Neural Machine Translation

Existing work in document-level neural machine translation commonly concatenates several consecutive sentences as a pseudo-document, and then learns inter-sentential dependencies. This strategy limits the model's ability to leverage information from distant context. We overcome this limitation with a novel Document Flattening (DocFlat) technique that integrates Flat-Batch Attention (FBA) and Neural Context Gate (NCG) into Transformer model to utilize information beyond the pseudo-document boundaries. FBA allows the model to attend to all the positions in the batch and learns the relationships between positions explicitly and NCG identifies the useful information from the distant context. We conduct comprehensive experiments and analyses on three benchmark datasets for English-German translation, and validate the effectiveness of two variants of DocFlat. Empirical results show that our approach outperforms strong baselines with statistical significance on BLEU, COMET and accuracy on the contrastive test set. The analyses highlight that DocFlat is highly effective in capturing the long-range information.

* 15 pages, 8 figures, accepted by EACL 2023 
Viaarxiv icon

The unreasonable effectiveness of few-shot learning for machine translation

Feb 02, 2023
Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Fangxiaoyu Feng, Melvin Johnson, Orhan Firat

Figure 1 for The unreasonable effectiveness of few-shot learning for machine translation
Figure 2 for The unreasonable effectiveness of few-shot learning for machine translation
Figure 3 for The unreasonable effectiveness of few-shot learning for machine translation
Figure 4 for The unreasonable effectiveness of few-shot learning for machine translation

We demonstrate the potential of few-shot translation systems, trained with unpaired language data, for both high and low-resource language pairs. We show that with only 5 examples of high-quality translation data shown at inference, a transformer decoder-only model trained solely with self-supervised learning, is able to match specialized supervised state-of-the-art models as well as more general commercial translation systems. In particular, we outperform the best performing system on the WMT'21 English - Chinese news translation task by only using five examples of English - Chinese parallel data at inference. Moreover, our approach in building these models does not necessitate joint multilingual training or back-translation, is conceptually simple and shows the potential to extend to the multilingual setting. Furthermore, the resulting models are two orders of magnitude smaller than state-of-the-art language models. We then analyze the factors which impact the performance of few-shot translation systems, and highlight that the quality of the few-shot demonstrations heavily determines the quality of the translations generated by our models. Finally, we show that the few-shot paradigm also provides a way to control certain attributes of the translation -- we show that we are able to control for regional varieties and formality using only a five examples at inference, paving the way towards controllable machine translation systems.

Viaarxiv icon

Prompting PaLM for Translation: Assessing Strategies and Performance

Nov 16, 2022
David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, George Foster

Figure 1 for Prompting PaLM for Translation: Assessing Strategies and Performance
Figure 2 for Prompting PaLM for Translation: Assessing Strategies and Performance
Figure 3 for Prompting PaLM for Translation: Assessing Strategies and Performance
Figure 4 for Prompting PaLM for Translation: Assessing Strategies and Performance

Large language models (LLMs) that have been trained on multilingual but not parallel text exhibit a remarkable ability to translate between languages. We probe this ability in an in-depth study of the pathways language model (PaLM), which has demonstrated the strongest machine translation (MT) performance among similarly-trained LLMs to date. We investigate various strategies for choosing translation examples for few-shot prompting, concluding that example quality is the most important factor. Using optimized prompts, we revisit previous assessments of PaLM's MT capabilities with more recent test sets, modern MT metrics, and human evaluation, and find that its performance, while impressive, still lags that of state-of-the-art supervised systems. We conclude by providing an analysis of PaLM's MT output which reveals some interesting properties and prospects for future work.

Viaarxiv icon

Toward More Effective Human Evaluation for Machine Translation

Apr 11, 2022
Belén Saldías, George Foster, Markus Freitag, Qijun Tan

Figure 1 for Toward More Effective Human Evaluation for Machine Translation
Figure 2 for Toward More Effective Human Evaluation for Machine Translation
Figure 3 for Toward More Effective Human Evaluation for Machine Translation
Figure 4 for Toward More Effective Human Evaluation for Machine Translation

Improvements in text generation technologies such as machine translation have necessitated more costly and time-consuming human evaluation procedures to ensure an accurate signal. We investigate a simple way to reduce cost by reducing the number of text segments that must be annotated in order to accurately predict a score for a complete test set. Using a sampling approach, we demonstrate that information from document membership and automatic metrics can help improve estimates compared to a pure random sampling baseline. We achieve gains of up to 20% in average absolute error by leveraging stratified sampling and control variates. Our techniques can improve estimates made from a fixed annotation budget, are easy to implement, and can be applied to any problem with structure similar to the one we study.

* ACL 2022 Workshop on Human Evaluation of NLP Systems 
Viaarxiv icon

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Apr 29, 2021
Markus Freitag, George Foster, David Grangier, Viresh Ratnakar, Qijun Tan, Wolfgang Macherey

Figure 1 for Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Figure 2 for Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Figure 3 for Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation
Figure 4 for Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

Viaarxiv icon

Assessing Reference-Free Peer Evaluation for Machine Translation

Apr 12, 2021
Sweta Agrawal, George Foster, Markus Freitag, Colin Cherry

Figure 1 for Assessing Reference-Free Peer Evaluation for Machine Translation
Figure 2 for Assessing Reference-Free Peer Evaluation for Machine Translation
Figure 3 for Assessing Reference-Free Peer Evaluation for Machine Translation
Figure 4 for Assessing Reference-Free Peer Evaluation for Machine Translation

Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.

* NAACL 2021 
Viaarxiv icon

Inference Strategies for Machine Translation with Conditional Masking

Oct 20, 2020
Julia Kreutzer, George Foster, Colin Cherry

Figure 1 for Inference Strategies for Machine Translation with Conditional Masking
Figure 2 for Inference Strategies for Machine Translation with Conditional Masking
Figure 3 for Inference Strategies for Machine Translation with Conditional Masking
Figure 4 for Inference Strategies for Machine Translation with Conditional Masking

Conditional masked language model (CMLM) training has proven successful for non-autoregressive and semi-autoregressive sequence generation tasks, such as machine translation. Given a trained CMLM, however, it is not clear what the best inference strategy is. We formulate masked inference as a factorization of conditional probabilities of partial sequences, show that this does not harm performance, and investigate a number of simple heuristics motivated by this perspective. We identify a thresholding strategy that has advantages over the standard "mask-predict" algorithm, and provide analyses of its behavior on machine translation tasks.

* EMNLP 2020, updated Fig 3 
Viaarxiv icon

Human-Paraphrased References Improve Neural Machine Translation

Oct 20, 2020
Markus Freitag, George Foster, David Grangier, Colin Cherry

Figure 1 for Human-Paraphrased References Improve Neural Machine Translation
Figure 2 for Human-Paraphrased References Improve Neural Machine Translation
Figure 3 for Human-Paraphrased References Improve Neural Machine Translation
Figure 4 for Human-Paraphrased References Improve Neural Machine Translation

Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has recently been proposed by Freitag et al. When used in place of original references, the paraphrased versions produce metric scores that correlate better with human judgment. This effect holds for a variety of different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In this paper we compare the results of performing end-to-end system development using standard and paraphrased references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment, and demonstrates for the first time that using these scores for system development can lead to significant improvements.

* Accepted at WMT 2020 
Viaarxiv icon