Alert button
Picture for Kenton Murray

Kenton Murray

Alert button

Narrowing the Gap between Zero- and Few-shot Machine Translation by Matching Styles

Nov 04, 2023
Weiting Tan, Haoran Xu, Lingfeng Shen, Shuyue Stella Li, Kenton Murray, Philipp Koehn, Benjamin Van Durme, Yunmo Chen

Large language models trained primarily in a monolingual setting have demonstrated their ability to generalize to machine translation using zero- and few-shot examples with in-context learning. However, even though zero-shot translations are relatively good, there remains a discernible gap comparing their performance with the few-shot setting. In this paper, we investigate the factors contributing to this gap and find that this gap can largely be closed (for about 70%) by matching the writing styles of the target corpus. Additionally, we explore potential approaches to enhance zero-shot baselines without the need for parallel demonstration examples, providing valuable insights into how these methods contribute to improving translation metrics.

Viaarxiv icon

Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Oct 02, 2023
Tianjian Li, Haoran Xu, Philipp Koehn, Daniel Khashabi, Kenton Murray

Figure 1 for Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
Figure 2 for Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
Figure 3 for Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models
Figure 4 for Error Norm Truncation: Robust Training in the Presence of Data Noise for Text Generation Models

Text generation models are notoriously vulnerable to errors in the training data. With the wide-spread availability of massive amounts of web-crawled data becoming more commonplace, how can we enhance the robustness of models trained on a massive amount of noisy web-crawled text? In our work, we propose Error Norm Truncation (ENT), a robust enhancement method to the standard training objective that truncates noisy data. Compared to methods that only uses the negative log-likelihood loss to estimate data quality, our method provides a more accurate estimation by considering the distribution of non-target tokens, which is often overlooked by previous work. Through comprehensive experiments across language modeling, machine translation, and text summarization, we show that equipping text generation models with ENT improves generation quality over standard training and previous soft and hard truncation methods. Furthermore, we show that our method improves the robustness of models against two of the most detrimental types of noise in machine translation, resulting in an increase of more than 2 BLEU points over the MLE baseline when up to 50% of noise is added to the data.

Viaarxiv icon

Linking Symptom Inventories using Semantic Textual Similarity

Sep 08, 2023
Eamonn Kennedy, Shashank Vadlamani, Hannah M Lindsey, Kelly S Peterson, Kristen Dams OConnor, Kenton Murray, Ronak Agarwal, Houshang H Amiri, Raeda K Andersen, Talin Babikian, David A Baron, Erin D Bigler, Karen Caeyenberghs, Lisa Delano-Wood, Seth G Disner, Ekaterina Dobryakova, Blessen C Eapen, Rachel M Edelstein, Carrie Esopenko, Helen M Genova, Elbert Geuze, Naomi J Goodrich-Hunsaker, Jordan Grafman, Asta K Haberg, Cooper B Hodges, Kristen R Hoskinson, Elizabeth S Hovenden, Andrei Irimia, Neda Jahanshad, Ruchira M Jha, Finian Keleher, Kimbra Kenney, Inga K Koerte, Spencer W Liebel, Abigail Livny, Marianne Lovstad, Sarah L Martindale, Jeffrey E Max, Andrew R Mayer, Timothy B Meier, Deleene S Menefee, Abdalla Z Mohamed, Stefania Mondello, Martin M Monti, Rajendra A Morey, Virginia Newcombe, Mary R Newsome, Alexander Olsen, Nicholas J Pastorek, Mary Jo Pugh, Adeel Razi, Jacob E Resch, Jared A Rowland, Kelly Russell, Nicholas P Ryan, Randall S Scheibel, Adam T Schmidt, Gershon Spitz, Jaclyn A Stephens, Assaf Tal, Leah D Talbert, Maria Carmela Tartaglia, Brian A Taylor, Sophia I Thomopoulos, Maya Troyanskaya, Eve M Valera, Harm Jan van der Horn, John D Van Horn, Ragini Verma, Benjamin SC Wade, Willian SC Walker, Ashley L Ware, J Kent Werner Jr, Keith Owen Yeates, Ross D Zafonte, Michael M Zeineh, Brandon Zielinski, Paul M Thompson, Frank G Hillary, David F Tate, Elisabeth A Wilde, Emily L Dennis

Figure 1 for Linking Symptom Inventories using Semantic Textual Similarity
Figure 2 for Linking Symptom Inventories using Semantic Textual Similarity
Figure 3 for Linking Symptom Inventories using Semantic Textual Similarity
Figure 4 for Linking Symptom Inventories using Semantic Textual Similarity

An extensive library of symptom inventories has been developed over time to measure clinical symptoms, but this variety has led to several long standing issues. Most notably, results drawn from different settings and studies are not comparable, which limits reproducibility. Here, we present an artificial intelligence (AI) approach using semantic textual similarity (STS) to link symptoms and scores across previously incongruous symptom inventories. We tested the ability of four pre-trained STS models to screen thousands of symptom description pairs for related content - a challenging task typically requiring expert panels. Models were tasked to predict symptom severity across four different inventories for 6,607 participants drawn from 16 international data sources. The STS approach achieved 74.8% accuracy across five tasks, outperforming other models tested. This work suggests that incorporating contextual, semantic information can assist expert decision-making processes, yielding gains for both general and disease-specific clinical assessment.

Viaarxiv icon

MegaWika: Millions of reports and their sources across 50 diverse languages

Jul 13, 2023
Samuel Barham, Orion Weller, Michelle Yuan, Kenton Murray, Mahsa Yarmohammadi, Zhengping Jiang, Siddharth Vashishtha, Alexander Martin, Anqi Liu, Aaron Steven White, Jordan Boyd-Graber, Benjamin Van Durme

Figure 1 for MegaWika: Millions of reports and their sources across 50 diverse languages
Figure 2 for MegaWika: Millions of reports and their sources across 50 diverse languages
Figure 3 for MegaWika: Millions of reports and their sources across 50 diverse languages
Figure 4 for MegaWika: Millions of reports and their sources across 50 diverse languages

To foster the development of new models for collaborative AI-assisted report generation, we introduce MegaWika, consisting of 13 million Wikipedia articles in 50 diverse languages, along with their 71 million referenced source materials. We process this dataset for a myriad of applications, going beyond the initial Wikipedia citation extraction and web scraping of content, including translating non-English articles for cross-lingual applications and providing FrameNet parses for automated semantic analysis. MegaWika is the largest resource for sentence-level report generation and the only report generation dataset that is multilingual. We manually analyze the quality of this resource through a semantically stratified sample. Finally, we provide baseline results and trained models for crucial steps in automated report generation: cross-lingual question answering and citation retrieval.

* Submitted to ACL, 2023 
Viaarxiv icon

Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution

May 27, 2023
Tianjian Li, Kenton Murray

Figure 1 for Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution
Figure 2 for Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution
Figure 3 for Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution
Figure 4 for Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution

Zero-shot cross-lingual transfer is when a multilingual model is trained to perform a task in one language and then is applied to another language. Although the zero-shot cross-lingual transfer approach has achieved success in various classification tasks, its performance on natural language generation tasks falls short in quality and sometimes outputs an incorrect language. In our study, we show that the fine-tuning process learns language invariant representations, which is beneficial for classification tasks but harmful for generation tasks. Motivated by this, we propose a simple method to regularize the model from learning language invariant representations and a method to select model checkpoints without a development set in the target language, both resulting in better generation quality. Experiments on three semantically diverse generation tasks show that our method reduces the accidental translation problem by 68% and improves the ROUGE-L score by 1.5 on average.

* Findings of ACL 2023 
Viaarxiv icon

Exploring Representational Disparities Between Multilingual and Bilingual Translation Models

May 23, 2023
Neha Verma, Kenton Murray, Kevin Duh

Figure 1 for Exploring Representational Disparities Between Multilingual and Bilingual Translation Models
Figure 2 for Exploring Representational Disparities Between Multilingual and Bilingual Translation Models
Figure 3 for Exploring Representational Disparities Between Multilingual and Bilingual Translation Models
Figure 4 for Exploring Representational Disparities Between Multilingual and Bilingual Translation Models

Multilingual machine translation has proven immensely useful for low-resource and zero-shot language pairs. However, language pairs in multilingual models sometimes see worse performance than in bilingual models, especially when translating in a one-to-many setting. To understand why, we examine the geometric differences in the representations from bilingual models versus those from one-to-many multilingual models. Specifically, we evaluate the isotropy of the representations, to measure how well they utilize the dimensions in their underlying vector space. Using the same evaluation data in both models, we find that multilingual model decoder representations tend to be less isotropic than bilingual model decoder representations. Additionally, we show that much of the anisotropy in multilingual decoder representations can be attributed to modeling language-specific information, therefore limiting remaining representational capacity.

Viaarxiv icon

Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

May 23, 2023
Haoran Xu, Weiting Tan, Shuyue Stella Li, Yunmo Chen, Benjamin Van Durme, Philipp Koehn, Kenton Murray

Figure 1 for Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Figure 2 for Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Figure 3 for Condensing Multilingual Knowledge with Lightweight Language-Specific Modules
Figure 4 for Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

Incorporating language-specific (LS) modules is a proven method to boost performance in multilingual machine translation. This approach bears similarity to Mixture-of-Experts (MoE) because it does not inflate FLOPs. However, the scalability of this approach to hundreds of languages (experts) tends to be unmanageable due to the prohibitive number of parameters introduced by full-rank matrices in fully-connected layers. In this work, we introduce the Language-Specific Matrix Synthesis (LMS) method. This approach constructs LS modules by generating low-rank matrices from two significantly smaller matrices to approximate the full-rank matrix. Furthermore, we condense multilingual knowledge from multiple LS modules into a single shared module with the Fuse Distillation (FD) technique to improve the efficiency of inference and model serialization. We show that our LMS method significantly outperforms previous LS methods and MoE methods with the same amount of extra parameters, e.g., 1.73 BLEU points over the Switch Transformer on many-to-many multilingual machine translation. Importantly, LMS is able to have comparable translation performance with much fewer parameters.

Viaarxiv icon

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

May 03, 2023
Haoran Xu, Maha Elbayad, Kenton Murray, Jean Maillard, Vedanuj Goswami

Figure 1 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Figure 2 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Figure 3 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity
Figure 4 for Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demonstrated effectiveness in significantly increasing the number of parameters while maintaining low computational requirements per token. However, recent studies have established that MoE models are inherently parameter-inefficient as the improvement in performance diminishes with an increasing number of experts. We hypothesize this parameter inefficiency is a result of all experts having equal capacity, which may not adequately meet the varying complexity requirements of different tokens or tasks, e.g., in a multilingual setting, languages based on their resource levels might require different capacities. In light of this, we propose Stratified Mixture of Experts(SMoE) models, which feature a stratified structure and can assign dynamic capacity to different tokens. We demonstrate the effectiveness of SMoE on two multilingual machine translation benchmarks, where it outperforms multiple state-of-the-art MoE models. On a diverse 15-language dataset, SMoE improves the translation quality over vanilla MoE by +0.93 BLEU points on average. Additionally, SMoE is parameter-efficient, matching vanilla MoE performance with around 50\% fewer parameters.

Viaarxiv icon

Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns

Nov 14, 2022
Shuyue Stella Li, Kenton Murray

Figure 1 for Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns
Figure 2 for Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns
Figure 3 for Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns
Figure 4 for Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns

In this work, we focus on intrasentential code-mixing and propose several different Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks across various amounts of labeled gold data. Most importantly, our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy, motivating further linguistic insights into the phenomenon of code-mixing. We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73% on the extremely scarce English-Malayalam dataset. We conclude that the code-switch pattern in code-mixing sentences is also important for the model to learn. Finally, we propose a language-agnostic SCM algorithm that is cheap yet extremely helpful for low-resource languages.

* 12 pages, 5 figures 
Viaarxiv icon

The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains

May 23, 2022
Haoran Xu, Philipp Koehn, Kenton Murray

Figure 1 for The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains
Figure 2 for The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains
Figure 3 for The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains
Figure 4 for The Importance of Being Parameters: An Intra-Distillation Method for Serious Gains

Recent model pruning methods have demonstrated the ability to remove redundant parameters without sacrificing model performance. Common methods remove redundant parameters according to the parameter sensitivity, a gradient-based measure reflecting the contribution of the parameters. In this paper, however, we argue that redundant parameters can be trained to make beneficial contributions. We first highlight the large sensitivity (contribution) gap among high-sensitivity and low-sensitivity parameters and show that the model generalization performance can be significantly improved after balancing the contribution of all parameters. Our goal is to balance the sensitivity of all parameters and encourage all of them to contribute equally. We propose a general task-agnostic method, namely intra-distillation, appended to the regular training loss to balance parameter sensitivity. Moreover, we also design a novel adaptive learning method to control the strength of intra-distillation loss for faster convergence. Our experiments show the strong effectiveness of our methods on machine translation, natural language understanding, and zero-shot cross-lingual transfer across up to 48 languages, e.g., a gain of 3.54 BLEU on average across 8 language pairs from the IWSLT'14 translation dataset.

Viaarxiv icon