Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ekaterina Vylomova

TULUN: Transparent and Adaptable Low-resource Machine Translation

May 24, 2025

Raphaël Merx, Hanna Suominen, Lois Hong, Nick Thieberger, Trevor Cohn, Ekaterina Vylomova

Abstract:Machine translation (MT) systems that support low-resource languages often struggle on specialized domains. While researchers have proposed various techniques for domain adaptation, these approaches typically require model fine-tuning, making them impractical for non-technical users and small organizations. To address this gap, we propose Tulun, a versatile solution for terminology-aware translation, combining neural MT with large language model (LLM)-based post-editing guided by existing glossaries and translation memories. Our open-source web-based platform enables users to easily create, edit, and leverage terminology resources, fostering a collaborative human-machine translation process that respects and incorporates domain expertise while increasing MT accuracy. Evaluations show effectiveness in both real-world and benchmark scenarios: on medical and disaster relief translation tasks for Tetun and Bislama, our system achieves improvements of 16.90-22.41 ChrF++ points over baseline MT systems. Across six low-resource languages on the FLORES dataset, Tulun outperforms both standalone MT and LLM approaches, achieving an average improvement of 2.8 ChrF points over NLLB-54B.

Via

Access Paper or Ask Questions

A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data

Mar 11, 2025

Naomi Baes, Raphaël Merx, Nick Haslam, Ekaterina Vylomova, Haim Dubossarsky

Abstract:Lexical Semantic Change (LSC) offers insights into cultural and social dynamics. Yet, the validity of methods for measuring kinds of LSC has yet to be established due to the absence of historical benchmark datasets. To address this gap, we develop a novel three-stage evaluation framework that involves: 1) creating a scalable, domain-general methodology for generating synthetic datasets that simulate theory-driven LSC across time, leveraging In-Context Learning and a lexical database; 2) using these datasets to evaluate the effectiveness of various methods; and 3) assessing their suitability for specific dimensions and domains. We apply this framework to simulate changes across key dimensions of LSC (SIB: Sentiment, Intensity, and Breadth) using examples from psychology, and evaluate the sensitivity of selected methods to detect these artificially induced changes. Our findings support the utility of the synthetic data approach, validate the efficacy of tailored methods for detecting synthetic changes in SIB, and reveal that a state-of-the-art LSC model faces challenges in detecting affective dimensions of LSC. This framework provides a valuable tool for dimension- and domain-specific bench-marking and evaluation of LSC methods, with particular benefits for the social sciences.

* 36 pages, under review

Via

Access Paper or Ask Questions

Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service

Nov 19, 2024

Raphael Merx, Hanna Suominen, Adérito José Guterres Correia, Trevor Cohn, Ekaterina Vylomova

Abstract:The impact of machine translation (MT) on low-resource languages remains poorly understood. In particular, observational studies of actual usage patterns are scarce. Such studies could provide valuable insights into user needs and behaviours, complementing survey-based methods. Here we present an observational analysis of real-world MT usage for Tetun, the lingua franca of Timor-Leste, using server logs from a widely-used MT service with over $70,000$ monthly active users. Our analysis of $100,000$ translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate short texts into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues. Our results suggest that MT systems for languages like Tetun should prioritise translating into the low-resource language, handling brief inputs effectively, and covering a wide range of domains relevant to educational contexts. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.

Via

Access Paper or Ask Questions

Generating bilingual example sentences with large language models as lexicography assistants

Oct 04, 2024

Raphael Merx, Ekaterina Vylomova, Kemal Kurniawan

Figure 1 for Generating bilingual example sentences with large language models as lexicography assistants

Figure 2 for Generating bilingual example sentences with large language models as lexicography assistants

Figure 3 for Generating bilingual example sentences with large language models as lexicography assistants

Figure 4 for Generating bilingual example sentences with large language models as lexicography assistants

Abstract:We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility. Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low inter-annotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for typicality and intelligibility in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.

Via

Access Paper or Ask Questions

Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

Sep 22, 2024

Aso Mahmudi, Borja Herce, Demian Inostroza Amestica, Andreas Scherbakov, Eduard Hovy, Ekaterina Vylomova

Figure 1 for Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

Figure 2 for Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

Figure 3 for Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

Figure 4 for Can a Neural Model Guide Fieldwork? A Case Study on Morphological Inflection

Abstract:Linguistic fieldwork is an important component in language documentation and preservation. However, it is a long, exhaustive, and time-consuming process. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.

Via

Access Paper or Ask Questions

A Multidimensional Framework for Evaluating Lexical Semantic Change with Social Science Applications

Jun 10, 2024

Naomi Baes, Nick Haslam, Ekaterina Vylomova

Figure 1 for A Multidimensional Framework for Evaluating Lexical Semantic Change with Social Science Applications

Figure 2 for A Multidimensional Framework for Evaluating Lexical Semantic Change with Social Science Applications

Figure 3 for A Multidimensional Framework for Evaluating Lexical Semantic Change with Social Science Applications

Figure 4 for A Multidimensional Framework for Evaluating Lexical Semantic Change with Social Science Applications

Abstract:Historical linguists have identified multiple forms of lexical semantic change. We present a three-dimensional framework for integrating these forms and a unified computational methodology for evaluating them concurrently. The dimensions represent increases or decreases in semantic 1) sentiment, 2) breadth, and 3) intensity. These dimensions can be complemented by the evaluation of shifts in the frequency of the target words and the thematic content of its collocates. This framework enables lexical semantic change to be mapped economically and systematically and has applications in computational social science. We present an illustrative analysis of semantic shifts in mental health and mental illness in two corpora, demonstrating patterns of semantic change that illuminate contemporary concerns about pathologization, stigma, and concept creep.

Via

Access Paper or Ask Questions

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Apr 20, 2024

Khuyagbaatar Batsuren, Ekaterina Vylomova, Verna Dankers, Tsetsuukhei Delgerbaatar, Omri Uzan, Yuval Pinter, Gábor Bella

Figure 1 for Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Figure 2 for Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Figure 3 for Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Figure 4 for Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

Abstract:The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms have been proposed, their evaluation and cross-comparison is still an open problem. As a solution, we propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrinsic evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Extrinsic evaluation, in turn, is performed via the Out-of-Vocabulary Generalization Challenge 1.0 benchmark, which consists of three newly specified downstream text classification tasks. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that, in all language models studied (including ALBERT, BERT, RoBERTa, and DeBERTa), alien tokenization leads to poorer generalizations compared to morphological tokenization for semantic compositionality of word meanings.

Via

Access Paper or Ask Questions

Low-Resource Machine Translation through Retrieval-Augmented LLM Prompting: A Study on the Mambai Language

Apr 07, 2024

Raphaël Merx, Aso Mahmudi, Katrina Langford, Leo Alberto de Araujo, Ekaterina Vylomova

Abstract:This study explores the use of large language models (LLMs) for translating English into Mambai, a low-resource Austronesian language spoken in Timor-Leste, with approximately 200,000 native speakers. Leveraging a novel corpus derived from a Mambai language manual and additional sentences translated by a native speaker, we examine the efficacy of few-shot LLM prompting for machine translation (MT) in this low-resource context. Our methodology involves the strategic selection of parallel sentences and dictionary entries for prompting, aiming to enhance translation accuracy, using open-source and proprietary LLMs (LlaMa 2 70b, Mixtral 8x7B, GPT-4). We find that including dictionary entries in prompts and a mix of sentences retrieved through TF-IDF and semantic embeddings significantly improves translation quality. However, our findings reveal stark disparities in translation performance across test sets, with BLEU scores reaching as high as 21.2 on materials from the language manual, in contrast to a maximum of 4.4 on a test set provided by a native speaker. These results underscore the importance of diverse and representative corpora in assessing MT for low-resource languages. Our research provides insights into few-shot LLM prompting for low-resource MT, and makes available an initial corpus for the Mambai language.

Via

Access Paper or Ask Questions

Simpson's Paradox and the Accuracy-Fluency Tradeoff in Translation

Feb 20, 2024

Zheng Wei Lim, Ekaterina Vylomova, Trevor Cohn, Charles Kemp

Abstract:A good translation should be faithful to the source and should respect the norms of the target language. We address a theoretical puzzle about the relationship between these objectives. On one hand, intuition and some prior work suggest that accuracy and fluency should trade off against each other, and that capturing every detail of the source can only be achieved at the cost of fluency. On the other hand, quality assessment researchers often suggest that accuracy and fluency are highly correlated and difficult for human raters to distinguish (Callison-Burch et al. 2007). We show that the tension between these views is an instance of Simpson's paradox, and that accuracy and fluency are positively correlated at the level of the corpus but trade off at the level of individual source segments. We further suggest that the relationship between accuracy and fluency is best evaluated at the segment (or sentence) level, and that the trade off between these dimensions has implications both for assessing translation quality and developing improved MT systems.

Via

Access Paper or Ask Questions

Predicting Human Translation Difficulty with Neural Machine Translation

Dec 19, 2023

Zheng Wei Lim, Ekaterina Vylomova, Charles Kemp, Trevor Cohn

Abstract:Human translators linger on some words and phrases more than others, and predicting this variation is a step towards explaining the underlying cognitive processes. Using data from the CRITT Translation Process Research Database, we evaluate the extent to which surprisal and attentional features derived from a Neural Machine Translation (NMT) model account for reading and production times of human translators. We find that surprisal and attention are complementary predictors of translation difficulty, and that surprisal derived from a NMT model is the single most successful predictor of production duration. Our analyses draw on data from hundreds of translators operating across 13 language pairs, and represent the most comprehensive investigation of human translation difficulty to date.

Via

Access Paper or Ask Questions