Alert button
Picture for Katharina Kann

Katharina Kann

Alert button

Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

Jun 11, 2023
Manuel Mager, Rajat Bhatnagar, Graham Neubig, Ngoc Thang Vu, Katharina Kann

Figure 1 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
Figure 2 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
Figure 3 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
Figure 4 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

Neural models have drastically advanced state of the art for machine translation (MT) between high-resource languages. Traditionally, these models rely on large amounts of training data, but many language pairs lack these resources. However, an important part of the languages in the world do not have this amount of data. Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. Here, we present an introduction to the interested reader to the basic challenges, concepts, and techniques that involve the creation of MT systems for these languages. Finally, we discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.

* Accepted to AmericasNLP 2023 
Viaarxiv icon

Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

May 31, 2023
Manuel Mager, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Figure 1 for Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers
Figure 2 for Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers
Figure 3 for Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

In recent years machine translation has become very successful for high-resource language pairs. This has also sparked new interest in research on the automatic translation of low-resource languages, including Indigenous languages. However, the latter are deeply related to the ethnic and cultural groups that speak (or used to speak) them. The data collection, modeling and deploying machine translation systems thus result in new ethical questions that must be addressed. Motivated by this, we first survey the existing literature on ethical considerations for the documentation, translation, and general natural language processing for Indigenous languages. Afterward, we conduct and analyze an interview study to shed light on the positions of community leaders, teachers, and language activists regarding ethical concerns for the automatic translation of their languages. Our results show that the inclusion, at different degrees, of native speakers and community members is vital to performing better and more ethical research on Indigenous languages.

* Accepted to ACL2023 Main Conference 
Viaarxiv icon

An Investigation of Noise in Morphological Inflection

May 26, 2023
Adam Wiemerslage, Changbing Yang, Garrett Nicolai, Miikka Silfverberg, Katharina Kann

Figure 1 for An Investigation of Noise in Morphological Inflection
Figure 2 for An Investigation of Noise in Morphological Inflection
Figure 3 for An Investigation of Noise in Morphological Inflection
Figure 4 for An Investigation of Noise in Morphological Inflection

With a growing focus on morphological inflection systems for languages where high-quality data is scarce, training data noise is a serious but so far largely ignored concern. We aim at closing this gap by investigating the types of noise encountered within a pipeline for truly unsupervised morphological paradigm completion and its impact on morphological inflection systems: First, we propose an error taxonomy and annotation pipeline for inflection training data. Then, we compare the effect of different types of noise on multiple state-of-the-art inflection models. Finally, we propose a novel character-level masked language modeling (CMLM) pretraining objective and explore its impact on the models' resistance to noise. Our experiments show that various architectures are impacted differently by separate types of noise, but encoder-decoders tend to be more robust to noise than models trained with a copy bias. CMLM pretraining helps transformers, but has lower impact on LSTMs.

* ACL 2023 Findings 
Viaarxiv icon

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Feb 15, 2023
Abteen Ebrahimi, Arya D. McCarthy, Arturo Oncevay, Luis Chiruzzo, John E. Ortega, Gustavo A. Giménez-Lugo, Rolando Coto-Solano, Katharina Kann

Figure 1 for Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Figure 2 for Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Figure 3 for Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Figure 4 for Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.

* EACL 2023 
Viaarxiv icon

Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems

Dec 20, 2022
Sagi Shaier, Lawrence Hunter, Katharina Kann

Figure 1 for Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems
Figure 2 for Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems
Figure 3 for Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems
Figure 4 for Mind the Knowledge Gap: A Survey of Knowledge-enhanced Dialogue Systems

Many dialogue systems (DSs) lack characteristics humans have, such as emotion perception, factuality, and informativeness. Enhancing DSs with knowledge alleviates this problem, but, as many ways of doing so exist, keeping track of all proposed methods is difficult. Here, we present the first survey of knowledge-enhanced DSs. We define three categories of systems - internal, external, and hybrid - based on the knowledge they use. We survey the motivation for enhancing DSs with knowledge, used datasets, and methods for knowledge search, knowledge encoding, and knowledge incorporation. Finally, we propose how to improve existing systems based on theories from linguistics and cognitive science.

Viaarxiv icon

A Major Obstacle for NLP Research: Let's Talk about Time Allocation!

Nov 30, 2022
Katharina Kann, Shiran Dudy, Arya D. McCarthy

Figure 1 for A Major Obstacle for NLP Research: Let's Talk about Time Allocation!
Figure 2 for A Major Obstacle for NLP Research: Let's Talk about Time Allocation!

The field of natural language processing (NLP) has grown over the last few years: conferences have become larger, we have published an incredible amount of papers, and state-of-the-art research has been implemented in a large variety of customer-facing products. However, this paper argues that we have been less successful than we should have been and reflects on where and how the field fails to tap its full potential. Specifically, we demonstrate that, in recent years, subpar time allocation has been a major obstacle for NLP research. We outline multiple concrete problems together with their negative consequences and, importantly, suggest remedies to improve the status quo. We hope that this paper will be a starting point for discussions around which common practices are -- or are not -- beneficial for NLP research.

* To appear at EMNLP 2022 
Viaarxiv icon

Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Mar 21, 2022
Yoshinari Fujinuma, Jordan Boyd-Graber, Katharina Kann

Figure 1 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability
Figure 2 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability
Figure 3 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability
Figure 4 for Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

Pretrained multilingual models enable zero-shot learning even for unseen languages, and that performance can be further improved via adaptation prior to finetuning. However, it is unclear how the number of pretraining languages influences a model's zero-shot learning for languages unseen during pretraining. To fill this gap, we ask the following research questions: (1) How does the number of pretraining languages influence zero-shot performance on unseen target languages? (2) Does the answer to that question change with model adaptation? (3) Do the findings for our first question change if the languages used for pretraining are all related? Our experiments on pretraining with related languages indicate that choosing a diverse set of languages is crucial. Without model adaptation, surprisingly, increasing the number of pretraining languages yields better results up to adding related languages, after which performance plateaus. In contrast, with model adaptation via continued pretraining, pretraining on a larger number of languages often gives further improvement, suggesting that model adaptation is crucial to exploit additional pretraining languages.

* ACL 2022 camera ready 
Viaarxiv icon

BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Mar 16, 2022
Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Figure 1 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages
Figure 2 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages
Figure 3 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages
Figure 4 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri--Spanish.

* Accepted to Findings of ACL 2022 
Viaarxiv icon

Morphological Processing of Low-Resource Languages: Where We Are and What's Next

Mar 16, 2022
Adam Wiemerslage, Miikka Silfverberg, Changbing Yang, Arya D. McCarthy, Garrett Nicolai, Eliana Colunga, Katharina Kann

Figure 1 for Morphological Processing of Low-Resource Languages: Where We Are and What's Next
Figure 2 for Morphological Processing of Low-Resource Languages: Where We Are and What's Next
Figure 3 for Morphological Processing of Low-Resource Languages: Where We Are and What's Next
Figure 4 for Morphological Processing of Low-Resource Languages: Where We Are and What's Next

Automatic morphological processing can aid downstream natural language processing applications, especially for low-resource languages, and assist language documentation efforts for endangered languages. Having long been multilingual, the field of computational morphology is increasingly moving towards approaches suitable for languages with minimal or no annotated resources. First, we survey recent developments in computational morphology with a focus on low-resource languages. Second, we argue that the field is ready to tackle the logical next challenge: understanding a language's morphology from raw text alone. We perform an empirical study on a truly unsupervised version of the paradigm completion task and show that, while existing state-of-the-art models bridged by two newly proposed models we devise perform reasonably, there is still much room for improvement. The stakes are high: solving this task will increase the language coverage of morphological resources by a number of magnitudes.

* Findings of ACL 2022 
Viaarxiv icon

The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color

Oct 15, 2021
Cory Paik, Stéphane Aroca-Ouellette, Alessandro Roncone, Katharina Kann

Figure 1 for The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color
Figure 2 for The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color
Figure 3 for The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color
Figure 4 for The World of an Octopus: How Reporting Bias Influences a Language Model's Perception of Color

Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-perceived color distributions for 521 common objects; 2) use CoDa to analyze and compare the color distribution found in text, the distribution captured by language models, and a human's perception of color; and 3) investigate the performance differences between text-only and multimodal models on CoDa. Our results show that the distribution of colors that a language model recovers correlates more strongly with the inaccurate distribution found in text than with the ground-truth, supporting the claim that reporting bias negatively impacts and inherently limits text-only training. We then demonstrate that multimodal models can leverage their visual training to mitigate these effects, providing a promising avenue for future research.

* Accepted to EMNLP 2021, 9 Pages 
Viaarxiv icon