Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David Chiang

Using Source-Side Confidence Estimation for Reliable Translation into Unfamiliar Languages

Mar 30, 2025

Kenneth J. Sible, David Chiang

Abstract:We present an interactive machine translation (MT) system designed for users who are not proficient in the target language. It aims to improve trustworthiness and explainability by identifying potentially mistranslated words and allowing the user to intervene to correct mistranslations. However, confidence estimation in machine translation has traditionally focused on the target side. Whereas the conventional approach to source-side confidence estimation would have been to project target word probabilities to the source side via word alignments, we propose a direct, alignment-free approach that measures how sensitive the target word probabilities are to changes in the source embeddings. Experimental results show that our method outperforms traditional alignment-based methods at detection of mistranslations.

* 7 pages, 5 figures, 1 table. Submitted to ACL 2025 System Demonstrations

Via

Access Paper or Ask Questions

Simulating Hard Attention Using Soft Attention

Dec 13, 2024

Andy Yang, Lena Strobl, David Chiang, Dana Angluin

Abstract:We study conditions under which transformers using soft attention can simulate hard attention, that is, effectively focus all attention on a subset of positions. First, we examine several variants of linear temporal logic, whose formulas have been previously been shown to be computable using hard attention transformers. We demonstrate how soft attention transformers can compute formulas of these logics using unbounded positional embeddings or temperature scaling. Second, we demonstrate how temperature scaling allows softmax transformers to simulate a large subclass of average-hard attention transformers, those that have what we call the uniform-tieless property.

Via

Access Paper or Ask Questions

Improving Rare Word Translation With Dictionaries and Attention Masking

Aug 17, 2024

Kenneth J. Sible, David Chiang

Abstract:In machine translation, rare words continue to be a problem for the dominant encoder-decoder architecture, especially in low-resource and out-of-domain translation settings. Human translators solve this problem with monolingual or bilingual dictionaries. In this paper, we propose appending definitions from a bilingual dictionary to source sentences and using attention masking to link together rare words with their definitions. We find that including definitions for rare words improves performance by up to 1.0 BLEU and 1.6 MacroF1.

Via

Access Paper or Ask Questions

Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn't

Jun 13, 2024

Chihiro Taguchi, David Chiang

Abstract:We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.

* 11 pages, 5 figures, 5 tables, submitted to ACL 2024

Via

Access Paper or Ask Questions

PILA: A Historical-Linguistic Dataset of Proto-Italic and Latin

Apr 25, 2024

Stephen Bothwell, Brian DuSell, David Chiang, Brian Krostenko

Abstract:Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA's value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA's capability for enhancing other historical-linguistic datasets through a dataset compatibility study.

* 12 pages, 1 figure, 9 tables. Accepted at LREC-COLING 2024

Via

Access Paper or Ask Questions

Killkan: The Automatic Speech Recognition Dataset for Kichwa with Morphosyntactic Information

Apr 23, 2024

Chihiro Taguchi, Jefferson Saransig, Dayana Velásquez, David Chiang

Abstract:This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.

* 11 pages, 9 tables, 3 figures, to be published in LREC-COLING 2024

Via

Access Paper or Ask Questions

Nostra Domina at EvaLatin 2024: Improving Latin Polarity Detection through Data Augmentation

Apr 11, 2024

Stephen Bothwell, Abigail Swenor, David Chiang

Abstract:This paper describes submissions from the team Nostra Domina to the EvaLatin 2024 shared task of emotion polarity detection. Given the low-resource environment of Latin and the complexity of sentiment in rhetorical genres like poetry, we augmented the available data through automatic polarity annotation. We present two methods for doing so on the basis of the $k$-means algorithm, and we employ a variety of Latin large language models (LLMs) in a neural architecture to better capture the underlying contextual sentiment representations. Our best approach achieved the second highest macro-averaged Macro-$F_1$ score on the shared task's test set.

* Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages

Via

Access Paper or Ask Questions

We're Calling an Intervention: Taking a Closer Look at Language Model Adaptation to Different Types of Linguistic Variation

Apr 10, 2024

Aarohi Srivastava, David Chiang

Figure 1 for We're Calling an Intervention: Taking a Closer Look at Language Model Adaptation to Different Types of Linguistic Variation

Figure 2 for We're Calling an Intervention: Taking a Closer Look at Language Model Adaptation to Different Types of Linguistic Variation

Figure 3 for We're Calling an Intervention: Taking a Closer Look at Language Model Adaptation to Different Types of Linguistic Variation

Figure 4 for We're Calling an Intervention: Taking a Closer Look at Language Model Adaptation to Different Types of Linguistic Variation

Abstract:We present a suite of interventions and experiments that allow us to understand language model adaptation to text with linguistic variation (e.g., nonstandard or dialectal text). Our interventions address several features of linguistic variation, resulting in character, subword, and word-level changes. Applying our interventions during language model adaptation with varying size and nature of training data, we gain important insights into what makes linguistic variation particularly difficult for language models to deal with. For instance, on text with character-level variation, performance improves with even a few training examples but approaches a plateau, suggesting that more data is not the solution. In contrast, on text with variation involving new words or meanings, far more data is needed, but it leads to a massive breakthrough in performance. Our findings inform future work on dialectal NLP and making language models more robust to linguistic variation overall. We make the code for our interventions, which can be applied to any English text data, publicly available.

* Preprint. Under review

Via

Access Paper or Ask Questions

Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers

Apr 05, 2024

Andy Yang, David Chiang

Abstract:Deriving formal bounds on the expressivity of transformers, as well as studying transformers that are constructed to implement known algorithms, are both effective methods for better understanding the computational power of transformers. Towards both ends, we introduce the temporal counting logic $\textbf{K}_\text{t}$[#] alongside the RASP variant $\textbf{C-RASP}$. We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers with unbounded input size. We prove this by showing all $\textbf{K}_\text{t}$[#] formulas can be compiled into these transformers. As a case study, we demonstrate on paper how to use $\textbf{C-RASP}$ to construct simple transformer language models that, using greedy decoding, can only generate sentences that have given properties formally specified in $\textbf{K}_\text{t}$[#].

Via

Access Paper or Ask Questions

Transformers as Transducers

Apr 02, 2024

Lena Strobl, Dana Angluin, David Chiang, Jonathan Rawski, Ashish Sabharwal

Abstract:We study the sequence-to-sequence mapping capacity of transformers by relating them to finite transducers, and find that they can express surprisingly large classes of transductions. We do so using variants of RASP, a programming language designed to help people "think like transformers," as an intermediate representation. We extend the existing Boolean variant B-RASP to sequence-to-sequence functions and show that it computes exactly the first-order rational functions (such as string rotation). Then, we introduce two new extensions. B-RASP[pos] enables calculations on positions (such as copying the first half of a string) and contains all first-order regular functions. S-RASP adds prefix sum, which enables additional arithmetic operations (such as squaring a string) and contains all first-order polyregular functions. Finally, we show that masked average-hard attention transformers can simulate S-RASP. A corollary of our results is a new proof that transformer decoders are Turing-complete.

Via

Access Paper or Ask Questions