Alert button
Picture for Hendra Setiawan

Hendra Setiawan

Alert button

Automating Behavioral Testing in Machine Translation

Sep 07, 2023
Javier Ferrando, Matthias Sperber, Hendra Setiawan, Dominic Telaar, Saša Hasan

Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems, revealing that while in general pass-rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential bugs that go unnoticed when relying only on accuracy.

Viaarxiv icon

One Wide Feedforward is All You Need

Sep 04, 2023
Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Viaarxiv icon

Accurate Knowledge Distillation with n-best Reranking

May 20, 2023
Hendra Setiawan

Figure 1 for Accurate Knowledge Distillation with n-best Reranking
Figure 2 for Accurate Knowledge Distillation with n-best Reranking
Figure 3 for Accurate Knowledge Distillation with n-best Reranking
Figure 4 for Accurate Knowledge Distillation with n-best Reranking

We propose extending the Sequence-level Knowledge Distillation (Kim and Rush, 2016) with n-best reranking to consider not only the top-1 hypotheses but also the top n-best hypotheses of teacher models. Our approach leverages a diverse set of models, including publicly-available large pretrained models, to provide more accurate pseudo-labels for training student models. We validate our proposal on the WMT21 German-English translation task and demonstrate that our student model achieves comparable accuracy to a large translation model with 4.7 billion parameters from (Tran et al., 2021) while having two orders of magnitude fewer parameters.

Viaarxiv icon

Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

Dec 20, 2022
Mozhdeh Gheini, Tatiana Likhomanenko, Matthias Sperber, Hendra Setiawan

Figure 1 for Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data
Figure 2 for Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data
Figure 3 for Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data
Figure 4 for Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an absence of sufficient data resources. We show that under such data-deficient circumstances, the unlabeled data can significantly vary in domain from the supervised data, which results in pseudo-label quality degradation. We investigate two categories of remedies that require no additional supervision and target the domain mismatch: pseudo-label filtering and data augmentation. We show that pseudo-label analysis and processing as such results in additional gains on top of the vanilla pseudo-labeling setup resulting in total improvements of up to 0.6% absolute WER and 2.2 BLEU points.

Viaarxiv icon

End-to-End Speech Translation for Code Switched Speech

Apr 11, 2022
Orion Weller, Matthias Sperber, Telmo Pires, Hendra Setiawan, Christian Gollan, Dominic Telaar, Matthias Paulik

Figure 1 for End-to-End Speech Translation for Code Switched Speech
Figure 2 for End-to-End Speech Translation for Code Switched Speech
Figure 3 for End-to-End Speech Translation for Code Switched Speech
Figure 4 for End-to-End Speech Translation for Code Switched Speech

Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. We explore various ST architectures across two dimensions: cascaded (transcribe then translate) vs end-to-end (jointly transcribe and translate) and unidirectional (source -> target) vs bidirectional (source <-> target). We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.

* Accepted to Findings of ACL 2022 
Viaarxiv icon

Consistent Transcription and Translation of Speech

Aug 28, 2020
Matthias Sperber, Hendra Setiawan, Christian Gollan, Udhyakumar Nallasamy, Matthias Paulik

Figure 1 for Consistent Transcription and Translation of Speech
Figure 2 for Consistent Transcription and Translation of Speech
Figure 3 for Consistent Transcription and Translation of Speech
Figure 4 for Consistent Transcription and Translation of Speech

The conventional paradigm in speech translation starts with a speech recognition step to generate transcripts, followed by a translation step with the automatic transcripts as input. To address various shortcomings of this paradigm, recent work explores end-to-end trainable direct models that translate without transcribing. However, transcripts can be an indispensable output in practical applications, which often display transcripts alongside the translations to users. We make this common requirement explicit and explore the task of jointly transcribing and translating speech. While high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs that degrade the user experience. We introduce a methodology to evaluate consistency and compare several modeling approaches, including the traditional cascaded approach and end-to-end models. We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency. We further introduce simple techniques for directly optimizing for consistency, and analyze the resulting trade-offs between consistency, transcription accuracy, and translation accuracy.

* Accepted at TACL (pre-MIT Press publication version); added dataset link 
Viaarxiv icon

Variational Neural Machine Translation with Normalizing Flows

May 28, 2020
Hendra Setiawan, Matthias Sperber, Udhay Nallasamy, Matthias Paulik

Figure 1 for Variational Neural Machine Translation with Normalizing Flows
Figure 2 for Variational Neural Machine Translation with Normalizing Flows
Figure 3 for Variational Neural Machine Translation with Normalizing Flows
Figure 4 for Variational Neural Machine Translation with Normalizing Flows

Variational Neural Machine Translation (VNMT) is an attractive framework for modeling the generation of target translations, conditioned not only on the source sentence but also on some latent random variables. The latent variable modeling may introduce useful statistical dependencies that can improve translation accuracy. Unfortunately, learning informative latent variables is non-trivial, as the latent space can be prohibitively large, and the latent codes are prone to be ignored by many translation models at training time. Previous works impose strong assumptions on the distribution of the latent code and limit the choice of the NMT architecture. In this paper, we propose to apply the VNMT framework to the state-of-the-art Transformer and introduce a more flexible approximate posterior based on normalizing flows. We demonstrate the efficacy of our proposal under both in-domain and out-of-domain conditions, significantly outperforming strong baselines.

* To appear in 2020 Association for Computational Linguistics (ACL) as a short paper 
Viaarxiv icon

Statistical Machine Translation Features with Multitask Tensor Networks

Jun 01, 2015
Hendra Setiawan, Zhongqiang Huang, Jacob Devlin, Thomas Lamar, Rabih Zbib, Richard Schwartz, John Makhoul

Figure 1 for Statistical Machine Translation Features with Multitask Tensor Networks
Figure 2 for Statistical Machine Translation Features with Multitask Tensor Networks
Figure 3 for Statistical Machine Translation Features with Multitask Tensor Networks
Figure 4 for Statistical Machine Translation Features with Multitask Tensor Networks

We present a three-pronged approach to improving Statistical Machine Translation (SMT), building on recent success in the application of neural networks to SMT. First, we propose new features based on neural networks to model various non-local translation phenomena. Second, we augment the architecture of the neural network with tensor layers that capture important higher-order interaction among the network units. Third, we apply multitask learning to estimate the neural network parameters jointly. Each of our proposed methods results in significant improvements that are complementary. The overall improvement is +2.7 and +1.8 BLEU points for Arabic-English and Chinese-English translation over a state-of-the-art system that already includes neural network features.

* 11 pages (9 content + 2 references), 2 figures, accepted to ACL 2015 as a long paper 
Viaarxiv icon