Alert button
Picture for Ngoc Thang Vu

Ngoc Thang Vu

Alert button

Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions

Oct 26, 2023
Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu

Customizing voice and speaking style in a speech synthesis system with intuitive and fine-grained controls is challenging, given that little data with appropriate labels is available. Furthermore, editing an existing human's voice also comes with ethical concerns. In this paper, we propose a method to generate artificial speaker embeddings that cannot be linked to a real human while offering intuitive and fine-grained control over the voice and speaking style of the embeddings, without requiring any labels for speaker or style. The artificial and controllable embeddings can be fed to a speech synthesis system, conditioned on embeddings of real humans during training, without sacrificing privacy during inference.

* Published at ISCA Interspeech 2023 https://www.isca-speech.org/archive/interspeech_2023/lux23_interspeech.html 
Viaarxiv icon

The IMS Toucan System for the Blizzard Challenge 2023

Oct 26, 2023
Florian Lux, Julia Koch, Sarina Meyer, Thomas Bott, Nadja Schauffler, Pavel Denisov, Antje Schweitzer, Ngoc Thang Vu

For our contribution to the Blizzard Challenge 2023, we improved on the system we submitted to the Blizzard Challenge 2021. Our approach entails a rule-based text-to-phoneme processing system that includes rule-based disambiguation of homographs in the French language. It then transforms the phonemes to spectrograms as intermediate representations using a fast and efficient non-autoregressive synthesis architecture based on Conformer and Glow. A GAN based neural vocoder that combines recent state-of-the-art approaches converts the spectrogram to the final wave. We carefully designed the data processing, training, and inference procedures for the challenge data. Our system identifier is G. Open source code and demo are available.

* Published at the Blizzard Challenge Workshop 2023, colocated with the Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 2023 
Viaarxiv icon

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Oct 23, 2023
Injy Hamed, Nizar Habash, Ngoc Thang Vu

Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.

* Findings of EMNLP 2023 
Viaarxiv icon

Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-to-Sequence End-to-End Spoken Language Understanding

Oct 09, 2023
Pavel Denisov, Ngoc Thang Vu

A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the cross-lingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%.

* IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) 2023 
Viaarxiv icon

VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

Sep 14, 2023
Sarina Meyer, Xiaoxiao Miao, Ngoc Thang Vu

Figure 1 for VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research
Figure 2 for VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research
Figure 3 for VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research
Figure 4 for VoicePAT: An Efficient Open-source Evaluation Toolkit for Voice Privacy Research

Speaker anonymization is the task of modifying a speech recording such that the original speaker cannot be identified anymore. Since the first Voice Privacy Challenge in 2020, along with the release of a framework, the popularity of this research topic is continually increasing. However, the comparison and combination of different anonymization approaches remains challenging due to the complexity of evaluation and the absence of user-friendly research frameworks. We therefore propose an efficient speaker anonymization and evaluation framework based on a modular and easily extendable structure, almost fully in Python. The framework facilitates the orchestration of several anonymization approaches in parallel and allows for interfacing between different techniques. Furthermore, we propose modifications to common evaluation methods which make the evaluation more powerful and reduces their computation time by 65 to 95\%, depending on the metric. Our code is fully open source.

* Submitted to OJSP-ICASSP 2024 
Viaarxiv icon

Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

Jun 11, 2023
Manuel Mager, Rajat Bhatnagar, Graham Neubig, Ngoc Thang Vu, Katharina Kann

Figure 1 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
Figure 2 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
Figure 3 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction
Figure 4 for Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction

Neural models have drastically advanced state of the art for machine translation (MT) between high-resource languages. Traditionally, these models rely on large amounts of training data, but many language pairs lack these resources. However, an important part of the languages in the world do not have this amount of data. Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. Here, we present an introduction to the interested reader to the basic challenges, concepts, and techniques that involve the creation of MT systems for these languages. Finally, we discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.

* Accepted to AmericasNLP 2023 
Viaarxiv icon

Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

May 31, 2023
Manuel Mager, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Figure 1 for Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers
Figure 2 for Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers
Figure 3 for Ethical Considerations for Machine Translation of Indigenous Languages: Giving a Voice to the Speakers

In recent years machine translation has become very successful for high-resource language pairs. This has also sparked new interest in research on the automatic translation of low-resource languages, including Indigenous languages. However, the latter are deeply related to the ethnic and cultural groups that speak (or used to speak) them. The data collection, modeling and deploying machine translation systems thus result in new ethical questions that must be addressed. Motivated by this, we first survey the existing literature on ethical considerations for the documentation, translation, and general natural language processing for Indigenous languages. Afterward, we conduct and analyze an interview study to shed light on the positions of community leaders, teachers, and language activists regarding ethical concerns for the automatic translation of their languages. Our results show that the inclusion, at different degrees, of native speakers and community members is vital to performing better and more ethical research on Indigenous languages.

* Accepted to ACL2023 Main Conference 
Viaarxiv icon

Neighboring Words Affect Human Interpretation of Saliency Explanations

May 06, 2023
Alon Jacovi, Hendrik Schuff, Heike Adel, Ngoc Thang Vu, Yoav Goldberg

Figure 1 for Neighboring Words Affect Human Interpretation of Saliency Explanations
Figure 2 for Neighboring Words Affect Human Interpretation of Saliency Explanations
Figure 3 for Neighboring Words Affect Human Interpretation of Saliency Explanations
Figure 4 for Neighboring Words Affect Human Interpretation of Saliency Explanations

Word-level saliency explanations ("heat maps over words") are often used to communicate feature-attribution in text-based models. Recent studies found that superficial factors such as word length can distort human interpretation of the communicated saliency scores. We conduct a user study to investigate how the marking of a word's neighboring words affect the explainee's perception of the word's importance in the context of a saliency explanation. We find that neighboring words have significant effects on the word's importance rating. Concretely, we identify that the influence changes based on neighboring direction (left vs. right) and a-priori linguistic and computational measures of phrases and collocations (vs. unrelated neighboring words). Our results question whether text-based saliency explanations should be continued to be communicated at word level, and inform future research on alternative saliency explanation methods.

* Accepted to Findings of ACL 2023 
Viaarxiv icon

Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions

Apr 10, 2023
Daniel Ortega, Chia-Yu Li, Ngoc Thang Vu

Figure 1 for Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions
Figure 2 for Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions
Figure 3 for Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions
Figure 4 for Oh, Jeez! or Uh-huh? A Listener-aware Backchannel Predictor on ASR Transcriptions

This paper presents our latest investigation on modeling backchannel in conversations. Motivated by a proactive backchanneling theory, we aim at developing a system which acts as a proactive listener by inserting backchannels, such as continuers and assessment, to influence speakers. Our model takes into account not only lexical and acoustic cues, but also introduces the simple and novel idea of using listener embeddings to mimic different backchanneling behaviours. Our experimental results on the Switchboard benchmark dataset reveal that acoustic cues are more important than lexical cues in this task and their combination with listener embeddings works best on both, manual transcriptions and automatically generated transcriptions.

* Published in ICASSP 2020 
Viaarxiv icon