Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ngoc Thang Vu

The Who in Code-Switching: A Case Study for Predicting Egyptian Arabic-English Code-Switching Levels based on Character Profiles

Jul 31, 2022

Injy Hamed, Alia El Bolock, Cornelia Herbert, Slim Abdennadher, Ngoc Thang Vu

Figure 1 for The Who in Code-Switching: A Case Study for Predicting Egyptian Arabic-English Code-Switching Levels based on Character Profiles

Figure 2 for The Who in Code-Switching: A Case Study for Predicting Egyptian Arabic-English Code-Switching Levels based on Character Profiles

Figure 3 for The Who in Code-Switching: A Case Study for Predicting Egyptian Arabic-English Code-Switching Levels based on Character Profiles

Figure 4 for The Who in Code-Switching: A Case Study for Predicting Egyptian Arabic-English Code-Switching Levels based on Character Profiles

Abstract:Code-switching (CS) is a common linguistic phenomenon exhibited by multilingual individuals, where they tend to alternate between languages within one single conversation. CS is a complex phenomenon that not only encompasses linguistic challenges, but also contains a great deal of complexity in terms of its dynamic behaviour across speakers. Given that the factors giving rise to CS vary from one country to the other, as well as from one person to the other, CS is found to be a speaker-dependant behaviour, where the frequency by which the foreign language is embedded differs across speakers. While several researchers have looked into predicting CS behaviour from a linguistic point of view, research is still lacking in the task of predicting user CS behaviour from sociological and psychological perspectives. We provide an empirical user study, where we investigate the correlations between users' CS levels and character traits. We conduct interviews with bilinguals and gather information on their profiles, including their demographics, personality traits, and traveling experiences. We then use machine learning (ML) to predict users' CS levels based on their profiles, where we identify the main influential factors in the modeling process. We experiment with both classification as well as regression tasks. Our results show that the CS behaviour is affected by the relation between speakers, travel experiences as well as Neuroticism and Extraversion personality traits.

* To be published in the International Journal of Asian Language Processing. arXiv admin note: substantial text overlap with arXiv:2112.06462

Via

Access Paper or Ask Questions

PoeticTTS -- Controllable Poetry Reading for Literary Studies

Jul 11, 2022

Julia Koch, Florian Lux, Nadja Schauffler, Toni Bernhart, Felix Dieterle, Jonas Kuhn, Sandra Richter, Gabriel Viehhauser, Ngoc Thang Vu

Figure 1 for PoeticTTS -- Controllable Poetry Reading for Literary Studies

Figure 2 for PoeticTTS -- Controllable Poetry Reading for Literary Studies

Figure 3 for PoeticTTS -- Controllable Poetry Reading for Literary Studies

Figure 4 for PoeticTTS -- Controllable Poetry Reading for Literary Studies

Abstract:Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we propose an approach to synthesise poems with almost human like naturalness in order to enable literary scholars to systematically examine hypotheses on the interplay between text, spoken realisation, and the listener's perception of poems. To meet these special requirements for literary studies, we resynthesise poems by cloning prosodic values from a human reference recitation, and afterwards make use of fine-grained prosody control to manipulate the synthetic speech in a human-in-the-loop setting to alter the recitation w.r.t. specific phenomena. We find that finetuning our TTS model on poetry captures poetic intonation patterns to a large extent which is beneficial for prosody cloning and manipulation and verify the success of our approach both in an objective evaluation as well as in human studies.

* Accepted to Interspeech 2022

Via

Access Paper or Ask Questions

Speaker Anonymization with Phonetic Intermediate Representations

Jul 11, 2022

Sarina Meyer, Florian Lux, Pavel Denisov, Julia Koch, Pascal Tilli, Ngoc Thang Vu

Figure 1 for Speaker Anonymization with Phonetic Intermediate Representations

Figure 2 for Speaker Anonymization with Phonetic Intermediate Representations

Figure 3 for Speaker Anonymization with Phonetic Intermediate Representations

Figure 4 for Speaker Anonymization with Phonetic Intermediate Representations

Abstract:In this work, we propose a speaker anonymization pipeline that leverages high quality automatic speech recognition and synthesis systems to generate speech conditioned on phonetic transcriptions and anonymized speaker embeddings. Using phones as the intermediate representation ensures near complete elimination of speaker identity information from the input while preserving the original phonetic content as much as possible. Our experimental results on LibriSpeech and VCTK corpora reveal two key findings: 1) although automatic speech recognition produces imperfect transcriptions, our neural speech synthesis system can handle such errors, making our system feasible and robust, and 2) combining speaker embeddings from different resources is beneficial and their appropriate normalization is crucial. Overall, our final best system outperforms significantly the baselines provided in the Voice Privacy Challenge 2020 in terms of privacy robustness against a lazy-informed attacker while maintaining high intelligibility and naturalness of the anonymized speech.

* Accepted at Interspeech 2022

Via

Access Paper or Ask Questions

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

Jun 24, 2022

Florian Lux, Julia Koch, Ngoc Thang Vu

Figure 1 for Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

Figure 2 for Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

Figure 3 for Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

Figure 4 for Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

Abstract:The cloning of a speaker's voice using an untranscribed reference sample is one of the great advances of modern neural text-to-speech (TTS) methods. Approaches for mimicking the prosody of a transcribed reference audio have also been proposed recently. In this work, we bring these two tasks together for the first time through utterance level normalization in conjunction with an utterance level speaker embedding. We further introduce a lightweight aligner for extracting fine-grained prosodic features, that can be finetuned on individual samples within seconds. We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as our objective evaluation and human study show. All of our code and trained models are available, alongside static and interactive demos.

Via

Access Paper or Ask Questions

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

May 25, 2022

Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu

Figure 1 for Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Figure 2 for Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Figure 3 for Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Figure 4 for Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

Abstract:Code-switching (CS) poses several challenges to NLP tasks, where data sparsity is a main problem hindering the development of CS NLP systems. In this paper, we investigate data augmentation techniques for synthesizing Dialectal Arabic-English CS text. We perform lexical replacements using parallel corpora and alignments where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We evaluate the effectiveness of data augmentation on language modeling (LM), machine translation (MT), and automatic speech recognition (ASR) tasks. Results show that in the case of using 1-1 alignments, using trained predictive models produces more natural CS sentences, as reflected in perplexity. By relying on grow-diag-final alignments, we then identify aligning segments and perform replacements accordingly. By replacing segments instead of words, the quality of synthesized data is greatly improved. With this improvement, random-based approach outperforms using trained predictive models on all extrinsic tasks. Our best models achieve 33.6% improvement in perplexity, +3.2-5.6 BLEU points on MT task, and 7% relative improvement on WER for ASR task. We also contribute in filling the gap in resources by collecting and publishing the first Arabic English CS-English parallel corpus.

Via

Access Paper or Ask Questions

Meta Learning for Natural Language Processing: A Survey

May 03, 2022

Hung-yi Lee, Shang-Wen Li, Ngoc Thang Vu

Figure 1 for Meta Learning for Natural Language Processing: A Survey

Figure 2 for Meta Learning for Natural Language Processing: A Survey

Figure 3 for Meta Learning for Natural Language Processing: A Survey

Figure 4 for Meta Learning for Natural Language Processing: A Survey

Abstract:Deep learning has been the mainstream technique in natural language processing (NLP) area. However, the techniques require many labeled data and are less generalizable across domains. Meta-learning is an arising field in machine learning studying approaches to learn better learning algorithms. Approaches aim at improving algorithms in various aspects, including data efficiency and generalizability. Efficacy of approaches has been shown in many NLP tasks, but there is no systematic survey of these approaches in NLP, which hinders more researchers from joining the field. Our goal with this survey paper is to offer researchers pointers to relevant meta-learning works in NLP and attract more attention from the NLP community to drive future innovation. This paper first introduces the general concepts of meta-learning and the common approaches. Then we summarize task construction settings and application of meta-learning for various NLP problems and review the development of meta-learning in NLP community.

* Accepted by NAACL 2022

Via

Access Paper or Ask Questions

BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Mar 16, 2022

Manuel Mager, Arturo Oncevay, Elisabeth Mager, Katharina Kann, Ngoc Thang Vu

Figure 1 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Figure 2 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Figure 3 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Figure 4 for BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages

Abstract:Morphologically-rich polysynthetic languages present a challenge for NLP systems due to data sparsity, and a common strategy to handle this issue is to apply subword segmentation. We investigate a wide variety of supervised and unsupervised morphological segmentation methods for four polysynthetic languages: Nahuatl, Raramuri, Shipibo-Konibo, and Wixarika. Then, we compare the morphologically inspired segmentation methods against Byte-Pair Encodings (BPEs) as inputs for machine translation (MT) when translating to and from Spanish. We show that for all language pairs except for Nahuatl, an unsupervised morphological segmentation algorithm outperforms BPEs consistently and that, although supervised methods achieve better segmentation scores, they under-perform in MT challenges. Finally, we contribute two new morphological segmentation datasets for Raramuri and Shipibo-Konibo, and a parallel corpus for Raramuri--Spanish.

* Accepted to Findings of ACL 2022

Via

Access Paper or Ask Questions

Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

Mar 07, 2022

Florian Lux, Ngoc Thang Vu

Figure 1 for Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

Figure 2 for Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

Figure 3 for Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

Figure 4 for Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features

Abstract:While neural text-to-speech systems perform remarkably well in high-resource scenarios, they cannot be applied to the majority of the over 6,000 spoken languages in the world due to a lack of appropriate training data. In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages. In conjunction with language agnostic meta learning, this enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.

* Accepted for the ACL 2022 main conference

Via

Access Paper or Ask Questions

Human Interpretation of Saliency-based Explanation Over Text

Jan 27, 2022

Hendrik Schuff, Alon Jacovi, Heike Adel, Yoav Goldberg, Ngoc Thang Vu

Figure 1 for Human Interpretation of Saliency-based Explanation Over Text

Figure 2 for Human Interpretation of Saliency-based Explanation Over Text

Figure 3 for Human Interpretation of Saliency-based Explanation Over Text

Figure 4 for Human Interpretation of Saliency-based Explanation Over Text

Abstract:While a lot of research in explainable AI focuses on producing effective explanations, less work is devoted to the question of how people understand and interpret the explanation. In this work, we focus on this question through a study of saliency-based explanations over textual data. Feature-attribution explanations of text models aim to communicate which parts of the input text were more influential than others towards the model decision. Many current explanation methods, such as gradient-based or Shapley value-based methods, provide measures of importance which are well-understood mathematically. But how does a person receiving the explanation (the explainee) comprehend it? And does their understanding match what the explanation attempted to communicate? We empirically investigate the effect of various factors of the input, the feature-attribution explanation, and visualization procedure, on laypeople's interpretation of the explanation. We query crowdworkers for their interpretation on tasks in English and German, and fit a GAMM model to their responses considering the factors of interest. We find that people often mis-interpret the explanations: superficial and unrelated factors, such as word length, influence the explainees' importance assignment despite the explanation communicating importance directly. We then show that some of this distortion can be attenuated: we propose a method to adjust saliencies based on model estimates of over- and under-perception, and explore bar charts as an alternative to heatmap saliency visualization. We find that both approaches can attenuate the distorting effect of specific factors, leading to better-calibrated understanding of the explanation.

Via

Access Paper or Ask Questions

Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

Dec 19, 2021

Chia-Yu Li, Ngoc Thang Vu

Figure 1 for Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

Figure 2 for Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

Figure 3 for Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

Figure 4 for Integrating Knowledge in End-to-End Automatic Speech Recognition for Mandarin-English Code-Switching

Abstract:Code-Switching (CS) is a common linguistic phenomenon in multilingual communities that consists of switching between languages while speaking. This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech. We analyse different CS specific issues such as the properties mismatches between languages in a CS language pair, the unpredictable nature of switching points, and the data scarcity problem. We exploit and improve the state-of-the-art end-to-end system by merging nonlinguistic symbols, by integrating language identification using hierarchical softmax, by modeling sub-word units, by artificially lowering the speaking rate, and by augmenting data using speed perturbed technique and several monolingual datasets to improve the final performance not only on CS speech but also on monolingual benchmarks in order to make the system more applicable on real life settings. Finally, we explore the effect of different language model integration methods on the performance of the proposed model. Our experimental results reveal that all the proposed techniques improve the recognition performance. The best combined system improves the baseline system by up to 35% relatively in terms of mixed error rate and delivers acceptable performance on monolingual benchmarks.

* The 2019 International Conference on Asian Language Processing (IALP)

Via

Access Paper or Ask Questions