Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Drugman

Distribution augmentation for low-resource expressive text-to-speech

Feb 13, 2022
Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

Figure 1 for Distribution augmentation for low-resource expressive text-to-speech

Figure 2 for Distribution augmentation for low-resource expressive text-to-speech

Figure 3 for Distribution augmentation for low-resource expressive text-to-speech

Figure 4 for Distribution augmentation for low-resource expressive text-to-speech

This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models.

Via

Access Paper or Ask Questions

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Jun 29, 2021
Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

Figure 1 for Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Figure 2 for Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Figure 3 for Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Figure 4 for Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.

* Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

Via

Access Paper or Ask Questions

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Jun 16, 2021
Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman

Figure 1 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Figure 2 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Figure 3 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Figure 4 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing non-parallel zero-shot VC, an important requirement for any VC system that needs to work on speakers not seen during training. We have validated our approach using a noisy reverberant version of the LibriSpeech dataset. Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments.

* Presented at the Speech Synthesis Workshops 2021 (SSW11)

Via

Access Paper or Ask Questions

A learned conditional prior for the VAE acoustic space of a TTS system

Jun 14, 2021
Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

Figure 1 for A learned conditional prior for the VAE acoustic space of a TTS system

Figure 2 for A learned conditional prior for the VAE acoustic space of a TTS system

Figure 3 for A learned conditional prior for the VAE acoustic space of a TTS system

Figure 4 for A learned conditional prior for the VAE acoustic space of a TTS system

Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.

* in Proceedings of Interspeech 2021

Via

Access Paper or Ask Questions

Weakly-supervised word-level pronunciation error detection in non-native English speech

Jun 07, 2021
Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

Figure 1 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Figure 2 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Figure 3 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Figure 4 for Weakly-supervised word-level pronunciation error detection in non-native English speech

We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.

* Accepted to Interspeech 2021

Via

Access Paper or Ask Questions

Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Feb 08, 2021
Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

Figure 1 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Figure 2 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Figure 3 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Figure 4 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do not always hold, which can result in a significant amount of false mispronunciation alarms. We propose a novel approach to overcome this problem based on two principles: a) taking into account uncertainty in the automatic phoneme recognition step, b) accounting for the fact that there may be multiple valid pronunciations. We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18% (relative) compared to the common approach.

* Accepted to ICASSP 2021

Via

Access Paper or Ask Questions

EmoCat: Language-agnostic Emotional Voice Conversion

Jan 14, 2021
Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

Figure 1 for EmoCat: Language-agnostic Emotional Voice Conversion

Figure 2 for EmoCat: Language-agnostic Emotional Voice Conversion

Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less than 45 minutes of German emotional recordings by exploiting large amounts of emotional data in US English. EmoCat is an encoder-decoder model based on CopyCat, a voice conversion system which transfers prosody. We use adversarial training to remove emotion leakage from the encoder to the decoder. The adversarial training is improved by a novel contribution to gradient reversal to truly reverse gradients. This allows to remove only the leaking information and to converge to better optima with higher conversion performance. Evaluations show that Emocat can convert to different emotions but misses on emotion intensity compared to the recordings, especially for very expressive emotions. EmoCat is able to achieve audio quality on par with the recordings for five out of six tested emotion intensities.

* Submitted to IEEE ICASSP 2021

Via

Access Paper or Ask Questions

Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

Dec 29, 2020
Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

Figure 1 for Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

Figure 2 for Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

Figure 3 for Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

Figure 4 for Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention

This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8\% precision and 49.2\% recall for the detection of incorrectly stressed words in L2 English speech of Slavic speakers.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Nov 04, 2020
Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

Figure 1 for Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Figure 2 for Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Figure 3 for Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case.

* 5 pages and 3 figures

Via

Access Paper or Ask Questions