Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jaime Lorenzo-Trueba

Enhancing audio quality for expressive Neural Text-to-Speech

Aug 13, 2021

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

Figure 1 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 2 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 3 for Enhancing audio quality for expressive Neural Text-to-Speech

Figure 4 for Enhancing audio quality for expressive Neural Text-to-Speech

Abstract:Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice.

* 6 pages, 4 figures, 2 tables, SSW 2021

Via

Access Paper or Ask Questions

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Jun 16, 2021

Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman

Figure 1 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Figure 2 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Figure 3 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Figure 4 for Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Abstract:Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing non-parallel zero-shot VC, an important requirement for any VC system that needs to work on speakers not seen during training. We have validated our approach using a noisy reverberant version of the LibriSpeech dataset. Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments.

* Presented at the Speech Synthesis Workshops 2021 (SSW11)

Via

Access Paper or Ask Questions

Weakly-supervised word-level pronunciation error detection in non-native English speech

Jun 07, 2021

Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

Figure 1 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Figure 2 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Figure 3 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Figure 4 for Weakly-supervised word-level pronunciation error detection in non-native English speech

Abstract:We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.

* Accepted to Interspeech 2021

Via

Access Paper or Ask Questions

Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Apr 15, 2021

Shubhi Tyagi, Antonio Bonafonte, Jaime Lorenzo-Trueba, Javier Latorre

Figure 1 for Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Figure 2 for Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Figure 3 for Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Figure 4 for Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Abstract:Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn majority of the classes and their normalizations from the training data itself. This is further combined with minimal precoded linguistic knowledge for other classes. We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English. All annotated datasets used for experimentation will be released at https://github.com/amazon-research/proteno.

* Accepted to NAACL 2021

Via

Access Paper or Ask Questions

Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Feb 08, 2021

Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

Figure 1 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Figure 2 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Figure 3 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Figure 4 for Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

Abstract:A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do not always hold, which can result in a significant amount of false mispronunciation alarms. We propose a novel approach to overcome this problem based on two principles: a) taking into account uncertainty in the automatic phoneme recognition step, b) accounting for the fact that there may be multiple valid pronunciations. We evaluate the model on non-native (L2) English speech of German, Italian and Polish speakers, where it is shown to increase the precision of detecting mispronunciations by up to 18% (relative) compared to the common approach.

* Accepted to ICASSP 2021

Via

Access Paper or Ask Questions

EmoCat: Language-agnostic Emotional Voice Conversion

Jan 14, 2021

Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

Figure 1 for EmoCat: Language-agnostic Emotional Voice Conversion

Figure 2 for EmoCat: Language-agnostic Emotional Voice Conversion

Abstract:Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with less than 45 minutes of German emotional recordings by exploiting large amounts of emotional data in US English. EmoCat is an encoder-decoder model based on CopyCat, a voice conversion system which transfers prosody. We use adversarial training to remove emotion leakage from the encoder to the decoder. The adversarial training is improved by a novel contribution to gradient reversal to truly reverse gradients. This allows to remove only the leaking information and to converge to better optima with higher conversion performance. Evaluations show that Emocat can convert to different emotions but misses on emotion intensity compared to the recordings, especially for very expressive emotions. EmoCat is able to achieve audio quality on par with the recordings for five out of six tested emotion intensities.

* Submitted to IEEE ICASSP 2021

Via

Access Paper or Ask Questions

Detection of Lexical Stress Errors in Non-native English with Data Augmentation and Attention

Dec 29, 2020

Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

Figure 1 for Detection of Lexical Stress Errors in Non-native English with Data Augmentation and Attention

Figure 2 for Detection of Lexical Stress Errors in Non-native English with Data Augmentation and Attention

Figure 3 for Detection of Lexical Stress Errors in Non-native English with Data Augmentation and Attention

Figure 4 for Detection of Lexical Stress Errors in Non-native English with Data Augmentation and Attention

Abstract:This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as syllable nucleus. We propose an attention-based deep learning model that automatically derives optimal syllable-level representation from frame-level and phoneme-level audio features. Training this model is challenging because of the limited amount of incorrect stress patterns. To solve this problem, we propose to augment the training set with incorrectly stressed words generated with Neural TTS. Combining both techniques achieves 94.8\% precision and 49.2\% recall for the detection of incorrectly stressed words in L2 English speech of Slavic speakers.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Voice Conversion for Whispered Speech Synthesis

Jan 17, 2020

Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

Figure 1 for Voice Conversion for Whispered Speech Synthesis

Figure 2 for Voice Conversion for Whispered Speech Synthesis

Figure 3 for Voice Conversion for Whispered Speech Synthesis

Figure 4 for Voice Conversion for Whispered Speech Synthesis

Abstract:We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa.

* Submitted to IEEE Signal Processing Letters

Via

Access Paper or Ask Questions

Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

Dec 02, 2019

Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba

Figure 1 for Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

Figure 2 for Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

Figure 3 for Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

Figure 4 for Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

Abstract:Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities when considering isolated sentences. But something which is still lacking in order to achieve human-like communication is the dynamic variations and adaptability of human speech. This work attempts to solve the problem of achieving a more dynamic and natural intonation in TTS systems, particularly for stylistic speech such as the newscaster speaking style. We propose a novel embedding selection approach which exploits linguistic information, leveraging the speech variability present in the training dataset. We analyze the contribution of both semantic and syntactic features. Our results show that the approach improves the prosody and naturalness for complex utterances as well as in Long Form Reading (LFR).

* Submitted for ICASSP 2020

Via

Access Paper or Ask Questions

Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Nov 28, 2019

Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, Jaime Lorenzo-Trueba, Roberto Barra-Chicote

Figure 1 for Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Figure 2 for Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Figure 3 for Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Figure 4 for Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

Abstract:We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving perceptual metrics over state-of-the-art. At synthesis time we use one example of expressive style as a reference input to the encoder for generating any text in the desired style. Perceptual MUSHRA evaluations show that we can create a voice with a 9% relative naturalness improvement over standard Neural Text-to-Speech, while also improving the perceived emotional intensity (59 compared to the 55 of neutral speech).

* Submitted to ICASSP 2020

Via

Access Paper or Ask Questions