Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

EMA2S: An End-to-End Multimodal Articulatory-to-Speech System

Feb 07, 2021
Yu-Wen Chen, Kuo-Hsuan Hung, Shang-Yi Chuang, Jonathan Sherman, Wen-Chin Huang, Xugang Lu, Yu Tsao

Synthesized speech from articulatory movements can have real-world use for patients with vocal cord disorders, situations requiring silent speech, or in high-noise environments. In this work, we present EMA2S, an end-to-end multimodal articulatory-to-speech system that directly converts articulatory movements to speech signals. We use a neural-network-based vocoder combined with multimodal joint-training, incorporating spectrogram, mel-spectrogram, and deep features. The experimental results confirm that the multimodal approach of EMA2S outperforms the baseline system in terms of both objective evaluation and subjective evaluation metrics. Moreover, results demonstrate that joint mel-spectrogram and deep feature loss training can effectively improve system performance.

  Access Paper or Ask Questions

Scaling Speech Enhancement in Unseen Environments with Noise Embeddings

Oct 26, 2018
Gil Keren, Jing Han, Björn Schuller

We address the problem of speech enhancement generalisation to unseen environments by performing two manipulations. First, we embed an additional recording from the environment alone, and use this embedding to alter activations in the main enhancement subnetwork. Second, we scale the number of noise environments present at training time to 16,784 different environments. Experiment results show that both manipulations reduce word error rates of a pretrained speech recognition system and improve enhancement quality according to a number of performance measures. Specifically, our best model reduces the word error rate from 34.04% on noisy speech to 15.46% on the enhanced speech. Enhanced audio samples can be found in

* The Fifth CHiME Challenge Workshop, 2018 

  Access Paper or Ask Questions

Cycle-Consistent Speech Enhancement

Sep 06, 2018
Zhong Meng, Jinyu Li, Yifan Gong, Biing-Hwang, Juang

Feature mapping using deep neural networks is an effective approach for single-channel speech enhancement. Noisy features are transformed to the enhanced ones through a mapping network and the mean square errors between the enhanced and clean features are minimized. In this paper, we propose a cycle-consistent speech enhancement (CSE) in which an additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones. A cycle-consistent constraint is enforced to minimize the reconstruction loss. Similarly, a backward cycle of mappings is performed in the opposite direction with the same networks and losses. With cycle-consistency, the speech structure is well preserved in the enhanced features while noise is effectively reduced such that the feature-mapping network generalizes better to unseen data. In cases where only unparalleled noisy and clean data is available for training, two discriminator networks are used to distinguish the enhanced and noised features from the clean and noisy ones. The discrimination losses are jointly optimized with reconstruction losses through adversarial multi-task learning. Evaluated on the CHiME-3 dataset, the proposed CSE achieves 19.60% and 6.69% relative word error rate improvements respectively when using or without using parallel clean and noisy speech data.

* Interspeech 2018 
* 5 pages, 2 figures. arXiv admin note: text overlap with arXiv:1809.02251 

  Access Paper or Ask Questions

Detection and Analysis of Human Emotions through Voice and Speech Pattern Processing

Oct 27, 2017
Poorna Banerjee Dasgupta

The ability to modulate vocal sounds and generate speech is one of the features which set humans apart from other living beings. The human voice can be characterized by several attributes such as pitch, timbre, loudness, and vocal tone. It has often been observed that humans express their emotions by varying different vocal attributes during speech generation. Hence, deduction of human emotions through voice and speech analysis has a practical plausibility and could potentially be beneficial for improving human conversational and persuasion skills. This paper presents an algorithmic approach for detection and analysis of human emotions with the help of voice and speech processing. The proposed approach has been developed with the objective of incorporation with futuristic artificial intelligence systems for improving human-computer interactions.

* International Journal of Computer Trends and Technology (IJCTT) V52(1):01-03, October 2017 
* 3 pages, Published with International Journal of Computer Trends and Technology (IJCTT), Volume-52 Number-1, 2017 

  Access Paper or Ask Questions

GANtron: Emotional Speech Synthesis with Generative Adversarial Networks

Oct 06, 2021
Enrique Hortal, Rodrigo Brechard Alarcia

Speech synthesis is used in a wide variety of industries. Nonetheless, it always sounds flat or robotic. The state of the art methods that allow for prosody control are very cumbersome to use and do not allow easy tuning. To tackle some of these drawbacks, in this work we target the implementation of a text-to-speech model where the inferred speech can be tuned with the desired emotions. To do so, we use Generative Adversarial Networks (GANs) together with a sequence-to-sequence model using an attention mechanism. We evaluate four different configurations considering different inputs and training strategies, study them and prove how our best model can generate speech files that lie in the same distribution as the initial training dataset. Additionally, a new strategy to boost the training convergence by applying a guided attention loss is proposed.

* 9 pages, 4 figures 

  Access Paper or Ask Questions

Speech-to-Singing Conversion based on Boundary Equilibrium GAN

May 30, 2020
Da-Yi Wu, Yi-Hsuan Yang

This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and optionally the F0 contour of the target singing, the proposed model generates as the output a singing signal with a progressive-growing encoder/decoder architecture and boundary equilibrium GAN loss functions. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline. For reproducibility, the code will be publicly available at a GitHub repository upon paper publication.

  Access Paper or Ask Questions

Curriculum Pre-training for End-to-End Speech Translation

Apr 21, 2020
Chengyi Wang, Yu Wu, Shujie Liu, Ming Zhou, Zhenglu Yang

End-to-end speech translation poses a heavy burden on the encoder, because it has to transcribe, understand, and learn cross-lingual semantics simultaneously. To obtain a powerful encoder, traditional methods pre-train it on ASR data to capture speech features. However, we argue that pre-training the encoder only through simple speech recognition is not enough and high-level linguistic knowledge should be considered. Inspired by this, we propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages. The difficulty of these courses is gradually increasing. Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.

* accepted by ACL2020 

  Access Paper or Ask Questions

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

Nov 28, 2016
Brian Patton, Yannis Agiomyrgiannakis, Michael Terry, Kevin Wilson, Rif A. Saurous, D. Sculley

Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.

* 4 pages, 2 figures, 2 tables, NIPS 2016 End-to-end Learning for Speech and Audio Processing Workshop 

  Access Paper or Ask Questions

End-to-End Speech Recognition and Disfluency Removal

Sep 28, 2020
Paria Jamshid Lou, Mark Johnson

Disfluency detection is usually an intermediate step between an automatic speech recognition (ASR) system and a downstream task. By contrast, this paper aims to investigate the task of end-to-end speech recognition and disfluency removal. We specifically explore whether it is possible to train an ASR model to directly map disfluent speech into fluent transcripts, without relying on a separate disfluency detection model. We show that end-to-end models do learn to directly generate fluent transcripts; however, their performance is slightly worse than a baseline pipeline approach consisting of an ASR system and a disfluency detection model. We also propose two new metrics that can be used for evaluating integrated ASR and disfluency models. The findings of this paper can serve as a benchmark for further research on the task of end-to-end speech recognition and disfluency removal in the future.

  Access Paper or Ask Questions