Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram

Oct 25, 2019
Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim

We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network. In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution of the realistic speech waveform. As our method does not require density distillation used in the conventional teacher-student framework, the entire model can be easily trained even with a small number of parameters. In particular, the proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment. Perceptual listening test results verify that our proposed method achieves 4.16 mean opinion score within a Transformer-based text-to-speech framework, which is comparative to the best distillation-based Parallel WaveNet system.

* submitted to ICASSP 2020 

  Access Paper or Ask Questions

Language Modeling with Highway LSTM

Sep 19, 2017
Gakuto Kurata, Bhuvana Ramabhadran, George Saon, Abhinav Sethy

Language models (LMs) based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks. In this paper, we extend an LSTM by adding highway networks inside an LSTM and use the resulting Highway LSTM (HW-LSTM) model for language modeling. The added highway networks increase the depth in the time dimension. Since a typical LSTM has two internal states, a memory cell and a hidden state, we compare various types of HW-LSTM by adding highway networks onto the memory cell and/or the hidden state. Experimental results on English broadcast news and conversational telephone speech recognition show that the proposed HW-LSTM LM improves speech recognition accuracy on top of a strong LSTM LM baseline. We report 5.1% and 9.9% on the Switchboard and CallHome subsets of the Hub5 2000 evaluation, which reaches the best performance numbers reported on these tasks to date.

* to appear in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2017) 

  Access Paper or Ask Questions

Estimating Confusions in the ASR Channel for Improved Topic-based Language Model Adaptation

Mar 21, 2013
Damianos Karakos, Mark Dredze, Sanjeev Khudanpur

Human language is a combination of elemental languages/domains/styles that change across and sometimes within discourses. Language models, which play a crucial role in speech recognizers and machine translation systems, are particularly sensitive to such changes, unless some form of adaptation takes place. One approach to speech language model adaptation is self-training, in which a language model's parameters are tuned based on automatically transcribed audio. However, transcription errors can misguide self-training, particularly in challenging settings such as conversational speech. In this work, we propose a model that considers the confusions (errors) of the ASR channel. By modeling the likely confusions in the ASR output instead of using just the 1-best, we improve self-training efficacy by obtaining a more reliable reference transcription estimate. We demonstrate improved topic-based language modeling adaptation results over both 1-best and lattice self-training using our ASR channel confusion estimates on telephone conversations.

* Technical Report 8, Human Language Technology Center of Excellence, Johns Hopkins University 

  Access Paper or Ask Questions

Unraveling Social Perceptions & Behaviors towards Migrants on Twitter

Dec 04, 2021
Aparup Khatua, Wolfgang Nejdl

We draw insights from the social psychology literature to identify two facets of Twitter deliberations about migrants, i.e., perceptions about migrants and behaviors towards mi-grants. Our theoretical anchoring helped us in identifying two prevailing perceptions (i.e., sympathy and antipathy) and two dominant behaviors (i.e., solidarity and animosity) of social media users towards migrants. We have employed unsuper-vised and supervised approaches to identify these perceptions and behaviors. In the domain of applied NLP, our study of-fers a nuanced understanding of migrant-related Twitter de-liberations. Our proposed transformer-based model, i.e., BERT + CNN, has reported an F1-score of 0.76 and outper-formed other models. Additionally, we argue that tweets con-veying antipathy or animosity can be broadly considered hate speech towards migrants, but they are not the same. Thus, our approach has fine-tuned the binary hate speech detection task by highlighting the granular differences between perceptual and behavioral aspects of hate speeches.

* This work has been accepted to appear at International Conference on Web and Social Media ICWSM-2022 

  Access Paper or Ask Questions

Generalization Ability of MOS Prediction Networks

Oct 18, 2021
Erica Cooper, Wen-Chin Huang, Tomoki Toda, Junichi Yamagishi

Automatic methods to predict listener opinions of synthesized speech remain elusive since listeners, systems being evaluated, characteristics of the speech, and even the instructions given and the rating scale all vary from test to test. While automatic predictors for metrics such as mean opinion score (MOS) can achieve high prediction accuracy on samples from the same test, they typically fail to generalize well to new listening test contexts. In this paper, using a variety of networks for MOS prediction including MOSNet and self-supervised speech models such as wav2vec2, we investigate their performance on data from different listening tests in both zero-shot and fine-tuned settings. We find that wav2vec2 models fine-tuned for MOS prediction have good generalization capability to out-of-domain data even for the most challenging case of utterance-level predictions in the zero-shot setting, and that fine-tuning to in-domain data can improve predictions. We also observe that unseen systems are especially challenging for MOS prediction models.

* Submitted to ICASSP 2022 

  Access Paper or Ask Questions

Emphasis control for parallel neural TTS

Oct 06, 2021
Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li

The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody. Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance. However, these systems often lack simple control over the output prosody, thus restricting the semantic information conveyable for a given text. This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis. Three candidate features for the latent space are compared: 1) Variance of pitch and duration within words in a sentence, 2) a wavelet based feature computed from pitch, energy, and duration and 3) a learned combination of the above features. Objective measures reveal that the proposed methods are able to achieve a wide range of emphasis modification, and subjective evaluations on the degree of emphasis and the overall quality indicate that they show promise for real-world applications.

* 5 pages, 6 figures, preprint will be submitted to ICASSP 2022 

  Access Paper or Ask Questions

End-to-End Spoken Language Understanding using RNN-Transducer ASR

Jul 08, 2021
Anirudh Raju, Gautam Tiwari, Milind Rao, Pranav Dheram, Bryan Anderson, Zhe Zhang, Bach Bui, Ariya Rastrow

We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface. This interface allows for end-to-end training using multi-task RNNT and NLU losses. Additionally, we introduce semantic sequence loss training for the joint RNNT-NLU system that allows direct optimization of non-differentiable SLU metrics. This end-to-end SLU model paradigm can leverage state-of-the-art advancements and pretrained models in both ASR and NLU research communities, outperforming recently proposed direct speech-to-semantics models, and conventional pipelined ASR and NLU systems. We show that this method improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets.

  Access Paper or Ask Questions

Exploring emotional prototypes in a high dimensional TTS latent space

May 05, 2021
Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M. C. Harrison, Pauline Larrouy-Maestri, Elisabeth André, Nori Jacoby

Recent TTS systems are able to generate prosodically varied and realistic speech. However, it is unclear how this prosodic variation contributes to the perception of speakers' emotional states. Here we use the recent psychological paradigm 'Gibbs Sampling with People' to search the prosodic latent space in a trained GST Tacotron model to explore prototypes of emotional prosody. Participants are recruited online and collectively manipulate the latent space of the generative speech model in a sequentially adaptive way so that the stimulus presented to one group of participants is determined by the response of the previous groups. We demonstrate that (1) particular regions of the model's latent space are reliably associated with particular emotions, (2) the resulting emotional prototypes are well-recognized by a separate group of human raters, and (3) these emotional prototypes can be effectively transferred to new sentences. Collectively, these experiments demonstrate a novel approach to the understanding of emotional speech by providing a tool to explore the relation between the latent space of generative models and human semantics.

* Submitted to INTERSPEECH'21 

  Access Paper or Ask Questions