Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

A Robust Frame-based Nonlinear Prediction System for Automatic Speech Coding

Jan 22, 2016
Mahmood Yousefi-Azar, Farbod Razzazi

In this paper, we propose a neural-based coding scheme in which an artificial neural network is exploited to automatically compress and decompress speech signals by a trainable approach. Having a two-stage training phase, the system can be fully specified to each speech frame and have robust performance across different speakers and wide range of spoken utterances. Indeed, Frame-based nonlinear predictive coding (FNPC) would code a frame in the procedure of training to predict the frame samples. The motivating objective is to analyze the system behavior in regenerating not only the envelope of spectra, but also the spectra phase. This scheme has been evaluated in time and discrete cosine transform (DCT) domains and the output of predicted phonemes show the potentiality of the FNPC to reconstruct complicated signals. The experiments were conducted on three voiced plosive phonemes, b/d/g/ in time and DCT domains versus the number of neurons in the hidden layer. Experiments approve the FNPC capability as an automatic coding system by which /b/d/g/ phonemes have been reproduced with a good accuracy. Evaluations revealed that the performance of FNPC system, trained to predict DCT coefficients is more desirable, particularly for frames with the wider distribution of energy, compared to time samples.

* 11 pages, 8 figures 

  Access Paper or Ask Questions

A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis

Dec 29, 2019
Thomas Drugman, Geoffrey Wilfart, Thierry Dutoit

Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis. The stochastic component is a high-pass filtered noise whose time structure is modulated by an energy-envelope, similarly to what is done in the Harmonic plus Noise Model (HNM). The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. In addition the proposed model requires few computational load and memory, which is essential for its integration in commercial applications.

  Access Paper or Ask Questions

A Multilinear Tongue Model Derived from Speech Related MRI Data of the Human Vocal Tract

Apr 17, 2018
Alexander Hewer, Stefanie Wuhrer, Ingmar Steiner, Korin Richmond

We present a multilinear statistical model of the human tongue that captures anatomical and tongue pose related shape variations separately. The model is derived from 3D magnetic resonance imaging data of 11 speakers sustaining speech related vocal tract configurations. The extraction is performed by using a minimally supervised method that uses as basis an image segmentation approach and a template fitting technique. Furthermore, it uses image denoising to deal with possibly corrupt data, palate surface information reconstruction to handle palatal tongue contacts, and a bootstrap strategy to refine the obtained shapes. Our evaluation concludes that limiting the degrees of freedom for the anatomical and speech related variations to 5 and 4, respectively, produces a model that can reliably register unknown data while avoiding overfitting effects. Furthermore, we show that it can be used to generate a plausible tongue animation by tracking sparse motion capture data.

* Computer Speech & Language 51 (2018) 68-92 

  Access Paper or Ask Questions

Data Augmentation for Training Dialog Models Robust to Speech Recognition Errors

Jun 10, 2020
Longshaokan Wang, Maryam Fazel-Zarandi, Aditya Tiwari, Spyros Matsoukas, Lazaros Polymenakos

Speech-based virtual assistants, such as Amazon Alexa, Google assistant, and Apple Siri, typically convert users' audio signals to text data through automatic speech recognition (ASR) and feed the text to downstream dialog models for natural language understanding and response generation. The ASR output is error-prone; however, the downstream dialog models are often trained on error-free text data, making them sensitive to ASR errors during inference time. To bridge the gap and make dialog models more robust to ASR errors, we leverage an ASR error simulator to inject noise into the error-free text data, and subsequently train the dialog models with the augmented data. Compared to other approaches for handling ASR errors, such as using ASR lattice or end-to-end methods, our data augmentation approach does not require any modification to the ASR or downstream dialog models; our approach also does not introduce any additional latency during inference time. We perform extensive experiments on benchmark data and show that our approach improves the performance of downstream dialog models in the presence of ASR errors, and it is particularly effective in the low-resource situations where there are constraints on model size or the training data is scarce.

* To be presented at 2nd Workshop on NLP for ConvAI, ACL 2020 

  Access Paper or Ask Questions

Role of Artificial Intelligence in Detection of Hateful Speech for Hinglish Data on Social Media

May 11, 2021
Ananya Srivastava, Mohammed Hasan, Bhargav Yagnik, Rahee Walambe, Ketan Kotecha

Social networking platforms provide a conduit to disseminate our ideas, views and thoughts and proliferate information. This has led to the amalgamation of English with natively spoken languages. Prevalence of Hindi-English code-mixed data (Hinglish) is on the rise with most of the urban population all over the world. Hate speech detection algorithms deployed by most social networking platforms are unable to filter out offensive and abusive content posted in these code-mixed languages. Thus, the worldwide hate speech detection rate of around 44% drops even more considering the content in Indian colloquial languages and slangs. In this paper, we propose a methodology for efficient detection of unstructured code-mix Hinglish language. Fine-tuning based approaches for Hindi-English code-mixed language are employed by utilizing contextual based embeddings such as ELMo (Embeddings for Language Models), FLAIR, and transformer-based BERT (Bidirectional Encoder Representations from Transformers). Our proposed approach is compared against the pre-existing methods and results are compared for various datasets. Our model outperforms the other methods and frameworks.

* This work was presented at ICAAIML2020 and will be published in Lecture Notes in Electrical Engineering 

  Access Paper or Ask Questions

DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Jul 31, 2018
Mandar Gogate, Ahsan Adeel, Ricard Marxer, Jon Barker, Amir Hussain

Human auditory cortex excels at selectively suppressing background noise to focus on a target speaker. The process of selective attention in the brain is known to contextually exploit the available audio and visual cues to better focus on target speaker while filtering out other noises. In this study, we propose a novel deep neural network (DNN) based audiovisual (AV) mask estimation model. The proposed AV mask estimation model contextually integrates the temporal dynamics of both audio and noise-immune visual features for improved mask estimation and speech separation. For optimal AV features extraction and ideal binary mask (IBM) estimation, a hybrid DNN architecture is exploited to leverages the complementary strengths of a stacked long short term memory (LSTM) and convolution LSTM network. The comparative simulation results in terms of speech quality and intelligibility demonstrate significant performance improvement of our proposed AV mask estimation model as compared to audio-only and visual-only mask estimation approaches for both speaker dependent and independent scenarios.

* Accepted for Interspeech 2018, 5 pages, 4 figures 

  Access Paper or Ask Questions

HumanGAN: generative adversarial network with human-based discriminator and its evaluation in speech perception modeling

Sep 25, 2019
Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari

We propose the HumanGAN, a generative adversarial network (GAN) incorporating human perception as a discriminator. A basic GAN trains a generator to represent a real-data distribution by fooling the discriminator that distinguishes real and generated data. Therefore, the basic GAN cannot represent the outside of a real-data distribution. In the case of speech perception, humans can recognize not only human voices but also processed (i.e., a non-existent human) voices as human voice. Such a human-acceptable distribution is typically wider than a real-data one and cannot be modeled by the basic GAN. To model the human-acceptable distribution, we formulate a backpropagation-based generator training algorithm by regarding human perception as a black-boxed discriminator. The training efficiently iterates generator training by using a computer and discrimination by crowdsourcing. We evaluate our HumanGAN in speech naturalness modeling and demonstrate that it can represent a human-acceptable distribution that is wider than a real-data distribution.

* Submitted to IEEE ICASSP 2020 

  Access Paper or Ask Questions

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

Oct 24, 2017
Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara

This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without any recurrent units. Recurrent neural network (RNN) has been a standard technique to model sequential data recently, and this technique has been used in some cutting-edge neural TTS techniques. However, training RNN component often requires a very powerful computer, or very long time typically several days or weeks. Recent other studies, on the other hand, have shown that CNN-based sequence synthesis can be much faster than RNN-based techniques, because of high parallelizability. The objective of this paper is to show an alternative neural TTS system, based only on CNN, that can alleviate these economic costs of training. In our experiment, the proposed Deep Convolutional TTS can be sufficiently trained only in a night (15 hours), using an ordinary gaming PC equipped with two GPUs, while the quality of the synthesized speech was almost acceptable.

* submitted to ICASSP 2018 

  Access Paper or Ask Questions

Conformer-based End-to-end Speech Recognition With Rotary Position Embedding

Jul 13, 2021
Shengqiang Li, Menglong Xu, Xiao-Lei Zhang

Transformer-based end-to-end speech recognition models have received considerable attention in recent years due to their high training speed and ability to model a long-range global context. Position embedding in the transformer architecture is indispensable because it provides supervision for dependency modeling between elements at different positions in the input sequence. To make use of the time order of the input sequence, many works inject some information about the relative or absolute position of the element into the input sequence. In this work, we investigate various position embedding methods in the convolution-augmented transformer (conformer) and adopt a novel implementation named rotary position embedding (RoPE). RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. To evaluate the effectiveness of the RoPE method, we conducted experiments on AISHELL-1 and LibriSpeech corpora. Results show that the conformer enhanced with RoPE achieves superior performance in the speech recognition task. Specifically, our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.

* 5 pages, 3 figures 

  Access Paper or Ask Questions

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

May 14, 2021
Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).

  Access Paper or Ask Questions