Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"speech recognition": models, code, and papers

Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition

Jan 27, 2022
Mohammad Soleymanpour, Michael T. Johnson, Rahim Soleymanpour, Jeffrey Berry

Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. To have robust dysarthria-specific ASR, sufficient training speech is required, which is not readily available. Recent advances in Text-To-Speech (TTS) synthesis multi-speaker end-to-end TTS systems suggest the possibility of using synthesis for data augmentation. In this paper, we aim to improve multi-speaker end-to-end TTS systems to synthesize dysarthric speech for improved training of a dysarthria-specific DNN-HMM ASR. In the synthesized speech, we add dysarthria severity level and pause insertion mechanisms to other control parameters such as pitch, energy, and duration. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Audio samples are available at

* Accepted ICASSP 2022 
  

End-to-end training of time domain audio separation and recognition

Dec 25, 2019
Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

* 5 pages, 1 figure, to appear in ICASSP 2020 
  

Ene-to-end training of time domain audio separation and recognition

Dec 18, 2019
Thilo von Neumann, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, Reinhold Haeb-Umbach

The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multi-speaker speech recognition. However, up until now, state-of-the-art neural network-based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.

* 5 pages, 1 figure, to appear in ICASSP 2020 
  

Speech Recognition Oriented Vowel Classification Using Temporal Radial Basis Functions

Dec 19, 2009
Mustapha Guezouri, Larbi Mesbahi, Abdelkader Benyettou

The recent resurgence of interest in spatio-temporal neural network as speech recognition tool motivates the present investigation. In this paper an approach was developed based on temporal radial basis function "TRBF" looking to many advantages: few parameters, speed convergence and time invariance. This application aims to identify vowels taken from natural speech samples from the Timit corpus of American speech. We report a recognition accuracy of 98.06 percent in training and 90.13 in test on a subset of 6 vowel phonemes, with the possibility to expend the vowel sets in future.

* Journal of Computing, Volume 1, Issue 1, pp 162-167, December 2009 
  

Unsupervised low-rank representations for speech emotion recognition

Apr 14, 2021
Georgios Paraskevopoulos, Efthymios Tzinis, Nikolaos Ellinas, Theodoros Giannakopoulos, Alexandros Potamianos

We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition. Two feature sets are used, one based on low-level descriptors and their aggregations (IS10) and one modeling recurrence dynamics of speech (RQA), as well as their fusion. We report speech emotion recognition (SER) results for learned representations on two databases using different classification methods. Classification with low-dimensional representations yields performance improvement in a variety of settings. This indicates that dimensionality reduction is an effective way to combat the curse of dimensionality for SER. Visualization of features in two dimensions provides insight into discriminatory abilities of reduced feature sets.

* Published at Interspeech 2019 https://www.isca-speech.org/archive/Interspeech_2019/abstracts/2769.html 
  

Supervised Contrastive Learning for Accented Speech Recognition

Jul 02, 2021
Tao Han, Hantao Huang, Ziang Yang, Wei Han

Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents. In this paper, we study the supervised contrastive learning framework for accented speech recognition. To build different views (similar "positive" data samples) for contrastive learning, three data augmentation techniques including noise injection, spectrogram augmentation and TTS-same-sentence generation are further investigated. From the experiments on the Common Voice dataset, we have shown that contrastive learning helps to build data-augmentation invariant and pronunciation invariant representations, which significantly outperforms traditional joint training methods in both zero-shot and full-shot settings. Experiments show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average, comparing to the joint training method.

* Accented speech recognition, deep neural networks, model adaptation, supervised contrastive learning 
  

Dyn-ASR: Compact, Multilingual Speech Recognition via Spoken Language and Accent Identification

Aug 04, 2021
Sangeeta Ghangam, Daniel Whitenack, Joshua Nemecek

Running automatic speech recognition (ASR) on edge devices is non-trivial due to resource constraints, especially in scenarios that require supporting multiple languages. We propose a new approach to enable multilingual speech recognition on edge devices. This approach uses both language identification and accent identification to select one of multiple monolingual ASR models on-the-fly, each fine-tuned for a particular accent. Initial results for both recognition performance and resource usage are promising with our approach using less than 1/12th of the memory consumed by other solutions.

* Accepted to IEEE WF-IOT 2021 
  

Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets

Oct 09, 2021
Jimmy Tobin, Katrin Tomanek

This study investigates the performance of personalized automatic speech recognition (ASR) for recognizing disordered speech using small amounts of per-speaker adaptation data. We trained personalized models for 195 individuals with different types and severities of speech impairment with training sets ranging in size from <1 minute to 18-20 minutes of speech data. Word error rate (WER) thresholds were selected to determine Success Percentage (the percentage of personalized models reaching the target WER) in different application scenarios. For the home automation scenario, 79% of speakers reached the target WER with 18-20 minutes of speech; but even with only 3-4 minutes of speech, 63% of speakers reached the target WER. Further evaluation found similar improvement on test sets with conversational and out-of-domain, unprompted phrases. Our results demonstrate that with only a few minutes of recordings, individuals with disordered speech could benefit from personalized ASR.

* Submitted to ICASSP 2022 
  

Computing Optimal Location of Microphone for Improved Speech Recognition

Mar 24, 2022
Karan Nathwani, Bhavya Dixit, Sunil Kumar Kopparapu

It was shown in our earlier work that the measurement error in the microphone position affected the room impulse response (RIR) which in turn affected the single-channel close microphone and multi-channel distant microphone speech recognition. In this paper, as an extension, we systematically study to identify the optimal location of the microphone, given an approximate and hence erroneous location of the microphone in 3D space. The primary idea is to use Monte-Carlo technique to generate a large number of random microphone positions around the erroneous microphone position and select the microphone position that results in the best performance of a general purpose automatic speech recognition (gp-asr). We experiment with clean and noisy speech and show that the optimal location of the microphone is unique and is affected by noise.

* 5 pages 
  

Cross-lingual and Multilingual Speech Emotion Recognition on English and French

Mar 01, 2018
Michael Neumann, Ngoc Thang Vu

Research on multilingual speech emotion recognition faces the problem that most available speech corpora differ from each other in important ways, such as annotation methods or interaction scenarios. These inconsistencies complicate building a multilingual system. We present results for cross-lingual and multilingual emotion recognition on English and French speech data with similar characteristics in terms of interaction (human-human conversations). Further, we explore the possibility of fine-tuning a pre-trained cross-lingual model with only a small number of samples from the target language, which is of great interest for low-resource languages. To gain more insights in what is learned by the deployed convolutional neural network, we perform an analysis on the attention mechanism inside the network.

* ICASSP 2018, Calgary 
  
<<
42
43
44
45
46
47
48
49
50
>>