Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"speech recognition": models, code, and papers

Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

Feb 19, 2020
Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of them are transcribed so far. Thus, we started a project of automatic speech recognition (ASR) for the Ainu language in order to contribute to the development of annotated language archives. In this paper, we report speech corpus development and the structure and performance of end-to-end ASR for Ainu. We investigated four modeling units (phone, syllable, word piece, and word) and found that the syllable-based model performed best in terms of both word and phone recognition accuracy, which were about 60% and over 85% respectively in speaker-open condition. Furthermore, word and phone accuracy of 80% and 90% has been achieved in a speaker-closed setting. We also found out that a multilingual ASR training with additional speech corpora of English and Japanese further improves the speaker-open test accuracy.

* ver. 2 

Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter

Nov 17, 2020
Xiong Wang, Zhuoyuan Yao, Xian Shi, Lei Xie

End-to-end models are favored in automatic speech recognition (ASR) because of its simplified system structure and superior performance. Among these models, recurrent neural network transducer (RNN-T) has achieved significant progress in streaming on-device speech recognition because of its high-accuracy and low-latency. RNN-T adopts a prediction network to enhance language information, but its language modeling ability is limited because it still needs paired speech-text data to train. Further strengthening the language modeling ability through extra text data, such as shallow fusion with an external language model, only brings a small performance gain. In view of the fact that Mandarin Chinese is a character-based language and each character is pronounced as a tonal syllable, this paper proposes a novel cascade RNN-T approach to improve the language modeling ability of RNN-T. Our approach firstly uses an RNN-T to transform acoustic feature into syllable sequence, and then converts the syllable sequence into character sequence through an RNN-T-based syllable-to-character converter. Thus a rich text repository can be easily used to strengthen the language model ability. By introducing several important tricks, the cascade RNN-T approach surpasses the character-based RNN-T by a large margin on several Mandarin test sets, with much higher recognition quality and similar latency.

* 7 pages, 3 figures, 5 tables 

Analysis of EEG frequency bands for Envisioned Speech Recognition

Mar 29, 2022
Ayush Tripathi

The use of Automatic speech recognition (ASR) interfaces have become increasingly popular in daily life for use in interaction and control of electronic devices. The interfaces currently being used are not feasible for a variety of users such as those suffering from a speech disorder, locked-in syndrome, paralysis or people with utmost privacy requirements. In such cases, an interface that can identify envisioned speech using electroencephalogram (EEG) signals can be of great benefit. Various works targeting this problem have been done in the past. However, there has been limited work in identifying the frequency bands ($\delta, \theta, \alpha, \beta, \gamma$) of the EEG signal that contribute towards envisioned speech recognition. Therefore, in this work, we aim to analyze the significance of different EEG frequency bands and signals obtained from different lobes of the brain and their contribution towards recognizing envisioned speech. Signals obtained from different lobes and bandpass filtered for different frequency bands are fed to a spatio-temporal deep learning architecture with Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM). The performance is evaluated on a publicly available dataset comprising of three classification tasks - digit, character and images. We obtain a classification accuracy of $85.93\%$, $87.27\%$ and $87.51\%$ for the three tasks respectively. The code for the implementation has been made available at


Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition

Jul 10, 2018
Chun-Fu Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, Rogerio Feris

In this paper, we propose a novel Convolutional Neural Network (CNN) architecture for learning multi-scale feature representations with good tradeoffs between speed and accuracy. This is achieved by using a multi-branch network, which has different computational complexity at different branches. Through frequent merging of features from branches at distinct scales, our model obtains multi-scale features while using less computation. The proposed approach demonstrates improvement of model efficiency and performance on both object recognition and speech recognition tasks,using popular architectures including ResNet and ResNeXt. For object recognition, our approach reduces computation by 33% on object recognition while improving accuracy with 0.9%. Furthermore, our model surpasses state-of-the-art CNN acceleration approaches by a large margin in accuracy and FLOPs reduction. On the task of speech recognition, our proposed multi-scale CNNs save 30% FLOPs with slightly better word error rates, showing good generalization across domains.


Voice based self help System: User Experience Vs Accuracy

Apr 07, 2015
Sunil Kumar Kopparapu

In general, self help systems are being increasingly deployed by service based industries because they are capable of delivering better customer service and increasingly the switch is to voice based self help systems because they provide a natural interface for a human to interact with a machine. A speech based self help system ideally needs a speech recognition engine to convert spoken speech to text and in addition a language processing engine to take care of any misrecognitions by the speech recognition engine. Any off-the-shelf speech recognition engine is generally a combination of acoustic processing and speech grammar. While this is the norm, we believe that ideally a speech recognition application should have in addition to a speech recognition engine a separate language processing engine to give the system better performance. In this paper, we discuss ways in which the speech recognition engine and the language processing engine can be combined to give a better user experience.

* 5 pages; 1 figure 

Articulatory information and Multiview Features for Large Vocabulary Continuous Speech Recognition

Feb 16, 2018
Vikramjit Mitra, Wen Wang, Chris Bartels, Horacio Franco, Dimitra Vergyri

This paper explores the use of multi-view features and their discriminative transforms in a convolutional deep neural network (CNN) architecture for a continuous large vocabulary speech recognition task. Mel-filterbank energies and perceptually motivated forced damped oscillator coefficient (DOC) features are used after feature-space maximum-likelihood linear regression (fMLLR) transforms, which are combined and fed as a multi-view feature to a single CNN acoustic model. Use of multi-view feature representation demonstrated significant reduction in word error rates (WERs) compared to the use of individual features by themselves. In addition, when articulatory information was used as an additional input to a fused deep neural network (DNN) and CNN acoustic model, it was found to demonstrate further reduction in WER for the Switchboard subset and the CallHome subset (containing partly non-native accented speech) of the NIST 2000 conversational telephone speech test set, reducing the error rate by 12% relative to the baseline in both cases. This work shows that multi-view features in association with articulatory information can improve speech recognition robustness to spontaneous and non-native speech.

* 5 pages 

Machine Speech Chain with One-shot Speaker Adaptation

Mar 28, 2018
Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

In previous work, we developed a closed-loop speech chain model based on deep learning, in which the architecture enabled the automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components to mutually improve their performance. This was accomplished by the two parts teaching each other using both labeled and unlabeled data. This approach could significantly improve model performance within a single-speaker speech dataset, but only a slight increase could be gained in multi-speaker tasks. Furthermore, the model is still unable to handle unseen speakers. In this paper, we present a new speech chain mechanism by integrating a speaker recognition model inside the loop. We also propose extending the capability of TTS to handle unseen speakers by implementing one-shot speaker adaptation. This enables TTS to mimic voice characteristics from one speaker to another with only a one-shot speaker sample, even from a text without any speaker information. In the speech chain loop mechanism, ASR also benefits from the ability to further learn an arbitrary speaker's characteristics from the generated speech waveform, resulting in a significant improvement in the recognition rate.


3-D Feature and Acoustic Modeling for Far-Field Speech Recognition

Nov 13, 2019
Anurenjan Purushothaman, Anirudh Sreeram, Sriram Ganapathy

Automatic speech recognition in multi-channel reverberant conditions is a challenging task. The conventional way of suppressing the reverberation artifacts involves a beamforming based enhancement of the multi-channel speech signal, which is used to extract spectrogram based features for a neural network acoustic model. In this paper, we propose to extract features directly from the multi-channel speech signal using a multi variate autoregressive (MAR) modeling approach, where the correlations among all the three dimensions of time, frequency and channel are exploited. The MAR features are fed to a convolutional neural network (CNN) architecture which performs the joint acoustic modeling on the three dimensions. The 3-D CNN architecture allows the combination of multi-channel features that optimize the speech recognition cost compared to the traditional beamforming models that focus on the enhancement task. Experiments are conducted on the CHiME-3 and REVERB Challenge dataset using multi-channel reverberant speech. In these experiments, the proposed 3-D feature and acoustic modeling approach provides significant improvements over an ASR system trained with beamformed audio (average relative improvements of 10 % and 9 % in word error rates for CHiME-3 and REVERB Challenge datasets respectively.


Multilingual Speech Recognition for Low-Resource Indian Languages using Multi-Task conformer

Sep 10, 2021
Krishna D N

Transformers have recently become very popular for sequence-to-sequence applications such as machine translation and speech recognition. In this work, we propose a multi-task learning-based transformer model for low-resource multilingual speech recognition for Indian languages. Our proposed model consists of a conformer [1] encoder and two parallel transformer decoders. We use a phoneme decoder (PHN-DEC) for the phoneme recognition task and a grapheme decoder (GRP-DEC) to predict grapheme sequence. We consider the phoneme recognition task as an auxiliary task for our multi-task learning framework. We jointly optimize the network for both phoneme and grapheme recognition tasks using Joint CTC-Attention [2] training. We use a conditional decoding scheme to inject the language information into the model before predicting the grapheme sequence. Our experiments show that our proposed approach can obtain significant improvement over previous approaches [4]. We also show that our conformer-based dual-decoder approach outperforms both the transformer-based dual-decoder approach and single decoder approach. Finally, We compare monolingual ASR models with our proposed multilingual ASR approach.

* 5 pages. Rejected from Interspeech 2021. arXiv admin note: substantial text overlap with arXiv:2109.03277 

Manifold-Kernels Comparison in MKPLS for Visual Speech Recognition

Jan 22, 2016
Amr Bakry, Ahmed Elgammal

Speech recognition is a challenging problem. Due to the acoustic limitations, using visual information is essential for improving the recognition accuracy in real-life unconstraint situations. One common approach is to model the visual recognition as nonlinear optimization problem. Measuring the distances between visual units is essential for solving this problem. Embedding the visual units on a manifold and using manifold kernels is one way to measure these distances. This work is intended to evaluate the performance of several manifold kernels for solving the problem of visual speech recognition. We show the theory behind each kernel. We apply manifold kernel partial least squares framework to OuluVs and AvLetters databases, and show empirical comparison between all kernels. This framework provides convenient way to explore different kernels.