Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"speech": models, code, and papers

ExKaldi-RT: A Real-Time Automatic Speech Recognition Extension Toolkit of Kaldi

Apr 03, 2021
Yu Wang, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki

The availability of open-source software is playing a remarkable role in automatic speech recognition (ASR). Kaldi, for instance, is widely used to develop state-of-the-art offline and online ASR systems. This paper describes the "ExKaldi-RT," online ASR toolkit implemented based on Kaldi and Python language. ExKaldi-RT provides tools for providing a real-time audio stream pipeline, extracting acoustic features, transmitting packets with a remote connection, estimating acoustic probabilities with a neural network, and online decoding. While similar functions are available built on Kaldi, a key feature of ExKaldi-RT is completely working on Python language, which has an easy-to-use interface for online ASR system developers to exploit original research, for example, by applying neural network-based signal processing and acoustic model trained with deep learning frameworks. We performed benchmark experiments on the minimum LibriSpeech corpus, and showed that ExKaldi-RT could achieve competitive ASR performance in real-time.

* Submitted to INTERSPEECH2021 

  Access Paper or Ask Questions

Attention-Augmented End-to-End Multi-Task Learning for Emotion Prediction from Speech

Mar 29, 2019
Zixing Zhang, Bingwen Wu, Bjoern Schuller

Despite the increasing research interest in end-to-end learning systems for speech emotion recognition, conventional systems either suffer from the overfitting due in part to the limited training data, or do not explicitly consider the different contributions of automatically learnt representations for a specific task. In this contribution, we propose a novel end-to-end framework which is enhanced by learning other auxiliary tasks and an attention mechanism. That is, we jointly train an end-to-end network with several different but related emotion prediction tasks, i.e., arousal, valence, and dominance predictions, to extract more robust representations shared among various tasks than traditional systems with the hope that it is able to relieve the overfitting problem. Meanwhile, an attention layer is implemented on top of the layers for each task, with the aim to capture the contribution distribution of different segment parts for each individual task. To evaluate the effectiveness of the proposed system, we conducted a set of experiments on the widely used database IEMOCAP. The empirical results show that the proposed systems significantly outperform corresponding baseline systems.

* accepted by ICASSP 2019 

  Access Paper or Ask Questions

The Best of Both Worlds: Lexical Resources To Improve Low-Resource Part-of-Speech Tagging

Nov 21, 2018
Barbara Plank, Sigrid Klerke, Zeljko Agic

In natural language processing, the deep learning revolution has shifted the focus from conventional hand-crafted symbolic representations to dense inputs, which are adequate representations learned automatically from corpora. However, particularly when working with low-resource languages, small amounts of symbolic lexical resources such as user-generated lexicons are often available even when gold-standard corpora are not. Such additional linguistic information is though often neglected, and recent neural approaches to cross-lingual tagging typically rely only on word and subword embeddings. While these representations are effective, our recent work has shown clear benefits of combining the best of both worlds: integrating conventional lexical information improves neural cross-lingual part-of-speech (PoS) tagging. However, little is known on how complementary such additional information is, and to what extent improvements depend on the coverage and quality of these external resources. This paper seeks to fill this gap by providing the first thorough analysis on the contributions of lexical resources for cross-lingual PoS tagging in neural times.

* Under review for Natural Language Engineering 

  Access Paper or Ask Questions

[email protected]: Hope Speech Detection Using Indic Transliteration and Transformers

Feb 25, 2021
Ishan Sanjeev Upadhyay, Nikhil E, Anshul Wadhawan, Radhika Mamidi

This paper aims to describe the approach we used to detect hope speech in the HopeEDI dataset. We experimented with two approaches. In the first approach, we used contextual embeddings to train classifiers using logistic regression, random forest, SVM, and LSTM based models.The second approach involved using a majority voting ensemble of 11 models which were obtained by fine-tuning pre-trained transformer models (BERT, ALBERT, RoBERTa, IndicBERT) after adding an output layer. We found that the second approach was superior for English, Tamil and Malayalam. Our solution got a weighted F1 score of 0.93, 0.75 and 0.49 for English,Malayalam and Tamil respectively. Our solution ranked first in English, eighth in Malayalam and eleventh in Tamil.

  Access Paper or Ask Questions

Probabilistic SVM/GMM Classifier for Speaker-Independent Vowel Recognition in Continues Speech

Dec 12, 2008
Mohammad Nazari, Abolghasem Sayadiyan, SeyedMajid Valiollahzadeh

In this paper, we discuss the issues in automatic recognition of vowels in Persian language. The present work focuses on new statistical method of recognition of vowels as a basic unit of syllables. First we describe a vowel detection system then briefly discuss how the detected vowels can feed to recognition unit. According to pattern recognition, Support Vector Machines (SVM) as a discriminative classifier and Gaussian mixture model (GMM) as a generative model classifier are two most popular techniques. Current state-ofthe- art systems try to combine them together for achieving more power of classification and improving the performance of the recognition systems. The main idea of the study is to combine probabilistic SVM and traditional GMM pattern classification with some characteristic of speech like band-pass energy to achieve better classification rate. This idea has been analytically formulated and tested on a FarsDat based vowel recognition system. The results show inconceivable increases in recognition accuracy. The tests have been carried out by various proposed vowel recognition algorithms and the results have been compared.

* 4 pages 

  Access Paper or Ask Questions

FastCorrect: Fast Error Correction with Edit Alignment for Automatic Speech Recognition

May 20, 2021
Yichong Leng, Xu Tan, Linchen Zhu, Jin Xu, Renqian Luo, Linquan Liu, Tao Qin, Xiang-Yang Li, Ed Lin, Tie-Yan Liu

Error correction techniques have been used to refine the output sentences from automatic speech recognition (ASR) models and achieve a lower word error rate (WER) than original ASR outputs. Previous works usually use a sequence-to-sequence model to correct an ASR output sentence autoregressively, which causes large latency and cannot be deployed in online ASR services. A straightforward solution to reduce latency, inspired by non-autoregressive (NAR) neural machine translation, is to use an NAR sequence generation model for ASR error correction, which, however, comes at the cost of significantly increased ASR error rate. In this paper, observing distinctive error patterns and correction operations (i.e., insertion, deletion, and substitution) in ASR, we propose FastCorrect, a novel NAR error correction model based on edit alignment. In training, FastCorrect aligns each source token from an ASR output sentence to the target tokens from the corresponding ground-truth sentence based on the edit distance between the source and target sentences, and extracts the number of target tokens corresponding to each source token during edition/correction, which is then used to train a length predictor and to adjust the source tokens to match the length of the target sentence for parallel generation. In inference, the token number predicted by the length predictor is used to adjust the source tokens for target sequence generation. Experiments on the public AISHELL-1 dataset and an internal industrial-scale ASR dataset show the effectiveness of FastCorrect for ASR error correction: 1) it speeds up the inference by 6-9 times and maintains the accuracy (8-14% WER reduction) compared with the autoregressive correction model; and 2) it outperforms the accuracy of popular NAR models adopted in neural machine translation by a large margin.

  Access Paper or Ask Questions

Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

Apr 04, 2022
Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal representations are what we want to obtain from the memory network. We then construct an associative bridge between source and target memories that considers the interrelationship between the two memories. By learning the interrelationship through the associative bridge, the proposed bridging framework is able to obtain the target modal representations inside the memory network, even with the source modal input only, and it provides rich information for its downstream tasks. We apply the proposed framework to two tasks: lip reading and speech reconstruction from silent video. Through the proposed associative bridge and modality-specific memories, each task knowledge is enriched with the recalled audio context, achieving state-of-the-art performance. We also verify that the associative bridge properly relates the source and target memories.

* Published at ICCV 2021 

  Access Paper or Ask Questions

Modified SPLICE and its Extension to Non-Stereo Data for Noise Robust Speech Recognition

Jul 15, 2013
D. S. Pavan Kumar, N. Vishnu Prasad, Vikas Joshi, S. Umesh

In this paper, a modification to the training process of the popular SPLICE algorithm has been proposed for noise robust speech recognition. The modification is based on feature correlations, and enables this stereo-based algorithm to improve the performance in all noise conditions, especially in unseen cases. Further, the modified framework is extended to work for non-stereo datasets where clean and noisy training utterances, but not stereo counterparts, are required. Finally, an MLLR-based computationally efficient run-time noise adaptation method in SPLICE framework has been proposed. The modified SPLICE shows 8.6% absolute improvement over SPLICE in Test C of Aurora-2 database, and 2.93% overall. Non-stereo method shows 10.37% and 6.93% absolute improvements over Aurora-2 and Aurora-4 baseline models respectively. Run-time adaptation shows 9.89% absolute improvement in modified framework as compared to SPLICE for Test C, and 4.96% overall w.r.t. standard MLLR adaptation on HMMs.

* Submitted to Automatic Speech Recognition and Understanding (ASRU) 2013 Workshop 

  Access Paper or Ask Questions