Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"speech recognition": models, code, and papers

Fast offline Transformer-based end-to-end automatic speech recognition for real-world applications

Jan 14, 2021
Yoo Rhee Oh, Kiyoung Park, Jeon Gyu Park

Many real-world applications require to convert speech files into text with high accuracy with limited resources. This paper proposes a method to recognize large speech database fast using the Transformer-based end-to-end model. Transfomers have improved the state-of-the-art performance in many fields as well as speech recognition. But it is not easy to be used for long sequences. In this paper, various techniques to speed up the recognition of real-world speeches are proposed and tested including parallelizing the recognition using batched beam search, detecting end-of-speech based on connectionist temporal classification (CTC), restricting CTC prefix score and splitting long speeches into short segments. Experiments are conducted with real-world Korean speech recognition task. Experimental results with an 8-hour test corpus show that the proposed system can convert speeches into text in less than 3 minutes with 10.73% character error rate which is 27.1% relatively low compared to conventional DNN-HMM based recognition system.

* Submitted to the International Conference on Acoustics, Speech & Signal Processing (ICASSP) 2021 
Access Paper or Ask Questions

Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech

Jun 23, 2018
Gabrielle K. Liu

Current approaches to speech emotion recognition focus on speech features that can capture the emotional content of a speech signal. Mel Frequency Cepstral Coefficients (MFCCs) are one of the most commonly used representations for audio speech recognition and classification. This paper proposes Gammatone Frequency Cepstral Coefficients (GFCCs) as a potentially better representation of speech signals for emotion recognition. The effectiveness of MFCC and GFCC representations are compared and evaluated over emotion and intensity classification tasks with fully connected and recurrent neural network architectures. The results provide evidence that GFCCs outperform MFCCs in speech emotion recognition.

* 5 pages, 1 figure, 3 tables 
Access Paper or Ask Questions

Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition

Nov 09, 2020
Cunhang Fan, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Bin Liu, Zhengqi Wen

The joint training framework for speech enhancement and recognition methods have obtained quite good performances for robust end-to-end automatic speech recognition (ASR). However, these methods only utilize the enhanced feature as the input of the speech recognition component, which are affected by the speech distortion problem. In order to address this problem, this paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. Therefore, the GRF can not only remove the noise signals from the enhanced features, but also learn the raw fine structures from the noisy features so that it can alleviate the speech distortion. The proposed method consists of speech enhancement, GRF and speech recognition. Firstly, the mask based speech enhancement network is applied to enhance the input speech. Secondly, the GRF is applied to address the speech distortion problem. Thirdly, to improve the performance of ASR, the state-of-the-art speech transformer algorithm is used as the speech recognition component. Finally, the joint training framework is utilized to optimize these three components, simultaneously. Our experiments are conducted on an open-source Mandarin speech corpus called AISHELL-1. Experimental results show that the proposed method achieves the relative character error rate (CER) reduction of 10.04\% over the conventional joint enhancement and transformer method only using the enhanced features. Especially for the low signal-to-noise ratio (0 dB), our proposed method can achieves better performances with 12.67\% CER reduction, which suggests the potential of our proposed method.

* Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing 
Access Paper or Ask Questions

AV Taris: Online Audio-Visual Speech Recognition

Dec 14, 2020
George Sterpu, Naomi Harte

In recent years, Automatic Speech Recognition (ASR) technology has approached human-level performance on conversational speech under relatively clean listening conditions. In more demanding situations involving distant microphones, overlapped speech, background noise, or natural dialogue structures, the ASR error rate is at least an order of magnitude higher. The visual modality of speech carries the potential to partially overcome these challenges and contribute to the sub-tasks of speaker diarisation, voice activity detection, and the recovery of the place of articulation, and can compensate for up to 15dB of noise on average. This article develops AV Taris, a fully differentiable neural network model capable of decoding audio-visual speech in real time. We achieve this by connecting two recently proposed models for audio-visual speech integration and online speech recognition, namely AV Align and Taris. We evaluate AV Taris under the same conditions as AV Align and Taris on one of the largest publicly available audio-visual speech datasets, LRS2. Our results show that AV Taris is superior to the audio-only variant of Taris, demonstrating the utility of the visual modality to speech recognition within the real time decoding framework defined by Taris. Compared to an equivalent Transformer-based AV Align model that takes advantage of full sentences without meeting the real-time requirement, we report an absolute degradation of approximately 3% with AV Taris. As opposed to the more popular alternative for online speech recognition, namely the RNN Transducer, Taris offers a greatly simplified fully differentiable training pipeline. As a consequence, AV Taris has the potential to popularise the adoption of Audio-Visual Speech Recognition (AVSR) technology and overcome the inherent limitations of the audio modality in less optimal listening conditions.

Access Paper or Ask Questions

Constrained Variational Autoencoder for improving EEG based Speech Recognition Systems

Jun 01, 2020
Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

In this paper we introduce a recurrent neural network (RNN) based variational autoencoder (VAE) model with a new constrained loss function that can generate more meaningful electroencephalography (EEG) features from raw EEG features to improve the performance of EEG based speech recognition systems. We demonstrate that both continuous and isolated speech recognition systems trained and tested using EEG features generated from raw EEG features using our VAE model results in improved performance and we demonstrate our results for a limited English vocabulary consisting of 30 unique sentences for continuous speech recognition and for an English vocabulary consisting of 2 unique sentences for isolated speech recognition. We compare our method with another recently introduced method described by authors in [1] to improve the performance of EEG based continuous speech recognition systems and we demonstrate that our method outperforms their method as vocabulary size increases when trained and tested using the same data set. Even though we demonstrate results only for automatic speech recognition (ASR) experiments in this paper, the proposed VAE model with constrained loss function can be extended to a variety of other EEG based brain computer interface (BCI) applications.

* Under Review. arXiv admin note: substantial text overlap with arXiv:2006.01260 
Access Paper or Ask Questions

Speech Emotion Recognition Based on Multi-feature and Multi-lingual Fusion

Jan 16, 2020
Chunyi Wang

A speech emotion recognition algorithm based on multi-feature and Multi-lingual fusion is proposed in order to resolve low recognition accuracy caused by lack of large speech dataset and low robustness of acoustic features in the recognition of speech emotion. First, handcrafted and deep automatic features are extracted from existing data in Chinese and English speech emotions. Then, the various features are fused respectively. Finally, the fused features of different languages are fused again and trained in a classification model. Distinguishing the fused features with the unfused ones, the results manifest that the fused features significantly enhance the accuracy of speech emotion recognition algorithm. The proposed solution is evaluated on the two Chinese corpus and two English corpus, and is shown to provide more accurate predictions compared to original solution. As a result of this study, the multi-feature and Multi-lingual fusion algorithm can significantly improve the speech emotion recognition accuracy when the dataset is small.

Access Paper or Ask Questions

Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition

Jul 04, 2014
Prashant Bordea, Amarsinh Varpeb, Ramesh Manzac, Pravin Yannawara

Automatic Speech Recognition (ASR) by machine is an attractive research topic in signal processing domain and has attracted many researchers to contribute in this area. In recent year, there have been many advances in automatic speech reading system with the inclusion of audio and visual speech features to recognize words under noisy conditions. The objective of audio-visual speech recognition system is to improve recognition accuracy. In this paper we computed visual features using Zernike moments and audio feature using Mel Frequency Cepstral Coefficients (MFCC) on vVISWa (Visual Vocabulary of Independent Standard Words) dataset which contains collection of isolated set of city names of 10 speakers. The visual features were normalized and dimension of features set was reduced by Principal Component Analysis (PCA) in order to recognize the isolated word utterance on PCA space.The performance of recognition of isolated words based on visual only and audio only features results in 63.88 and 100 respectively.

Access Paper or Ask Questions

Speech Recognition With No Speech Or With Noisy Speech Beyond English

Jul 14, 2019
Gautam Krishna, Co Tran, Yan Han, Mason Carnahan, Ahmed H Tewfik

In this paper we demonstrate continuous noisy speech recognition using connectionist temporal classification (CTC) model on limited Chinese vocabulary using electroencephalography (EEG) features with no speech signal as input and we further demonstrate single CTC model based continuous noisy speech recognition on limited joint English and Chinese vocabulary using EEG features with no speech signal as input.

* On preparation for submission for ICASSP 2020. arXiv admin note: text overlap with arXiv:1906.08044 
Access Paper or Ask Questions

KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos

Mar 01, 2019
Egor Lakomkin, Sven Magg, Cornelius Weber, Stefan Wermter

In this paper, we describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions. Automatically collected samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40\% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set. The demo ( and the crawler code ( are publicly available.

* Accepted at the Conference on Empirical Methods in Natural Language Processing 2018, Brussels, Belgium 
Access Paper or Ask Questions