Get our free extension to see links to code for papers anywhere online!

Models, code, and papers for "speech recognition"

##### Speech Recognition with no speech or with noisy speech
Mar 02, 2019
Gautam Krishna, Co Tran, Jianguo Yu, Ahmed H Tewfik

The performance of automatic speech recognition systems(ASR) degrades in the presence of noisy speech. This paper demonstrates that using electroencephalography (EEG) can help automatic speech recognition systems overcome performance loss in the presence of noise. The paper also shows that distillation training of automatic speech recognition systems using EEG features will increase their performance. Finally, we demonstrate the ability to recognize words from EEG with no speech signal on a limited English vocabulary with high accuracy.

* Accepted for ICASSP 2019

##### Speech Enhancement Modeling Towards Robust Speech Recognition System
May 07, 2013
Urmila Shrawankar, V. M. Thakare

Form about four decades human beings have been dreaming of an intelligent machine which can master the natural speech. In its simplest form, this machine should consist of two subsystems, namely automatic speech recognition (ASR) and speech understanding (SU). The goal of ASR is to transcribe natural speech while SU is to understand the meaning of the transcription. Recognizing and understanding a spoken sentence is obviously a knowledge-intensive process, which must take into account all variable information about the speech communication process, from acoustics to semantics and pragmatics. While developing an Automatic Speech Recognition System, it is observed that some adverse conditions degrade the performance of the Speech Recognition System. In this contribution, speech enhancement system is introduced for enhancing speech signals corrupted by additive noise and improving the performance of Automatic Speech Recognizers in noisy conditions. Automatic speech recognition experiments show that replacing noisy speech signals by the corresponding enhanced speech signals leads to an improvement in the recognition accuracies. The amount of improvement varies with the type of the corrupting noise.

* Pages: 04; Conference Proceedings International Conference on Advance Computing (ICAC-2008), India

##### Advancing Speech Recognition With No Speech Or With Noisy Speech
Jul 27, 2019
Gautam Krishna, Co Tran, Mason Carnahan, Ahmed H Tewfik

In this paper we demonstrate end to end continuous speech recognition (CSR) using electroencephalography (EEG) signals with no speech signal as input. An attention model based automatic speech recognition (ASR) and connectionist temporal classification (CTC) based ASR systems were implemented for performing recognition. We further demonstrate CSR for noisy speech by fusing with EEG features.

* Accepted for publication at IEEE EUSIPCO 2019. Camera-ready version. arXiv admin note: text overlap with arXiv:1906.08045

##### Speech Recognition with Augmented Synthesized Speech

Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of the synthesized speech. In this paper, we evaluate the feasibility of enhancing speech recognition performance using speech synthesis using two corpora from different domains. We explore algorithms to provide the necessary acoustic and lexical diversity needed for robust speech recognition. Finally, we demonstrate the feasibility of this approach as a data augmentation strategy for domain-transfer. We find that improvements to speech recognition performance is achievable by augmenting training data with synthesized material. However, there remains a substantial gap in performance between recognizers trained on human speech those trained on synthesized speech.

* Accepted for publication at ASRU 2020

##### Visual Speech Recognition
Sep 03, 2014

Lip reading is used to understand or interpret speech without hearing it, a technique especially mastered by people with hearing difficulties. The ability to lip read enables a person with a hearing impairment to communicate with others and to engage in social activities, which otherwise would be difficult. Recent advances in the fields of computer vision, pattern recognition, and signal processing has led to a growing interest in automating this challenging task of lip reading. Indeed, automating the human ability to lip read, a process referred to as visual speech recognition (VSR) (or sometimes speech reading), could open the door for other novel related applications. VSR has received a great deal of attention in the last decade for its potential use in applications such as human-computer interaction (HCI), audio-visual speech recognition (AVSR), speaker recognition, talking heads, sign language recognition and video surveillance. Its main aim is to recognise spoken word(s) by using only the visual signal that is produced during speech. Hence, VSR deals with the visual domain of speech and involves image processing, artificial intelligence, object detection, pattern recognition, statistical modelling, etc.

* Speech and Language Technologies (Book), Prof. Ivo Ipsic (Ed.), ISBN: 978-953-307-322-4, InTech (2011)

##### Speech Recognition by Machine, A Review
Jan 13, 2010
M. A. Anusuya, S. K. Katti

This paper presents a brief survey on Automatic Speech Recognition and discusses the major themes and advances made in the past 60 years of research, so as to provide a technological perspective and an appreciation of the fundamental progress that has been accomplished in this important area of speech communication. After years of research and development the accuracy of automatic speech recognition remains one of the important research challenges (e.g., variations of the context, speakers, and environment).The design of Speech Recognition system requires careful attentions to the following issues: Definition of various types of speech classes, speech representation, feature extraction techniques, speech classifiers, database and performance evaluation. The problems that are existing in ASR and the various techniques to solve these problems constructed by various research workers have been presented in a chronological order. Hence authors hope that this work shall be a contribution in the area of speech recognition. The objective of this review paper is to summarize and compare some of the well known methods used in various stages of speech recognition system and identify research topic and applications which are at the forefront of this exciting and challenging field.

* International Journal of Computer Science and Information Security, IJCSIS, Vol. 6, No. 3, pp. 181-205, December 2009, USA
* 25 pages IEEE format, International Journal of Computer Science and Information Security, IJCSIS December 2009, ISSN 1947 5500, http://sites.google.com/site/ijcsis/

##### CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

##### MIMO-SPEECH: End-to-End Multi-Channel Multi-Speaker Speech Recognition

Recently, the end-to-end approach has proven its efficacy in monaural multi-speaker speech recognition. However, high word error rates (WERs) still prevent these systems from being used in practical applications. On the other hand, the spatial information in multi-channel signals has proven helpful in far-field speech recognition tasks. In this work, we propose a novel neural sequence-to-sequence (seq2seq) architecture, MIMO-Speech, which extends the original seq2seq to deal with multi-channel input and multi-channel output so that it can fully model multi-channel multi-speaker speech separation and recognition. MIMO-Speech is a fully neural end-to-end framework, which is optimized only via an ASR criterion. It is comprised of: 1) a monaural masking network, 2) a multi-source neural beamformer, and 3) a multi-output speech recognition model. With this processing, the input overlapped speech is directly mapped to text sequences. We further adopted a curriculum learning strategy, making the best use of the training set to improve the performance. The experiments on the spatialized wsj1-2mix corpus show that our model can achieve more than 60% WER reduction compared to the single-channel system with high quality enhanced signals (SI-SDR = 23.1 dB) obtained by the above separation function.

* Accepted at ASRU 2019

##### Monaural Multi-Talker Speech Recognition using Factorial Speech Processing Models
Oct 05, 2016

A Pascal challenge entitled monaural multi-talker speech recognition was developed, targeting the problem of robust automatic speech recognition against speech like noises which significantly degrades the performance of automatic speech recognition systems. In this challenge, two competing speakers say a simple command simultaneously and the objective is to recognize speech of the target speaker. Surprisingly during the challenge, a team from IBM research, could achieve a performance better than human listeners on this task. The proposed method of the IBM team, consist of an intermediate speech separation and then a single-talker speech recognition. This paper reconsiders the task of this challenge based on gain adapted factorial speech processing models. It develops a joint-token passing algorithm for direct utterance decoding of both target and masker speakers, simultaneously. Comparing it to the challenge winner, it uses maximum uncertainty during the decoding which cannot be used in the past two-phased method. It provides detailed derivation of inference on these models based on general inference procedures of probabilistic graphical models. As another improvement, it uses deep neural networks for joint-speaker identification and gain estimation which makes these two steps easier than before producing competitive results for these steps. The proposed method of this work outperforms past super-human results and even the results were achieved recently by Microsoft research, using deep neural networks. It achieved 5.5% absolute task performance improvement compared to the first super-human system and 2.7% absolute task performance improvement compared to its recent competitor.

##### Opportunities & Challenges In Automatic Speech Recognition
May 09, 2013
Rashmi Makhijani, Urmila Shrawankar, V M Thakare

Automatic speech recognition enables a wide range of current and emerging applications such as automatic transcription, multimedia content analysis, and natural human-computer interfaces. This paper provides a glimpse of the opportunities and challenges that parallelism provides for automatic speech recognition and related application research from the point of view of speech researchers. The increasing parallelism in computing platforms opens three major possibilities for speech recognition systems: improving recognition accuracy in non-ideal, everyday noisy environments; increasing recognition throughput in batch processing of speech data; and reducing recognition latency in realtime usage scenarios. This paper describes technical challenges, approaches taken, and possible directions for future research to guide the design of efficient parallel software and hardware infrastructures.

* Pages: 05 Figures : 01 Proceedings of the International Conference BEATS 2010, NIT Jalandhar, INDIA

##### Research on several key technologies in practical speech emotion recognition
Sep 27, 2017
Chengwei Huang

In this dissertation the practical speech emotion recognition technology is studied, including several cognitive related emotion types, namely fidgetiness, confidence and tiredness. The high quality of naturalistic emotional speech data is the basis of this research. The following techniques are used for inducing practical emotional speech: cognitive task, computer game, noise stimulation, sleep deprivation and movie clips. A practical speech emotion recognition system is studied based on Gaussian mixture model. A two-class classifier set is adopted for performance improvement under the small sample case. Considering the context information in continuous emotional speech, a Gaussian mixture model embedded with Markov networks is proposed. A further study is carried out for system robustness analysis. First, noise reduction algorithm based on auditory masking properties is fist introduced to the practical speech emotion recognition. Second, to deal with the complicated unknown emotion types under real situation, an emotion recognition method with rejection ability is proposed, which enhanced the system compatibility against unknown emotion samples. Third, coping with the difficulties brought by a large number of unknown speakers, an emotional feature normalization method based on speaker-sensitive feature clustering is proposed. Fourth, by adding the electrocardiogram channel, a bi-modal emotion recognition system based on speech signals and electrocardiogram signals is first introduced. The speech emotion recognition methods studied in this dissertation may be extended into the cross-language speech emotion recognition and the whispered speech emotion recognition.

* in Chinese

##### Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?
Feb 28, 2010
Steven Wegmann, Larry Gillick

Hidden Markov models (HMMs) have been successfully applied to automatic speech recognition for more than 35 years in spite of the fact that a key HMM assumption -- the statistical independence of frames -- is obviously violated by speech data. In fact, this data/model mismatch has inspired many attempts to modify or replace HMMs with alternative models that are better able to take into account the statistical dependence of frames. However it is fair to say that in 2010 the HMM is the consensus model of choice for speech recognition and that HMMs are at the heart of both commercially available products and contemporary research systems. In this paper we present a preliminary exploration aimed at understanding how speech data depart from HMMs and what effect this departure has on the accuracy of HMM-based speech recognition. Our analysis uses standard diagnostic tools from the field of statistics -- hypothesis testing, simulation and resampling -- which are rarely used in the field of speech recognition. Our main result, obtained by novel manipulations of real and resampled data, demonstrates that real data have statistical dependency and that this dependency is responsible for significant numbers of recognition errors. We also demonstrate, using simulation and resampling, that if we `remove' the statistical dependency from data, then the resulting recognition error rates become negligible. Taken together, these results suggest that a better understanding of the structure of the statistical dependency in speech data is a crucial first step towards improving HMM-based speech recognition.

##### Speech-Driven Text Retrieval: Using Target IR Collections for Statistical Language Model Adaptation in Speech Recognition
Jun 24, 2002
Atsushi Fujii, Katunobu Itou, Tetsuya Ishikawa

Speech recognition has of late become a practical technology for real world applications. Aiming at speech-driven text retrieval, which facilitates retrieving information with spoken queries, we propose a method to integrate speech recognition and retrieval methods. Since users speak contents related to a target collection, we adapt statistical language models used for speech recognition based on the target collection, so as to improve both the recognition and retrieval accuracy. Experiments using existing test collections combined with dictated queries showed the effectiveness of our method.

* Anni R. Coden and Eric W. Brown and Savitha Srinivasan (Eds.), Information Retrieval Techniques for Speech Applications (LNCS 2273), pp.94-104, Springer, 2002

##### Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface
Nov 18, 1996
Geunbae Lee, Jong-Hyeok Lee, Sangeok Kim

This paper presents a HMM-based speech recognition engine and its integration into direct manipulation interfaces for Korean document editor. Speech recognition can reduce typical tedious and repetitive actions which are inevitable in standard GUIs (graphic user interfaces). Our system consists of general speech recognition engine called ABrain {Auditory Brain} and speech commandable document editor called SHE {Simple Hearing Editor}. ABrain is a phoneme-based speech recognition engine which shows up to 97% of discrete command recognition rate. SHE is a EuroBridge widget-based document editor that supports speech commands as well as direct manipulation interfaces.

* 6 pages, ps file, presented at icmi96 (Bejing)

##### Understanding effect of speech perception in EEG based speech recognition systems
May 29, 2020
Gautam Krishna, Co Tran, Mason Carnahan, Ahmed Tewfik

The electroencephalography (EEG) signals recorded in parallel with speech are used to perform isolated and continuous speech recognition. During speaking process, one also hears his or her own speech and this speech perception is also reflected in the recorded EEG signals. In this paper we investigate whether it is possible to separate out this speech perception component from EEG signals in order to design more robust EEG based speech recognition systems. We further demonstrate predicting EEG signals recorded in parallel with speaking from EEG signals recorded in parallel with passive listening and vice versa with very low normalized root mean squared error (RMSE). We finally demonstrate both isolated and continuous speech recognition using EEG signals recorded in parallel with listening, speaking and improve the previous connectionist temporal classification (CTC) model results demonstrated by authors in [1] using their data set.

* Under Review

##### TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices
Aug 23, 2020
Alexander Wong, Mahmoud Famouri, Maya Pavlova, Siddharth Surana

Advances in deep learning have led to state-of-the-art performance across a multitude of speech recognition tasks. Nevertheless, the widespread deployment of deep neural networks for on-device speech recognition remains a challenge, particularly in edge scenarios where the memory and computing resources are highly constrained (e.g., low-power embedded devices) or where the memory and computing budget dedicated to speech recognition is low (e.g., mobile devices performing numerous tasks besides speech recognition). In this study, we introduce the concept of attention condensers for building low-footprint, highly-efficient deep neural networks for on-device speech recognition on the edge. More specifically, an attention condenser is a self-attention mechanism that learns and produces a condensed embedding characterizing joint local and cross-channel activation relationships, and performs selective attention accordingly. To illustrate its efficacy, we introduce TinySpeech, low-precision deep neural networks comprising largely of attention condensers tailored for on-device speech recognition using a machine-driven design exploration strategy. Experimental results on the Google Speech Commands benchmark dataset for limited-vocabulary speech recognition showed that TinySpeech networks achieved significantly lower architectural complexity (as much as $207\times$ fewer parameters) and lower computational complexity (as much as $21\times$ fewer multiply-add operations) when compared to previous deep neural networks in research literature. These results not only demonstrate the efficacy of attention condensers for building highly efficient deep neural networks for on-device speech recognition, but also illuminate its potential for accelerating deep learning on the edge and empowering a wide range of TinyML applications.

* 9 pages