Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"speech recognition": models, code, and papers

Improving EEG based Continuous Speech Recognition

Dec 18, 2019
Gautam Krishna, Co Tran, Mason Carnahan, Yan Han, Ahmed H Tewfik

In this paper we introduce various techniques to improve the performance of electroencephalography (EEG) features based continuous speech recognition (CSR) systems. A connectionist temporal classification (CTC) based automatic speech recognition (ASR) system was implemented for performing recognition. We introduce techniques to initialize the weights of the recurrent layers in the encoder of the CTC model with more meaningful weights rather than with random weights and we make use of an external language model to improve the beam search during decoding time. We finally study the problem of predicting articulatory features from EEG features in this paper.

* On preparation for submission to EUSIPCO 2020. arXiv admin note: text overlap with arXiv:1911.04261, arXiv:1906.08871 

Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition

Mar 29, 2022
Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson

An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system by leveraging recent advances in unsupervised automatic speech recognition (ASR). Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at

* submitted to INTERSPEECH 

Exploiting Hidden Representations from a DNN-based Speech Recogniser for Speech Intelligibility Prediction in Hearing-impaired Listeners

Apr 08, 2022
Zehai Tu, Ning Ma, Jon Barker

An accurate objective speech intelligibility prediction algorithms is of great interest for many applications such as speech enhancement for hearing aids. Most algorithms measures the signal-to-noise ratios or correlations between the acoustic features of clean reference signals and degraded signals. However, these hand-picked acoustic features are usually not explicitly correlated with recognition. Meanwhile, deep neural network (DNN) based automatic speech recogniser (ASR) is approaching human performance in some speech recognition tasks. This work leverages the hidden representations from DNN-based ASR as features for speech intelligibility prediction in hearing-impaired listeners. The experiments based on a hearing aid intelligibility database show that the proposed method could make better prediction than a widely used short-time objective intelligibility (STOI) based binaural measure.

* Submitted to INTERSPEECH2022 

Belief Hidden Markov Model for speech recognition

Jan 22, 2015
Siwar Jendoubi, Boutheina Ben Yaghlane, Arnaud Martin

Speech Recognition searches to predict the spoken words automatically. These systems are known to be very expensive because of using several pre-recorded hours of speech. Hence, building a model that minimizes the cost of the recognizer will be very interesting. In this paper, we present a new approach for recognizing speech based on belief HMMs instead of proba-bilistic HMMs. Experiments shows that our belief recognizer is insensitive to the lack of the data and it can be trained using only one exemplary of each acoustic unit and it gives a good recognition rates. Consequently, using the belief HMM recognizer can greatly minimize the cost of these systems.

* International Conference on Modeling, Simulation and Applied Optimization (ICMSAO), Apr 2013, Hammamet, Tunisia. pp.1 - 6 

Learning to Recognise Words using Visually Grounded Speech

May 31, 2020
Sebastiaan Scholten, Danny Merkx, Odette Scharenborg

We investigated word recognition in a Visually Grounded Speech model. The model has been trained on pairs of images and spoken captions to create visually grounded embeddings which can be used for speech to image retrieval and vice versa. We investigate whether such a model can be used to recognise words by embedding isolated words and using them to retrieve images of their visual referents. We investigate the time-course of word recognition using a gating paradigm and perform a statistical analysis to see whether well known word competition effects in human speech processing influence word recognition. Our experiments show that the model is able to recognise words, and the gating paradigm reveals that words can be recognised from partial input as well and that recognition is negatively influenced by word competition from the word initial cohort.


Multi-task Recurrent Model for Speech and Speaker Recognition

Sep 27, 2016
Zhiyuan Tang, Lantian Li, Dong Wang

Although highly correlated, speech and speaker recognition have been regarded as two independent tasks and studied by two communities. This is certainly not the way that people behave: we decipher both speech content and speaker traits at the same time. This paper presents a unified model to perform speech and speaker recognition simultaneously and altogether. The model is based on a unified neural network where the output of one task is fed to the input of the other, leading to a multi-task recurrent network. Experiments show that the joint model outperforms the task-specific models on both the two tasks.

* APSIPA 2016