Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech recognition": models, code, and papers

Toward Cross-Domain Speech Recognition with End-to-End Models

Mar 09, 2020
Thai-Son Nguyen, Sebastian Stüker, Alex Waibel

Figure 1 for Toward Cross-Domain Speech Recognition with End-to-End Models

Figure 2 for Toward Cross-Domain Speech Recognition with End-to-End Models

Figure 3 for Toward Cross-Domain Speech Recognition with End-to-End Models

Figure 4 for Toward Cross-Domain Speech Recognition with End-to-End Models

In the area of multi-domain speech recognition, research in the past focused on hybrid acoustic models to build cross-domain and domain-invariant speech recognition systems. In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems when mixing acoustic training data from several domains. For these experiments we composed a multi-domain dataset from public sources, with the different domains in the corpus covering a wide variety of topics and acoustic conditions such as telephone conversations, lectures, read speech and broadcast news. We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains. However, our end-to-end models optimized with sequence-based criterion generalize better than the hybrid models on diverse domains. In term of word-error-rate performance, our experimental acoustic-to-word and attention-based models trained on multi-domain dataset reach the performance of domain-specific long short-term memory (LSTM) hybrid models, thus resulting in multi-domain speech recognition systems that do not suffer in performance over domain specific ones. Moreover, the use of neural end-to-end models eliminates the need of domain-adapted language models during recognition, which is a great advantage when the input domain is unknown.

* Presented in Life-Long Learning for Spoken Language Systems Workshop - ASRU 2019

Via

Access Paper or Ask Questions

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Jun 16, 2021
Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

Figure 1 for Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Figure 2 for Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.

* Accepted to Interspeech 2021

Via

Access Paper or Ask Questions

Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers

Dec 07, 2022
Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney

Figure 1 for Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers

Figure 2 for Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers

Figure 3 for Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers

Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid modes, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-level minimum Bayes risk, and lattice-free minimum Bayes risk, which are used for the final posterior output of the phoneme-based neural transducer with a limited context dependency. Compared to criteria using N-best lists, lattice-free methods eliminate the decoding step for hypotheses generation during training, which leads to more efficient training. Experimental results show that lattice-free methods gain up to 6.5% relative improvement in word error rate compared to a sequence-level cross-entropy trained model. Compared to the N-best-list based minimum Bayes risk objectives, lattice-free methods gain 40% - 70% relative training time speedup with a small degradation in performance.

* submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Evaluating context-invariance in unsupervised speech representations

Oct 27, 2022
Mark Hallap, Emmanuel Dupoux, Ewan Dunbar

Figure 1 for Evaluating context-invariance in unsupervised speech representations

Figure 2 for Evaluating context-invariance in unsupervised speech representations

Figure 3 for Evaluating context-invariance in unsupervised speech representations

Figure 4 for Evaluating context-invariance in unsupervised speech representations

Unsupervised speech representations have taken off, with benchmarks (SUPERB, ZeroSpeech) demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of ``discovering the phonemes'' of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced, while the text remains stable. This is what allows tokens of the same word to have the same transcriptions -- key to language understanding. Current benchmarks do not measure context-invariance. We develop a new version of the ZeroSpeech ABX benchmark that measures context-invariance, and apply it to recent self-supervised representations. We demonstrate that the context-independence of representations is predictive of the stability of word-level representations. We suggest research concentrate on improving context-independence of self-supervised and unsupervised representations.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Jan 14, 2022
Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng

Figure 1 for Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Figure 2 for Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Figure 3 for Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Figure 4 for Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid DNN and end-to-end disordered speech recognition systems. Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i-Vector adaptation by up to 2.63% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation. Learning hidden unit contribution (LHUC) based speaker adaptation was further applied. The final speaker adapted system using the proposed spectral basis embedding features gave an overall WER of 25.6% on the UASpeech test set of 16 dysarthric speakers

* Proceedings of INTERSPEECH 2021

Via

Access Paper or Ask Questions

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Nov 02, 2018
Jason Li, Ravi Gadde, Boris Ginsburg, Vitaly Lavrukhin

Figure 1 for Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Figure 2 for Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Figure 3 for Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Figure 4 for Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

Building an accurate automatic speech recognition (ASR) system requires a large dataset that contains many hours of labeled speech samples produced by a diverse set of speakers. The lack of such open free datasets is one of the main issues preventing advancements in ASR research. To address this problem, we propose to augment a natural speech dataset with synthetic speech. We train very large end-to-end neural speech recognition models using the LibriSpeech dataset augmented with synthetic speech. These new models achieve state of the art Word Error Rate (WER) for character-level based models without an external language model.

* Pre-print. Work in progress, 5 pages, 1 figure

Via

Access Paper or Ask Questions

Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge

Jul 29, 2022
Alef Iury Siqueira Ferreira, Gustavo dos Reis Oliveira

Figure 1 for Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge

Figure 2 for Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge

Figure 3 for Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge

Figure 4 for Domain Specific Wav2vec 2.0 Fine-tuning For The SE&R 2022 Challenge

This paper presents our efforts to build a robust ASR model for the shared task Automatic Speech Recognition for spontaneous and prepared speech & Speech Emotion Recognition in Portuguese (SE&R 2022). The goal of the challenge is to advance the ASR research for the Portuguese language, considering prepared and spontaneous speech in different dialects. Our method consist on fine-tuning an ASR model in a domain-specific approach, applying gain normalization and selective noise insertion. The proposed method improved over the strong baseline provided on the test set in 3 of the 4 tracks available

* Proceedings of the First Workshop on Automatic Speech Recognition for Spontaneous and Prepared Speech & Speech Emotion Recognition in Portuguese (SE&R 2022), co-located with PROPOR 2022

Via

Access Paper or Ask Questions

BUT Opensat 2019 Speech Recognition System

Jan 30, 2020
Martin Karafiát, Murali Karthick Baskar, Igor Szöke, Hari Krishna Vydana, Karel Veselý, Jan "Honza'' Černocký

Figure 1 for BUT Opensat 2019 Speech Recognition System

Figure 2 for BUT Opensat 2019 Speech Recognition System

Figure 3 for BUT Opensat 2019 Speech Recognition System

Figure 4 for BUT Opensat 2019 Speech Recognition System

The paper describes the BUT Automatic Speech Recognition (ASR) systems submitted for OpenSAT evaluations under two domain categories such as low resourced languages and public safety communications. The first was challenging due to lack of training data, therefore various architectures and multilingual approaches were employed. The combination led to superior performance. The second domain was challenging due to recording in extreme conditions such as specific channel, speaker under stress and high levels of noise. Data augmentation process was inevitable to get reasonably good performance.

* REJECTED in ICASSP 2020

Via

Access Paper or Ask Questions

Learning to Rank Microphones for Distant Speech Recognition

Apr 13, 2021
Samuele Cornell, Alessio Brutti, Marco Matassoni, Stefano Squartini

Figure 1 for Learning to Rank Microphones for Distant Speech Recognition

Figure 2 for Learning to Rank Microphones for Distant Speech Recognition

Figure 3 for Learning to Rank Microphones for Distant Speech Recognition

Figure 4 for Learning to Rank Microphones for Distant Speech Recognition

Fully exploiting ad-hoc microphone networks for distant speech recognition is still an open issue. Empirical evidence shows that being able to select the best microphone leads to significant improvements in recognition without any additional effort on front-end processing. Current channel selection techniques either rely on signal, decoder or posterior-based features. Signal-based features are inexpensive to compute but do not always correlate with recognition performance. Instead decoder and posterior-based features exhibit better correlation but require substantial computational resources. In this work, we tackle the channel selection problem by proposing MicRank, a learning to rank framework where a neural network is trained to rank the available channels using directly the recognition performance on the training set. The proposed approach is agnostic with respect to the array geometry and type of recognition back-end. We investigate different learning to rank strategies using a synthetic dataset developed on purpose and the CHiME-6 data. Results show that the proposed approach is able to considerably improve over previous selection techniques, reaching comparable and in some instances better performance than oracle signal-based measures.

Via

Access Paper or Ask Questions

Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Feb 15, 2021
Ye Bai, Jiangyan Yi, Jianhua Tao, Zhengkun Tian, Zhengqi Wen, Shuai Zhang

Figure 1 for Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Figure 2 for Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Figure 3 for Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Figure 4 for Fast End-to-End Speech Recognition via Non-Autoregressive Models and Cross-Modal Knowledge Transferring from BERT

Attention-based encoder-decoder (AED) models have achieved promising performance in speech recognition. However, because the decoder predicts text tokens (such as characters or words) in an autoregressive manner, it is difficult for an AED model to predict all tokens in parallel. This makes the inference speed relatively slow. We believe that because the encoder already captures the whole speech utterance, which has the token-level relationship implicitly, we can predict a token without explicitly autoregressive language modeling. When the prediction of a token does not rely on other tokens, the parallel prediction of all tokens in the sequence is realizable. Based on this idea, we propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once). The model consists of an encoder, a decoder, and a position dependent summarizer (PDS). The three modules are based on basic attention blocks. The encoder extracts high-level representations from the speech. The PDS uses positional encodings corresponding to tokens to convert the acoustic representations into token-level representations. The decoder further captures token-level relationships with the self-attention mechanism. At last, the probability distribution on the vocabulary is computed for each token position. Therefore, speech recognition is re-formulated as a position-wise classification problem. Further, we propose a cross-modal transfer learning method to refine semantics from a large-scale pre-trained language model BERT for improving the performance.

* 14 pages, 7 figures

Via

Access Paper or Ask Questions