Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"speech recognition": models, code, and papers

Korean Tokenization for Beam Search Rescoring in Speech Recognition

Mar 28, 2022
Kyuhong Shim, Hyewon Bae, Wonyong Sung

The performance of automatic speech recognition (ASR) models can be greatly improved by proper beam-search decoding with external language model (LM). There has been an increasing interest in Korean speech recognition, but not many studies have been focused on the decoding procedure. In this paper, we propose a Korean tokenization method for neural network-based LM used for Korean ASR. Although the common approach is to use the same tokenization method for external LM as the ASR model, we show that it may not be the best choice for Korean. We propose a new tokenization method that inserts a special token, SkipTC, when there is no trailing consonant in a Korean syllable. By utilizing the proposed SkipTC token, the input sequence for LM becomes very regularly patterned so that the LM can better learn the linguistic characteristics. Our experiments show that the proposed approach achieves a lower word error rate compared to the same LM model without SkipTC. In addition, we are the first to report the ASR performance for the recently introduced large-scale 7,600h Korean speech dataset.

* Submitted to INTERSPEECH 2022 
Access Paper or Ask Questions

Learning Speech Emotion Representations in the Quaternion Domain

Apr 05, 2022
Eric Guizzo, Tillman Weyde, Simone Scardapane, Danilo Comminiello

The modeling of human emotion expression in speech signals is an important, yet challenging task. The high resource demand of speech emotion recognition models, combined with the the general scarcity of emotion-labelled data are obstacles to the development and application of effective solutions in this field. In this paper, we present an approach to jointly circumvent these difficulties. Our method, named RH-emo, is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms, enabling the use of quaternion-valued networks for speech emotion recognition tasks. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. On the one hand, the classifier permits to optimize each latent axis of the embeddings for the classification of a specific emotion-related characteristic: valence, arousal, dominance and overall emotion. On the other hand, the quaternion reconstruction enables the latent dimension to develop intra-channel correlations that are required for an effective representation as a quaternion entity. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess, comparing the performance of three well-established real-valued CNN architectures (AlexNet, ResNet-50, VGG) and their quaternion-valued equivalent fed with the embeddings created with RH-emo. We obtain a consistent improvement in the test accuracy for all datasets, while drastically reducing the resources' demand of models. Moreover, we performed additional experiments and ablation studies that confirm the effectiveness of our approach. The RH-emo repository is available at:

* Paper Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processing 
Access Paper or Ask Questions

Self-Attention Networks for Connectionist Temporal Classification in Speech Recognition

Jan 22, 2019
Julian Salazar, Katrin Kirchhoff, Zhiheng Huang

Self-attention has demonstrated great success in sequence-to-sequence tasks in natural language processing, with preliminary work applying it to end-to-end encoder-decoder approaches in speech recognition. Separately, connectionist temporal classification (CTC) has matured as an alignment-free strategy for monotonic sequence transduction, either by itself or in various multitask and decoding frameworks. We propose SAN-CTC, a deep, fully self-attentional network for CTC, and show it is tractable and competitive for speech recognition. On the Wall Street Journal and LibriSpeech datasets, SAN-CTC trains quickly and outperforms existing CTC models and most encoder-decoder models, attaining 4.7% CER in 1 day and 2.8% CER in 1 week respectively, using the same architecture and one GPU. We motivate the architecture for speech, evaluate position and downsampling approaches, and explore how the label alphabet affects attention head and performance outcomes.

* Under review at ICASSP 2019 
Access Paper or Ask Questions

Automatic recognition of suprasegmentals in speech

Aug 04, 2021
Jiahong Yuan, Neville Ryant, Xingyu Cai, Kenneth Church, Mark Liberman

This study reports our efforts to improve automatic recognition of suprasegmentals by fine-tuning wav2vec 2.0 with CTC, a method that has been successful in automatic speech recognition. We demonstrate that the method can improve the state-of-the-art on automatic recognition of syllables, tones, and pitch accents. Utilizing segmental information, by employing tonal finals or tonal syllables as recognition units, can significantly improve Mandarin tone recognition. Language models are helpful when tonal syllables are used as recognition units, but not helpful when tones are recognition units. Finally, Mandarin tone recognition can benefit from English phoneme recognition by combining the two tasks in fine-tuning wav2vec 2.0.

* submitted to ASRU 2021 
Access Paper or Ask Questions

Small-footprint Deep Neural Networks with Highway Connections for Speech Recognition

Jun 14, 2017
Liang Lu, Steve Renals

For speech recognition, deep neural networks (DNNs) have significantly improved the recognition accuracy in most of benchmark datasets and application domains. However, compared to the conventional Gaussian mixture models, DNN-based acoustic models usually have much larger number of model parameters, making it challenging for their applications in resource constrained platforms, e.g., mobile devices. In this paper, we study the application of the recently proposed highway network to train small-footprint DNNs, which are {\it thinner} and {\it deeper}, and have significantly smaller number of model parameters compared to conventional DNNs. We investigated this approach on the AMI meeting speech transcription corpus which has around 70 hours of audio data. The highway neural networks constantly outperformed their plain DNN counterparts, and the number of model parameters can be reduced significantly without sacrificing the recognition accuracy.

* 5 pages, 3 figures, fixed typo, accepted by Interspeech 2016 
Access Paper or Ask Questions

Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

Dec 10, 2021
Rohit Paturi, Sundararajan Srinivasan, Katrin Kirchhoff

Many of the recent advances in speech separation are primarily aimed at synthetic mixtures of short audio utterances with high degrees of overlap. These datasets significantly differ from the real conversational data and hence, the models trained and evaluated on these datasets do not generalize to real conversational scenarios. Another issue with using most of these models for long form speech is the nondeterministic ordering of separated speech segments due to either unsupervised clustering for time-frequency masks or Permutation Invariant training (PIT) loss. This leads to difficulty in accurately stitching homogenous speaker segments for downstream tasks like Automatic Speech Recognition (ASR). In this paper, we propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal. We train this model using a directed loss which regulates the order of the separated segments. With this model, we achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.

Access Paper or Ask Questions

Language model fusion for streaming end to end speech recognition

Apr 09, 2021
Rodrigo Cabrera, Xiaofeng Liu, Mohammadreza Ghodsi, Zebulun Matteson, Eugene Weinstein, Anjuli Kannan

Streaming processing of speech audio is required for many contemporary practical speech recognition tasks. Even with the large corpora of manually transcribed speech data available today, it is impossible for such corpora to cover adequately the long tail of linguistic content that's important for tasks such as open-ended dictation and voice search. We seek to address both the streaming and the tail recognition challenges by using a language model (LM) trained on unpaired text data to enhance the end-to-end (E2E) model. We extend shallow fusion and cold fusion approaches to streaming Recurrent Neural Network Transducer (RNNT), and also propose two new competitive fusion approaches that further enhance the RNNT architecture. Our results on multiple languages with varying training set sizes show that these fusion methods improve streaming RNNT performance through introducing extra linguistic features. Cold fusion works consistently better on streaming RNNT with up to a 8.5% WER improvement.

* 5 pages 
Access Paper or Ask Questions

Applying wav2vec2.0 to Speech Recognition in various low-resource languages

Dec 22, 2020
Cheng Yi, Jianzhong Wang, Ning Cheng, Shiyu Zhou, Bo Xu

Several domains own corresponding widely used feature extractors, such as ResNet, BERT, and GPT-x. These models are pre-trained on large amounts of unlabelled data by self-supervision and can be effectively applied for downstream tasks. In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on Librispeech corpus. However, this model has not been tested on real spoken scenarios and languages other than English. To verify its universality over languages, we apply the released pre-trained models to solve low-resource speech recognition tasks in various spoken languages. We achieve more than 20\% relative improvements in six languages compared with previous works. Among these languages, English improves up to 52.4\%. Moreover, using coarse-grained modeling units, such as subword and character, achieves better results than the letter.

Access Paper or Ask Questions

Thoughts on the potential to compensate a hearing loss in noise

Feb 24, 2021
Marc René Schädler

The effect of hearing impairment on speech perception was described by Plomp (1978) as a sum of a loss of class A, due to signal attenuation, and a loss of class D, due to signal distortion. While a loss of class A can be compensated by linear amplification, a loss of class D, which severely limits the benefit of hearing aids in noisy listening conditions, cannot. Not few users of hearing aids keep complaining about the limited benefit of their devices in noisy environments. Recently, in an approach to model human speech recognition by means of a re-purposed automatic speech recognition system, the loss of class D was explained by introducing a level uncertainty which reduces the individual accuracy of spectro-temporal signal levels. Based on this finding, an implementation of a patented dynamic range manipulation scheme (PLATT) is proposed, which aims to mitigate the effect of increased level uncertainty on speech recognition in noise by expanding spectral modulation patterns in the range of 2 to 4 ERB. An objective evaluation of the benefit in speech recognition thresholds in noise using an ASR-based speech recognition model suggests that more than half of the class D loss due to an increased level uncertainty might be compensable.

* 26 pages, 22 figures, related code 
Access Paper or Ask Questions