Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech recognition": models, code, and papers

Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees

Oct 08, 2021
Yuanchao Wang, Wenji Du, Chenghao Cai, Yanyan Xu

Figure 1 for Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees

Figure 2 for Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees

Figure 3 for Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees

Figure 4 for Explaining the Attention Mechanism of End-to-End Speech Recognition Using Decision Trees

The attention mechanism has largely improved the performance of end-to-end speech recognition systems. However, the underlying behaviours of attention is not yet clearer. In this study, we use decision trees to explain how the attention mechanism impact itself in speech recognition. The results indicate that attention levels are largely impacted by their previous states rather than the encoder and decoder patterns. Additionally, the default attention mechanism seems to put more weights on closer states, but behaves poorly on modelling long-term dependencies of attention states.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Joint unsupervised and supervised learning for context-aware language identification

Mar 29, 2023
Jinseok Park, Hyung Yong Kim, Jihwan Park, Byeong-Yeol Kim, Shukjae Choi, Yunkyu Lim

Figure 1 for Joint unsupervised and supervised learning for context-aware language identification

Figure 2 for Joint unsupervised and supervised learning for context-aware language identification

Figure 3 for Joint unsupervised and supervised learning for context-aware language identification

Figure 4 for Joint unsupervised and supervised learning for context-aware language identification

Language identification (LID) recognizes the language of a spoken utterance automatically. According to recent studies, LID models trained with an automatic speech recognition (ASR) task perform better than those trained with a LID task only. However, we need additional text labels to train the model to recognize speech, and acquiring the text labels is a cost high. In order to overcome this problem, we propose context-aware language identification using a combination of unsupervised and supervised learning without any text labels. The proposed method learns the context of speech through masked language modeling (MLM) loss and simultaneously trains to determine the language of the utterance with supervised learning loss. The proposed joint learning was found to reduce the error rate by 15.6% compared to the same structure model trained by supervised-only learning on a subset of the VoxLingua107 dataset consisting of sub-three-second utterances in 11 languages.

* Accepted by ICASSP 2023

Via

Access Paper or Ask Questions

Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

Oct 11, 2021
Yuchen Hu, Nana Hou, Chen Chen, Eng Siong Chng

Figure 1 for Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

Figure 2 for Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

Figure 3 for Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

Figure 4 for Interactive Feature Fusion for End-to-End Noise-Robust Speech Recognition

Speech enhancement (SE) aims to suppress the additive noise from a noisy speech signal to improve the speech's perceptual quality and intelligibility. However, the over-suppression phenomenon in the enhanced speech might degrade the performance of downstream automatic speech recognition (ASR) task due to the missing latent information. To alleviate such problem, we propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition to learn complementary information from the enhanced feature and original noisy feature. Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline on RATS Channel-A corpus. Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.

* 5 pages, 7 figures, Submitted to ICASSP 2022

Via

Access Paper or Ask Questions

Deformable TDNN with adaptive receptive fields for speech recognition

Apr 30, 2021
Keyu An, Yi Zhang, Zhijian Ou

Figure 1 for Deformable TDNN with adaptive receptive fields for speech recognition

Figure 2 for Deformable TDNN with adaptive receptive fields for speech recognition

Figure 3 for Deformable TDNN with adaptive receptive fields for speech recognition

Figure 4 for Deformable TDNN with adaptive receptive fields for speech recognition

Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based hybrid speech recognition systems and recent end-to-end systems. Nevertheless, the receptive fields of TDNNs are limited and fixed, which is not desirable for tasks like speech recognition, where the temporal dynamics of speech are varied and affected by many factors. This paper proposes to use deformable TDNNs for adaptive temporal dynamics modeling in end-to-end speech recognition. Inspired by deformable ConvNets, deformable TDNNs augment the temporal sampling locations with additional offsets and learn the offsets automatically based on the ASR criterion, without additional supervision. Experiments show that deformable TDNNs obtain state-of-the-art results on WSJ benchmarks (1.42\%/3.45\% WER on WSJ eval92/dev93 respectively), outperforming standard TDNNs significantly. Furthermore, we propose the latency control mechanism for deformable TDNNs, which enables deformable TDNNs to do streaming ASR without accuracy degradation.

* 5 pages. submitted to Interspeech 2021

Via

Access Paper or Ask Questions

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Jul 08, 2022
Xianrui Zheng, Chao Zhang, Philip C. Woodland

Figure 1 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Figure 2 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Figure 3 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Figure 4 for Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Self-supervised-learning-based pre-trained models for speech data, such as Wav2Vec 2.0 (W2V2), have become the backbone of many speech tasks. In this paper, to achieve speaker diarisation and speech recognition using a single model, a tandem multitask training (TMT) method is proposed to fine-tune W2V2. For speaker diarisation, the tasks of voice activity detection (VAD) and speaker classification (SC) are required, and connectionist temporal classification (CTC) is used for ASR. The multitask framework implements VAD, SC, and ASR using an early layer, middle layer, and late layer of W2V2, which coincides with the order of segmenting the audio with VAD, clustering the segments based on speaker embeddings, and transcribing each segment with ASR. Experimental results on the augmented multi-party (AMI) dataset showed that using different W2V2 layers for VAD, SC, and ASR from the earlier to later layers for TMT not only saves computational cost, but also reduces diarisation error rates (DERs). Joint fine-tuning of VAD, SC, and ASR yielded 16%/17% relative reductions of DER with manual/automatic segmentation respectively, and consistent reductions in speaker attributed word error rate, compared to the baseline with separately fine-tuned models.

* To appear in Interspeech 2022

Via

Access Paper or Ask Questions

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Apr 13, 2023
Hainan Xu, Fei Jia, Somshubra Majumdar, He Huang, Shinji Watanabe, Boris Ginsburg

Figure 1 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Figure 2 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Figure 3 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

Figure 4 for Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

This paper introduces a novel Token-and-Duration Transducer (TDT) architecture for sequence-to-sequence tasks. TDT extends conventional RNN-Transducer architectures by jointly predicting both a token and its duration, i.e. the number of input frames covered by the emitted token. This is achieved by using a joint network with two outputs which are independently normalized to generate distributions over tokens and durations. During inference, TDT models can skip input frames guided by the predicted duration output, which makes them significantly faster than conventional Transducers which process the encoder output frame by frame. TDT models achieve both better accuracy and significantly faster inference than conventional Transducers on different sequence transduction tasks. TDT models for Speech Recognition achieve better accuracy and up to 2.82X faster inference than RNN-Transducers. TDT models for Speech Translation achieve an absolute gain of over 1 BLEU on the MUST-C test compared with conventional Transducers, and its inference is 2.27X faster. In Speech Intent Classification and Slot Filling tasks, TDT models improve the intent accuracy up to over 1% (absolute) over conventional Transducers, while running up to 1.28X faster.

Via

Access Paper or Ask Questions

Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Apr 28, 2023
Ruchao Fan, Yunzheng Zhu, Jinhan Wang, Abeer Alwan

Figure 1 for Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Figure 2 for Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Figure 3 for Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Figure 4 for Towards Better Domain Adaptation for Self-supervised Models: A Case Study of Child ASR

Recently, self-supervised learning (SSL) from unlabelled speech data has gained increased attention in the automatic speech recognition (ASR) community. Typical SSL methods include autoregressive predictive coding (APC), Wav2vec2.0, and hidden unit BERT (HuBERT). However, SSL models are biased to the pretraining data. When SSL models are finetuned with data from another domain, domain shifting occurs and might cause limited knowledge transfer for downstream tasks. In this paper, we propose a novel framework, domain responsible adaptation and finetuning (DRAFT), to reduce domain shifting in pretrained speech models, and evaluate it for a causal and non-causal transformer. For the causal transformer, an extension of APC (E-APC) is proposed to learn richer information from unlabelled data by using multiple temporally-shifted sequences to perform prediction. For the non-causal transformer, various solutions for using the bidirectional APC (Bi-APC) are investigated. In addition, the DRAFT framework is examined for Wav2vec2.0 and HuBERT methods, which use non-causal transformers as the backbone. The experiments are conducted on child ASR (using the OGI and MyST databases) using SSL models trained with unlabelled adult speech data from Librispeech. The relative WER improvements of up to 19.7% on the two child tasks are observed when compared to the pretrained models without adaptation. With the proposed methods (E-APC and DRAFT), the relative WER improvements are even larger (30% and 19% on the OGI and MyST data, respectively) when compared to the models without using pretraining methods.

* Published in IEEE Journal of Selected Topics in Signal Processing, ICASSP Journal Poster Presentation

Via

Access Paper or Ask Questions

Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP

Mar 31, 2023
Sara Papi, Marco Gaido, Andrea Pilzer, Matteo Negri

Figure 1 for Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP

Figure 2 for Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP

Figure 3 for Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP

Figure 4 for Reproducibility is Nothing without Correctness: The Importance of Testing Code in NLP

Despite its pivotal role in research experiments, code correctness is often presumed only on the basis of the perceived quality of the results. This comes with the risk of erroneous outcomes and potentially misleading findings. To address this issue, we posit that the current focus on result reproducibility should go hand in hand with the emphasis on coding best practices. We bolster our call to the NLP community by presenting a case study, in which we identify (and correct) three bugs in widely used open-source implementations of the state-of-the-art Conformer architecture. Through comparative experiments on automatic speech recognition and translation in various language settings, we demonstrate that the existence of bugs does not prevent the achievement of good and reproducible results and can lead to incorrect conclusions that potentially misguide future research. In response to this, this study is a call to action toward the adoption of coding best practices aimed at fostering correctness and improving the quality of the developed software.

Via

Access Paper or Ask Questions

Mask scalar prediction for improving robust automatic speech recognition

Apr 26, 2022
Arun Narayanan, James Walker, Sankaran Panchapagesan, Nathan Howard, Yuma Koizumi

Figure 1 for Mask scalar prediction for improving robust automatic speech recognition

Figure 2 for Mask scalar prediction for improving robust automatic speech recognition

Figure 3 for Mask scalar prediction for improving robust automatic speech recognition

Figure 4 for Mask scalar prediction for improving robust automatic speech recognition

Using neural network based acoustic frontends for improving robustness of streaming automatic speech recognition (ASR) systems is challenging because of the causality constraints and the resulting distortion that the frontend processing introduces in speech. Time-frequency masking based approaches have been shown to work well, but they need additional hyper-parameters to scale the mask to limit speech distortion. Such mask scalars are typically hand-tuned and chosen conservatively. In this work, we present a technique to predict mask scalars using an ASR-based loss in an end-to-end fashion, with minimal increase in the overall model size and complexity. We evaluate the approach on two robust ASR tasks: multichannel enhancement in the presence of speech and non-speech noise, and acoustic echo cancellation (AEC). Results show that the presented algorithm consistently improves word error rate (WER) without the need for any additional tuning over strong baselines that use hand-tuned hyper-parameters: up to 16% for multichannel enhancement in noisy conditions, and up to 7% for AEC.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses

Nov 16, 2021
Viet Anh Trinh, Sebastian Braun

Figure 1 for Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses

Figure 2 for Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses

Figure 3 for Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses

Figure 4 for Unsupervised Speech Enhancement with speech recognition embedding and disentanglement losses

Speech enhancement has recently achieved great success with various deep learning methods. However, most conventional speech enhancement systems are trained with supervised methods that impose two significant challenges. First, a majority of training datasets for speech enhancement systems are synthetic. When mixing clean speech and noisy corpora to create the synthetic datasets, domain mismatches occur between synthetic and real-world recordings of noisy speech or audio. Second, there is a trade-off between increasing speech enhancement performance and degrading speech recognition (ASR) performance. Thus, we propose an unsupervised loss function to tackle those two problems. Our function is developed by extending the MixIT loss function with speech recognition embedding and disentanglement loss. Our results show that the proposed function effectively improves the speech enhancement performance compared to a baseline trained in a supervised way on the noisy VoxCeleb dataset. While fully unsupervised training is unable to exceed the corresponding baseline, with joint super- and unsupervised training, the system is able to achieve similar speech quality and better ASR performance than the best supervised baseline.

* Submitted to ICASSP 2022

Via

Access Paper or Ask Questions