Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"speech recognition": models, code, and papers

Adaptive Multi-Corpora Language Model Training for Speech Recognition

Nov 09, 2022
Yingyi Ma, Zhe Liu, Xuedong Zhang

Figure 1 for Adaptive Multi-Corpora Language Model Training for Speech Recognition

Figure 2 for Adaptive Multi-Corpora Language Model Training for Speech Recognition

Figure 3 for Adaptive Multi-Corpora Language Model Training for Speech Recognition

Figure 4 for Adaptive Multi-Corpora Language Model Training for Speech Recognition

Neural network language model (NNLM) plays an essential role in automatic speech recognition (ASR) systems, especially in adaptation tasks when text-only data is available. In practice, an NNLM is typically trained on a combination of data sampled from multiple corpora. Thus, the data sampling strategy is important to the adaptation performance. Most existing works focus on designing static sampling strategies. However, each corpus may show varying impacts at different NNLM training stages. In this paper, we introduce a novel adaptive multi-corpora training algorithm that dynamically learns and adjusts the sampling probability of each corpus along the training process. The algorithm is robust to corpora sizes and domain relevance. Compared with static sampling strategy baselines, the proposed approach yields remarkable improvement by achieving up to relative 7% and 9% word error rate (WER) reductions on in-domain and out-of-domain adaptation tasks, respectively.

Via

Access Paper or Ask Questions

Anytime, Anywhere: Human Arm Pose from Smartwatch Data for Ubiquitous Robot Control and Teleoperation

Jun 22, 2023
Fabian C Weigend, Shubham Sonawani, Michael Drolet, Heni Ben Amor

Figure 1 for Anytime, Anywhere: Human Arm Pose from Smartwatch Data for Ubiquitous Robot Control and Teleoperation

Figure 2 for Anytime, Anywhere: Human Arm Pose from Smartwatch Data for Ubiquitous Robot Control and Teleoperation

Figure 3 for Anytime, Anywhere: Human Arm Pose from Smartwatch Data for Ubiquitous Robot Control and Teleoperation

Figure 4 for Anytime, Anywhere: Human Arm Pose from Smartwatch Data for Ubiquitous Robot Control and Teleoperation

This work devises an optimized machine learning approach for human arm pose estimation from a single smartwatch. Our approach results in a distribution of possible wrist and elbow positions, which allows for a measure of uncertainty and the detection of multiple possible arm posture solutions, i.e., multimodal pose distributions. Combining estimated arm postures with speech recognition, we turn the smartwatch into a ubiquitous, low-cost and versatile robot control interface. We demonstrate in two use-cases that this intuitive control interface enables users to swiftly intervene in robot behavior, to temporarily adjust their goal, or to train completely new control policies by imitation. Extensive experiments show that the approach results in a 40% reduction in prediction error over the current state-of-the-art and achieves a mean error of 2.56cm for wrist and elbow positions.

* 8 pages, 10, figures, 1 table, conference: IROS

Via

Access Paper or Ask Questions

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Jun 27, 2023
Haitao Tang, Yu Fu, Lei Sun, Jiabin Xue, Dan Liu, Yongchao Li, Zhiqiang Ma, Minghui Wu, Jia Pan, Genshun Wan, Ming'en Zhao

Figure 1 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Figure 2 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Figure 3 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Figure 4 for Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure that their hidden and output distributions are consistent, which can be achieved by hierarchical knowledge distillation. However, it is difficult to ensure the distribution consistency simultaneously because the learning of the output distribution depends on the hidden one. In this paper, we propose an adaptive two-stage knowledge distillation method consisting of hidden layer learning and output layer learning. In the former stage, we learn hidden representation with full context by applying mean square error loss function. In the latter stage, we design a power transformation based adaptive smoothness method to learn stable output distribution. It achieved 19\% relative reduction in word error rate, and a faster response for the first token compared with the original streaming model in LibriSpeech corpus.

Via

Access Paper or Ask Questions

Curriculum optimization for low-resource speech recognition

Feb 17, 2022
Anastasia Kuznetsova, Anurag Kumar, Jennifer Drexler Fox, Francis Tyers

Figure 1 for Curriculum optimization for low-resource speech recognition

Figure 2 for Curriculum optimization for low-resource speech recognition

Figure 3 for Curriculum optimization for low-resource speech recognition

Figure 4 for Curriculum optimization for low-resource speech recognition

Modern end-to-end speech recognition models show astonishing results in transcribing audio signals into written text. However, conventional data feeding pipelines may be sub-optimal for low-resource speech recognition, which still remains a challenging task. We propose an automated curriculum learning approach to optimize the sequence of training examples based on both the progress of the model while training and prior knowledge about the difficulty of the training examples. We introduce a new difficulty measure called compression ratio that can be used as a scoring function for raw audio in various noise conditions. The proposed method improves speech recognition Word Error Rate performance by up to 33% relative over the baseline system

Via

Access Paper or Ask Questions

RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

May 24, 2023
David Qiu, David Rim, Shaojin Ding, Oleg Rybakov, Yanzhang He

Figure 1 for RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

Figure 2 for RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

Figure 3 for RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

Figure 4 for RAND: Robustness Aware Norm Decay For Quantized Seq2seq Models

With the rapid increase in the size of neural networks, model compression has become an important area of research. Quantization is an effective technique at decreasing the model size, memory access, and compute load of large models. Despite recent advances in quantization aware training (QAT) technique, most papers present evaluations that are focused on computer vision tasks, which have different training dynamics compared to sequence tasks. In this paper, we first benchmark the impact of popular techniques such as straight through estimator, pseudo-quantization noise, learnable scale parameter, clipping, etc. on 4-bit seq2seq models across a suite of speech recognition datasets ranging from 1,000 hours to 1 million hours, as well as one machine translation dataset to illustrate its applicability outside of speech. Through the experiments, we report that noise based QAT suffers when there is insufficient regularization signal flowing back to the quantization scale. We propose low complexity changes to the QAT process to improve model accuracy (outperforming popular learnable scale and clipping methods). With the improved accuracy, it opens up the possibility to exploit some of the other benefits of noise based QAT: 1) training a single model that performs well in mixed precision mode and 2) improved generalization on long form speech recognition.

Via

Access Paper or Ask Questions

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

Apr 01, 2022
Xuankai Chang, Takashi Maekaku, Yuya Fujita, Shinji Watanabe

Figure 1 for End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

Figure 2 for End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

Figure 3 for End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

Figure 4 for End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation

This work presents our end-to-end (E2E) automatic speech recognition (ASR) model targetting at robust speech recognition, called Integraded speech Recognition with enhanced speech Input for Self-supervised learning representation (IRIS). Compared with conventional E2E ASR models, the proposed E2E model integrates two important modules including a speech enhancement (SE) module and a self-supervised learning representation (SSLR) module. The SE module enhances the noisy speech. Then the SSLR module extracts features from enhanced speech to be used for speech recognition (ASR). To train the proposed model, we establish an efficient learning scheme. Evaluation results on the monaural CHiME-4 task show that the IRIS model achieves the best performance reported in the literature for the single-channel CHiME-4 benchmark (2.0% for the real development and 3.9% for the real test) thanks to the powerful pre-trained SSLR module and the fine-tuned SE module.

* Submitted to Interspeech 2022

Via

Access Paper or Ask Questions

Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

May 16, 2022
Alëna Aksënova, Zhehuai Chen, Chung-Cheng Chiu, Daan van Esch, Pavel Golik, Wei Han, Levi King, Bhuvana Ramabhadran, Andrew Rosenberg, Suzan Schwartz, Gary Wang

Figure 1 for Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Figure 2 for Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Figure 3 for Accented Speech Recognition: Benchmarking, Pre-training, and Diverse Data

Building inclusive speech recognition systems is a crucial step towards developing technologies that speakers of all language varieties can use. Therefore, ASR systems must work for everybody independently of the way they speak. To accomplish this goal, there should be available data sets representing language varieties, and also an understanding of model configuration that is the most helpful in achieving robust understanding of all types of speech. However, there are not enough data sets for accented speech, and for the ones that are already available, more training approaches need to be explored to improve the quality of accented speech recognition. In this paper, we discuss recent progress towards developing more inclusive ASR systems, namely, the importance of building new data sets representing linguistic diversity, and exploring novel training approaches to improve performance for all users. We address recent directions within benchmarking ASR systems for accented speech, measure the effects of wav2vec 2.0 pre-training on accented speech recognition, and highlight corpora relevant for diverse ASR evaluations.

* 5 pages, 3 tables

Via

Access Paper or Ask Questions

Boosting Local Spectro-Temporal Features for Speech Analysis

May 17, 2023
Michael Guerzhoy

We introduce the problem of phone classification in the context of speech recognition, and explore several sets of local spectro-temporal features that can be used for phone classification. In particular, we present some preliminary results for phone classification using two sets of features that are commonly used for object detection: Haar features and SVM-classified Histograms of Gradients (HoG)

* Master's project, University of Toronto

Via

Access Paper or Ask Questions

Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person

May 23, 2023
Lucas Rafael Stefanel Gris, Ricardo Marcacini, Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Sandra Maria Aluísio

Figure 1 for Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person

Figure 2 for Evaluating OpenAI's Whisper ASR for Punctuation Prediction and Topic Modeling of life histories of the Museum of the Person

Automatic speech recognition (ASR) systems play a key role in applications involving human-machine interactions. Despite their importance, ASR models for the Portuguese language proposed in the last decade have limitations in relation to the correct identification of punctuation marks in automatic transcriptions, which hinder the use of transcriptions by other systems, models, and even by humans. However, recently Whisper ASR was proposed by OpenAI, a general-purpose speech recognition model that has generated great expectations in dealing with such limitations. This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. We present an experimental evaluation considering both theoretical aspects involving pausing points (comma) and complete ideas (exclamation, question, and fullstop), as well as practical aspects involving transcript-based topic modeling - an application dependent on punctuation marks for promising performance. We analyzed experimental results from videos of Museum of the Person, a virtual museum that aims to tell and preserve people's life histories, thus discussing the pros and cons of Whisper in a real-world scenario. Although our experiments indicate that Whisper achieves state-of-the-art results, we conclude that some punctuation marks require improvements, such as exclamation, semicolon and colon.

Via

Access Paper or Ask Questions

HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

Jun 20, 2023
Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, Sanjeev Khudanpur

Figure 1 for HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

Figure 2 for HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

Figure 3 for HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

Figure 4 for HK-LegiCoST: Leveraging Non-Verbatim Transcripts for Speech Translation

We introduce HK-LegiCoST, a new three-way parallel corpus of Cantonese-English translations, containing 600+ hours of Cantonese audio, its standard traditional Chinese transcript, and English translation, segmented and aligned at the sentence level. We describe the notable challenges in corpus preparation: segmentation, alignment of long audio recordings, and sentence-level alignment with non-verbatim transcripts. Such transcripts make the corpus suitable for speech translation research when there are significant differences between the spoken and written forms of the source language. Due to its large size, we are able to demonstrate competitive speech translation baselines on HK-LegiCoST and extend them to promising cross-corpus results on the FLEURS Cantonese subset. These results deliver insights into speech recognition and translation research in languages for which non-verbatim or ``noisy'' transcription is common due to various factors, including vernacular and dialectal speech.

Via

Access Paper or Ask Questions