Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jinyu Li

Fred

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Nov 03, 2020

Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, Jinyu Li, Yifan Gong

Figure 1 for Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Figure 2 for Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Figure 3 for Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Figure 4 for Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

Abstract:The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including the most popular recurrent neural network transducer (RNN-T) and attention-based encoder-decoder (AED) models. Trained with audio-transcript pairs, an E2E model implicitly learns an internal LM that characterizes the training data in the source domain. With ILME, the internal LM scores of an E2E model are estimated and subtracted from the log-linear interpolation between the scores of the E2E model and the external LM. The internal LM scores are approximated as the output of an E2E model when eliminating its acoustic components. ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR. Experimented with 30K-hour trained RNN-T and AED models, ILME achieves up to 15.5% and 6.8% relative word error rate reductions from Shallow Fusion on out-of-domain LibriSpeech and in-domain Microsoft production test sets, respectively.

* 2021 IEEE Spoken Language Technology Workshop (SLT)
* 8 pages, 2 figures, SLT 2021

Via

Access Paper or Ask Questions

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Oct 23, 2020

Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li, Yifan Gong

Figure 1 for On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Figure 2 for On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Figure 3 for On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Figure 4 for On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Abstract:Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion. In HAT, the blank probability and the label probability are estimated using two separate probability distributions, which provides a more accurate solution for internal LM score estimation, and thus works better when combining with an external LM. Previous work mainly focuses on HAT model training with the negative log-likelihood loss, while in this paper, we study the minimum word error rate (MWER) training of HAT -- a criterion that is closer to the evaluation metric for speech recognition, and has been successfully applied to other types of end-to-end models such as sequence-to-sequence (S2S) and RNN-T models. From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models, while at the same time, improving the robustness of the model against the decoding hyper-parameters such as length normalization and decoding beam during inference.

* 5 pages, submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

Oct 23, 2020

Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, Jinyu Li

Figure 1 for Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

Figure 2 for Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

Figure 3 for Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

Abstract:With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently. However, multi-channel speech separation sometimes does not necessarily need such a heavy structure for all time frames especially when the cross-talker challenge happens only occasionally. For example, in conversation scenarios, most regions contain only a single active speaker, where the separation task downgrades to a single speaker enhancement problem. It turns out that using a very deep network structure for dealing with signals with a low overlap ratio not only negatively affects the inference efficiency but also hurts the separation performance. To deal with this problem, we propose an early exit mechanism, which enables the Transformer model to handle different cases with adaptive depth. Experimental results indicate that not only does the early exit mechanism accelerate the inference, but it also improves the accuracy.

Via

Access Paper or Ask Questions

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Oct 22, 2020

Xie Chen, Yu Wu, Zhenghao Wang, Shujie Liu, Jinyu Li

Figure 1 for Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Figure 2 for Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Figure 3 for Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Figure 4 for Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Abstract:Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

* 5 pages

Via

Access Paper or Ask Questions

Speaker Separation Using Speaker Inventories and Estimated Speech

Oct 20, 2020

Peidong Wang, Zhuo Chen, DeLiang Wang, Jinyu Li, Yifan Gong

Figure 1 for Speaker Separation Using Speaker Inventories and Estimated Speech

Figure 2 for Speaker Separation Using Speaker Inventories and Estimated Speech

Figure 3 for Speaker Separation Using Speaker Inventories and Estimated Speech

Figure 4 for Speaker Separation Using Speaker Inventories and Estimated Speech

Abstract:We propose speaker separation using speaker inventories and estimated speech (SSUSIES), a framework leveraging speaker profiles and estimated speech for speaker separation. SSUSIES contains two methods, speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES). SSUSI performs speaker separation with the help of speaker inventory. By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches. SSUES is a widely applicable technique that can substantially improve speaker separation performance using the output of first-pass separation. We evaluate the models on both speaker separation and speech recognition metrics.

Via

Access Paper or Ask Questions

Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Aug 17, 2020

Vikas Joshi, Rui Zhao, Rupesh R. Mehta, Kshitiz Kumar, Jinyu Li

Figure 1 for Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Figure 2 for Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Figure 3 for Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Figure 4 for Transfer Learning Approaches for Streaming End-to-End Speech Recognition System

Abstract:Transfer learning (TL) is widely used in conventional hybrid automatic speech recognition (ASR) system, to transfer the knowledge from source to target language. TL can be applied to end-to-end (E2E) ASR system such as recurrent neural network transducer (RNN-T) models, by initializing the encoder and/or prediction network of the target language with the pre-trained models from source language. In the hybrid ASR system, transfer learning is typically done by initializing the target language acoustic model (AM) with source language AM. Several transfer learning strategies exist in the case of the RNN-T framework, depending upon the choice of the initialization model for encoder and prediction networks. This paper presents a comparative study of four different TL methods for RNN-T framework. We show 17% relative word error rate reduction with different TL methods over randomly initialized RNN-T model. We also study the impact of TL with varying amount of training data ranging from 50 hours to 1000 hours and show the efficacy of TL for languages with small amount of training data.

Via

Access Paper or Ask Questions

Adaptation Algorithms for Speech Recognition: An Overview

Aug 14, 2020

Peter Bell, Joachim Fainberg, Ondrej Klejch, Jinyu Li, Steve Renals, Pawel Swietojanski

Figure 1 for Adaptation Algorithms for Speech Recognition: An Overview

Figure 2 for Adaptation Algorithms for Speech Recognition: An Overview

Figure 3 for Adaptation Algorithms for Speech Recognition: An Overview

Figure 4 for Adaptation Algorithms for Speech Recognition: An Overview

Abstract:We present a structured overview of adaptation algorithms for neural network-based speech recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neural network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms, based on relative error rate reductions as reported in the literature.

* Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27 figures

Via

Access Paper or Ask Questions

Continuous Speech Separation with Conformer

Aug 13, 2020

Sanyuan Chen, Yu Wu, Zhuo Chen, Jinyu Li, Chengyi Wang, Shujie Liu, Ming Zhou

Figure 1 for Continuous Speech Separation with Conformer

Figure 2 for Continuous Speech Separation with Conformer

Figure 3 for Continuous Speech Separation with Conformer

Figure 4 for Continuous Speech Separation with Conformer

Abstract:Continuous speech separation plays a vital role in complicated speech related tasks such as conversation transcription. The separation model extracts a single speaker signal from a mixed speech. In this paper, we use transformer and conformer in lieu of recurrent neural networks in the separation system, as we believe capturing global information with the self-attention based method is crucial for the speech separation. Evaluating on the LibriCSS dataset, the conformer separation model achieves state of the art results, with a relative 23.5% word error rate (WER) reduction from bi-directional LSTM (BLSTM) in the utterance-wise evaluation and a 15.4% WER reduction in the continuous evaluation.

Via

Access Paper or Ask Questions

Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Jul 30, 2020

Jinyu Li, Rui Zhao, Zhong Meng, Yanqing Liu, Wenning Wei, Sarangarajan Parthasarathy, Vadim Mazalov, Zhenghao Wang, Lei He, Sheng Zhao(+1 more)

Figure 1 for Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Figure 2 for Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Figure 3 for Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Figure 4 for Developing RNN-T Models Surpassing High-Performance Hybrid Models with Customization Capability

Abstract:Because of its streaming nature, recurrent neural network transducer (RNN-T) is a very promising end-to-end (E2E) model that may replace the popular hybrid model for automatic speech recognition. In this paper, we describe our recent development of RNN-T models with reduced GPU memory consumption during training, better initialization strategy, and advanced encoder modeling with future lookahead. When trained with Microsoft's 65 thousand hours of anonymized training data, the developed RNN-T model surpasses a very well trained hybrid model with both better recognition accuracy and lower latency. We further study how to customize RNN-T models to a new domain, which is important for deploying E2E models to practical scenarios. By comparing several methods leveraging text-only data in the new domain, we found that updating RNN-T's prediction and joint networks using text-to-speech generated from domain-specific text is the most effective.

* Accepted by Interspeech 2020

Via

Access Paper or Ask Questions

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

May 28, 2020

Jinyu Li, Yu Wu, Yashesh Gaur, Chengyi Wang, Rui Zhao, Shujie Liu

Figure 1 for On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

Figure 2 for On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition

Abstract:Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

* submitted to Interspeech 2020

Via

Access Paper or Ask Questions