Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Trevor Strohman

Modular Hybrid Autoregressive Transducer

Oct 31, 2022

Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, Bhuvana Ramabhadran, W. Ronny Huang(+3 more)

Figure 1 for Modular Hybrid Autoregressive Transducer

Figure 2 for Modular Hybrid Autoregressive Transducer

Figure 3 for Modular Hybrid Autoregressive Transducer

Figure 4 for Modular Hybrid Autoregressive Transducer

Abstract:Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder. The encoder and label decoder outputs are directly projected to AM and internal LM scores and then added to compute label posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure that its internal LM becomes a standalone neural LM that can be effectively adapted to text. Moreover, text adaptation of MHAT fosters a much better LM fusion than internal LM subtraction-based methods. On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM fusion from 400K-hour trained HAT.

* 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar
* 8 pages, 1 figure, SLT 2022

Via

Access Paper or Ask Questions

JOIST: A Joint Speech and Text Streaming Model For ASR

Oct 13, 2022

Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

Figure 1 for JOIST: A Joint Speech and Text Streaming Model For ASR

Figure 2 for JOIST: A Joint Speech and Text Streaming Model For ASR

Figure 3 for JOIST: A Joint Speech and Text Streaming Model For ASR

Figure 4 for JOIST: A Joint Speech and Text Streaming Model For ASR

Abstract:We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.

Via

Access Paper or Ask Questions

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Oct 11, 2022

Dongseong Hwang, Khe Chai Sim, Yu Zhang, Trevor Strohman

Figure 1 for Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Figure 2 for Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Figure 3 for Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Figure 4 for Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Abstract:Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.

* 8 pages, 1 figure

Via

Access Paper or Ask Questions

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Sep 13, 2022

Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-yiin Chang, Parisa Haghani

Figure 1 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Figure 2 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Figure 3 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Figure 4 for Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Abstract:Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascaded-encoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve lower word error rates (WERs) using second-pass decoding with longer right-context. By leveraging such differences in the right-contexts and a streaming implementation of statistics pooling, the proposed method can achieve accurate streaming LID prediction with little extra test-time cost. Experimental results on a voice search dataset with 9 language locales shows that the proposed method achieves an average of 96.2% LID prediction accuracy and the same second-pass WER as that obtained by including oracle LID in the input.

Via

Access Paper or Ask Questions

A Language Agnostic Multilingual Streaming On-Device ASR System

Aug 29, 2022

Bo Li, Tara N. Sainath, Ruoming Pang, Shuo-yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He(+2 more)

Figure 1 for A Language Agnostic Multilingual Streaming On-Device ASR System

Figure 2 for A Language Agnostic Multilingual Streaming On-Device ASR System

Figure 3 for A Language Agnostic Multilingual Streaming On-Device ASR System

Figure 4 for A Language Agnostic Multilingual Streaming On-Device ASR System

Abstract:On-device end-to-end (E2E) models have shown improvements over a conventional model on English Voice Search tasks in both quality and latency. E2E models have also shown promising results for multilingual automatic speech recognition (ASR). In this paper, we extend our previous capacity solution to streaming applications and present a streaming multilingual E2E ASR system that runs fully on device with comparable quality and latency to individual monolingual models. To achieve that, we propose an Encoder Endpointer model and an End-of-Utterance (EOU) Joint Layer for a better quality and latency trade-off. Our system is built in a language agnostic manner allowing it to natively support intersentential code switching in real time. To address the feasibility concerns on large models, we conducted on-device profiling and replaced the time consuming LSTM decoder with the recently developed Embedding decoder. With these changes, we managed to run such a system on a mobile device in less than real time.

* Accepted in Interspeech 2022

Via

Access Paper or Ask Questions

Streaming Intended Query Detection using E2E Modeling for Continued Conversation

Aug 29, 2022

Shuo-yiin Chang, Guru Prakash, Zelin Wu, Qiao Liang, Tara N. Sainath, Bo Li, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

Figure 1 for Streaming Intended Query Detection using E2E Modeling for Continued Conversation

Figure 2 for Streaming Intended Query Detection using E2E Modeling for Continued Conversation

Figure 3 for Streaming Intended Query Detection using E2E Modeling for Continued Conversation

Figure 4 for Streaming Intended Query Detection using E2E Modeling for Continued Conversation

Abstract:In voice-enabled applications, a predetermined hotword isusually used to activate a device in order to attend to the query.However, speaking queries followed by a hotword each timeintroduces a cognitive burden in continued conversations. Toavoid repeating a hotword, we propose a streaming end-to-end(E2E) intended query detector that identifies the utterancesdirected towards the device and filters out other utterancesnot directed towards device. The proposed approach incor-porates the intended query detector into the E2E model thatalready folds different components of the speech recognitionpipeline into one neural network.The E2E modeling onspeech decoding and intended query detection also allows us todeclare a quick intended query detection based on early partialrecognition result, which is important to decrease latencyand make the system responsive. We demonstrate that theproposed E2E approach yields a 22% relative improvement onequal error rate (EER) for the detection accuracy and 600 mslatency improvement compared with an independent intendedquery detector. In our experiment, the proposed model detectswhether the user is talking to the device with a 8.7% EERwithin 1.4 seconds of median latency after user starts speaking.

* 5 pages, Interspeech 2022

Via

Access Paper or Ask Questions

Turn-Taking Prediction for Natural Conversational Speech

Aug 29, 2022

Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

Figure 1 for Turn-Taking Prediction for Natural Conversational Speech

Figure 2 for Turn-Taking Prediction for Natural Conversational Speech

Figure 3 for Turn-Taking Prediction for Natural Conversational Speech

Figure 4 for Turn-Taking Prediction for Natural Conversational Speech

Abstract:While a streaming voice assistant system has been used in many applications, this system typically focuses on unnatural, one-shot interactions assuming input from a single voice query without hesitation or disfluency. However, a common conversational utterance often involves multiple queries with turn-taking, in addition to disfluencies. These disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases. This makes doing speech recognition with conversational speech, including one with multiple queries, a challenging task. To better model the conversational interaction, it is critical to discriminate disfluencies and end of query in order to allow the user to hold the floor for disfluencies while having the system respond as quickly as possible when the user has finished speaking. In this paper, we present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer. Our best system is obtained by jointly optimizing for ASR task and detecting when the user is paused to think or finished speaking. The proposed approach demonstrates over 97% recall rate and 85% precision rate on predicting true turn-taking with only 100 ms latency on a test set designed with 4 types of disfluencies inserted in conversational utterances.

* 5 pages, Interspeech 2022

Via

Access Paper or Ask Questions

Improving Deliberation by Text-Only and Semi-Supervised Training

Jun 29, 2022

Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

Figure 1 for Improving Deliberation by Text-Only and Semi-Supervised Training

Figure 2 for Improving Deliberation by Text-Only and Semi-Supervised Training

Figure 3 for Improving Deliberation by Text-Only and Semi-Supervised Training

Figure 4 for Improving Deliberation by Text-Only and Semi-Supervised Training

Abstract:Text-only and semi-supervised training based on audio-only data has gained popularity recently due to the wide availability of unlabeled text and speech data. In this work, we propose incorporating text-only and semi-supervised training into an attention-based deliberation model. By incorporating text-only data in training a bidirectional encoder representation from transformer (BERT) for the deliberation text encoder, and large-scale text-to-speech and audio-only utterances using joint acoustic and text decoder (JATD) and semi-supervised training, we achieved 4%-12% WER reduction for various tasks compared to the baseline deliberation. Compared to a state-of-the-art language model (LM) rescoring method, the deliberation model reduces the Google Voice Search WER by 11% relative. We show that the deliberation model also achieves a positive human side-by-side evaluation compared to the state-of-the-art LM rescorer with reasonable endpointer latencies.

* Accepted by Interspeech 2022

Via

Access Paper or Ask Questions

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Apr 20, 2022

Shaojin Ding, Weiran Wang, Ding Zhao, Tara N. Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang(+4 more)

Figure 1 for A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Figure 2 for A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Figure 3 for A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Figure 4 for A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes

Abstract:In this paper, we propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios. Moreover, the model can significantly reduce model size and power consumption without loss of quality. Namely, with the dynamic cascaded encoder model, we explore three techniques to maximally boost the performance of each model size: 1) Use separate decoders for each sub-model while sharing the encoders; 2) Use funnel-pooling to improve the encoder efficiency; 3) Balance the size of causal and non-causal encoders to improve quality and fit deployment constraints. Overall, the proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model. The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss, while substantially reducing the engineering efforts of having separate models.

* Submitted to INTERSPEECH 2022

Via

Access Paper or Ask Questions

Improving Rare Word Recognition with LM-aware MWER Training

Apr 15, 2022

Weiran Wang, Tongzhou Chen, Tara N. Sainath, Ehsan Variani, Rohit Prabhavalkar, Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser(+3 more)

Figure 1 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 2 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 3 for Improving Rare Word Recognition with LM-aware MWER Training

Figure 4 for Improving Rare Word Recognition with LM-aware MWER Training

Abstract:Language models (LMs) significantly improve the recognition accuracy of end-to-end (E2E) models on words rarely seen during training, when used in either the shallow fusion or the rescoring setups. In this work, we introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework, to mitigate the training versus inference gap regarding the use of LMs. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10\% relative improvement over the model trained with standard MWER on voice search test sets containing rare words. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner. This model achieves the same rescoring WER as regular MWER-trained model, but without the need for sweeping fusion weights.

* In submission to INTERSPEECH 2022

Via

Access Paper or Ask Questions