Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongqiang Wang

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Oct 01, 2021
Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yanping Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang, Yonghui Wu

Figure 1 for BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Figure 2 for BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Figure 3 for BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Figure 4 for BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.

* 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated

Via

Access Paper or Ask Questions

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Nov 03, 2020
Ching-Feng Yeh, Yongqiang Wang, Yangyang Shi, Chunyang Wu, Frank Zhang, Julian Chan, Michael L. Seltzer

Figure 1 for Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Figure 2 for Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Figure 3 for Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Figure 4 for Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition

Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.

* IEEE Spoken Language Technology Workshop 2021

Via

Access Paper or Ask Questions

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Oct 30, 2020
Xutai Ma, Yongqiang Wang, Mohammad Javad Dousti, Philipp Koehn, Juan Pino

Figure 1 for Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Figure 2 for Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Figure 3 for Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Figure 4 for Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence and the computational cost grows quadratically with the length of the input sequence. Nevertheless, most of the previous work on simultaneous speech translation, the task of generating translations from partial audio input, ignores the time spent in generating the translation when analyzing the latency. With this assumption, a system may have good latency quality trade-offs but be inapplicable in real-time scenarios. In this paper, we focus on the task of streaming simultaneous speech translation, where the systems are not only capable of translating with partial input but are also able to handle very long or continuous input. We propose an end-to-end transformer-based sequence-to-sequence model, equipped with an augmented memory transformer encoder, which has shown great success on the streaming automatic speech recognition task with hybrid or transducer-based models. We conduct an empirical evaluation of the proposed model on segment, context and memory sizes and we compare our approach to a transformer with a unidirectional mask.

Via

Access Paper or Ask Questions

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

Oct 29, 2020
Yongqiang Wang, Yangyang Shi, Frank Zhang, Chunyang Wu, Julian Chan, Ching-Feng Yeh, Alex Xiao

Figure 1 for Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

Figure 2 for Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

Figure 3 for Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

Figure 4 for Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

In this paper, we summarize the application of transformer and its streamable variant, Emformer based acoustic model for large scale speech recognition applications. We compare the transformer based acoustic models with their LSTM counterparts on industrial scale tasks. Specifically, we compare Emformer with latency-controlled BLSTM (LCBLSTM) on medium latency tasks and LSTM on low latency tasks. On a low latency voice assistant task, Emformer gets 24% to 26% relative word error rate reductions (WERRs). For medium latency scenarios, comparing with LCBLSTM with similar model size and latency, Emformer gets significant WERR across four languages in video captioning datasets with 2-3 times inference real-time factors reduction.

* submitted to ICASSP2021

Via

Access Paper or Ask Questions

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

Oct 29, 2020
Yangyang Shi, Yongqiang Wang, Chunyang Wu, Ching-Feng Yeh, Julian Chan, Frank Zhang, Duc Le, Mike Seltzer

Figure 1 for Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

Figure 2 for Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

Figure 3 for Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

Figure 4 for Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the long-range history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER $2.50\%$ on test-clean and $5.62\%$ on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets $4.6$ folds training speedup and $18\%$ relative real-time factor (RTF) reduction in decoding with relative WER reduction $17\%$ on test-clean and $9\%$ on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER $3.01\%$ on test-clean and $7.09\%$ on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction $9\%$ and $16\%$ on test-clean and test-other, respectively.

* 5 pages, 2 figures, submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

May 19, 2020
Frank Zhang, Yongqiang Wang, Xiaohui Zhang, Chunxi Liu, Yatharth Saraf, Geoffrey Zweig

Figure 1 for Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

Figure 2 for Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

Figure 3 for Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

Figure 4 for Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal \emph{VideoASR} datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

* submitted to interspeech 2020

Via

Access Paper or Ask Questions

Weak-Attention Suppression For Transformer Based Speech Recognition

May 18, 2020
Yangyang Shi, Yongqiang Wang, Chunyang Wu, Christian Fuegen, Frank Zhang, Duc Le, Ching-Feng Yeh, Michael L. Seltzer

Figure 1 for Weak-Attention Suppression For Transformer Based Speech Recognition

Figure 2 for Weak-Attention Suppression For Transformer Based Speech Recognition

Figure 3 for Weak-Attention Suppression For Transformer Based Speech Recognition

Figure 4 for Weak-Attention Suppression For Transformer Based Speech Recognition

Transformers, originally proposed for natural language processing (NLP) tasks, have recently achieved great success in automatic speech recognition (ASR). However, adjacent acoustic units (i.e., frames) are highly correlated, and long-distance dependencies between them are weak, unlike text units. It suggests that ASR will likely benefit from sparse and localized attention. In this paper, we propose Weak-Attention Suppression (WAS), a method that dynamically induces sparsity in attention probabilities. We demonstrate that WAS leads to consistent Word Error Rate (WER) improvement over strong transformer baselines. On the widely used LibriSpeech benchmark, our proposed method reduced WER by 10%$ on test-clean and 5% on test-other for streamable transformers, resulting in a new state-of-the-art among streaming models. Further analysis shows that WAS learns to suppress attention of non-critical and redundant continuous acoustic frames, and is more likely to suppress past frames rather than future ones. It indicates the importance of lookahead in attention-based ASR models.

* submitted to interspeech 2020

Via

Access Paper or Ask Questions

Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

May 16, 2020
Chunyang Wu, Yongqiang Wang, Yangyang Shi, Ching-Feng Yeh, Frank Zhang

Figure 1 for Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Figure 2 for Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Figure 3 for Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Figure 4 for Streaming Transformer-based Acoustic Models Using Self-attention with Augmented Memory

Transformer-based acoustic modeling has achieved great suc-cess for both hybrid and sequence-to-sequence speech recogni-tion. However, it requires access to the full sequence, and thecomputational cost grows quadratically with respect to the in-put sequence length. These factors limit its adoption for stream-ing applications. In this work, we proposed a novel augmentedmemory self-attention, which attends on a short segment of theinput sequence and a bank of memories. The memory bankstores the embedding information for all the processed seg-ments. On the librispeech benchmark, our proposed methodoutperforms all the existing streamable transformer methods bya large margin and achieved over 15% relative error reduction,compared with the widely used LC-BLSTM baseline. Our find-ings are also confirmed on some large internal datasets.

* submitted to Interspeech 2020

Via

Access Paper or Ask Questions

Improving N-gram Language Models with Pre-trained Deep Transformer

Nov 22, 2019
Yiren Wang, Hongzhao Huang, Zhe Liu, Yutong Pang, Yongqiang Wang, ChengXiang Zhai, Fuchun Peng

Figure 1 for Improving N-gram Language Models with Pre-trained Deep Transformer

Figure 2 for Improving N-gram Language Models with Pre-trained Deep Transformer

Figure 3 for Improving N-gram Language Models with Pre-trained Deep Transformer

Figure 4 for Improving N-gram Language Models with Pre-trained Deep Transformer

Although n-gram language models (LMs) have been outperformed by the state-of-the-art neural LMs, they are still widely used in speech recognition due to its high efficiency in inference. In this paper, we demonstrate that n-gram LM can be improved by neural LMs through a text generation based data augmentation method. In contrast to previous approaches, we employ a large-scale general domain pre-training followed by in-domain fine-tuning strategy to construct deep Transformer based neural LMs. Large amount of in-domain text data is generated with the well trained deep Transformer to construct new n-gram LMs, which are then interpolated with baseline n-gram systems. Empirical studies on different speech recognition tasks show that the proposed approach can effectively improve recognition accuracy. In particular, our proposed approach brings significant relative word error rate reduction up to 6.0% for domains with limited in-domain data.

Via

Access Paper or Ask Questions

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Oct 28, 2019
Ching-Feng Yeh, Jay Mahadeokar, Kaustubh Kalgaonkar, Yongqiang Wang, Duc Le, Mahaveer Jain, Kjell Schubert, Christian Fuegen, Michael L. Seltzer

Figure 1 for Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Figure 2 for Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Figure 3 for Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Figure 4 for Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts. We propose 1) using VGGNet with causal convolution to incorporate positional information and reduce frame rate for efficient inference 2) using truncated self-attention to enable streaming for Transformer and reduce computational complexity. All experiments are conducted on the public LibriSpeech corpus. The proposed Transformer-Transducer outperforms neural transducer with LSTM/BLSTM networks and achieved word error rates of 6.37 % on the test-clean set and 15.30 % on the test-other set, while remaining streamable, compact with 45.7M parameters for the entire system, and computationally efficient with complexity of O(T), where T is input sequence length.

Via

Access Paper or Ask Questions