Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Takaaki Hori

Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

Jul 02, 2021

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

Figure 2 for Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition

Abstract:Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks.

* Accepted to Interspeech 2021

Via

Access Paper or Ask Questions

Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Jun 16, 2021

Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

Figure 1 for Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Figure 2 for Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

Abstract:Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.

* Accepted to Interspeech 2021

Via

Access Paper or Ask Questions

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

Apr 19, 2021

Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux

Figure 1 for Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

Figure 2 for Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

Figure 3 for Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

Abstract:This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15% relative error reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3% character error rate for the HKUST dataset and 12.0%/6.3% word error rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new decoding method reduces decoding time by more than 50% and further enables streaming ASR with limited accuracy degradation.

* Submitted to INTERSPEECH 2021

Via

Access Paper or Ask Questions

Capturing Multi-Resolution Context by Dilated Self-Attention

Apr 07, 2021

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Capturing Multi-Resolution Context by Dilated Self-Attention

Figure 2 for Capturing Multi-Resolution Context by Dilated Self-Attention

Figure 3 for Capturing Multi-Resolution Context by Dilated Self-Attention

Abstract:Self-attention has become an important and widely used neural network component that helped to establish new state-of-the-art results for various applications, such as machine translation and automatic speech recognition (ASR). However, the computational complexity of self-attention grows quadratically with the input sequence length. This can be particularly problematic for applications such as ASR, where an input sequence generated from an utterance can be relatively long. In this work, we propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention. The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution. Different methods for summarizing distant frames are studied, such as subsampling, mean-pooling, and attention-based pooling. ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.

* In Proc. ICASSP 2021

Via

Access Paper or Ask Questions

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Dec 23, 2020

Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo(+5 more)

Figure 1 for The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Figure 2 for The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Abstract:This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by incorporating transformer, advanced data augmentation, and conformer. This project aims to provide up-to-date speech processing experience to the community so that researchers in academia and various industry scales can develop their technologies collaboratively.

Via

Access Paper or Ask Questions

Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Nov 26, 2020

Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Figure 2 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Figure 3 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Figure 4 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Abstract:The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model's uncertainty about its prediction. DUST excludes pseudo-labeled data with high uncertainties from the training, which leads to substantially improved ASR results compared to ST without filtering, and accelerates the training time due to a reduced training data set. Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD as the target domains show that up to 80% of the performance of a system trained on ground-truth data can be recovered.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Oct 29, 2020

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Figure 2 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Figure 3 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Figure 4 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Abstract:Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training targets. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only a temporal alignment, similarly to CTC, but also a label alignment to obtain the optimal pseudo-label sequence from the weighted graph. Results show that this approach can effectively exploit an N-best list of pseudo-labels with associated scores, outperforming standard pseudo-labeling by a large margin, with ASR results close to an oracle experiment in which the best hypotheses of the N-best lists are selected manually.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Multi-Pass Transformer for Machine Translation

Sep 23, 2020

Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux

Figure 1 for Multi-Pass Transformer for Machine Translation

Figure 2 for Multi-Pass Transformer for Machine Translation

Figure 3 for Multi-Pass Transformer for Machine Translation

Figure 4 for Multi-Pass Transformer for Machine Translation

Abstract:In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers. To maintain a directed acyclic graph structure, the encoder stack of a transformer is repeated along a new multi-pass dimension, keeping the parameters tied, and information is allowed to proceed unidirectionally both towards deeper layers within an encoder stack and towards any layer of subsequent stacks. We consider both soft (i.e., continuous) and hard (i.e., discrete) connections between parallel encoder stacks, relying on a neural architecture search to find the best connection pattern in the hard case. We perform an extensive ablation study of the proposed MPT architecture and compare it with other state-of-the-art transformer architectures. Surprisingly, Base Transformer equipped with MPT can surpass the performance of Large Transformer on the challenging machine translation En-De and En-Fr datasets. In the hard connection case, the optimal connection pattern found for En-De also leads to improved performance for En-Fr.

* 10 pages, 5 figures and 2 tables

Via

Access Paper or Ask Questions

Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Feb 14, 2020

Leda Sarı, Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Figure 2 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Figure 3 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Figure 4 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Abstract:We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoder-decoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

* To appear in Proc. ICASSP 2020

Via

Access Paper or Ask Questions

Streaming automatic speech recognition with the transformer model

Jan 09, 2020

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Streaming automatic speech recognition with the transformer model

Figure 2 for Streaming automatic speech recognition with the transformer model

Figure 3 for Streaming automatic speech recognition with the transformer model

Abstract:Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.7% and 7.0% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

Via

Access Paper or Ask Questions