Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonathan Le Roux

MERL

Capturing Multi-Resolution Context by Dilated Self-Attention

Apr 07, 2021

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Capturing Multi-Resolution Context by Dilated Self-Attention

Figure 2 for Capturing Multi-Resolution Context by Dilated Self-Attention

Figure 3 for Capturing Multi-Resolution Context by Dilated Self-Attention

Abstract:Self-attention has become an important and widely used neural network component that helped to establish new state-of-the-art results for various applications, such as machine translation and automatic speech recognition (ASR). However, the computational complexity of self-attention grows quadratically with the input sequence length. This can be particularly problematic for applications such as ASR, where an input sequence generated from an utterance can be relatively long. In this work, we propose a combination of restricted self-attention and a dilation mechanism, which we refer to as dilated self-attention. The restricted self-attention allows attention to neighboring frames of the query at a high resolution, and the dilation mechanism summarizes distant information to allow attending to it with a lower resolution. Different methods for summarizing distant frames are studied, such as subsampling, mean-pooling, and attention-based pooling. ASR results demonstrate substantial improvements compared to restricted self-attention alone, achieving similar results compared to full-sequence based self-attention with a fraction of the computational costs.

* In Proc. ICASSP 2021

Via

Access Paper or Ask Questions

Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Nov 26, 2020

Sameer Khurana, Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Figure 2 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Figure 3 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Figure 4 for Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training

Abstract:The performance of automatic speech recognition (ASR) systems typically degrades significantly when the training and test data domains are mismatched. In this paper, we show that self-training (ST) combined with an uncertainty-based pseudo-label filtering approach can be effectively used for domain adaptation. We propose DUST, a dropout-based uncertainty-driven self-training technique which uses agreement between multiple predictions of an ASR system obtained for different dropout settings to measure the model's uncertainty about its prediction. DUST excludes pseudo-labeled data with high uncertainties from the training, which leads to substantially improved ASR results compared to ST without filtering, and accelerates the training time due to a reduced training data set. Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD as the target domains show that up to 80% of the performance of a system trained on ground-truth data can be recovered.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Oct 29, 2020

Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Figure 2 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Figure 3 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Figure 4 for Semi-Supervised Speech Recognition via Graph-based Temporal Classification

Abstract:Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training targets. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only a temporal alignment, similarly to CTC, but also a label alignment to obtain the optimal pseudo-label sequence from the weighted graph. Results show that this approach can effectively exploit an N-best list of pseudo-labels with associated scores, outperforming standard pseudo-labeling by a large margin, with ASR results close to an oracle experiment in which the best hypotheses of the N-best lists are selected manually.

* Submitted to ICASSP 2021

Via

Access Paper or Ask Questions

Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Oct 22, 2020

Yun-Ning Hung, Gordon Wichern, Jonathan Le Roux

Figure 1 for Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Figure 2 for Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Figure 3 for Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Figure 4 for Transcription Is All You Need: Learning to Separate Musical Mixtures with Score as Supervision

Abstract:Most music source separation systems require large collections of isolated sources for training, which can be difficult to obtain. In this work, we use musical scores, which are comparatively easy to obtain, as a weak label for training a source separation system. In contrast with previous score-informed separation approaches, our system does not require isolated sources, and score is used only as a training target, not required for inference. Our model consists of a separator that outputs a time-frequency mask for each instrument, and a transcriptor that acts as a critic, providing both temporal and frequency supervision to guide the learning of the separator. A harmonic mask constraint is introduced as another way of leveraging score information during training, and we propose two novel adversarial losses for additional fine-tuning of both the transcriptor and the separator. Results demonstrate that using score information outperforms temporal weak-labels, and adversarial structures lead to further improvements in both separation and transcription performance.

Via

Access Paper or Ask Questions

Multi-Pass Transformer for Machine Translation

Sep 23, 2020

Peng Gao, Chiori Hori, Shijie Geng, Takaaki Hori, Jonathan Le Roux

Figure 1 for Multi-Pass Transformer for Machine Translation

Figure 2 for Multi-Pass Transformer for Machine Translation

Figure 3 for Multi-Pass Transformer for Machine Translation

Figure 4 for Multi-Pass Transformer for Machine Translation

Abstract:In contrast with previous approaches where information flows only towards deeper layers of a stack, we consider a multi-pass transformer (MPT) architecture in which earlier layers are allowed to process information in light of the output of later layers. To maintain a directed acyclic graph structure, the encoder stack of a transformer is repeated along a new multi-pass dimension, keeping the parameters tied, and information is allowed to proceed unidirectionally both towards deeper layers within an encoder stack and towards any layer of subsequent stacks. We consider both soft (i.e., continuous) and hard (i.e., discrete) connections between parallel encoder stacks, relying on a neural architecture search to find the best connection pattern in the hard case. We perform an extensive ablation study of the proposed MPT architecture and compare it with other state-of-the-art transformer architectures. Surprisingly, Base Transformer equipped with MPT can surpass the performance of Large Transformer on the challenging machine translation En-De and En-Fr datasets. In the hard connection case, the optimal connection pattern found for En-De also leads to improved performance for En-Fr.

* 10 pages, 5 figures and 2 tables

Via

Access Paper or Ask Questions

AutoClip: Adaptive Gradient Clipping for Source Separation Networks

Jul 25, 2020

Prem Seetharaman, Gordon Wichern, Bryan Pardo, Jonathan Le Roux

Figure 1 for AutoClip: Adaptive Gradient Clipping for Source Separation Networks

Figure 2 for AutoClip: Adaptive Gradient Clipping for Source Separation Networks

Abstract:Clipping the gradient is a known approach to improving gradient descent, but requires hand selection of a clipping threshold hyperparameter. We present AutoClip, a simple method for automatically and adaptively choosing a gradient clipping threshold, based on the history of gradient norms observed during training. Experimental results show that applying AutoClip results in improved generalization performance for audio source separation networks. Observation of the training dynamics of a separation network trained with and without AutoClip show that AutoClip guides optimization into smoother parts of the loss landscape. AutoClip is very simple to implement and can be integrated readily into a variety of applications across multiple domains.

* Accepted at 2020 IEEE International Workshop on Machine Learning for Signal Processing, Sept.\ 21--24, 2020, Espoo, Finland

Via

Access Paper or Ask Questions

Spatio-Temporal Scene Graphs for Video Dialog

Jul 08, 2020

Shijie Geng, Peng Gao, Chiori Hori, Jonathan Le Roux, Anoop Cherian

Figure 1 for Spatio-Temporal Scene Graphs for Video Dialog

Figure 2 for Spatio-Temporal Scene Graphs for Video Dialog

Figure 3 for Spatio-Temporal Scene Graphs for Video Dialog

Figure 4 for Spatio-Temporal Scene Graphs for Video Dialog

Abstract:The Audio-Visual Scene-aware Dialog (AVSD) task requires an agent to indulge in a natural conversation with a human about a given video. Specifically, apart from the video frames, the agent receives the audio, brief captions, and a dialog history, and the task is to produce the correct answer to a question about the video. Due to the diversity in the type of inputs, this task poses a very challenging multimodal reasoning problem. Current approaches to AVSD either use global video-level features or those from a few sampled frames, and thus lack the ability to explicitly capture relevant visual regions or their interactions for answer generation. To this end, we propose a novel spatio-temporal scene graph representation (STSGR) modeling fine-grained information flows within videos. Specifically, on an input video sequence, STSGR (i) creates a two-stream visual and semantic scene graph on every frame, (ii) conducts intra-graph reasoning using node and edge convolutions generating visual memories, and (iii) applies inter-graph aggregation to capture their temporal evolutions. These visual memories are then combined with other modalities and the question embeddings using a novel semantics-controlled multi-head shuffled transformer, which then produces the answer recursively. Our entire pipeline is trained end-to-end. We present experiments on the AVSD dataset and demonstrate state-of-the-art results. A human evaluation on the quality of our generated answers shows 12% relative improvement against prior methods.

Via

Access Paper or Ask Questions

Detecting Audio Attacks on ASR Systems with Dropout Uncertainty

Jun 02, 2020

Tejas Jayashankar, Jonathan Le Roux, Pierre Moulin

Figure 1 for Detecting Audio Attacks on ASR Systems with Dropout Uncertainty

Figure 2 for Detecting Audio Attacks on ASR Systems with Dropout Uncertainty

Abstract:Various adversarial audio attacks have recently been developed to fool automatic speech recognition (ASR) systems. We here propose a defense against such attacks based on the uncertainty introduced by dropout in neural networks. We show that our defense is able to detect attacks created through optimized perturbations and frequency masking on a state-of-the-art end-to-end ASR system. Furthermore, the defense can be made robust against attacks that are immune to noise reduction. We test our defense on Mozilla's CommonVoice dataset, the UrbanSound dataset, and an excerpt of the LibriSpeech dataset, showing that it achieves high detection accuracy in a wide range of scenarios.

Via

Access Paper or Ask Questions

Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Feb 14, 2020

Leda Sarı, Niko Moritz, Takaaki Hori, Jonathan Le Roux

Figure 1 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Figure 2 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Figure 3 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Figure 4 for Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

Abstract:We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoder-decoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

* To appear in Proc. ICASSP 2020

Via

Access Paper or Ask Questions

End-to-End Multi-speaker Speech Recognition with Transformer

Feb 13, 2020

Xuankai Chang, Wangyou Zhang, Yanmin Qian, Jonathan Le Roux, Shinji Watanabe

Figure 1 for End-to-End Multi-speaker Speech Recognition with Transformer

Figure 2 for End-to-End Multi-speaker Speech Recognition with Transformer

Figure 3 for End-to-End Multi-speaker Speech Recognition with Transformer

Figure 4 for End-to-End Multi-speaker Speech Recognition with Transformer

Abstract:Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER.

* To appear in ICASSP 2020

Via

Access Paper or Ask Questions