Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ronan Collobert

Pseudo-Labeling for Massively Multilingual Speech Recognition

Oct 30, 2021

Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Pseudo-Labeling for Massively Multilingual Speech Recognition

Figure 2 for Pseudo-Labeling for Massively Multilingual Speech Recognition

Figure 3 for Pseudo-Labeling for Massively Multilingual Speech Recognition

Figure 4 for Pseudo-Labeling for Massively Multilingual Speech Recognition

Abstract:Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised learning on a target language, generate pseudo-labels for that language, and train a final model using pseudo-labels for all languages, either from scratch or by fine-tuning. Experiments on the labeled Common Voice and unlabeled VoxPopuli datasets show that our recipe can yield a model with better performance for many languages that also transfers well to LibriSpeech.

Via

Access Paper or Ask Questions

Word Order Does Not Matter For Speech Recognition

Oct 18, 2021

Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Word Order Does Not Matter For Speech Recognition

Figure 2 for Word Order Does Not Matter For Speech Recognition

Figure 3 for Word Order Does Not Matter For Speech Recognition

Figure 4 for Word Order Does Not Matter For Speech Recognition

Abstract:In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

Via

Access Paper or Ask Questions

Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Jun 14, 2021

Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

Figure 1 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Figure 2 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Figure 3 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Figure 4 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Abstract:In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised training. The proposed approach uses a teacher model which is updated as the exponential moving average of the student model parameters. This can be seen as a continuous version of the iterative pseudo-labeling approach for semi-supervised training. It is applicable for different training criteria, and in this paper we demonstrate it for frame-level hybrid hidden Markov model - deep neural network (HMM-DNN) models and sequence-level connectionist temporal classification (CTC) based models. The proposed approach shows more than 10% word error rate (WER) reduction over standard teacher-student training and more than 50\% relative WER reduction over 10 hour supervised baseline when using large scale realistic unsupervised public videos in UK English and Italian languages.

* 4 figures, 7 pages; fixed author list going out of margin

Via

Access Paper or Ask Questions

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Jun 06, 2021

Tatiana Likhomanenko, Qiantong Xu, Ronan Collobert, Gabriel Synnaeve, Alex Rogozhnikov

Figure 1 for CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Figure 2 for CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Figure 3 for CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Figure 4 for CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Abstract:Without positional information, attention-based transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed transformer models positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences of different length than those seen at training time. Relative positions are more robust to length change, but are more complex to implement and yield inferior model throughput. In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

Via

Access Paper or Ask Questions

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Apr 02, 2021

Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve(+1 more)

Figure 1 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Figure 2 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Figure 3 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Figure 4 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Abstract:Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq.

Via

Access Paper or Ask Questions

MLS: A Large-Scale Multilingual Dataset for Speech Research

Dec 19, 2020

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

Figure 1 for MLS: A Large-Scale Multilingual Dataset for Speech Research

Figure 2 for MLS: A Large-Scale Multilingual Dataset for Speech Research

Figure 3 for MLS: A Large-Scale Multilingual Dataset for Speech Research

Figure 4 for MLS: A Large-Scale Multilingual Dataset for Speech Research

Abstract:This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

* Interspeech 2020

Via

Access Paper or Ask Questions

Joint Masked CPC and CTC Training for ASR

Oct 30, 2020

Chaitanya Talnikar, Tatiana Likhomanenko, Ronan Collobert, Gabriel Synnaeve

Figure 1 for Joint Masked CPC and CTC Training for ASR

Figure 2 for Joint Masked CPC and CTC Training for ASR

Figure 3 for Joint Masked CPC and CTC Training for ASR

Figure 4 for Joint Masked CPC and CTC Training for ASR

Abstract:Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temporal Classification (CTC). We show that this joint training method directly optimizes performance for the downstream ASR task using unsupervised data while achieving similar word error rates to wav2vec~2.0 on the Librispeech 100-hour dataset. Finally, we postulate that solving the contrastive task is a regularization for the supervised CTC loss.

Via

Access Paper or Ask Questions

Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Oct 22, 2020

Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

Figure 1 for Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Figure 2 for Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Figure 3 for Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Figure 4 for Rethinking Evaluation in ASR: Are Our Models Robust Enough?

Abstract:Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets -- in particular, if models trained on a single dataset transfer to other (possibly out-of-domain) datasets. Further, we demonstrate that when a large enough set of benchmarks is used, average word error rate (WER) performance over them provides a good proxy for performance on real-world data. Finally, we show that training a single acoustic model on the most widely-used datasets -- combined -- reaches competitive performance on both research and real-world benchmarks.

Via

Access Paper or Ask Questions

slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Oct 22, 2020

Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert

Figure 1 for slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Figure 2 for slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Figure 3 for slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Figure 4 for slimIPL: Language-Model-Free Iterative Pseudo-Labeling

Abstract:Recent results in end-to-end ASR have demonstrated the efficacy of simple pseudo-labeling for semi-supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to further increase performance in ASR. We improve upon the IPL algorithm: as the model learns, we propose to iteratively re-generate transcriptions with hard labels (the most probable tokens) assignments, that is without a language model. We call this approach Language-Model-Free IPL (slimIPL) and we give a resultant training setup for CTC and seq2seq models. At inference, our experiments show that decoding with a strong language model is more beneficial with slimIPL than IPL, asIPL exhibits some language model over-fitting issues. Compared to prior work on semi-supervised and unsupervised approaches, slimIPL not only simplifies the training process, but also achieves competitive and state-of-the-art results on LibriSpeech test sets in both standard and low-resource settings.

Via

Access Paper or Ask Questions

Self-training and Pre-training are Complementary for Speech Recognition

Oct 22, 2020

Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Figure 1 for Self-training and Pre-training are Complementary for Speech Recognition

Figure 2 for Self-training and Pre-training are Complementary for Speech Recognition

Figure 3 for Self-training and Pre-training are Complementary for Speech Recognition

Figure 4 for Self-training and Pre-training are Complementary for Speech Recognition

Abstract:Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of labeled data from Libri-light as well as 53k hours of unlabeled data from LibriVox achieves WERs of 3.0%/5.2% on the clean and other test sets of Librispeech - rivaling the best published systems trained on 960 hours of labeled data only a year ago. Training on all labeled data of Librispeech achieves WERs of 1.5%/3.1%.

Via

Access Paper or Ask Questions