Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ronan Collobert

Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

Jul 08, 2020

Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

Figure 2 for Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

Figure 3 for Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

Figure 4 for Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

Abstract:We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language cluster). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our knowledge, this is the first work studying multilingual ASR at massive scale, with more than 50 languages and more than 16,000 hours of audio across them.

Via

Access Paper or Ask Questions

Unsupervised Cross-lingual Representation Learning for Speech Recognition

Jun 24, 2020

Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

Figure 1 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Figure 2 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Figure 3 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Figure 4 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Abstract:This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on a concurrently introduced self-supervised model which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to the strongest comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages.

Via

Access Paper or Ask Questions

Iterative Pseudo-Labeling for Speech Recognition

May 19, 2020

Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Iterative Pseudo-Labeling for Speech Recognition

Figure 2 for Iterative Pseudo-Labeling for Speech Recognition

Figure 3 for Iterative Pseudo-Labeling for Speech Recognition

Figure 4 for Iterative Pseudo-Labeling for Speech Recognition

Abstract:Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR

Via

Access Paper or Ask Questions

Multi-scale Transformer Language Models

May 01, 2020

Sandeep Subramanian, Ronan Collobert, Marc'Aurelio Ranzato, Y-Lan Boureau

Figure 1 for Multi-scale Transformer Language Models

Figure 2 for Multi-scale Transformer Language Models

Figure 3 for Multi-scale Transformer Language Models

Figure 4 for Multi-scale Transformer Language Models

Abstract:We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to train a hierarchical variant with 30 layers that has 23% smaller memory footprint and better perplexity, compared to a vanilla transformer with less than half the number of layers, on the Toronto BookCorpus. We analyze the advantages of learned representations at multiple scales in terms of memory footprint, compute time, and perplexity, which are particularly appealing given the quadratic scaling of transformers' run time and memory usage with respect to sequence length.

Via

Access Paper or Ask Questions

Scaling Up Online Speech Recognition Using ConvNets

Jan 27, 2020

Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Scaling Up Online Speech Recognition Using ConvNets

Figure 2 for Scaling Up Online Speech Recognition Using ConvNets

Figure 3 for Scaling Up Online Speech Recognition Using ConvNets

Figure 4 for Scaling Up Online Speech Recognition Using ConvNets

Abstract:We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements.

Via

Access Paper or Ask Questions

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Dec 17, 2019

Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen(+5 more)

Figure 1 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 2 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 3 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Figure 4 for Libri-Light: A Benchmark for ASR with Limited or No Supervision

Abstract:We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.

Via

Access Paper or Ask Questions

End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Nov 19, 2019

Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Edouard Grave, Tatiana Likhomanenko, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert

Figure 1 for End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Figure 2 for End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

Abstract:We study ResNet-, Time-Depth Separable ConvNets-, and Transformer-based acoustic models, trained with CTC or Seq2Seq criterions. We perform experiments on the LibriSpeech dataset, with and without LM decoding, optionally with beam rescoring. We reach 5.18% WER with external language models for decoding and rescoring. Additionally, we leverage the unlabeled data from LibriVox by doing semi-supervised training and show that it is possible to reach 5.29% WER on test-other without decoding, and 4.11% WER with decoding and rescoring, with only the standard 960 hours from LibriSpeech as labeled data.

Via

Access Paper or Ask Questions

Word-level Speech Recognition with a Dynamic Lexicon

Jun 10, 2019

Ronan Collobert, Awni Hannun, Gabriel Synnaeve

Figure 1 for Word-level Speech Recognition with a Dynamic Lexicon

Figure 2 for Word-level Speech Recognition with a Dynamic Lexicon

Figure 3 for Word-level Speech Recognition with a Dynamic Lexicon

Figure 4 for Word-level Speech Recognition with a Dynamic Lexicon

Abstract:We propose a direct-to-word sequence model with a dynamic lexicon. Our word network constructs word embeddings dynamically from the character level tokens. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. Sub-word units are commonly used in speech recognition yet are generated without the use of acoustic context. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. Furthermore, we empirically validate that the word-level embeddings we learn contain significant acoustic information, making them more suitable for use in speech recognition. We also show that our direct-to-word approach retains the ability to predict words not seen at training time without any retraining.

Via

Access Paper or Ask Questions

wav2vec: Unsupervised Pre-training for Speech Recognition

May 24, 2019

Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

Figure 1 for wav2vec: Unsupervised Pre-training for Speech Recognition

Figure 2 for wav2vec: Unsupervised Pre-training for Speech Recognition

Figure 3 for wav2vec: Unsupervised Pre-training for Speech Recognition

Figure 4 for wav2vec: Unsupervised Pre-training for Speech Recognition

Abstract:We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 32% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using three orders of magnitude less labeled training data.

Via

Access Paper or Ask Questions

Who Needs Words? Lexicon-Free Speech Recognition

Apr 09, 2019

Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 2 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 3 for Who Needs Words? Lexicon-Free Speech Recognition

Figure 4 for Who Needs Words? Lexicon-Free Speech Recognition

Abstract:Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (character) contexts, which is key for good speech recognition performance downstream. We specifically show that the lexicon-free decoding performance (WER) on utterances with OOV words using character-based LMs is better than lexicon-based decoding, both with character or word-based LMs.

* 8 pages, 1 figure

Via

Access Paper or Ask Questions