Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexei Baevski

Unsupervised Cross-lingual Representation Learning for Speech Recognition

Jun 24, 2020
Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

Figure 1 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Figure 2 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Figure 3 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

Figure 4 for Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on a concurrently introduced self-supervised model which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining. On the CommonVoice benchmark, XLSR shows a relative phoneme error rate reduction of 72% compared to the best known results. On BABEL, our approach improves word error rate by 16% relative compared to the strongest comparable system. Our approach enables a single multilingual speech recognition model which is competitive to strong individual models. Analysis shows that the latent discrete speech representations are shared across languages with increased sharing for related languages.

Via

Access Paper or Ask Questions

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Jun 20, 2020
Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

Figure 1 for wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Figure 2 for wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Figure 3 for wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Figure 4 for wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. We set a new state of the art on both the 100 hour subset of Librispeech as well as on TIMIT phoneme recognition. When lowering the amount of labeled data to one hour, our model outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 5.7/10.1 WER on the noisy/clean test sets of Librispeech. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Fine-tuning on all of Librispeech achieves 1.9/3.5 WER using a simple baseline model architecture. We will release code and models.

Via

Access Paper or Ask Questions

Effectiveness of self-supervised pre-training for speech recognition

Nov 10, 2019
Alexei Baevski, Michael Auli, Abdelrahman Mohamed

Figure 1 for Effectiveness of self-supervised pre-training for speech recognition

Figure 2 for Effectiveness of self-supervised pre-training for speech recognition

Figure 3 for Effectiveness of self-supervised pre-training for speech recognition

Figure 4 for Effectiveness of self-supervised pre-training for speech recognition

We present pre-training approaches for self-supervised representation learning of speech data. A BERT, masked language model, loss on discrete features is compared with an InfoNCE-based constrastive loss on continuous speech features. The pre-trained models are then fine-tuned with a Connectionist Temporal Classification (CTC) loss to predict target character sequences. To study impact of stacking multiple feature learning modules trained using different self-supervised loss functions, we test the discrete and continuous BERT pre-training approaches on spectral features and on learned acoustic representations, showing synergitic behaviour between acoustically motivated and masked language model loss functions. In low-resource conditions using only 10 hours of labeled data, we achieve Word Error Rates (WER) of 10.2\% and 23.5\% on the standard test "clean" and "other" benchmarks of the Librispeech dataset, which is almost on bar with previously published work that uses 10 times more labeled data. Moreover, compared to previous work that uses two models in tandem, by using one model for both BERT pre-trainining and fine-tuning, our model provides an average relative WER reduction of 9%.

Via

Access Paper or Ask Questions

vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Oct 12, 2019
Alexei Baevski, Steffen Schneider, Michael Auli

Figure 1 for vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Figure 2 for vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Figure 3 for vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

Figure 4 for vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

Via

Access Paper or Ask Questions

Facebook FAIR's WMT19 News Translation Task Submission

Jul 15, 2019
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov

Figure 1 for Facebook FAIR's WMT19 News Translation Task Submission

Figure 2 for Facebook FAIR's WMT19 News Translation Task Submission

Figure 3 for Facebook FAIR's WMT19 News Translation Task Submission

Figure 4 for Facebook FAIR's WMT19 News Translation Task Submission

This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes, as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations. This system improves upon our WMT'18 submission by 4.5 BLEU points.

* 7 pages; WMT

Via

Access Paper or Ask Questions

wav2vec: Unsupervised Pre-training for Speech Recognition

May 24, 2019
Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

Figure 1 for wav2vec: Unsupervised Pre-training for Speech Recognition

Figure 2 for wav2vec: Unsupervised Pre-training for Speech Recognition

Figure 3 for wav2vec: Unsupervised Pre-training for Speech Recognition

Figure 4 for wav2vec: Unsupervised Pre-training for Speech Recognition

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 32% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using three orders of magnitude less labeled training data.

Via

Access Paper or Ask Questions

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Apr 01, 2019
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli

Figure 1 for fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Figure 2 for fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Figure 3 for fairseq: A Fast, Extensible Toolkit for Sequence Modeling

Figure 4 for fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found at https://www.youtube.com/watch?v=OtgDdWtHvto

* NAACL 2019 Demo paper

Via

Access Paper or Ask Questions

Pre-trained Language Model Representations for Language Generation

Apr 01, 2019
Sergey Edunov, Alexei Baevski, Michael Auli

Figure 1 for Pre-trained Language Model Representations for Language Generation

Figure 2 for Pre-trained Language Model Representations for Language Generation

Figure 3 for Pre-trained Language Model Representations for Language Generation

Figure 4 for Pre-trained Language Model Representations for Language Generation

Pre-trained language model representations have been successful in a wide range of language understanding tasks. In this paper, we examine different strategies to integrate pre-trained representations into sequence to sequence models and apply it to neural machine translation and abstractive summarization. We find that pre-trained representations are most effective when added to the encoder network which slows inference by only 14%. Our experiments in machine translation show gains of up to 5.3 BLEU in a simulated resource-poor setup. While returns diminish with more labeled data, we still observe improvements when millions of sentence-pairs are available. Finally, on abstractive summarization we achieve a new state of the art on the full text version of CNN/DailyMail.

* NAACL 2019

Via

Access Paper or Ask Questions

Cloze-driven Pretraining of Self-attention Networks

Mar 19, 2019
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, Michael Auli

Figure 1 for Cloze-driven Pretraining of Self-attention Networks

Figure 2 for Cloze-driven Pretraining of Self-attention Networks

Figure 3 for Cloze-driven Pretraining of Self-attention Networks

Figure 4 for Cloze-driven Pretraining of Self-attention Networks

We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with the concurrently introduced BERT model. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.

Via

Access Paper or Ask Questions