Alert button
Picture for Vineel Pratap

Vineel Pratap

Alert button

TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

Oct 27, 2023
Jeff Hwang, Moto Hira, Caroline Chen, Xiaohui Zhang, Zhaoheng Ni, Guangzhi Sun, Pingchuan Ma, Ruizhe Huang, Vineel Pratap, Yuekai Zhang, Anurag Kumar, Chin-Yun Yu, Chuang Zhu, Chunxi Liu, Jacob Kahn, Mirco Ravanelli, Peng Sun, Shinji Watanabe, Yangyang Shi, Yumeng Tao, Robin Scheibler, Samuele Cornell, Sean Kim, Stavros Petridis

Figure 1 for TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Figure 2 for TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Figure 3 for TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch
Figure 4 for TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch

TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.

Viaarxiv icon

Scaling Speech Technology to 1,000+ Languages

May 22, 2023
Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

Figure 1 for Scaling Speech Technology to 1,000+ Languages
Figure 2 for Scaling Speech Technology to 1,000+ Languages
Figure 3 for Scaling Speech Technology to 1,000+ Languages
Figure 4 for Scaling Speech Technology to 1,000+ Languages

Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.

Viaarxiv icon

Flashlight: Enabling Innovation in Tools for Machine Learning

Jan 29, 2022
Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, Benoit Steiner, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Flashlight: Enabling Innovation in Tools for Machine Learning
Figure 2 for Flashlight: Enabling Innovation in Tools for Machine Learning
Figure 3 for Flashlight: Enabling Innovation in Tools for Machine Learning
Figure 4 for Flashlight: Enabling Innovation in Tools for Machine Learning

As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototyping new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together.

Viaarxiv icon

Star Temporal Classification: Sequence Classification with Partially Labeled Data

Jan 28, 2022
Vineel Pratap, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Star Temporal Classification: Sequence Classification with Partially Labeled Data
Figure 2 for Star Temporal Classification: Sequence Classification with Partially Labeled Data
Figure 3 for Star Temporal Classification: Sequence Classification with Partially Labeled Data
Figure 4 for Star Temporal Classification: Sequence Classification with Partially Labeled Data

We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be missing. We express STC as the composition of weighted finite-state transducers (WFSTs) and use GTN (a framework for automatic differentiation with WFSTs) to compute gradients. We perform extensive experiments on automatic speech recognition. These experiments show that STC can recover most of the performance of supervised baseline when up to 70% of the labels are missing. We also perform experiments in handwriting recognition to show that our method easily applies to other sequence classification tasks.

Viaarxiv icon

Word Order Does Not Matter For Speech Recognition

Oct 18, 2021
Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

Figure 1 for Word Order Does Not Matter For Speech Recognition
Figure 2 for Word Order Does Not Matter For Speech Recognition
Figure 3 for Word Order Does Not Matter For Speech Recognition
Figure 4 for Word Order Does Not Matter For Speech Recognition

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

Viaarxiv icon

Parallel Composition of Weighted Finite-State Transducers

Oct 06, 2021
Shubho Sengupta, Vineel Pratap, Awni Hannun

Figure 1 for Parallel Composition of Weighted Finite-State Transducers
Figure 2 for Parallel Composition of Weighted Finite-State Transducers
Figure 3 for Parallel Composition of Weighted Finite-State Transducers
Figure 4 for Parallel Composition of Weighted Finite-State Transducers

Finite-state transducers (FSTs) are frequently used in speech recognition. Transducer composition is an essential operation for combining different sources of information at different granularities. However, composition is also one of the more computationally expensive operations. Due to the heterogeneous structure of FSTs, parallel algorithms for composition are suboptimal in efficiency, generality, or both. We propose an algorithm for parallel composition and implement it on graphics processing units. We benchmark our parallel algorithm on the composition of random graphs and the composition of graphs commonly used in speech recognition. The parallel composition scales better with the size of the input graphs and for large graphs can be as much as 10 to 30 times faster than a sequential CPU algorithm.

Viaarxiv icon

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Apr 02, 2021
Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

Figure 1 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Figure 2 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Figure 3 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training
Figure 4 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq.

Viaarxiv icon

MLS: A Large-Scale Multilingual Dataset for Speech Research

Dec 19, 2020
Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

Figure 1 for MLS: A Large-Scale Multilingual Dataset for Speech Research
Figure 2 for MLS: A Large-Scale Multilingual Dataset for Speech Research
Figure 3 for MLS: A Large-Scale Multilingual Dataset for Speech Research
Figure 4 for MLS: A Large-Scale Multilingual Dataset for Speech Research

This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models and for all the languages in our dataset. We believe such a large transcribed dataset will open new avenues in ASR and Text-To-Speech (TTS) research. The dataset will be made freely available for anyone at http://www.openslr.org.

* Interspeech 2020  
Viaarxiv icon