Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ankur Bapna

Dima

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Mar 03, 2023

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang(+17 more)

Figure 1 for Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Figure 2 for Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Figure 3 for Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Figure 4 for Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

Abstract:We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

* 20 pages, 7 figures, 8 tables

Via

Access Paper or Ask Questions

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Mar 03, 2023

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Yu Zhang, Wei Han, Ankur Bapna, Michiel Bacchiani

Figure 1 for Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Figure 2 for Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Figure 3 for Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Figure 4 for Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

Abstract:Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/

* Work in progress

Via

Access Paper or Ask Questions

Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models

Dec 19, 2022

Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna

$Figure 1 for Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models$

$Figure 2 for Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models$

$Figure 3 for Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models$

$Figure 4 for Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models$

Abstract:We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST) and Machine Translation (MT), in over 100 languages. By leveraging a quantized representation of speech as a target, Mu$^{2}$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder, for both unlabeled speech and text, while utilizing the supervised tasks to improve cross-lingual and cross-modal representation alignment within the model. On CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained on public datasets, improving on xx-en translation over the previous best by 1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder, despite using a relatively weaker sequence-to-sequence architecture. On text understanding tasks, our model improves by more than 6\% over mSLAM on XNLI, getting closer to the performance of mT5 models of comparable capacity on XNLI and TydiQA, paving the way towards a single model for all speech and text understanding tasks.

Via

Access Paper or Ask Questions

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

Oct 27, 2022

Takaaki Saeki, Heiga Zen, Zhehuai Chen, Nobuyuki Morioka, Gary Wang, Yu Zhang, Ankur Bapna, Andrew Rosenberg, Bhuvana Ramabhadran

Abstract:This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Oct 18, 2022

Zhehuai Chen, Ankur Bapna, Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Pedro Moreno, Nanxin Chen

Figure 1 for Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Figure 2 for Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Figure 3 for Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Figure 4 for Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Abstract:Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover $102$ languages, where transcribed speech is available in $52$ of these languages and can be used to improve end-to-end ASR quality on the remaining $50$. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8\% to 30.8\%, a relative reduction of 53\%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5\% relative and reduces the CER of 19 languages below 15\%.

* Accepted by SLT 2022

Via

Access Paper or Ask Questions

JOIST: A Joint Speech and Text Streaming Model For ASR

Oct 13, 2022

Tara N. Sainath, Rohit Prabhavalkar, Ankur Bapna, Yu Zhang, Zhouyuan Huo, Zhehuai Chen, Bo Li, Weiran Wang, Trevor Strohman

Figure 1 for JOIST: A Joint Speech and Text Streaming Model For ASR

Figure 2 for JOIST: A Joint Speech and Text Streaming Model For ASR

Figure 3 for JOIST: A Joint Speech and Text Streaming Model For ASR

Figure 4 for JOIST: A Joint Speech and Text Streaming Model For ASR

Abstract:We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs. Unlike previous works, we explore joint training with both modalities, rather than pre-training and fine-tuning. In addition, we explore JOIST using a streaming E2E model with an order of magnitude more data, which are also novelties compared to previous works. Through a series of ablation studies, we explore different types of text modeling, including how to model the length of the text sequence and the appropriate text sub-word unit representation. We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text. In addition, we quantitatively show that JOIST maintains streaming capabilities, which is important for good user-level experience.

Via

Access Paper or Ask Questions

SQuId: Measuring Speech Naturalness in Many Languages

Oct 12, 2022

Thibault Sellam, Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, Jason Riesa

Figure 1 for SQuId: Measuring Speech Naturalness in Many Languages

Figure 2 for SQuId: Measuring Speech Naturalness in Many Languages

Figure 3 for SQuId: Measuring Speech Naturalness in Many Languages

Figure 4 for SQuId: Measuring Speech Naturalness in Many Languages

Abstract:Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.

Via

Access Paper or Ask Questions

FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

May 25, 2022

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna

Figure 1 for FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Figure 2 for FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Figure 3 for FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Figure 4 for FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech

Abstract:We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark. FLEURS is an n-way parallel speech dataset in 102 languages built on top of the machine translation FLoRes-101 benchmark, with approximately 12 hours of speech supervision per language. FLEURS can be used for a variety of speech tasks, including Automatic Speech Recognition (ASR), Speech Language Identification (Speech LangID), Translation and Retrieval. In this paper, we provide baselines for the tasks based on multilingual pre-trained models like mSLAM. The goal of FLEURS is to enable speech technology in more languages and catalyze research in low-resource speech understanding.

Via

Access Paper or Ask Questions

Building Machine Translation Systems for the Next Thousand Languages

May 16, 2022

Ankur Bapna, Isaac Caswell, Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey(+14 more)

Figure 1 for Building Machine Translation Systems for the Next Thousand Languages

Figure 2 for Building Machine Translation Systems for the Next Thousand Languages

Figure 3 for Building Machine Translation Systems for the Next Thousand Languages

Figure 4 for Building Machine Translation Systems for the Next Thousand Languages

Abstract:In this paper we share findings from our effort to build practical machine translation (MT) systems capable of translating across over one thousand languages. We describe results in three research domains: (i) Building clean, web-mined datasets for 1500+ languages by leveraging semi-supervised pre-training for language identification and developing data-driven filtering techniques; (ii) Developing practical MT models for under-served languages by leveraging massively multilingual models trained with supervised parallel data for over 100 high-resource languages and monolingual datasets for an additional 1000+ languages; and (iii) Studying the limitations of evaluation metrics for these languages and conducting qualitative analysis of the outputs from our MT models, highlighting several frequent error modes of these types of models. We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.

* V2: updated with some details from 24-language Google Translate launch in May 2022

Via

Access Paper or Ask Questions

XTREME-S: Evaluating Cross-lingual Speech Representations

Apr 13, 2022

Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale(+9 more)

Figure 1 for XTREME-S: Evaluating Cross-lingual Speech Representations

Figure 2 for XTREME-S: Evaluating Cross-lingual Speech Representations

Figure 3 for XTREME-S: Evaluating Cross-lingual Speech Representations

Figure 4 for XTREME-S: Evaluating Cross-lingual Speech Representations

Abstract:We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in "universal" speech representation learning. This paper describes the new benchmark and establishes the first speech-only and speech-text baselines using XLS-R and mSLAM on all downstream tasks. We motivate the design choices and detail how to use the benchmark. Datasets and fine-tuning scripts are made easily accessible at https://hf.co/datasets/google/xtreme_s.

* Minor fix: language code for Filipino (Tagalog), "tg" -> "tl"

Via

Access Paper or Ask Questions