Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Wei-Ning Hsu

Text-Free Prosody-Aware Generative Spoken Language Modeling

Sep 07, 2021

Eugene Kharitonov, Ann Lee, Adam Polyak, Yossi Adi, Jade Copet, Kushal Lakhotia, Tu-Anh Nguyen, Morgane Rivière, Abdelrahman Mohamed, Emmanuel Dupoux(+1 more)

Figure 1 for Text-Free Prosody-Aware Generative Spoken Language Modeling

Figure 2 for Text-Free Prosody-Aware Generative Spoken Language Modeling

Figure 3 for Text-Free Prosody-Aware Generative Spoken Language Modeling

Figure 4 for Text-Free Prosody-Aware Generative Spoken Language Modeling

Abstract:Speech pre-training has primarily demonstrated efficacy on classification tasks, while its capability of generating novel speech, similar to how GPT-2 can generate coherent paragraphs, has barely been explored. Generative Spoken Language Modeling (GSLM) (Lakhotia et al., 2021) is the only prior work addressing the generative aspects of speech pre-training, which replaces text with discovered phone-like units for language modeling and shows the ability to generate meaningful novel sentences. Unfortunately, despite eliminating the need of text, the units used in GSLM discard most of the prosodic information. Hence, GSLM fails to leverage prosody for better comprehension, and does not generate expressive speech. In this work, we present a prosody-aware generative spoken language model (pGSLM). It is composed of a multi-stream transformer language model (MS-TLM) of speech, represented as discovered unit and prosodic feature streams, and an adapted HiFi-GAN model converting MS-TLM outputs to waveforms. We devise a series of metrics for prosody modeling and generation, and re-use metrics from GSLM for content modeling. Experimental results show that the pGSLM can utilize prosody to improve both prosody and content modeling, and also generate natural, meaningful, and coherent speech given a spoken prompt. Audio samples can be found at https://speechbot.github.io/pgslm.

Via

Access Paper or Ask Questions

Direct speech-to-speech translation with discrete units

Jul 12, 2021

Ann Lee, Peng-Jen Chen, Changhan Wang, Jiatao Gu, Xutai Ma, Adam Polyak, Yossi Adi, Qing He, Yun Tang, Juan Pino(+1 more)

Figure 1 for Direct speech-to-speech translation with discrete units

Figure 2 for Direct speech-to-speech translation with discrete units

Figure 3 for Direct speech-to-speech translation with discrete units

Figure 4 for Direct speech-to-speech translation with discrete units

Abstract:We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. Previous work addresses the problem by training an attention-based sequence-to-sequence model that maps source speech spectrograms into target spectrograms. To tackle the challenge of modeling continuous spectrogram features of the target speech, we propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that predicting discrete units and joint speech and text training improve model performance by 11 BLEU compared with a baseline that predicts spectrograms and bridges 83% of the performance gap towards a cascaded system. When trained without any text transcripts, our model achieves similar performance as a baseline that predicts spectrograms and is trained with text data.

Via

Access Paper or Ask Questions

Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Jun 14, 2021

Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

Figure 1 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Figure 2 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Figure 3 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Figure 4 for Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Abstract:In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised training. The proposed approach uses a teacher model which is updated as the exponential moving average of the student model parameters. This can be seen as a continuous version of the iterative pseudo-labeling approach for semi-supervised training. It is applicable for different training criteria, and in this paper we demonstrate it for frame-level hybrid hidden Markov model - deep neural network (HMM-DNN) models and sequence-level connectionist temporal classification (CTC) based models. The proposed approach shows more than 10% word error rate (WER) reduction over standard teacher-student training and more than 50\% relative WER reduction over 10 hour supervised baseline when using large scale realistic unsupervised public videos in UK English and Italian languages.

* 4 figures, 7 pages; fixed author list going out of margin

Via

Access Paper or Ask Questions

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Jun 14, 2021

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed

Figure 1 for HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Figure 2 for HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Figure 3 for HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Figure 4 for HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Abstract:Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.

Via

Access Paper or Ask Questions

Unsupervised Speech Recognition

May 24, 2021

Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

Figure 1 for Unsupervised Speech Recognition

Figure 2 for Unsupervised Speech Recognition

Figure 3 for Unsupervised Speech Recognition

Figure 4 for Unsupervised Speech Recognition

Abstract:Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago. We also experiment on nine other languages, including low-resource languages such as Kyrgyz, Swahili and Tatar.

Via

Access Paper or Ask Questions

Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Apr 02, 2021

Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve(+1 more)

Figure 1 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Figure 2 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Figure 3 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Figure 4 for Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

Abstract:Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which in turn may differ from the test data domain. Our experiments show that using target domain data during pre-training leads to large performance improvements across a variety of setups. On a large-scale competitive setup, we show that pre-training on unlabeled in-domain data reduces the gap between models trained on in-domain and out-of-domain labeled data by 66%-73%. This has obvious practical implications since it is much easier to obtain unlabeled target domain data than labeled data. Moreover, we find that pre-training on multiple domains improves generalization performance on domains not seen during training. Code and models will be made available at https://github.com/pytorch/fairseq.

Via

Access Paper or Ask Questions

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Apr 02, 2021

Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux

Figure 1 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Figure 2 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Figure 3 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Figure 4 for Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

Abstract:We propose using self-supervised discrete representations for the task of speech resynthesis. To generate disentangled representation, we separately extract low-bitrate representations for speech content, prosodic information, and speaker identity. This allows to synthesize speech in a controllable manner. We analyze various state-of-the-art, self-supervised representation learning methods and shed light on the advantages of each method while considering reconstruction quality and disentanglement properties. Specifically, we evaluate the F0 reconstruction, speaker identification performance (for both resynthesis and voice conversion), recordings' intelligibility, and overall quality using subjective human evaluation. Lastly, we demonstrate how these representations can be used for an ultra-lightweight speech codec. Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods. Audio samples can be found under https://resynthesis-ssl.github.io/.

Via

Access Paper or Ask Questions

Generative Spoken Language Modeling from Raw Audio

Feb 01, 2021

Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed(+1 more)

Figure 1 for Generative Spoken Language Modeling from Raw Audio

Figure 2 for Generative Spoken Language Modeling from Raw Audio

Figure 3 for Generative Spoken Language Modeling from Raw Audio

Figure 4 for Generative Spoken Language Modeling from Raw Audio

Abstract:Generative spoken language modeling involves learning jointly the acoustic and linguistic characteristics of a language from raw audio only (without text or labels). We introduce metrics to automatically evaluate the generated output in terms of acoustic and linguistic quality in two associated end-to-end tasks, respectively: speech resynthesis (repeating the speech input using the system's own voice), and speech generation (producing novel speech outputs conditional on a spoken prompt, or unconditionally), and validate these metrics with human judgment. We test baseline systems consisting of a discrete speech encoder (returning discrete, low bitrate, pseudo-text units), a generative language model (trained on pseudo-text units), and a speech decoder (generating a waveform from pseudo-text). By comparing three state-of-the-art unsupervised speech encoders (Contrastive Predictive Coding (CPC), wav2vec 2.0, HuBERT), and varying the number of discrete units (50, 100, 200), we investigate how the generative performance depends on the quality of the learned units as measured by unsupervised metrics (zero-shot probe tasks). We will open source our evaluation stack and baseline models.

Via

Access Paper or Ask Questions

Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Dec 31, 2020

Wei-Ning Hsu, David Harwath, Christopher Song, James Glass

Figure 1 for Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Figure 2 for Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Figure 3 for Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Figure 4 for Text-Free Image-to-Speech Synthesis Using Learned Segmental Units

Abstract:In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

Via

Access Paper or Ask Questions

Differentiable Weighted Finite-State Transducers

Oct 02, 2020

Awni Hannun, Vineel Pratap, Jacob Kahn, Wei-Ning Hsu

Figure 1 for Differentiable Weighted Finite-State Transducers

Figure 2 for Differentiable Weighted Finite-State Transducers

Figure 3 for Differentiable Weighted Finite-State Transducers

Figure 4 for Differentiable Weighted Finite-State Transducers

Abstract:We introduce a framework for automatic differentiation with weighted finite-state transducers (WFSTs) allowing them to be used dynamically at training time. Through the separation of graphs from operations on graphs, this framework enables the exploration of new structured loss functions which in turn eases the encoding of prior knowledge into learning algorithms. We show how the framework can combine pruning and back-off in transition models with various sequence-level loss functions. We also show how to learn over the latent decomposition of phrases into word pieces. Finally, to demonstrate that WFSTs can be used in the interior of a deep neural network, we propose a convolutional WFST layer which maps lower-level representations to higher-level representations and can be used as a drop-in replacement for a traditional convolution. We validate these algorithms with experiments in handwriting recognition and speech recognition.

Via

Access Paper or Ask Questions