Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanuel Dupoux

LSCP, CoML

Textually Pretrained Speech Language Models

May 22, 2023

Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux(+2 more)

Figure 1 for Textually Pretrained Speech Language Models

Figure 2 for Textually Pretrained Speech Language Models

Figure 3 for Textually Pretrained Speech Language Models

Figure 4 for Textually Pretrained Speech Language Models

Abstract:Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model design choices such as the speech tokenizer, the pretrained textual model, and the dataset size. We find that model and dataset scale both play an important role in constructing better-performing SpeechLMs. Based on our observations, we present the largest (to the best of our knowledge) SpeechLM both in terms of number of parameters and training data. We additionally introduce two spoken versions of the StoryCloze textual benchmark to further improve model evaluation and advance future research in the field. Speech samples can be found on our website: https://pages.cs.huji.ac.il/adiyoss-lab/twist/ .

Via

Access Paper or Ask Questions

ProsAudit, a prosodic benchmark for self-supervised speech models

Feb 24, 2023

Maureen de Seyssel, Marvin Lavechin, Hadrien Titeux, Arthur Thomas, Gwendal Virlet, Andrea Santos Revilla, Guillaume Wisniewski, Bogdan Ludusan, Emmanuel Dupoux

Figure 1 for ProsAudit, a prosodic benchmark for self-supervised speech models

Figure 2 for ProsAudit, a prosodic benchmark for self-supervised speech models

Figure 3 for ProsAudit, a prosodic benchmark for self-supervised speech models

Figure 4 for ProsAudit, a prosodic benchmark for self-supervised speech models

Abstract:We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when trained on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.

* 4 pages + references, 1 figure

Via

Access Paper or Ask Questions

Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge

Oct 27, 2022

Ewan Dunbar, Nicolas Hamilakis, Emmanuel Dupoux

Abstract:Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks -- Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling -- and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

* Journal: IEEE Journal of Selected Topics in Signal Processing Publication Date: OCTOBER 2022 Volume: 16, Issue: 6 On Page(s): 1211-1226 Print ISSN: 1932-4553 Online ISSN: 1941-0484 Digital Object Identifier: 10.1109/JSTSP.2022.3206084

Via

Access Paper or Ask Questions

Evaluating context-invariance in unsupervised speech representations

Oct 27, 2022

Mark Hallap, Emmanuel Dupoux, Ewan Dunbar

Figure 1 for Evaluating context-invariance in unsupervised speech representations

Figure 2 for Evaluating context-invariance in unsupervised speech representations

Figure 3 for Evaluating context-invariance in unsupervised speech representations

Figure 4 for Evaluating context-invariance in unsupervised speech representations

Abstract:Unsupervised speech representations have taken off, with benchmarks (SUPERB, ZeroSpeech) demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of ``discovering the phonemes'' of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced, while the text remains stable. This is what allows tokens of the same word to have the same transcriptions -- key to language understanding. Current benchmarks do not measure context-invariance. We develop a new version of the ZeroSpeech ABX benchmark that measures context-invariance, and apply it to recent self-supervised representations. We demonstrate that the context-independence of representations is predictive of the stability of word-level representations. We suggest research concentrate on improving context-independence of self-supervised and unsupervised representations.

* Submitted to ICASSP 2023

Via

Access Paper or Ask Questions

Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Oct 27, 2022

Marvin Lavechin, Marianne Métais, Hadrien Titeux, Alodie Boissonnet, Jade Copet, Morgane Rivière, Elika Bergelson, Alejandrina Cristia, Emmanuel Dupoux, Hervé Bredin

Figure 1 for Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Figure 2 for Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Figure 3 for Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Figure 4 for Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Abstract:Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI's Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community.

Via

Access Paper or Ask Questions

Are word boundaries useful for unsupervised language learning?

Oct 06, 2022

Tu Anh Nguyen, Maureen de Seyssel, Robin Algayres, Patricia Roze, Ewan Dunbar, Emmanuel Dupoux

Figure 1 for Are word boundaries useful for unsupervised language learning?

Figure 2 for Are word boundaries useful for unsupervised language learning?

Figure 3 for Are word boundaries useful for unsupervised language learning?

Figure 4 for Are word boundaries useful for unsupervised language learning?

Abstract:Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and 28\% in relative performance depending on the task. We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm, and that even modest segmentation performance gives a gain in performance on two of the three tasks compared to basic character/phone based models without boundary information.

* This is an archived version from September 2020

Via

Access Paper or Ask Questions

On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Sep 30, 2022

Itai Gat, Felix Kreuk, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, Yossi Adi

Figure 1 for On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Figure 2 for On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Figure 3 for On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Figure 4 for On The Robustness of Self-Supervised Representations for Spoken Language Modeling

Abstract:Self-supervised representations have been extensively studied for discriminative and generative tasks. However, their robustness capabilities have not been extensively investigated. This work focuses on self-supervised representations for spoken generative language models. First, we empirically demonstrate how current state-of-the-art speech representation models lack robustness to basic signal variations that do not alter the spoken information. To overcome this, we propose an effective and efficient method to learn robust self-supervised speech representation for generative spoken language modeling. The proposed approach is based on applying a set of signal transformations to the speech signal and optimizing the model using an iterative pseudo-labeling scheme. Our method significantly improves over the evaluated baselines when considering encoding metrics. We additionally evaluate our method on the speech-to-speech translation task. We consider Spanish-English and French-English conversions and empirically demonstrate the benefits of following the proposed approach.

Via

Access Paper or Ask Questions

Emergent Communication: Generalization and Overfitting in Lewis Games

Sep 30, 2022

Mathieu Rita, Corentin Tallec, Paul Michel, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux, Florian Strub

Figure 1 for Emergent Communication: Generalization and Overfitting in Lewis Games

Figure 2 for Emergent Communication: Generalization and Overfitting in Lewis Games

Figure 3 for Emergent Communication: Generalization and Overfitting in Lewis Games

Figure 4 for Emergent Communication: Generalization and Overfitting in Lewis Games

Abstract:Lewis signaling games are a class of simple communication games for simulating the emergence of language. In these games, two agents must agree on a communication protocol in order to solve a cooperative task. Previous work has shown that agents trained to play this game with reinforcement learning tend to develop languages that display undesirable properties from a linguistic point of view (lack of generalization, lack of compositionality, etc). In this paper, we aim to provide better understanding of this phenomenon by analytically studying the learning problem in Lewis games. As a core contribution, we demonstrate that the standard objective in Lewis games can be decomposed in two components: a co-adaptation loss and an information loss. This decomposition enables us to surface two potential sources of overfitting, which we show may undermine the emergence of a structured communication protocol. In particular, when we control for overfitting on the co-adaptation loss, we recover desired properties in the emergent languages: they are more compositional and generalize better.

* 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

Via

Access Paper or Ask Questions

Is the Language Familiarity Effect gradual? A computational modelling approach

Jun 27, 2022

Maureen de Seyssel, Guillaume Wisniewski, Emmanuel Dupoux

Figure 1 for Is the Language Familiarity Effect gradual? A computational modelling approach

Figure 2 for Is the Language Familiarity Effect gradual? A computational modelling approach

Figure 3 for Is the Language Familiarity Effect gradual? A computational modelling approach

Figure 4 for Is the Language Familiarity Effect gradual? A computational modelling approach

Abstract:According to the Language Familiarity Effect (LFE), people are better at discriminating between speakers of their native language. Although this cognitive effect was largely studied in the literature, experiments have only been conducted on a limited number of language pairs and their results only show the presence of the effect without yielding a gradual measure that may vary across language pairs. In this work, we show that the computational model of LFE introduced by Thorburn, Feldmand and Schatz (2019) can address these two limitations. In a first experiment, we attest to this model's capacity to obtain a gradual measure of the LFE by replicating behavioural findings on native and accented speech. In a second experiment, we evaluate LFE on a large number of language pairs, including many which have never been tested on humans. We show that the effect is replicated across a wide array of languages, providing further evidence of its universality. Building on the gradual measure of LFE, we also show that languages belonging to the same family yield smaller scores, supporting the idea of an effect of language distance on LFE.

* 8 pages, 2 figures, accepted at CogSci 2022

Via

Access Paper or Ask Questions

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Jun 22, 2022

Robin Algayres, Tristan Ricoul, Julien Karadayi, Hugo Laurençon, Salah Zaiem, Abdelrahman Mohamed, Benoît Sagot, Emmanuel Dupoux

Figure 1 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 2 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 3 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Figure 4 for DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Abstract:Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.

Via

Access Paper or Ask Questions