Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Emmanuel Dupoux

LSCP, CoML

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Apr 30, 2024

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, Olivier Pietquin

Figure 1 for Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Figure 2 for Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Figure 3 for Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Figure 4 for Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Abstract:While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.

Via

Access Paper or Ask Questions

Language Evolution with Deep Learning

Mar 18, 2024

Mathieu Rita, Paul Michel, Rahma Chaabouni, Olivier Pietquin, Emmanuel Dupoux, Florian Strub

Figure 1 for Language Evolution with Deep Learning

Figure 2 for Language Evolution with Deep Learning

Figure 3 for Language Evolution with Deep Learning

Figure 4 for Language Evolution with Deep Learning

Abstract:Computational modeling plays an essential role in the study of language emergence. It aims to simulate the conditions and learning processes that could trigger the emergence of a structured language within a simulated controlled environment. Several methods have been used to investigate the origin of our language, including agent-based systems, Bayesian agents, genetic algorithms, and rule-based systems. This chapter explores another class of computational models that have recently revolutionized the field of machine learning: deep learning models. The chapter introduces the basic concepts of deep and reinforcement learning methods and summarizes their helpfulness for simulating language emergence. It also discusses the key findings, limitations, and recent attempts to build realistic simulations. This chapter targets linguists and cognitive scientists seeking an introduction to deep learning as a tool to investigate language evolution.

* to appear in the Oxford Handbook of Approaches to Language Evolution

Via

Access Paper or Ask Questions

SpiRit-LM: Interleaved Spoken and Written Language Model

Feb 08, 2024

Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat(+4 more)

Abstract:We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally, we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).

Via

Access Paper or Ask Questions

EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models

Dec 21, 2023

Maureen de Seyssel, Antony D'Avirro, Adina Williams, Emmanuel Dupoux

Abstract:We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.

Via

Access Paper or Ask Questions

WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models

Nov 27, 2023

Youssef Benchekroun, Megi Dervishi, Mark Ibrahim, Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent

Abstract:We propose WorldSense, a benchmark designed to assess the extent to which LLMs are consistently able to sustain tacit world models, by testing how they draw simple inferences from descriptions of simple arrangements of entities. Worldsense is a synthetic benchmark with three problem types, each with their own trivial control, which explicitly avoids bias by decorrelating the abstract structure of problems from the vocabulary and expressions, and by decorrelating all problem subparts with the correct response. We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) and show that these models make errors even with as few as three objects. Furthermore, they have quite heavy response biases, preferring certain responses irrespective of the question. Errors persist even with chain-of-thought prompting and in-context learning. Lastly, we show that while finetuning on similar problems does result in substantial improvements -- within- and out-of-distribution -- the finetuned models do not generalise beyond a constraint problem space.

Via

Access Paper or Ask Questions

XLS-R fine-tuning on noisy word boundaries for unsupervised speech segmentation into words

Oct 08, 2023

Robin Algayres, Pablo Diego-Simon, Benoit Sagot, Emmanuel Dupoux

Abstract:Due to the absence of explicit word boundaries in the speech stream, the task of segmenting spoken sentences into word units without text supervision is particularly challenging. In this work, we leverage the most recent self-supervised speech models that have proved to quickly adapt to new tasks through fine-tuning, even in low resource conditions. Taking inspiration from semi-supervised learning, we fine-tune an XLS-R model to predict word boundaries themselves produced by top-tier speech segmentation systems: DPDP, VG-HuBERT, GradSeg and DP-Parse. Once XLS-R is fine-tuned, it is used to infer new word boundary labels that are used in turn for another fine-tuning step. Our method consistently improves the performance of each system and sets a new state-of-the-art that is, on average 130% higher than the previous one as measured by the F1 score on correctly discovered word tokens on five corpora featuring different languages. Finally, our system can segment speech from languages unseen during fine-tuning in a zero-shot fashion.

* Findings at EMNLP 2023

Via

Access Paper or Ask Questions

Generative Spoken Language Model based on continuous word-sized audio tokens

Oct 08, 2023

Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoit Sagot, Emmanuel Dupoux

Figure 1 for Generative Spoken Language Model based on continuous word-sized audio tokens

Figure 2 for Generative Spoken Language Model based on continuous word-sized audio tokens

Figure 3 for Generative Spoken Language Model based on continuous word-sized audio tokens

Figure 4 for Generative Spoken Language Model based on continuous word-sized audio tokens

Abstract:In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative language model based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

* Conference paper at EMNLP 2023

Via

Access Paper or Ask Questions

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Sep 29, 2023

Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

Figure 1 for Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Figure 2 for Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Figure 3 for Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Figure 4 for Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Abstract:Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TTS) system with limited resources using SSL features and generate a large synthetic corpus for pre-training. Experimental results demonstrate that our proposed approach effectively reduces the demand for speech data by 90\% with only slight performance degradation. To the best of our knowledge, this is the first work aiming to enhance low-resource self-supervised learning in speech processing.

* Preprint

Via

Access Paper or Ask Questions

EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Aug 10, 2023

Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid(+3 more)

Figure 1 for EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Figure 2 for EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Figure 3 for EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Figure 4 for EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Abstract:Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthesis datasets are read, severely limiting spontaneity and expressivity. Here, we introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis that includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We illustrate the challenges and potentials of this dataset with an expressive resynthesis benchmark where the task is to encode the input in low-bitrate units and resynthesize it in a target voice while preserving content and style. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders, and explore tradeoffs between quality, bitrate and invariance to speaker and style. All the dataset, evaluation metrics and baseline models are open source

Via

Access Paper or Ask Questions

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Jun 08, 2023

Marvin Lavechin, Yaya Sy, Hadrien Titeux, María Andrea Cruz Blandón, Okko Räsänen, Hervé Bredin, Emmanuel Dupoux, Alejandrina Cristia

Figure 1 for BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Figure 2 for BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Figure 3 for BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Figure 4 for BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Abstract:Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children's language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.

* Proceedings of Interspeech 2023

Via

Access Paper or Ask Questions