Alert button
Picture for Marc Dymetman

Marc Dymetman

Alert button

Should you marginalize over possible tokenizations?

Jun 30, 2023
Nadezhda Chirkova, Germán Kruszewski, Jos Rozen, Marc Dymetman

Figure 1 for Should you marginalize over possible tokenizations?
Figure 2 for Should you marginalize over possible tokenizations?
Figure 3 for Should you marginalize over possible tokenizations?
Figure 4 for Should you marginalize over possible tokenizations?

Autoregressive language models (LMs) map token sequences to probabilities. The usual practice for computing the probability of any character string (e.g. English sentences) is to first transform it into a sequence of tokens that is scored by the model. However, there are exponentially many token sequences that represent any given string. To truly compute the probability of a string one should marginalize over all tokenizations, which is typically intractable. Here, we analyze whether the practice of ignoring the marginalization is justified. To this end, we devise an importance-sampling-based algorithm that allows us to compute estimates of the marginal probabilities and compare them to the default procedure in a range of state-of-the-art models and datasets. Our results show that the gap in log-likelihood is no larger than 0.5% in most cases, but that it becomes more pronounced for data with long complex words.

* Accepted to ACL 2023 
Viaarxiv icon

disco: a toolkit for Distributional Control of Generative Models

Mar 08, 2023
Germán Kruszewski, Jos Rozen, Marc Dymetman

Figure 1 for disco: a toolkit for Distributional Control of Generative Models
Figure 2 for disco: a toolkit for Distributional Control of Generative Models
Figure 3 for disco: a toolkit for Distributional Control of Generative Models
Figure 4 for disco: a toolkit for Distributional Control of Generative Models

Pre-trained language models and other generative models have revolutionized NLP and beyond. However, these models tend to reproduce undesirable biases present in their training data. Also, they may overlook patterns that are important but challenging to capture. To address these limitations, researchers have introduced distributional control techniques. These techniques, not limited to language, allow controlling the prevalence (i.e., expectations) of any features of interest in the model's outputs. Despite their potential, the widespread adoption of these techniques has been hindered by the difficulty in adapting complex, disconnected code. Here, we present disco, an open-source Python library that brings these techniques to the broader public.

Viaarxiv icon

Aligning Language Models with Preferences through f-divergence Minimization

Feb 16, 2023
Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, Marc Dymetman

Figure 1 for Aligning Language Models with Preferences through f-divergence Minimization
Figure 2 for Aligning Language Models with Preferences through f-divergence Minimization
Figure 3 for Aligning Language Models with Preferences through f-divergence Minimization
Figure 4 for Aligning Language Models with Preferences through f-divergence Minimization

Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences are good for approximating different targets. For instance, we discover that for GDC, the Jensen-Shannon divergence frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work.

Viaarxiv icon

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

Jun 01, 2022
Tomasz Korbak, Hady Elsahar, Germán Kruszewski, Marc Dymetman

Figure 1 for On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting
Figure 2 for On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting
Figure 3 for On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting
Figure 4 for On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.

Viaarxiv icon

Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs

Dec 10, 2021
Bryan Eikema, Germán Kruszewski, Hady Elsahar, Marc Dymetman

Figure 1 for Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs
Figure 2 for Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs
Figure 3 for Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs
Figure 4 for Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs

Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. However, they do not provide a mechanism for obtaining exact samples from these distributions. Monte Carlo techniques can aid us in obtaining samples if some proposal distribution that we can easily sample from is available. For instance, rejection sampling can provide exact samples but is often difficult or impossible to apply due to the need to find a proposal distribution that upper-bounds the target distribution everywhere. Approximate Markov chain Monte Carlo sampling techniques like Metropolis-Hastings are usually easier to design, exploiting a local proposal distribution that performs local edits on an evolving sample. However, these techniques can be inefficient due to the local nature of the proposal distribution and do not provide an estimate of the quality of their samples. In this work, we propose a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality, while providing explicit convergence bounds and diagnostics. QRS capitalizes on the availability of high-quality global proposal distributions obtained from deep learning models. We demonstrate the effectiveness of QRS sampling for discrete EBMs over text for the tasks of controlled text generation with distributional constraints and paraphrase generation. We show that we can sample from such EBMs with arbitrary precision at the cost of sampling efficiency.

Viaarxiv icon

Controlling Conditional Language Models with Distributional Policy Gradients

Dec 01, 2021
Tomasz Korbak, Hady Elsahar, German Kruszewski, Marc Dymetman

Figure 1 for Controlling Conditional Language Models with Distributional Policy Gradients
Figure 2 for Controlling Conditional Language Models with Distributional Policy Gradients
Figure 3 for Controlling Conditional Language Models with Distributional Policy Gradients
Figure 4 for Controlling Conditional Language Models with Distributional Policy Gradients

Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g. hallucination in abstractive summarization or wrong format in automatic code generation). This raises an important question on how to adapt pre-trained generative models to a new task without destroying its capabilities. Recent work has suggested to solve this problem by representing task-specific requirements through energy-based models (EBMs) and approximating these EBMs using distributional policy gradients (DPG). Unfortunately, this approach is limited to unconditional distributions, represented by unconditional EBMs. In this paper, we extend this approach to conditional tasks by proposing Conditional DPG (CDPG). We evaluate CDPG on three different control objectives across two tasks: summarization with T5 and code generation with GPT-Neo. Our results show that fine-tuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and -- in contrast with baseline approaches -- does not result in catastrophic forgetting.

* CtrlGen: Controllable Generative Modeling in Language and Vision Workshop at NeurIPS 2021 
Viaarxiv icon

Energy-Based Models for Code Generation under Compilability Constraints

Jun 09, 2021
Tomasz Korbak, Hady Elsahar, Marc Dymetman, Germán Kruszewski

Figure 1 for Energy-Based Models for Code Generation under Compilability Constraints
Figure 2 for Energy-Based Models for Code Generation under Compilability Constraints
Figure 3 for Energy-Based Models for Code Generation under Compilability Constraints
Figure 4 for Energy-Based Models for Code Generation under Compilability Constraints

Neural language models can be successfully trained on source code, leading to applications such as code completion. However, their versatile autoregressive self-supervision objective overlooks important global sequence-level features that are present in the data such as syntactic correctness or compilability. In this work, we pose the problem of learning to generate compilable code as constraint satisfaction. We define an Energy-Based Model (EBM) representing a pre-trained generative model with an imposed constraint of generating only compilable sequences. We then use the KL-Adaptive Distributional Policy Gradient algorithm (Khalifa et al., 2021) to train a generative model approximating the EBM. We conduct experiments showing that our proposed approach is able to improve compilability rates without sacrificing diversity and complexity of the generated samples.

* Accepted for the First Workshop on Natural Language Processing for Programming, ACL 2021 
Viaarxiv icon

A Distributional Approach to Controlled Text Generation

Dec 21, 2020
Muhammad Khalifa, Hady Elsahar, Marc Dymetman

Figure 1 for A Distributional Approach to Controlled Text Generation
Figure 2 for A Distributional Approach to Controlled Text Generation
Figure 3 for A Distributional Approach to Controlled Text Generation
Figure 4 for A Distributional Approach to Controlled Text Generation

We propose a Distributional Approach to address Controlled Text Generation from pre-trained Language Models (LMs). This view permits to define, in a single formal framework, "pointwise" and "distributional" constraints over the target LM -- to our knowledge, this is the first approach with such generality -- while minimizing KL divergence with the initial LM distribution. The optimal target distribution is then uniquely determined as an explicit EBM (Energy-Based Model) representation. From that optimal representation we then train the target controlled autoregressive LM through an adaptive distributional variant of Policy Gradient. We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM (GPT-2). We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study we show the effectiveness of our adaptive technique for obtaining faster convergence.

* Under review at ICLR 2021 
Viaarxiv icon

Distributional Reinforcement Learning for Energy-Based Sequential Models

Dec 18, 2019
Tetiana Parshakova, Jean-Marc Andreoli, Marc Dymetman

Figure 1 for Distributional Reinforcement Learning for Energy-Based Sequential Models
Figure 2 for Distributional Reinforcement Learning for Energy-Based Sequential Models
Figure 3 for Distributional Reinforcement Learning for Energy-Based Sequential Models
Figure 4 for Distributional Reinforcement Learning for Energy-Based Sequential Models

Global Autoregressive Models (GAMs) are a recent proposal [Parshakova et al., CoNLL 2019] for exploiting global properties of sequences for data-efficient learning of seq2seq models. In the first phase of training, an Energy-Based model (EBM) over sequences is derived. This EBM has high representational power, but is unnormalized and cannot be directly exploited for sampling. To address this issue [Parshakova et al., CoNLL 2019] proposes a distillation technique, which can only be applied under limited conditions. By relating this problem to Policy Gradient techniques in RL, but in a \emph{distributional} rather than \emph{optimization} perspective, we propose a general approach applicable to any sequential EBM. Its effectiveness is illustrated on GAM-based experiments.

* OptRL workshop (Optimization Foundations for Reinforcement Learning) at Neurips 2019 
Viaarxiv icon

Character-based NMT with Transformer

Nov 12, 2019
Rohit Gupta, Laurent Besacier, Marc Dymetman, Matthias Gallé

Figure 1 for Character-based NMT with Transformer
Figure 2 for Character-based NMT with Transformer
Figure 3 for Character-based NMT with Transformer
Figure 4 for Character-based NMT with Transformer

Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with the Transformer architecture. In particular, our experiments on EN-DE show that character-based Transformer models are more robust than their BPE counterpart, both when translating noisy text, and when translating text from a different domain. To obtain comparable BLEU scores in clean, in-domain data and close the gap with BPE-based models we use known techniques to train deeper Transformer models.

Viaarxiv icon