Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alexis Jacq

Diffusion Fine-tuning with Rewarded Moment Matching Distillation

Jun 29, 2026

Alexis Jacq, Guillaume Couairon, Valentin De Bortoli, Quentin Berthet, Arnaud Doucet, Romuald Elie

Abstract:Distillation and Reinforcement Learning (RL) fine-tuning are the primary pillars of diffusion post-training. While traditionally studied in isolation, the interaction between these phases remains poorly understood, and in particular how fine-tuning impacts the generative quality of distilled models. We introduce Rewarded Moment Matching Distillation (RMMD), a novel framework that simultaneously distills diffusion models and maximizes a reward function. RMMD preserves the high-fidelity ``naturalness'' characteristic of advanced distillation (such as 8-step Moment Matching) by adapting the sampling loop for on-policy training and repurposing the distillation loss as a proxy for integral KL regularization. By evaluating the FID-Reward Pareto fronts on ImageNet, we demonstrate that RMMD achieves superior trade-offs compared to single-step baselines (DI++) and multi-step competitors (DRaFT, HyperNoise). Finally, we apply RMMD to GenCast, a state-of-the-art weather forecasting model, to distill it while optimizing the Continuous Ranked Probability Score (CRPS) metric. The resulting distilled model achieves a 7.5x speedup while outperforming the teacher model on 93% of target weather variables, and being better calibrated. This proves that RMMD scales to complex, high-dimensional scientific domains.

Via

Access Paper or Ask Questions

C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining

Nov 07, 2022

Alexis Jacq, Manu Orsini, Gabriel Dulac-Arnold, Olivier Pietquin, Matthieu Geist, Olivier Bachem

Figure 1 for C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining

Figure 2 for C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining

Figure 3 for C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining

Figure 4 for C3PO: Learning to Achieve Arbitrary Goals via Massively Entropic Pretraining

Abstract:Given a particular embodiment, we propose a novel method (C3PO) that learns policies able to achieve any arbitrary position and pose. Such a policy would allow for easier control, and would be re-useable as a key building block for downstream tasks. The method is two-fold: First, we introduce a novel exploration algorithm that optimizes for uniform coverage, is able to discover a set of achievable states, and investigates its abilities in attaining both high coverage, and hard-to-discover states; Second, we leverage this set of achievable states as training data for a universal goal-achievement policy, a goal-based SAC variant. We demonstrate the trained policy's performance in achieving a large number of novel states. Finally, we showcase the influence of massive unsupervised training of a goal-achievement policy with state-of-the-art pose-based control of the Hopper, Walker, Halfcheetah, Humanoid and Ant embodiments.

Via

Access Paper or Ask Questions

Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

Mar 16, 2022

Alexis Jacq, Johan Ferret, Olivier Pietquin, Matthieu Geist

Figure 1 for Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

Figure 2 for Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

Figure 3 for Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

Figure 4 for Lazy-MDPs: Towards Interpretable Reinforcement Learning by Learning When to Act

Abstract:Traditionally, Reinforcement Learning (RL) aims at deciding how to act optimally for an artificial agent. We argue that deciding when to act is equally important. As humans, we drift from default, instinctive or memorized behaviors to focused, thought-out behaviors when required by the situation. To enhance RL agents with this aptitude, we propose to augment the standard Markov Decision Process and make a new mode of action available: being lazy, which defers decision-making to a default policy. In addition, we penalize non-lazy actions in order to encourage minimal effort and have agents focus on critical decisions only. We name the resulting formalism lazy-MDPs. We study the theoretical properties of lazy-MDPs, expressing value functions and characterizing optimal solutions. Then we empirically demonstrate that policies learned in lazy-MDPs generally come with a form of interpretability: by construction, they show us the states where the agent takes control over the default policy. We deem those states and corresponding actions important since they explain the difference in performance between the default and the new, lazy policy. With suboptimal policies as default (pretrained or random), we observe that agents are able to get competitive performance in Atari games while only taking control in a limited subset of states.

* Autonomous Agents and Multi-Agent Systems (2022)
* AAMAS 2022 (14 pages extended version, added Sec. 7.4 and appendix K)

Via

Access Paper or Ask Questions

Foolproof Cooperative Learning

Jun 24, 2019

Alexis Jacq, Julien Perolat, Matthieu Geist, Olivier Pietquin

Figure 1 for Foolproof Cooperative Learning

Figure 2 for Foolproof Cooperative Learning

Figure 3 for Foolproof Cooperative Learning

Abstract:This paper extends the notion of equilibrium in game theory to learning algorithms in repeated stochastic games. We define a learning equilibrium as an algorithm used by a population of players, such that no player can individually use an alternative algorithm and increase its asymptotic score. We introduce Foolproof Cooperative Learning (FCL), an algorithm that converges to a Tit-for-Tat behavior. It allows cooperative strategies when played against itself while being not exploitable by selfish players. We prove that in repeated symmetric games, this algorithm is a learning equilibrium. We illustrate the behavior of FCL on symmetric matrix and grid games, and its robustness to selfish learners.

Via

Access Paper or Ask Questions

Cognitive Architecture for Mutual Modelling

Feb 22, 2016

Alexis Jacq, Wafa Johal, Pierre Dillenbourg, Ana Paiva

Figure 1 for Cognitive Architecture for Mutual Modelling

Abstract:In social robotics, robots needs to be able to be understood by humans. Especially in collaborative tasks where they have to share mutual knowledge. For instance, in an educative scenario, learners share their knowledge and they must adapt their behaviour in order to make sure they are understood by others. Learners display behaviours in order to show their understanding and teachers adapt in order to make sure that the learners' knowledge is the required one. This ability requires a model of their own mental states perceived by others: \textit{"has the human understood that I(robot) need this object for the task or should I explain it once again ?"} In this paper, we discuss the importance of a cognitive architecture enabling second-order Mutual Modelling for Human-Robot Interaction in educative contexts.

* Presented at "2nd Workshop on Cognitive Architectures for Social Human-Robot Interaction 2016 (arXiv:1602.01868)

Via

Access Paper or Ask Questions