Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Maxime Heuillet

LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Nov 06, 2025

Baptiste Bonin, Maxime Heuillet, Audrey Durand

Figure 1 for LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Figure 2 for LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Figure 3 for LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Figure 4 for LLM-as-a-Judge: Toward World Models for Slate Recommendation Systems

Abstract:Modeling user preferences across domains remains a key challenge in slate recommendation (i.e. recommending an ordered sequence of items) research. We investigate how Large Language Models (LLM) can effectively act as world models of user preferences through pairwise reasoning over slates. We conduct an empirical study involving several LLMs on three tasks spanning different datasets. Our results reveal relationships between task performance and properties of the preference function captured by LLMs, hinting towards areas for improvement and highlighting the potential of LLMs as world models in recommender systems.

Via

Access Paper or Ask Questions

Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts

Aug 13, 2025

Maxime Heuillet, Yufei Cui, Boxing Chen, Audrey Durand, Prasanna Parthasarathi

Abstract:Advanced reasoning in LLMs on challenging domains like mathematical reasoning can be tackled using verifiable rewards based reinforced fine-tuning (ReFT). In standard ReFT frameworks, a behavior model generates multiple completions with answers per problem, for the answer to be then scored by a reward function. While such RL post-training methods demonstrate significant performance improvements across challenging reasoning domains, the computational cost of generating completions during training with multiple inference steps makes the training cost non-trivial. To address this, we draw inspiration from off-policy RL, and speculative decoding to introduce a novel ReFT framework, dubbed Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training. The behavior model configured with dynamic layer skipping per batch during training decreases the inference cost compared to the standard ReFT frameworks. Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance. Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes. Additionally, we explore three variants of bias mitigation to minimize the off-policyness in the gradient updates that allows for maintaining performance that matches the baseline ReFT performance.

Via

Access Paper or Ask Questions

Neural Active Learning Meets the Partial Monitoring Framework

May 14, 2024

Maxime Heuillet, Ola Ahmad, Audrey Durand

Figure 1 for Neural Active Learning Meets the Partial Monitoring Framework

Figure 2 for Neural Active Learning Meets the Partial Monitoring Framework

Figure 3 for Neural Active Learning Meets the Partial Monitoring Framework

Figure 4 for Neural Active Learning Meets the Partial Monitoring Framework

Abstract:We focus on the online-based active learning (OAL) setting where an agent operates over a stream of observations and trades-off between the costly acquisition of information (labelled observations) and the cost of prediction errors. We propose a novel foundation for OAL tasks based on partial monitoring, a theoretical framework specialized in online learning from partially informative actions. We show that previously studied binary and multi-class OAL tasks are instances of partial monitoring. We expand the real-world potential of OAL by introducing a new class of cost-sensitive OAL tasks. We propose NeuralCBP, the first PM strategy that accounts for predictive uncertainty with deep neural networks. Our extensive empirical evaluation on open source datasets shows that NeuralCBP has favorable performance against state-of-the-art baselines on multiple binary, multi-class and cost-sensitive OAL tasks.

Via

Access Paper or Ask Questions

Randomized Confidence Bounds for Stochastic Partial Monitoring

Feb 07, 2024

Maxime Heuillet, Ola Ahmad, Audrey Durand

Figure 1 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Figure 2 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Figure 3 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Figure 4 for Randomized Confidence Bounds for Stochastic Partial Monitoring

Abstract:The partial monitoring (PM) framework provides a theoretical formulation of sequential learning problems with incomplete feedback. On each round, a learning agent plays an action while the environment simultaneously chooses an outcome. The agent then observes a feedback signal that is only partially informative about the (unobserved) outcome. The agent leverages the received feedback signals to select actions that minimize the (unobserved) cumulative loss. In contextual PM, the outcomes depend on some side information that is observable by the agent before selecting the action on each round. In this paper, we consider the contextual and non-contextual PM settings with stochastic outcomes. We introduce a new class of strategies based on the randomization of deterministic confidence bounds, that extend regret guarantees to settings where existing stochastic strategies are not applicable. Our experiments show that the proposed RandCBP and RandCBPside* strategies improve state-of-the-art baselines in PM games. To encourage the adoption of the PM framework, we design a use case on the real-world problem of monitoring the error rate of any deployed classification system.

Via

Access Paper or Ask Questions