Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jonathan Colaço Carr

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

May 29, 2026

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

Abstract:Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

Via

Access Paper or Ask Questions

Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Oct 06, 2025

Jonathan Colaço Carr, Qinyi Sun, Cameron Allen

Figure 1 for Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Figure 2 for Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Figure 3 for Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Figure 4 for Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Abstract:Skills are essential for unlocking higher levels of problem solving. A common approach to discovering these skills is to learn ones that reliably reach different states, thus empowering the agent to control its environment. However, existing skill discovery algorithms often overlook the natural state variables present in many reinforcement learning problems, meaning that the discovered skills lack control of specific state variables. This can significantly hamper exploration efficiency, make skills more challenging to learn with, and lead to negative side effects in downstream tasks when the goal is under-specified. We introduce a general method that enables these skill discovery algorithms to learn focused skills -- skills that target and control specific state variables. Our approach improves state space coverage by a factor of three, unlocks new learning capabilities, and automatically avoids negative side effects in downstream tasks.

* Reinforcement Learning Journal 2025

Via

Access Paper or Ask Questions