Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Explicit Preference Optimization: No Need for an Implicit Reward Model

Jun 09, 2025

Xiangkun Hu, Lemin Kong, Tong He, David Wipf

Figure 1 for Explicit Preference Optimization: No Need for an Implicit Reward Model

Figure 2 for Explicit Preference Optimization: No Need for an Implicit Reward Model

Figure 3 for Explicit Preference Optimization: No Need for an Implicit Reward Model

Figure 4 for Explicit Preference Optimization: No Need for an Implicit Reward Model

Share this with someone who'll enjoy it:

Abstract:The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an \textit{implicit} reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an \textit{explicit} preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy of EXPO.

* arXiv admin note: substantial text overlap with arXiv:2407.09072

View paper on

Share this with someone who'll enjoy it:

Title:Explicit Preference Optimization: No Need for an Implicit Reward Model

Paper and Code