Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

May 21, 2025

Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, Tengyang Xie

Figure 1 for Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Figure 2 for Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Figure 3 for Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Figure 4 for Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Share this with someone who'll enjoy it:

Abstract:Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

View paper on

Share this with someone who'll enjoy it:

Title:Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Paper and Code