Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Sep 18, 2025

Dan Zhang, Min Cai, Jonathan Li, Ziniu Hu, Yisong Yue, Yuxiao Dong, Jie Tang

Figure 1 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Figure 2 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Figure 3 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Figure 4 for TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Share this with someone who'll enjoy it:

Abstract:Reward models are central to both reinforcement learning (RL) with language models and inference-time verification. However, existing reward models often lack temporal consistency, leading to ineffective policy updates and unstable RL training. We introduce TDRM, a method for learning smoother and more reliable reward models by minimizing temporal differences during training. This temporal-difference (TD) regularization produces smooth rewards and improves alignment with long-term objectives. Incorporating TDRM into the actor-critic style online RL loop yields consistent empirical gains. It is worth noting that TDRM is a supplement to verifiable reward methods, and both can be used in series. Experiments show that TD-trained process reward models (PRMs) improve performance across Best-of-N (up to 6.6%) and tree-search (up to 23.7%) settings. When combined with Reinforcement Learning with Verifiable Rewards (RLVR), TD-trained PRMs lead to more data-efficient RL -- achieving comparable performance with just 2.5k data to what baseline methods require 50.1k data to attain -- and yield higher-quality language model policies on 8 model variants (5 series), e.g., Qwen2.5-(0.5B, 1,5B), GLM4-9B-0414, GLM-Z1-9B-0414, Qwen2.5-Math-(1.5B, 7B), and DeepSeek-R1-Distill-Qwen-(1.5B, 7B). We release all code at https://github.com/THUDM/TDRM.

* 9 figures, 7 tables

View paper on

Share this with someone who'll enjoy it:

Title:TDRM: Smooth Reward Models with Temporal Difference for LLM RL and Inference

Paper and Code