Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Luhui Liu

Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Dec 08, 2025

Zhuoran Zhuang, Ye Chen, Jianghao Su, Chao Luo, Luhui Liu, Xia Zeng

Figure 1 for Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Figure 2 for Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Figure 3 for Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Figure 4 for Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

Abstract:Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence; (2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback - encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates. Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.

Via

Access Paper or Ask Questions