Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Oct 08, 2025

Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu

Figure 1 for Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Figure 2 for Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Figure 3 for Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Figure 4 for Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Share this with someone who'll enjoy it:

Abstract:Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

* 20 pages

View paper on

Share this with someone who'll enjoy it:

Title:Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Paper and Code