Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Nov 17, 2025

Subramanyam Sahoo

Figure 1 for The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Figure 2 for The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Figure 3 for The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Figure 4 for The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Share this with someone who'll enjoy it:

Abstract:Reward design is central to reinforcement learning from human feedback (RLHF) and alignment research. In this work, we propose a unified framework to study hard, continuous, and hybrid reward structures for fine-tuning large language models (LLMs) on mathematical reasoning tasks. Using Qwen3-4B with LoRA fine-tuning on the GSM8K dataset, we formalize and empirically evaluate reward formulations that incorporate correctness, perplexity, reasoning quality, and consistency. We introduce an adaptive hybrid reward scheduler that transitions between discrete and continuous signals, balancing exploration and stability. Our results show that hybrid reward structures improve convergence speed and training stability over purely hard or continuous approaches, offering insights for alignment via adaptive reward modeling.

* Paper accepted to the 2nd Workshop on Aligning Reinforcement Learning Experimentalists and Theorists (ARLET 2025) at NeurIPS; the paper consists of 14 pages (including the appendix) and contains 3 figures

View paper on

Share this with someone who'll enjoy it:

Title:The Good, The Bad, and The Hybrid: A Reward Structure Showdown in Reasoning Models Training

Paper and Code