Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Nov 07, 2025

Shuvendu Roy, Hossein Hajimirsadeghi, Mengyao Zhai, Golnoosh Samei

Figure 1 for You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Figure 2 for You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Figure 3 for You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Figure 4 for You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Share this with someone who'll enjoy it:

Abstract:Recent advances in large language models have demonstrated the promise of unsupervised reinforcement learning (RL) methods for enhancing reasoning capabilities without external supervision. However, the generalizability of these label-free RL approaches to smaller base models with limited reasoning capabilities remains unexplored. In this work, we systematically investigate the performance of label-free RL methods across different model sizes and reasoning strengths, from 0.5B to 7B parameters. Our empirical analysis reveals critical limitations: label-free RL is highly dependent on the base model's pre-existing reasoning capability, with performance often degrading below baseline levels for weaker models. We find that smaller models fail to generate sufficiently long or diverse chain-of-thought reasoning to enable effective self-reflection, and that training data difficulty plays a crucial role in determining success. To address these challenges, we propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems during training and mask no-majority rollouts during training. Additionally, we introduce a data curation pipeline to generate samples with predefined difficulty. Our approach demonstrates consistent improvements across all model sizes and reasoning capabilities, providing a path toward more robust unsupervised RL that can bootstrap reasoning abilities in resource-constrained models. We make our code available at https://github.com/BorealisAI/CuMa

* 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: MATH-AI

View paper on

Share this with someone who'll enjoy it:

Title:You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models

Paper and Code