Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Morgane M Moss

Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

May 07, 2025

Kusha Sareen, Morgane M Moss, Alessandro Sordoni, Rishabh Agarwal, Arian Hosseini

Figure 1 for Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Figure 2 for Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Figure 3 for Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Figure 4 for Putting the Value Back in RL: Better Test-Time Scaling by Unifying LLM Reasoners With Verifiers

Abstract:Prevalent reinforcement learning~(RL) methods for fine-tuning LLM reasoners, such as GRPO or Leave-one-out PPO, abandon the learned value function in favor of empirically estimated returns. This hinders test-time compute scaling that relies on using the value-function for verification. In this work, we propose RL$^V$ that augments any ``value-free'' RL method by jointly training the LLM as both a reasoner and a generative verifier using RL-generated data, adding verification capabilities without significant overhead. Empirically, RL$^V$ boosts MATH accuracy by over 20\% with parallel sampling and enables $8-32\times$ efficient test-time compute scaling compared to the base RL method. RL$^V$ also exhibits strong generalization capabilities for both easy-to-hard and out-of-domain tasks. Furthermore, RL$^V$ achieves $1.2-1.6\times$ higher performance when jointly scaling parallel and sequential test-time compute with a long reasoning R1 model.

Via

Access Paper or Ask Questions

debug-gym: A Text-Based Environment for Interactive Debugging

Mar 27, 2025

Xingdi Yuan, Morgane M Moss, Charbel El Feghali, Chinmay Singh, Darya Moldavskaya, Drew MacPhee, Lucas Caccia, Matheus Pereira, Minseon Kim, Alessandro Sordoni(+1 more)

Figure 1 for debug-gym: A Text-Based Environment for Interactive Debugging

Figure 2 for debug-gym: A Text-Based Environment for Interactive Debugging

Figure 3 for debug-gym: A Text-Based Environment for Interactive Debugging

Figure 4 for debug-gym: A Text-Based Environment for Interactive Debugging

Abstract:Large Language Models (LLMs) are increasingly relied upon for coding tasks, yet in most scenarios it is assumed that all relevant information can be either accessed in context or matches their training data. We posit that LLMs can benefit from the ability to interactively explore a codebase to gather the information relevant to their task. To achieve this, we present a textual environment, namely debug-gym, for developing LLM-based agents in an interactive coding setting. Our environment is lightweight and provides a preset of useful tools, such as a Python debugger (pdb), designed to facilitate an LLM-based agent's interactive debugging. Beyond coding and debugging tasks, this approach can be generalized to other tasks that would benefit from information-seeking behavior by an LLM agent.

Via

Access Paper or Ask Questions