Abstract:The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Abstract:Offline reinforcement learning aims to enable agents to be trained from pre-collected datasets, however, this comes with the added challenge of estimating the value of behavior not covered in the dataset. Model-based methods offer a solution by allowing agents to collect additional synthetic data via rollouts in a learned dynamics model. The prevailing theoretical understanding is that this can then be viewed as online reinforcement learning in an approximate dynamics model, and any remaining gap is therefore assumed to be due to the imperfect dynamics model. Surprisingly, however, we find that if the learned dynamics model is replaced by the true error-free dynamics, existing model-based methods completely fail. This reveals a major misconception. Our subsequent investigation finds that the general procedure used in model-based algorithms results in the existence of a set of edge-of-reach states which trigger pathological value overestimation and collapse in Bellman-based algorithms. We term this the edge-of-reach problem. Based on this, we fill some gaps in existing theory and also explain how prior model-based methods are inadvertently addressing the true underlying edge-of-reach problem. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and achieves strong performance across both proprioceptive and pixel-based benchmarks. Code open-sourced at: https://github.com/anyasims/edge-of-reach.