Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Apr 18, 2025

Yixuan Even Xu, Yash Savani, Fei Fang, Zico Kolter

Share this with someone who'll enjoy it:

Abstract:Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing reasoning capabilities in large language models, but faces a fundamental asymmetry in computation and memory requirements: inference is embarrassingly parallel with a minimal memory footprint, while policy updates require extensive synchronization and are memory-intensive. To address this asymmetry, we introduce PODS (Policy Optimization with Down-Sampling), a framework that strategically decouples these phases by generating numerous rollouts in parallel but updating only on an informative subset. Within this framework, we develop max-variance down-sampling, a theoretically motivated method that selects rollouts with maximally diverse reward signals. We prove that this approach has an efficient algorithmic solution, and empirically demonstrate that GRPO with PODS using max-variance down-sampling achieves superior performance over standard GRPO on the GSM8K benchmark.

* 9 pages, 1 figure

View paper on

Share this with someone who'll enjoy it:

Title:Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

Paper and Code