Picture for Clive Bai

Clive Bai

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Add code
May 07, 2026
Viaarxiv icon

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Add code
Feb 12, 2026
Viaarxiv icon

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Add code
Feb 02, 2026
Viaarxiv icon

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

Add code
Jan 13, 2026
Viaarxiv icon