Picture for Saiyong Yang

Saiyong Yang

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Add code
May 07, 2026
Viaarxiv icon

Tool Learning Needs Nothing More Than a Free 8B Language Model

Add code
Apr 20, 2026
Viaarxiv icon

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Add code
Feb 12, 2026
Viaarxiv icon

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

Add code
Feb 12, 2026
Viaarxiv icon

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

Add code
Feb 02, 2026
Viaarxiv icon

ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning

Add code
Jan 13, 2026
Viaarxiv icon

EntroPIC: Towards Stable Long-Term Training of LLMs via Entropy Stabilization with Proportional-Integral Control

Add code
Nov 19, 2025
Viaarxiv icon

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Add code
Nov 09, 2025
Viaarxiv icon

Think Outside the Policy: In-Context Steered Policy Optimization

Add code
Oct 30, 2025
Viaarxiv icon

Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

Add code
Oct 30, 2025
Viaarxiv icon