Policy Gradient


Interpreting and Controlling LLM Reasoning through Integrated Policy Gradient

Add code
Feb 03, 2026
Viaarxiv icon

Reparameterization Flow Policy Optimization

Add code
Feb 03, 2026
Viaarxiv icon

ConsistentRFT: Reducing Visual Hallucinations in Flow-based Reinforcement Fine-Tuning

Add code
Feb 03, 2026
Viaarxiv icon

An Approximate Ascent Approach To Prove Convergence of PPO

Add code
Feb 03, 2026
Viaarxiv icon

TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

Add code
Feb 03, 2026
Viaarxiv icon

$V_0$: A Generalist Value Model for Any Policy at State Zero

Add code
Feb 03, 2026
Viaarxiv icon

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

Add code
Feb 03, 2026
Viaarxiv icon

Flow Policy Gradients for Robot Control

Add code
Feb 02, 2026
Viaarxiv icon

Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

Add code
Feb 03, 2026
Viaarxiv icon

Causal Flow Q-Learning for Robust Offline Reinforcement Learning

Add code
Feb 02, 2026
Viaarxiv icon