policy gradient


Policy Mirror Descent with Temporal Difference Learning: Sample Complexity under Online Markov Data

Add code
Dec 30, 2025
Viaarxiv icon

Performative Policy Gradient: Optimality in Performative Reinforcement Learning

Add code
Dec 23, 2025
Viaarxiv icon

Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

Add code
Jan 12, 2026
Viaarxiv icon

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

Add code
Jan 07, 2026
Viaarxiv icon

Physics-Informed Tree Search for High-Dimensional Computational Design

Add code
Jan 10, 2026
Viaarxiv icon

Safety-Biased Policy Optimisation: Towards Hard-Constrained Reinforcement Learning via Trust Regions

Add code
Dec 29, 2025
Viaarxiv icon

Joint Link Adaptation and Device Scheduling Approach for URLLC Industrial IoT Network: A DRL-based Method with Bayesian Optimization

Add code
Dec 29, 2025
Viaarxiv icon

Beamforming for Massive MIMO Aerial Communications: A Robust and Scalable DRL Approach

Add code
Dec 29, 2025
Viaarxiv icon

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

Add code
Jan 08, 2026
Viaarxiv icon

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Add code
Dec 28, 2025
Viaarxiv icon