Abstract:Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.
Abstract:Offline Reinforcement Learning (RL) relies on policy constraints to mitigate extrapolation error, where both the constraint form and constraint strength critically shape performance. However, most existing methods commit to a single constraint family: weighted behavior cloning, density regularization, or support constraints, without a unified principle that explains their connections or trade-offs. In this work, we propose Continuous Constraint Interpolation (CCI), a unified optimization framework in which these three constraint families arise as special cases along a common constraint spectrum. The CCI framework introduces a single interpolation parameter that enables smooth transitions and principled combinations across constraint types. Building on CCI, we develop Automatic Constraint Policy Optimization (ACPO), a practical primal--dual algorithm that adapts the interpolation parameter via a Lagrangian dual update. Moreover, we establish a maximum-entropy performance difference lemma and derive performance lower bounds for both the closed-form optimal policy and its parametric projection. Experiments on D4RL and NeoRL2 demonstrate robust gains across diverse domains, achieving state-of-the-art performance overall.




Abstract:Offline Reinforcement Learning (RL) faces a critical challenge of extrapolation errors caused by out-of-distribution (OOD) actions. Implicit Q-Learning (IQL) algorithm employs expectile regression to achieve in-sample learning, effectively mitigating the risks associated with OOD actions. However, the fixed hyperparameter in policy evaluation and density-based policy improvement method limit its overall efficiency. In this paper, we propose Proj-IQL, a projective IQL algorithm enhanced with the support constraint. In the policy evaluation phase, Proj-IQL generalizes the one-step approach to a multi-step approach through vector projection, while maintaining in-sample learning and expectile regression framework. In the policy improvement phase, Proj-IQL introduces support constraint that is more aligned with the policy evaluation approach. Furthermore, we theoretically demonstrate that Proj-IQL guarantees monotonic policy improvement and enjoys a progressively more rigorous criterion for superior actions. Empirical results demonstrate the Proj-IQL achieves state-of-the-art performance on D4RL benchmarks, especially in challenging navigation domains.