Beijing Institute of Technolegy
Abstract:Reinforcement Learning from Human Feedback (RLHF) and its variants have emerged as the dominant approaches for aligning Large Language Models with human intent. While empirically effective, the theoretical generalization properties of these methods in high-dimensional settings remain to be explored. To this end, we build the generalization theory on RLHF of LLMs under the linear reward model, through the framework of algorithmic stability. In contrast to the existing works built upon the consistency of maximum likelihood estimations on reward model, our analysis is presented under an end-to-end learning framework, which is consistent with practice. Concretely, we prove that under a key \textbf{feature coverage} condition, the empirical optima of policy model have a generalization bound of order $\mathcal{O}(n^{-\frac{1}{2}})$. Moreover, the results can be extrapolated to parameters obtained by gradient-based learning algorithms, i.e., Gradient Ascent (GA) and Stochastic Gradient Ascent (SGA). Thus, we argue that our results provide new theoretical evidence for the empirically observed generalization of LLMs after RLHF.
Abstract:The exploration-exploitation (EE) trade-off is a central challenge in reinforcement learning (RL) for large language models (LLMs). With Group Relative Policy Optimization (GRPO), training tends to be exploitation driven: entropy decreases monotonically, samples convergence, and exploration fades. Most existing fixes are \textbf{sample-centric}: they seek or bonus rare samples, assuming exploration comes from novel trajectories and tokens. These heuristics depend on the "luck" of informative samples, lack principled control of the policy, and often yield limited or inconsistent gains. In this work, we are the first to introduce a \textbf{distribution-centric} perspective for RL, in which exploration is always guided by a "better" target distribution, and reveal that a policy's ability to resist entropy collapse is governed by the distribution itself rather than individual samples. Building on this insight, we propose Distribution-Centric Policy Optimization (DCPO), which reformulates entropy regulation as distribution-level regularization. DCPO achieves controllable entropy fully on-policy without sampling from external distributions, enabling efficient exploration while maintaining training stability. Across multiple models and seven benchmarks, DCPO improves over GRPO by about 20\% on average. Overall, DCPO replaces sample-level heuristics with distribution-level principles, offering a theoretically grounded and flexible framework for controllable exploration and a stronger EE trade-off. The code is available in https://github.com/597358816/DCPO.
Abstract:By using an parametric value function to replace the Monte-Carlo rollouts for value estimation, the actor-critic (AC) algorithms can reduce the variance of stochastic policy gradient so that to improve the convergence rate. While existing works mainly focus on analyzing convergence rate of AC algorithms under Markovian noise, the impacts of momentum on AC algorithms remain largely unexplored. In this work, we first propose a heavy-ball momentum based advantage actor-critic (\mbox{HB-A2C}) algorithm by integrating the heavy-ball momentum into the critic recursion that is parameterized by a linear function. When the sample trajectory follows a Markov decision process, we quantitatively certify the acceleration capability of the proposed HB-A2C algorithm. Our theoretical results demonstrate that the proposed HB-A2C finds an $\epsilon$-approximate stationary point with $\oo{\epsilon^{-2}}$ iterations for reinforcement learning tasks with Markovian noise. Moreover, we also reveal the dependence of learning rates on the length of the sample trajectory. By carefully selecting the momentum factor of the critic recursion, the proposed HB-A2C can balance the errors introduced by the initialization and the stoschastic approximation.




Abstract:We consider monotone inclusion problems where the operators may be expectation-valued. A direct application of proximal and splitting schemes is complicated by resolving problems with expectation-valued maps at each step, a concern that is addressed by using sampling. Accordingly, we propose avenues for addressing uncertainty in the mapping. (i) Variance-reduced stochastic proximal point method (vr-SPP). We develop amongst the first variance-reduced stochastic proximal-point schemes that achieves deterministic rates of convergence in terms of solving proximal-point problems. In addition, it is shown that the schemes are characterized by either optimal or near-optimal oracle (or sample) complexity guarantees. Finally, the generated sequences are shown to be convergent to a solution in an almost-sure sense in both monotone and strongly monotone regimes; (ii) Variance-reduced stochastic modified forward-backward splitting scheme (vr-SMFBS). In constrained settings, we consider structured settings when the map can be decomposed into an expectation-valued map $A$ and a maximal monotone map $B$ with a tractable resolvent. Akin to (i), we show that the proposed schemes are equipped with a.s. convergence guarantees, linear (strongly monotone $A$) and $\mathcal{O}(1/k)$ (monotone $A$) rates of convergence while achieving optimal oracle complexity bounds. Of these, the rate statements in monotone regimes rely on leveraging the Fitzpatrick gap function for monotone inclusions. Furthermore, the schemes rely on weaker moment requirements on noise as well as allow for weakening unbiasedness requirements on oracles in strongly monotone regimes. Preliminary numerics reflect these findings and show that the variance-reduced schemes outperform stochastic approximation schemes, stochastic splitting and proximal point schemes, and sample-average approximation approaches.




Abstract:The Gaussian process (GP) model, which has been extensively applied as priors of functions, has demonstrated excellent performance. The specification of a large number of parameters affects the computational efficiency and the feasibility of implementation of a control strategy. We propose a linear model to approximate GPs; this model expands the GP model by a series of basis functions. Several examples and simulation studies are presented to demonstrate the advantages of the proposed method. A control strategy is provided with the proposed linear model.