This note introduces Projected Microbatch Accumulation (PROMA), a proximal policy method that modifies gradient accumulation across microbatches rather than relying on likelihood ratios relative to a reference policy. During accumulation, PROMA projects the partially accumulated gradient to be orthogonal to the sequence-wise gradients of the current microbatch. This projection is applied layer-wise during the backward pass, enabling efficient implementation. Empirically, PROMA achieves proximal updates without entropy collapse while providing tighter local KL control than GRPO.

Title:PROMA: Projected Microbatch Accumulation for Reference-Free Proximal Policy Updates

Paper and Code