Abstract:Adaptive optimizers combining preconditioning, momentum, and weight decay (Adam and AdamW) are, under Polyak-Ruppert averaging, candidate engines for one-pass inference. Does the averaged iterate keep the classical Polyak-Ruppert central limit theorem (CLT), with sandwich covariance $H^{-1}SH^{-1}$ (Hessian $H$, gradient covariance $S$), under momentum and non-convergent preconditioning? The preconditioner-only analysis does not carry over: with momentum the canonical decomposition collapses to a tautology. Treating the augmented state (iterate, momentum buffer) as a time-varying linear stochastic approximation (SA), we prove (under local stabilization) positive drift stability, a non-autonomous Polyak-Ruppert CLT, and a projection identity. The upshot: the iterate-marginal covariance is exactly the plain stochastic gradient descent (SGD) sandwich $H^{-1}SH^{-1}$, so the adaptivity is asymptotically invisible. This holds for SA-Adam (sub-linearly vanishing momentum gain, $γ\in(α,1)$; the sub-linear regime is essential), not constant-$β$ deployed Adam. Coupled $L_2$ weight decay yields the ridge-penalized sandwich, extending one-pass inference to regularized problems.
Abstract:Polyak-Ruppert averaging yields an asymptotically normal estimator with sandwich covariance $H^{-1}SH^{-1}$, the foundation of online inference. When the gradient step is preconditioned by a data-driven matrix $P_t$, we ask how fast $P_t$ must stabilize for the central limit theorem (CLT) to remain valid. We resolve this via an exact preconditioner-isolating decomposition of the averaged error that confines $P_t$ to a dynamic remainder $R_n$, leaving the martingale and Taylor terms preconditioner-free. Let $M_t = (P_t H)^{-1}$ denote the effective inverse drift matrix, with $\|M_t - M_{t-1}\|_{\mathrm{op}} \lesssim t^{-β}$ and step-size exponent $α\in (1/2, 1)$. We identify a stabilization-rate threshold $β> (α+1)/2$ and prove that, within the class of polynomial rate hypotheses used in our upper bound, it cannot be weakened: the dynamic remainder $\sqrt{n}\,R_n$ vanishes in $L^2$ whenever $β> (α+1)/2$, and we exhibit sequences satisfying those hypotheses for which it does not vanish when $β\le (α+1)/2$. A single stabilization argument certifies three SA variants - SA-AdaGrad, SA-RMSProp, and SA-ONS - with gain $ρ_t = c/t$, each delivering one-step $L^2(\mathrm{op})$ stabilization of order $t^{-1}$, yielding the CLT $\sqrt{n}(\bar{x}_n - x^*) \to N(0, H^{-1}SH^{-1})$; under bounded inputs the pathwise rate $β= 1$ further preserves the $n^{-1/6}$ Wasserstein rate at $α^* = 2/3$. Under standard regularity conditions, Wald-type online inference remains valid for dynamically preconditioned averaged SGD whose stabilization rate exceeds the threshold.