Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Robert M. Gower

Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

Apr 02, 2025

Robert M. Gower, Guillaume Garrigos, Nicolas Loizou, Dimitris Oikonomou, Konstantin Mishchenko, Fabian Schaipp

Abstract:We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/\sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

* 44 pages, 7 figures

Via

Access Paper or Ask Questions

EigenVI: score-based variational inference with orthogonal function expansions

Oct 31, 2024

Diana Cai, Chirag Modi, Charles C. Margossian, Robert M. Gower, David M. Blei, Lawrence K. Saul

Abstract:We develop EigenVI, an eigenvalue-based approach for black-box variational inference (BBVI). EigenVI constructs its variational approximations from orthogonal function expansions. For distributions over $\mathbb{R}^D$, the lowest order term in these expansions provides a Gaussian variational approximation, while higher-order terms provide a systematic way to model non-Gaussianity. These approximations are flexible enough to model complex distributions (multimodal, asymmetric), but they are simple enough that one can calculate their low-order moments and draw samples from them. EigenVI can also model other types of random variables (e.g., nonnegative, bounded) by constructing variational approximations from different families of orthogonal functions. Within these families, EigenVI computes the variational approximation that best matches the score function of the target distribution by minimizing a stochastic estimate of the Fisher divergence. Notably, this optimization reduces to solving a minimum eigenvalue problem, so that EigenVI effectively sidesteps the iterative gradient-based optimizations that are required for many other BBVI algorithms. (Gradient-based methods can be sensitive to learning rates, termination criteria, and other tunable hyperparameters.) We use EigenVI to approximate a variety of target distributions, including a benchmark suite of Bayesian models from posteriordb. On these distributions, we find that EigenVI is more accurate than existing methods for Gaussian BBVI.

* 25 pages, 9 figures. Advances in Neural Information Processing Systems (NeurIPS), 2024

Via

Access Paper or Ask Questions

Enhancing Policy Gradient with the Polyak Step-Size Adaption

Apr 11, 2024

Yunxiang Li, Rui Yuan, Chen Fan, Mark Schmidt, Samuel Horváth, Robert M. Gower, Martin Takáč

Figure 1 for Enhancing Policy Gradient with the Polyak Step-Size Adaption

Figure 2 for Enhancing Policy Gradient with the Polyak Step-Size Adaption

Figure 3 for Enhancing Policy Gradient with the Polyak Step-Size Adaption

Figure 4 for Enhancing Policy Gradient with the Polyak Step-Size Adaption

Abstract:Policy gradient is a widely utilized and foundational algorithm in the field of reinforcement learning (RL). Renowned for its convergence guarantees and stability compared to other RL algorithms, its practical application is often hindered by sensitivity to hyper-parameters, particularly the step-size. In this paper, we introduce the integration of the Polyak step-size in RL, which automatically adjusts the step-size without prior knowledge. To adapt this method to RL settings, we address several issues, including unknown f* in the Polyak step-size. Additionally, we showcase the performance of the Polyak step-size in RL through experiments, demonstrating faster convergence and the attainment of more stable policies.

Via

Access Paper or Ask Questions

Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Mar 06, 2024

Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert M. Gower

Figure 1 for Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Figure 2 for Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Figure 3 for Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Figure 4 for Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Abstract:We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness.

* Twenty-four pages

Via

Access Paper or Ask Questions

Level Set Teleportation: An Optimization Perspective

Mar 05, 2024

Aaron Mishkin, Alberto Bietti, Robert M. Gower

Figure 1 for Level Set Teleportation: An Optimization Perspective

Figure 2 for Level Set Teleportation: An Optimization Perspective

Figure 3 for Level Set Teleportation: An Optimization Perspective

Figure 4 for Level Set Teleportation: An Optimization Perspective

Abstract:We study level set teleportation, an optimization sub-routine which seeks to accelerate gradient methods by maximizing the gradient norm on a level-set of the objective function. Since the descent lemma implies that gradient descent (GD) decreases the objective proportional to the squared norm of the gradient, level-set teleportation maximizes this one-step progress guarantee. For convex functions satisfying Hessian stability, we prove that GD with level-set teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than standard GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where we show level-set teleportation neither improves nor worsens convergence rates. To evaluate teleportation in practice, we develop a projected-gradient-type method requiring only Hessian-vector products. We use this method to show that gradient methods with access to a teleportation oracle uniformly out-perform their standard versions on a variety of learning problems.

* Thirty-five pages including appendices

Via

Access Paper or Ask Questions

Batch and match: black-box variational inference with a score-based divergence

Feb 22, 2024

Diana Cai, Chirag Modi, Loucas Pillaud-Vivien, Charles C. Margossian, Robert M. Gower, David M. Blei, Lawrence K. Saul

Figure 1 for Batch and match: black-box variational inference with a score-based divergence

Figure 2 for Batch and match: black-box variational inference with a score-based divergence

Figure 3 for Batch and match: black-box variational inference with a score-based divergence

Figure 4 for Batch and match: black-box variational inference with a score-based divergence

Abstract:Most leading implementations of black-box variational inference (BBVI) are based on optimizing a stochastic evidence lower bound (ELBO). But such approaches to BBVI often converge slowly due to the high variance of their gradient estimates. In this work, we propose batch and match (BaM), an alternative approach to BBVI based on a score-based divergence. Notably, this score-based divergence can be optimized by a closed-form proximal update for Gaussian variational families with full covariance matrices. We analyze the convergence of BaM when the target distribution is Gaussian, and we prove that in the limit of infinite batch size the variational parameter updates converge exponentially quickly to the target mean and covariance. We also evaluate the performance of BaM on Gaussian and non-Gaussian target distributions that arise from posterior inference in hierarchical and deep generative models. In these experiments, we find that BaM typically converges in fewer (and sometimes significantly fewer) gradient evaluations than leading implementations of BBVI based on ELBO maximization.

* 46 pages, 11 figures

Via

Access Paper or Ask Questions

Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

Jul 26, 2023

Guillaume Garrigos, Robert M. Gower, Fabian Schaipp

Figure 1 for Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

Figure 2 for Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

Abstract:Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called $\texttt{SPS}_+$ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This $\texttt{SPS}_+$ is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that $\texttt{SPS}_+$ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop $\texttt{FUVAL}$, a variant of $\texttt{SPS}_+$ where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of $\texttt{FUVAL}$, as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of $\texttt{FUVAL}$ and experimental results. The shortcomings of our work is that the convergence analysis of $\texttt{FUVAL}$ shows no advantage over SGD. Another shortcomming is that currently only the full batch version of $\texttt{FUVAL}$ shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make $\texttt{FUVAL}$ competitive. Currently the new $\texttt{FUVAL}$ method studied in this paper does not offer any clear theoretical or practical advantage. We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of $\texttt{SPS}_+$, and also to show an apparently interesting approach that currently does not work.

* 38 pages, 2 figures

Via

Access Paper or Ask Questions

A Model-Based Method for Minimizing CVaR and Beyond

May 27, 2023

Si Yi Meng, Robert M. Gower

Abstract:We develop a variant of the stochastic prox-linear method for minimizing the Conditional Value-at-Risk (CVaR) objective. CVaR is a risk measure focused on minimizing worst-case performance, defined as the average of the top quantile of the losses. In machine learning, such a risk measure is useful to train more robust models. Although the stochastic subgradient method (SGM) is a natural choice for minimizing the CVaR objective, we show that our stochastic prox-linear (SPL+) algorithm can better exploit the structure of the objective, while still providing a convenient closed form update. Our SPL+ method also adapts to the scaling of the loss function, which allows for easier tuning. We then specialize a general convergence theorem for SPL+ to our setting, and show that it allows for a wider selection of step sizes compared to SGM. We support this theoretical finding experimentally.

Via

Access Paper or Ask Questions

Improving Convergence and Generalization Using Parameter Symmetries

May 22, 2023

Bo Zhao, Robert M. Gower, Robin Walters, Rose Yu

Figure 1 for Improving Convergence and Generalization Using Parameter Symmetries

Figure 2 for Improving Convergence and Generalization Using Parameter Symmetries

Figure 3 for Improving Convergence and Generalization Using Parameter Symmetries

Figure 4 for Improving Convergence and Generalization Using Parameter Symmetries

Abstract:In overparametrized models, different values of the parameters may result in the same loss value. Parameter space symmetries are transformations that change the model parameters but leave the loss invariant. Teleportation applies such transformations to accelerate optimization. However, the exact mechanism behind this algorithm's success is not well understood. In this paper, we show that teleportation not only speeds up optimization in the short-term, but gives overall faster time to convergence. Additionally, we show that teleporting to minima with different curvatures improves generalization and provide insights on the connection between the curvature of the minima and generalization ability. Finally, we show that integrating teleportation into a wide range of optimization algorithms and optimization-based meta-learning improves convergence.

* 29 pages, 13 figures

Via

Access Paper or Ask Questions

MoMo: Momentum Models for Adaptive Learning Rates

May 12, 2023

Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

Figure 1 for MoMo: Momentum Models for Adaptive Learning Rates

Figure 2 for MoMo: Momentum Models for Adaptive Learning Rates

Figure 3 for MoMo: Momentum Models for Adaptive Learning Rates

Figure 4 for MoMo: Momentum Models for Adaptive Learning Rates

Abstract:We present new adaptive learning rates that can be used with any momentum method. To showcase our new learning rates we develop MoMo and MoMo-Adam, which are SGD with momentum (SGDM) and Adam together with our new adaptive learning rates. Our MoMo methods are motivated through model-based stochastic optimization, wherein we use momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation. Indeed most losses are bounded below by zero. We then approximately minimize this model at each iteration to compute the next step. For losses with unknown lower bounds, we develop new on-the-fly estimates of the lower bound that we use in our model. Numerical experiments show that our MoMo methods improve over SGDM and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet32, DLRM on the Criteo dataset, and a transformer model on the translation task IWSLT14.

* 25 pages, 11 figures

Via

Access Paper or Ask Questions