Alert button
Picture for Robert M. Gower

Robert M. Gower

Alert button

Function Value Learning: Adaptive Learning Rates Based on the Polyak Stepsize and Function Splitting in ERM

Jul 26, 2023
Guillaume Garrigos, Robert M. Gower, Fabian Schaipp

Here we develop variants of SGD (stochastic gradient descent) with an adaptive step size that make use of the sampled loss values. In particular, we focus on solving a finite sum-of-terms problem, also known as empirical risk minimization. We first detail an idealized adaptive method called $\texttt{SPS}_+$ that makes use of the sampled loss values and assumes knowledge of the sampled loss at optimality. This $\texttt{SPS}_+$ is a minor modification of the SPS (Stochastic Polyak Stepsize) method, where the step size is enforced to be positive. We then show that $\texttt{SPS}_+$ achieves the best known rates of convergence for SGD in the Lipschitz non-smooth. We then move onto to develop $\texttt{FUVAL}$, a variant of $\texttt{SPS}_+$ where the loss values at optimality are gradually learned, as opposed to being given. We give three viewpoints of $\texttt{FUVAL}$, as a projection based method, as a variant of the prox-linear method, and then as a particular online SGD method. We then present a convergence analysis of $\texttt{FUVAL}$ and experimental results. The shortcomings of our work is that the convergence analysis of $\texttt{FUVAL}$ shows no advantage over SGD. Another shortcomming is that currently only the full batch version of $\texttt{FUVAL}$ shows a minor advantages of GD (Gradient Descent) in terms of sensitivity to the step size. The stochastic version shows no clear advantage over SGD. We conjecture that large mini-batches are required to make $\texttt{FUVAL}$ competitive. Currently the new $\texttt{FUVAL}$ method studied in this paper does not offer any clear theoretical or practical advantage. We have chosen to make this draft available online nonetheless because of some of the analysis techniques we use, such as the non-smooth analysis of $\texttt{SPS}_+$, and also to show an apparently interesting approach that currently does not work.

* 38 pages, 2 figures 
Viaarxiv icon

A Model-Based Method for Minimizing CVaR and Beyond

May 27, 2023
Si Yi Meng, Robert M. Gower

Figure 1 for A Model-Based Method for Minimizing CVaR and Beyond
Figure 2 for A Model-Based Method for Minimizing CVaR and Beyond
Figure 3 for A Model-Based Method for Minimizing CVaR and Beyond
Figure 4 for A Model-Based Method for Minimizing CVaR and Beyond

We develop a variant of the stochastic prox-linear method for minimizing the Conditional Value-at-Risk (CVaR) objective. CVaR is a risk measure focused on minimizing worst-case performance, defined as the average of the top quantile of the losses. In machine learning, such a risk measure is useful to train more robust models. Although the stochastic subgradient method (SGM) is a natural choice for minimizing the CVaR objective, we show that our stochastic prox-linear (SPL+) algorithm can better exploit the structure of the objective, while still providing a convenient closed form update. Our SPL+ method also adapts to the scaling of the loss function, which allows for easier tuning. We then specialize a general convergence theorem for SPL+ to our setting, and show that it allows for a wider selection of step sizes compared to SGM. We support this theoretical finding experimentally.

Viaarxiv icon

Improving Convergence and Generalization Using Parameter Symmetries

May 22, 2023
Bo Zhao, Robert M. Gower, Robin Walters, Rose Yu

Figure 1 for Improving Convergence and Generalization Using Parameter Symmetries
Figure 2 for Improving Convergence and Generalization Using Parameter Symmetries
Figure 3 for Improving Convergence and Generalization Using Parameter Symmetries
Figure 4 for Improving Convergence and Generalization Using Parameter Symmetries

In overparametrized models, different values of the parameters may result in the same loss value. Parameter space symmetries are transformations that change the model parameters but leave the loss invariant. Teleportation applies such transformations to accelerate optimization. However, the exact mechanism behind this algorithm's success is not well understood. In this paper, we show that teleportation not only speeds up optimization in the short-term, but gives overall faster time to convergence. Additionally, we show that teleporting to minima with different curvatures improves generalization and provide insights on the connection between the curvature of the minima and generalization ability. Finally, we show that integrating teleportation into a wide range of optimization algorithms and optimization-based meta-learning improves convergence.

* 29 pages, 13 figures 
Viaarxiv icon

MoMo: Momentum Models for Adaptive Learning Rates

May 12, 2023
Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, Robert M. Gower

Figure 1 for MoMo: Momentum Models for Adaptive Learning Rates
Figure 2 for MoMo: Momentum Models for Adaptive Learning Rates
Figure 3 for MoMo: Momentum Models for Adaptive Learning Rates
Figure 4 for MoMo: Momentum Models for Adaptive Learning Rates

We present new adaptive learning rates that can be used with any momentum method. To showcase our new learning rates we develop MoMo and MoMo-Adam, which are SGD with momentum (SGDM) and Adam together with our new adaptive learning rates. Our MoMo methods are motivated through model-based stochastic optimization, wherein we use momentum estimates of the batch losses and gradients sampled at each iteration to build a model of the loss function. Our model also makes use of any known lower bound of the loss function by using truncation. Indeed most losses are bounded below by zero. We then approximately minimize this model at each iteration to compute the next step. For losses with unknown lower bounds, we develop new on-the-fly estimates of the lower bound that we use in our model. Numerical experiments show that our MoMo methods improve over SGDM and Adam in terms of accuracy and robustness to hyperparameter tuning for training image classifiers on MNIST, CIFAR10, CIFAR100, Imagenet32, DLRM on the Criteo dataset, and a transformer model on the translation task IWSLT14.

* 25 pages, 11 figures 
Viaarxiv icon

A Stochastic Proximal Polyak Step Size

Jan 12, 2023
Fabian Schaipp, Robert M. Gower, Michael Ulbrich

Figure 1 for A Stochastic Proximal Polyak Step Size
Figure 2 for A Stochastic Proximal Polyak Step Size
Figure 3 for A Stochastic Proximal Polyak Step Size
Figure 4 for A Stochastic Proximal Polyak Step Size

Recently, the stochastic Polyak step size (SPS) has emerged as a competitive adaptive step size scheme for stochastic gradient descent. Here we develop ProxSPS, a proximal variant of SPS that can handle regularization terms. Developing a proximal variant of SPS is particularly important, since SPS requires a lower bound of the objective function to work well. When the objective function is the sum of a loss and a regularizer, available estimates of a lower bound of the sum can be loose. In contrast, ProxSPS only requires a lower bound for the loss which is often readily available. As a consequence, we show that ProxSPS is easier to tune and more stable in the presence of regularization. Furthermore for image classification tasks, ProxSPS performs as well as AdamW with little to no tuning, and results in a network with smaller weight parameters. We also provide an extensive convergence analysis for ProxSPS that includes the non-smooth, smooth, weakly convex and strongly convex setting.

Viaarxiv icon

Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Oct 04, 2022
Rui Yuan, Simon S. Du, Robert M. Gower, Alessandro Lazaric, Lin Xiao

Figure 1 for Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as approximate versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\mathcal{O}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size.

Viaarxiv icon

SP2: A Second Order Stochastic Polyak Method

Jul 17, 2022
Shuang Li, William J. Swartworth, Martin Takáč, Deanna Needell, Robert M. Gower

Figure 1 for SP2: A Second Order Stochastic Polyak Method
Figure 2 for SP2: A Second Order Stochastic Polyak Method
Figure 3 for SP2: A Second Order Stochastic Polyak Method
Figure 4 for SP2: A Second Order Stochastic Polyak Method

Recently the "SP" (Stochastic Polyak step size) method has emerged as a competitive adaptive method for setting the step sizes of SGD. SP can be interpreted as a method specialized to interpolated models, since it solves the interpolation equations. SP solves these equation by using local linearizations of the model. We take a step further and develop a method for solving the interpolation equations that uses the local second-order approximation of the model. Our resulting method SP2 uses Hessian-vector products to speed-up the convergence of SP. Furthermore, and rather uniquely among second-order methods, the design of SP2 in no way relies on positive definite Hessian matrices or convexity of the objective function. We show SP2 is very competitive on matrix completion, non-convex test problems and logistic regression. We also provide a convergence theory on sums-of-quadratics.

Viaarxiv icon

Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

Feb 24, 2022
Robert M. Gower, Mathieu Blondel, Nidham Gazagnadou, Fabian Pedregosa

Figure 1 for Cutting Some Slack for SGD with Adaptive Polyak Stepsizes
Figure 2 for Cutting Some Slack for SGD with Adaptive Polyak Stepsizes
Figure 3 for Cutting Some Slack for SGD with Adaptive Polyak Stepsizes
Figure 4 for Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

Tuning the step size of stochastic gradient descent is tedious and error prone. This has motivated the development of methods that automatically adapt the step size using readily available information. In this paper, we consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods. These are methods that make use of gradient and loss value at the sampled points to adaptively adjust the step size. We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems. We use this insight to develop new variants of the SPS method that are better suited to nonlinear models. Our new variants are based on introducing a slack variable into the interpolation equations. This single slack variable tracks the loss function across iterations and is used in setting a stable step size. We provide extensive numerical results supporting our new methods and a convergence theory.

* 44 pages, 10 figures 
Viaarxiv icon

A general sample complexity analysis of vanilla policy gradient

Jul 23, 2021
Rui Yuan, Robert M. Gower, Alessandro Lazaric

Figure 1 for A general sample complexity analysis of vanilla policy gradient
Figure 2 for A general sample complexity analysis of vanilla policy gradient
Figure 3 for A general sample complexity analysis of vanilla policy gradient

The policy gradient (PG) is one of the most popular methods for solving reinforcement learning (RL) problems. However, a solid theoretical understanding of even the "vanilla" PG has remained elusive for long time. In this paper, we apply recent tools developed for the analysis of SGD in non-convex optimization to obtain convergence guarantees for both REINFORCE and GPOMDP under smoothness assumption on the objective function and weak conditions on the second moment of the norm of the estimated gradient. When instantiated under common assumptions on the policy space, our general result immediately recovers existing $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity guarantees, but for wider ranges of parameters (e.g., step size and batch size $m$) with respect to previous literature. Notably, our result includes the single trajectory case (i.e., $m=1$) and it provides a more accurate analysis of the dependency on problem-specific parameters by fixing previous results available in the literature. We believe that the integration of state-of-the-art tools from non-convex optimization may lead to identify a much broader range of problems where PG methods enjoy strong theoretical guarantees.

* ICML 2021 Workshop on "Reinforcement learning theory" 
Viaarxiv icon

Stochastic Polyak Stepsize with a Moving Target

Jun 22, 2021
Robert M. Gower, Aaron Defazio, Michael Rabbat

Figure 1 for Stochastic Polyak Stepsize with a Moving Target
Figure 2 for Stochastic Polyak Stepsize with a Moving Target
Figure 3 for Stochastic Polyak Stepsize with a Moving Target
Figure 4 for Stochastic Polyak Stepsize with a Moving Target

We propose a new stochastic gradient method that uses recorded past loss values to reduce the variance. Our method can be interpreted as a new stochastic variant of the Polyak Stepsize that converges globally without assuming interpolation. Our method introduces auxiliary variables, one for each data point, that track the loss value for each data point. We provide a global convergence theory for our method by showing that it can be interpreted as a special variant of online SGD. The new method only stores a single scalar per data point, opening up new applications for variance reduction where memory is the bottleneck.

* 41 pages, 13 figures, 1 table 
Viaarxiv icon