Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dmitriy Drusvyatskiy

Linear Recursive Feature Machines provably recover low-rank matrices

Jan 09, 2024
Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.

Via

Access Paper or Ask Questions

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Jun 05, 2023
Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma

Figure 1 for Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Figure 2 for Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Figure 3 for Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.

Via

Access Paper or Ask Questions

Asymptotic normality and optimality in nonsmooth stochastic approximation

Jan 16, 2023
Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

Figure 1 for Asymptotic normality and optimality in nonsmooth stochastic approximation

Figure 2 for Asymptotic normality and optimality in nonsmooth stochastic approximation

In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of H\'{a}jek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case.

* The arxiv report arXiv:2108.11832 has been split into two parts. This is Part 2 of the original submission, augmented by a some new results and a reworked exposition

Via

Access Paper or Ask Questions

Stochastic approximation with decision-dependent distributions: asymptotic normality and optimality

Jul 09, 2022
Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy

Figure 1 for Stochastic approximation with decision-dependent distributions: asymptotic normality and optimality

We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that nicely decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of H\'ajek and Le Cam, we show that the asymptotic performance of the algorithm is locally minimax optimal.

* 35 pages, 1 figure

Via

Access Paper or Ask Questions

Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

Apr 08, 2022
Mitas Ray, Dmitriy Drusvyatskiy, Maryam Fazel, Lillian J. Ratliff

Figure 1 for Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

Figure 2 for Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

Figure 3 for Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

Figure 4 for Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

This paper studies the problem of expected loss minimization given a data distribution that is dependent on the decision-maker's action and evolves dynamically in time according to a geometric decay process. Novel algorithms for both the information setting in which the decision-maker has a first order gradient oracle and the setting in which they have simply a loss function oracle are introduced. The algorithms operate on the same underlying principle: the decision-maker repeatedly deploys a fixed decision over the length of an epoch, thereby allowing the dynamically changing environment to sufficiently mix before updating the decision. The iteration complexity in each of the settings is shown to match existing rates for first and zero order stochastic gradient methods up to logarithmic factors. The algorithms are evaluated on a "semi-synthetic" example using real world data from the SFpark dynamic pricing pilot study; it is shown that the announced prices result in an improvement for the institution's objective (target occupancy), while achieving an overall reduction in parking rates.

* Accepted at AAAI 2022

Via

Access Paper or Ask Questions

Flat minima generalize for low-rank matrix recovery

Mar 07, 2022
Lijun Ding, Dmitriy Drusvyatskiy, Maryam Fazel

Figure 1 for Flat minima generalize for low-rank matrix recovery

Figure 2 for Flat minima generalize for low-rank matrix recovery

Figure 3 for Flat minima generalize for low-rank matrix recovery

Figure 4 for Flat minima generalize for low-rank matrix recovery

Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We complete the paper with synthetic experiments that illustrate our findings.

* 30 pages

Via

Access Paper or Ask Questions

Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Jan 10, 2022
Adhyyan Narang, Evan Faulkner, Dmitriy Drusvyatskiy, Maryam Fazel, Lillian J. Ratliff

Figure 1 for Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Figure 2 for Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Figure 3 for Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Figure 4 for Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called multi-player performative prediction. We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter equilibria are arguably more informative, but can be found efficiently only when the game is monotone. We show that under mild assumptions, the performatively stable equilibria can be found efficiently by a variety of algorithms, including repeated retraining and repeated (stochastic) gradient play. We then establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results.

Via

Access Paper or Ask Questions

Subgradient methods near active manifolds: saddle point avoidance, local convergence, and asymptotic normality

Aug 26, 2021
Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

Figure 1 for Subgradient methods near active manifolds: saddle point avoidance, local convergence, and asymptotic normality

Figure 2 for Subgradient methods near active manifolds: saddle point avoidance, local convergence, and asymptotic normality

Figure 3 for Subgradient methods near active manifolds: saddle point avoidance, local convergence, and asymptotic normality

Figure 4 for Subgradient methods near active manifolds: saddle point avoidance, local convergence, and asymptotic normality

Nonsmooth optimization problems arising in practice tend to exhibit beneficial smooth substructure: their domains stratify into "active manifolds" of smooth variation, which common proximal algorithms "identify" in finite time. Identification then entails a transition to smooth dynamics, and accommodates second-order acceleration techniques. While identification is clearly useful algorithmically, empirical evidence suggests that even those algorithms that do not identify the active manifold in finite time -- notably the subgradient method -- are nonetheless affected by it. This work seeks to explain this phenomenon, asking: how do active manifolds impact the subgradient method in nonsmooth optimization? In this work, we answer this question by introducing two algorithmically useful properties -- aiming and subgradient approximation -- that fully expose the smooth substructure of the problem. We show that these properties imply that the shadow of the (stochastic) subgradient method along the active manifold is precisely an inexact Riemannian gradient method with an implicit retraction. We prove that these properties hold for a wide class of problems, including cone reducible/decomposable functions and generic semialgebraic problems. Moreover, we develop a thorough calculus, proving such properties are preserved under smooth deformations and spectral lifts. This viewpoint then leads to several algorithmic consequences that parallel results in smooth optimization, despite the nonsmoothness of the problem: local rates of convergence, asymptotic normality, and saddle point avoidance. The asymptotic normality results appear to be new even in the most classical setting of stochastic nonlinear programming. The results culminate in the following observation: the perturbed subgradient method on generic, Clarke regular semialgebraic problems, converges only to local minimizers.

* 104 pages, 3 figures

Via

Access Paper or Ask Questions

Stochastic optimization under time drift: iterate averaging, step decay, and high probability guarantees

Aug 16, 2021
Joshua Cutler, Dmitriy Drusvyatskiy, Zaid Harchaoui

Figure 1 for Stochastic optimization under time drift: iterate averaging, step decay, and high probability guarantees

Figure 2 for Stochastic optimization under time drift: iterate averaging, step decay, and high probability guarantees

Figure 3 for Stochastic optimization under time drift: iterate averaging, step decay, and high probability guarantees

Figure 4 for Stochastic optimization under time drift: iterate averaging, step decay, and high probability guarantees

We consider the problem of minimizing a convex function that is evolving in time according to unknown and possibly stochastic dynamics. Such problems abound in the machine learning and signal processing literature, under the names of concept drift and stochastic tracking. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. Notably, we show that the tracking efficiency of the proximal stochastic gradient method depends only logarithmically on the initialization quality, when equipped with a step-decay schedule. The results moreover naturally extend to settings where the dynamics depend jointly on time and on the decision variable itself, as in the performative prediction framework.

* 57 pages, 6 figures, under review

Via

Access Paper or Ask Questions

Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

Jun 17, 2021
Damek Davis, Mateo Díaz, Dmitriy Drusvyatskiy

Figure 1 for Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

Figure 2 for Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

Figure 3 for Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.

* 29 pages, 1 figure

Via

Access Paper or Ask Questions