In this paper, we study the convergence theory of a class of gradient-based Model-Agnostic Meta-Learning (MAML) methods and characterize their overall computational complexity as well as their best achievable level of accuracy in terms of gradient norm for nonconvex loss functions. In particular, we start with the MAML algorithm and its first order approximation (FO-MAML) and highlight the challenges that emerge in their analysis. By overcoming these challenges not only we provide the first theoretical guarantees for MAML and FO-MAML in nonconvex settings, but also we answer some of the unanswered questions for the implementation of these algorithms including how to choose their learning rate (stepsize) and the batch size for both tasks and datasets corresponding to tasks. In particular, we show that MAML can find an $\epsilon$-first-order stationary point for any $\epsilon$ after at most $\mathcal{O}(1/\epsilon^2)$ iterations while the cost of each iteration is $\mathcal{O}(d^2)$, where $d$ is the problem dimension. We further show that FO-MAML reduces the cost per iteration of MAML to $\mathcal{O}(d)$, but, unlike MAML, its solution cannot reach any small desired level of accuracy. We further propose a new variant of the MAML algorithm called Hessian-free MAML (HF-MAML) which preserves all theoretical guarantees of MAML, while reducing its computational cost per iteration from $\mathcal{O}(d^2)$ to $\mathcal{O}(d)$.
We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods.
In this paper we analyze the iteration complexity of the optimistic gradient descent-ascent (OGDA) method as well as the extra-gradient (EG) method for finding a saddle point of a convex-concave unconstrained min-max problem. To do so, we first show that both OGDA and EG can be interpreted as approximate variants of the proximal point method. We then exploit this interpretation to show that both of these algorithms achieve a convergence rate of $\mathcal{O}(1/k)$ for smooth convex-concave saddle point problems. Our theoretical analysis is of interest as it provides a simple convergence analysis for the EG algorithm in terms of objective function value without using compactness assumption. Moreover, it provides the first convergence guarantee for OGDA in the general convex-concave setting.
How can we efficiently mitigate the overhead of gradient communications in distributed optimization? This problem is at the heart of training scalable machine learning models and has been mainly studied in the unconstrained setting. In this paper, we propose Quantized Frank-Wolfe (QFW), the first projection-free and communication-efficient algorithm for solving constrained optimization problems at scale. We consider both convex and non-convex objective functions, expressed as a finite-sum or more generally a stochastic optimization problem, and provide strong theoretical guarantees on the convergence rate of QFW. This is done by proposing quantization schemes that efficiently compress gradients while controlling the variance introduced during this process. Finally, we empirically validate the efficiency of QFW in terms of communication and the quality of returned solution against natural baselines.
In this paper, we develop Stochastic Continuous Greedy++ (SCG++), the first efficient variant of a conditional gradient method for maximizing a continuous submodular function subject to a convex constraint. Concretely, for a monotone and continuous DR-submodular function, SCG++ achieves a tight $[(1-1/e)\text{OPT} -\epsilon]$ solution while using $O(1/\epsilon^2)$ stochastic oracle queries and $O(1/\epsilon)$ calls to the linear optimization oracle. The best previously known algorithms either achieve a suboptimal $[(1/2)\text{OPT} -\epsilon]$ solution with $O(1/\epsilon^2)$ stochastic gradients or the tight $[(1-1/e)\text{OPT} -\epsilon]$ solution with suboptimal $O(1/\epsilon^3)$ stochastic gradients. SCG++ enjoys optimality in terms of both approximation guarantee and stochastic stochastic oracle queries. Our novel variance reduction method naturally extends to stochastic convex minimization. More precisely, we develop Stochastic Frank-Wolfe++ (SFW++) that achieves an $\epsilon$-approximate optimum with only $O(1/\epsilon)$ calls to the linear optimization oracle while using $O(1/\epsilon^2)$ stochastic oracle queries in total. Therefore, SFW++ is the first efficient projection-free algorithm that achieves the optimum complexity $O(1/\epsilon^2)$ in terms of stochastic oracle queries.
We consider solving convex-concave saddle point problems. We focus on two variants of gradient decent-ascent algorithms, Extra-gradient (EG) and Optimistic Gradient (OGDA) methods, and show that they admit a unified analysis as approximations of the classical proximal point method for solving saddle-point problems. This viewpoint enables us to generalize EG (in terms of extrapolation steps) and OGDA (in terms of parameters) and obtain new convergence rate results for these algorithms for the bilinear case as well as the strongly convex-concave case.
In this paper, we propose a Distributed Accumulated Newton Conjugate gradiEnt (DANCE) method in which sample size is gradually increasing to quickly obtain a solution whose empirical loss is under satisfactory statistical accuracy. Our proposed method is multistage in which the solution of a stage serves as a warm start for the next stage which contains more samples (including the samples in the previous stage). The proposed multistage algorithm reduces the number of passes over data to achieve the statistical accuracy of the full training set. Moreover, our algorithm in nature is easy to be distributed and shares the strong scaling property indicating that acceleration is always expected by using more computing nodes. Various iteration complexity results regarding descent direction computation, communication efficiency and stopping criteria are analyzed under convex setting. Our numerical results illustrate that the proposed method outperforms other comparable methods for solving learning problems including neural networks.
In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set $\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function. Specifically, our results hold if one can find a $\rho$-approximate solution of a quadratic program subject to $\mathcal{C}$ in polynomial time, where $\rho<1$ is a positive constant that depends on the structure of the set $\mathcal{C}$. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an $(\epsilon,\gamma)$-second order stationary point (SOSP) in at most $\mathcal{O}(\max\{\epsilon^{-2},\rho^{-3}\gamma^{-3}\})$ iterations. We further characterize the overall complexity of reaching an SOSP when the convex set $\mathcal{C}$ can be written as a set of quadratic constraints and the objective function Hessian has a specific structure over the convex set $\mathcal{C}$. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an $(\epsilon,\gamma)$-SOSP.
We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results.