Fine-tuning large language models (LLMs) with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.
Personalization aims to characterize individual preferences and is widely applied across many fields. However, conventional personalized methods operate in a centralized manner and potentially expose the raw data when pooling individual information. In this paper, with privacy considerations, we develop a flexible and interpretable personalized framework within the paradigm of Federated Learning, called PPFL (Population Personalized Federated Learning). By leveraging canonical models to capture fundamental characteristics among the heterogeneous population and employing membership vectors to reveal clients' preferences, it models the heterogeneity as clients' varying preferences for these characteristics and provides substantial insights into client characteristics, which is lacking in existing Personalized Federated Learning (PFL) methods. Furthermore, we explore the relationship between our method and three main branches of PFL methods: multi-task PFL, clustered FL, and decoupling PFL, and demonstrate the advantages of PPFL. To solve PPFL (a non-convex constrained optimization problem), we propose a novel random block coordinate descent algorithm and present the convergence property. We conduct experiments on both pathological and practical datasets, and the results validate the effectiveness of PPFL.
The conjugate gradient method is a crucial first-order optimization method that generally converges faster than the steepest descent method, and its computational cost is much lower than the second-order methods. However, while various types of conjugate gradient methods have been studied in Euclidean spaces and on Riemannian manifolds, there has little study for those in distributed scenarios. This paper proposes a decentralized Riemannian conjugate gradient descent (DRCGD) method that aims at minimizing a global function over the Stiefel manifold. The optimization problem is distributed among a network of agents, where each agent is associated with a local function, and communication between agents occurs over an undirected connected graph. Since the Stiefel manifold is a non-convex set, a global function is represented as a finite sum of possibly non-convex (but smooth) local functions. The proposed method is free from expensive Riemannian geometric operations such as retractions, exponential maps, and vector transports, thereby reducing the computational complexity required by each agent. To the best of our knowledge, DRCGD is the first decentralized Riemannian conjugate gradient algorithm to achieve global convergence over the Stiefel manifold.
The zeroth-order optimization has been widely used in machine learning applications. However, the theoretical study of the zeroth-order optimization focus on the algorithms which approximate (first-order) gradients using (zeroth-order) function value difference at a random direction. The theory of algorithms which approximate the gradient and Hessian information by zeroth-order queries is much less studied. In this paper, we focus on the theory of zeroth-order optimization which utilizes both the first-order and second-order information approximated by the zeroth-order queries. We first propose a novel reparameterized objective function with parameters $(\mu, \Sigma)$. This reparameterized objective function achieves its optimum at the minimizer and the Hessian inverse of the original objective function respectively, but with small perturbations. Accordingly, we propose a new algorithm to minimize our proposed reparameterized objective, which we call \texttt{MiNES} (mirror descent natural evolution strategy). We show that the estimated covariance matrix of \texttt{MiNES} converges to the inverse of Hessian matrix of the objective function with a convergence rate $\widetilde{\mathcal{O}}(1/k)$, where $k$ is the iteration number and $\widetilde{\mathcal{O}}(\cdot)$ hides the constant and $\log$ terms. We also provide the explicit convergence rate of \texttt{MiNES} and how the covariance matrix promotes the convergence rate.
We study finite-sum distributed optimization problems with $n$-clients under popular $\delta$-similarity condition and $\mu$-strong convexity. We propose two new algorithms: SVRS and AccSVRS motivated by previous works. The non-accelerated SVRS method combines the techniques of gradient-sliding and variance reduction, which achieves superior communication complexity $\tilde{\gO}(n {+} \sqrt{n}\delta/\mu)$ compared to existing non-accelerated algorithms. Applying the framework proposed in Katyusha X, we also build a direct accelerated practical version named AccSVRS with totally smoothness-free $\tilde{\gO}(n {+} n^{3/4}\sqrt{\delta/\mu})$ communication complexity that improves upon existing algorithms on ill-conditioning cases. Furthermore, we show a nearly matched lower bound to verify the tightness of our AccSVRS method.
This paper studies the stochastic optimization for decentralized nonconvex-strongly-concave minimax problem. We propose a simple and efficient algorithm, called Decentralized Recursive gradient descEnt Ascent Method (DREAM), which requires $\mathcal{O}(\kappa^3\epsilon^{-3})$ stochastic first-order oracle (SFO) calls and $\mathcal{O}\big(\kappa^2\epsilon^{-2}/\sqrt{1-\lambda_2(W)}\,\big)$ communication rounds to find an $\epsilon$-stationary point, where $\kappa$ is the condition number and $\lambda_2(W)$ is the second-largest eigenvalue of the gossip matrix $W$. To the best our knowledge, DREAM is the first algorithm whose SFO and communication complexities simultaneously achieve the optimal dependency on $\epsilon$ and $\lambda_2(W)$ for this problem.
This paper studies the decentralized nonconvex optimization problem $\min_{x\in{\mathbb R}^d} f(x)\triangleq \frac{1}{m}\sum_{i=1}^m f_i(x)$, where $f_i(x)\triangleq \frac{1}{n}\sum_{j=1}^n f_{i,j}(x)$ is the local function on the $i$-th agent of the network. We propose a novel stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (\DEAREST), which integrates the techniques of variance reduction, gradient tracking and multi-consensus. We construct a Lyapunov function that simultaneously characterizes the function value, the gradient estimation error and the consensus error for the convergence analysis. Based on this measure, we provide a concise proof to show DEAREST requires at most ${\mathcal O}(mn+\sqrt{mn}L\varepsilon^{-2})$ incremental first-order oracle (IFO) calls and ${\mathcal O}({L\varepsilon^{-2}}/{\sqrt{1-\lambda_2(W)}}\,)$ communication rounds to find an $\varepsilon$-stationary point in expectation, where $L$ is the smoothness parameter and $\lambda_2(W)$ is the second-largest eigenvalue of the gossip matrix $W$. We can verify both of the IFO complexity and communication complexity match the lower bounds. To the best of our knowledge, DEAREST is the first optimal algorithm for decentralized nonconvex finite-sum optimization.
This paper studies decentralized convex-concave minimax optimization problems of the form $\min_x\max_y f(x,y) \triangleq\frac{1}{m}\sum_{i=1}^m f_i(x,y)$, where $m$ is the number of agents and each local function can be written as $f_i(x,y)=\frac{1}{n}\sum_{j=1}^n f_{i,j}(x,y)$. We propose a novel decentralized optimization algorithm, called multi-consensus stochastic variance reduced extragradient, which achieves the best known stochastic first-order oracle (SFO) complexity for this problem. Specifically, each agent requires $\mathcal O((n+\kappa\sqrt{n})\log(1/\varepsilon))$ SFO calls for strongly-convex-strongly-concave problem and $\mathcal O((n+\sqrt{n}L/\varepsilon)\log(1/\varepsilon))$ SFO call for general convex-concave problem to achieve $\varepsilon$-accurate solution in expectation, where $\kappa$ is the condition number and $L$ is the smoothness parameter. The numerical experiments show the proposed method performs better than baselines.
Learning rate schedulers have been widely adopted in training deep neural networks. Despite their practical importance, there is a discrepancy between its practice and its theoretical analysis. For instance, it is not known what schedules of SGD achieve best convergence, even for simple problems such as optimizing quadratic objectives. So far, step decay has been one of the strongest candidates under this setup, which is proved to be nearly optimal with a $\mathcal{O}(\log T)$ gap. However, according to our analysis, this gap turns out to be $\Omega(\log T)$ in a wide range of settings, which throws the schedule optimality problem into an open question again. Towards answering this reopened question, in this paper, we propose Eigencurve, the first family of learning rate schedules that can achieve minimax optimal convergence rates (up to a constant) for SGD on quadratic objectives when the eigenvalue distribution of the underlying Hessian matrix is skewed. The condition is quite common in practice. Experimental results show that Eigencurve can significantly outperform step decay in image classification tasks on CIFAR-10, especially when the number of epochs is small. Moreover, the theory inspires two simple learning rate schedulers for practical applications that can approximate Eigencurve. For some problems, the optimal shape of the proposed schedulers resembles that of cosine decay, which sheds light to the success of cosine decay for such situations. For other situations, the proposed schedulers are superior to cosine decay.