Abstract:Recently, optimizers that explicitly treat weights as matrices, rather than flattened vectors, have demonstrated their effectiveness. This perspective naturally leads to structured approximations of the Fisher matrix as preconditioners, where the matrix view induces a Kronecker-factorized form that enables memory-efficient representation. However, constructing such approximations both efficiently and accurately remains an open challenge, since obtaining the optimal factorization is resource-intensive and practical methods therefore rely on heuristic design choices. In this work, we introduce a novel approach that leverages projector-splitting integrators to construct effective preconditioners. Our optimizer, DyKAF (Dynamical Kronecker Approximation of the Fisher Matrix), consistently improves the Fisher matrix approximation quality. Experiments on large language model pre-training and fine-tuning demonstrate that DyKAF outperforms existing optimizers across a range of evaluation metrics.
Abstract:Fine-tuning Large Language Models (LLMs) is essential for adapting pre-trained models to downstream tasks. Yet traditional first-order optimizers such as Stochastic Gradient Descent (SGD) and Adam incur prohibitive memory and computational costs that scale poorly with model size. In this paper, we investigate zero-order (ZO) optimization methods as a memory- and compute-efficient alternative, particularly in the context of parameter-efficient fine-tuning techniques like LoRA. We propose $\texttt{JAGUAR SignSGD}$, a ZO momentum-based algorithm that extends ZO SignSGD, requiring the same number of parameters as the standard ZO SGD and only $\mathcal{O}(1)$ function evaluations per iteration. To the best of our knowledge, this is the first study to establish rigorous convergence guarantees for SignSGD in the stochastic ZO case. We further propose $\texttt{JAGUAR Muon}$, a novel ZO extension of the Muon optimizer that leverages the matrix structure of model parameters, and we provide its convergence rate under arbitrary stochastic noise. Through extensive experiments on challenging LLM fine-tuning benchmarks, we demonstrate that the proposed algorithms meet or exceed the convergence quality of standard first-order methods, achieving significant memory reduction. Our theoretical and empirical results establish new ZO optimization methods as a practical and theoretically grounded approach for resource-constrained LLM adaptation. Our code is available at https://github.com/brain-mmo-lab/ZO_LLM
Abstract:Widespread deployment of societal-scale machine learning systems necessitates a thorough understanding of the resulting long-term effects these systems have on their environment, including loss of trustworthiness, bias amplification, and violation of AI safety requirements. We introduce a repeated learning process to jointly describe several phenomena attributed to unintended hidden feedback loops, such as error amplification, induced concept drift, echo chambers and others. The process comprises the entire cycle of obtaining the data, training the predictive model, and delivering predictions to end-users within a single mathematical model. A distinctive feature of such repeated learning setting is that the state of the environment becomes causally dependent on the learner itself over time, thus violating the usual assumptions about the data distribution. We present a novel dynamical systems model of the repeated learning process and prove the limiting set of probability distributions for positive and negative feedback loop modes of the system operation. We conduct a series of computational experiments using an exemplary supervised learning problem on two synthetic data sets. The results of the experiments correspond to the theoretical predictions derived from the dynamical model. Our results demonstrate the feasibility of the proposed approach for studying the repeated learning processes in machine learning systems and open a range of opportunities for further research in the area.