Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Thomas Pethick

When to use what Schatten-$p$ norm in deep learning?

Jun 13, 2026

Thomas Pethick

Abstract:Schatten-$\infty$ based optimizers such as Muon have shown promising empirical performance, but there remains seemingly conflicting observations regarding whether they are beneficial. We resolve this conflict by showing that the conclusion is regime dependent. Even when the objective is smooth in the Schatten-$\infty$ geometry, smaller Schatten-$p$ geometries can be optimal, specifically in the low-dimensional regime, which we show includes Chinchilla scaling. This conclusion follows from a new noise-robust acceleration result for the SODA framework for $p>2$. The same analysis explains why Muon-like methods do not require warmup, why they naturally favor large batches, and yields a batch size scaling rule for arbitrary $p$.

Via

Access Paper or Ask Questions

Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

Jun 12, 2026

Florian Hübler, Thomas Pethick, Suvrit Sra

Abstract:Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded $p$-th central moments, $p \in (1,2]$. We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for $m \times n$ matrices, Muon finds an $\varepsilon$-stationary point in nuclear norm within $\mathcal{O}\left(\min\{m, n\} \frac{Δ_1 L}{\varepsilon^2} \left(\frac σ\varepsilon \right)^{\frac p {p-1}}\right)$ samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.

Via

Access Paper or Ask Questions

Training Neural Networks at Any Scale

Nov 14, 2025

Thomas Pethick, Kimon Antonakopoulos, Antonio Silveti-Falls, Leena Chennuru Vankadara, Volkan Cevher

Abstract:This article reviews modern optimization methods for training neural networks with an emphasis on efficiency and scale. We present state-of-the-art optimization algorithms under a unified algorithmic template that highlights the importance of adapting to the structures in the problem. We then cover how to make these algorithms agnostic to the scale of the problem. Our exposition is intended as an introduction for both practitioners and researchers who wish to be involved in these exciting new developments.

Via

Access Paper or Ask Questions

Training Deep Learning Models with Norm-Constrained LMOs

Feb 11, 2025

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher

Abstract:In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.

Via

Access Paper or Ask Questions

SAMPa: Sharpness-aware Minimization Parallelized

Oct 14, 2024

Wanyun Xie, Thomas Pethick, Volkan Cevher

Figure 1 for SAMPa: Sharpness-aware Minimization Parallelized

Figure 2 for SAMPa: Sharpness-aware Minimization Parallelized

Figure 3 for SAMPa: Sharpness-aware Minimization Parallelized

Figure 4 for SAMPa: Sharpness-aware Minimization Parallelized

Abstract:Sharpness-aware minimization (SAM) has been shown to improve the generalization of neural networks. However, each SAM update requires \emph{sequentially} computing two gradients, effectively doubling the per-iteration cost compared to base optimizers like SGD. We propose a simple modification of SAM, termed SAMPa, which allows us to fully parallelize the two gradient computations. SAMPa achieves a twofold speedup of SAM under the assumption that communication costs between devices are negligible. Empirical results show that SAMPa ranks among the most efficient variants of SAM in terms of computational time. Additionally, our method consistently outperforms SAM across both vision and language tasks. Notably, SAMPa theoretically maintains convergence guarantees even for \emph{fixed} perturbation sizes, which is established through a novel Lyapunov function. We in fact arrive at SAMPa by treating this convergence guarantee as a hard requirement -- an approach we believe is promising for developing SAM-based methods in general. Our code is available at \url{https://github.com/LIONS-EPFL/SAMPa}.

* Advances in Neural Information Processing Systems (NeurIPS), 2024

Via

Access Paper or Ask Questions

Improving SAM Requires Rethinking its Optimization Formulation

Jul 17, 2024

Wanyun Xie, Fabian Latorre, Kimon Antonakopoulos, Thomas Pethick, Volkan Cevher

Figure 1 for Improving SAM Requires Rethinking its Optimization Formulation

Figure 2 for Improving SAM Requires Rethinking its Optimization Formulation

Figure 3 for Improving SAM Requires Rethinking its Optimization Formulation

Figure 4 for Improving SAM Requires Rethinking its Optimization Formulation

Abstract:This paper rethinks Sharpness-Aware Minimization (SAM), which is originally formulated as a zero-sum game where the weights of a network and a bounded perturbation try to minimize/maximize, respectively, the same differentiable loss. To fundamentally improve this design, we argue that SAM should instead be reformulated using the 0-1 loss. As a continuous relaxation, we follow the simple conventional approach where the minimizing (maximizing) player uses an upper bound (lower bound) surrogate to the 0-1 loss. This leads to a novel formulation of SAM as a bilevel optimization problem, dubbed as BiSAM. BiSAM with newly designed lower-bound surrogate loss indeed constructs stronger perturbation. Through numerical evidence, we show that BiSAM consistently results in improved performance when compared to the original SAM and variants, while enjoying similar computational complexity. Our code is available at https://github.com/LIONS-EPFL/BiSAM.

* International Conference on Machine Learning (ICML), 2024

Via

Access Paper or Ask Questions

Stable Nonconvex-Nonconcave Training via Linear Interpolation

Oct 20, 2023

Thomas Pethick, Wanyun Xie, Volkan Cevher

Abstract:This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method to achieve last iterate convergence rates for the full range of cohypomonotone problems. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.

Via

Access Paper or Ask Questions

Federated Learning under Covariate Shifts with Generalization Guarantees

Jun 08, 2023

Ali Ramezani-Kebrya, Fanghui Liu, Thomas Pethick, Grigorios Chrysos, Volkan Cevher

Abstract:This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance. To handle covariate shifts, we formulate a new global model training paradigm and propose Federated Importance-Weighted Empirical Risk Minimization (FTW-ERM) along with improving density ratio matching methods without requiring perfect knowledge of the supremum over true ratios. We also propose the communication-efficient variant FITW-ERM with the same level of privacy guarantees as those of classical ERM in FL. We theoretically show that FTW-ERM achieves smaller generalization error than classical ERM under certain settings. Experimental results demonstrate the superiority of FTW-ERM over existing FL baselines in challenging imbalanced federated settings in terms of data distribution shifts across clients.

* Published in Transactions on Machine Learning Research (TMLR)

Via

Access Paper or Ask Questions

Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Feb 20, 2023

Thomas Pethick, Puya Latafat, Panagiotis Patrinos, Olivier Fercoq, Volkan Cevher

Figure 1 for Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Figure 2 for Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Figure 3 for Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Figure 4 for Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Abstract:This paper introduces a new extragradient-type algorithm for a class of nonconvex-nonconcave minimax problems. It is well-known that finding a local solution for general minimax problems is computationally intractable. This observation has recently motivated the study of structures sufficient for convergence of first order methods in the more general setting of variational inequalities when the so-called weak Minty variational inequality (MVI) holds. This problem class captures non-trivial structures as we demonstrate with examples, for which a large family of existing algorithms provably converge to limit cycles. Our results require a less restrictive parameter range in the weak MVI compared to what is previously known, thus extending the applicability of our scheme. The proposed algorithm is applicable to constrained and regularized problems, and involves an adaptive stepsize allowing for potentially larger stepsizes. Our scheme also converges globally even in settings where the underlying operator exhibits limit cycles.

* Code accessible at: https://github.com/LIONS-EPFL/weak-minty-code/

Via

Access Paper or Ask Questions

Revisiting adversarial training for the worst-performing class

Feb 17, 2023

Thomas Pethick, Grigorios G. Chrysos, Volkan Cevher

Abstract:Despite progress in adversarial training (AT), there is a substantial gap between the top-performing and worst-performing classes in many datasets. For example, on CIFAR10, the accuracies for the best and worst classes are 74% and 23%, respectively. We argue that this gap can be reduced by explicitly optimizing for the worst-performing class, resulting in a min-max-max optimization formulation. Our method, called class focused online learning (CFOL), includes high probability convergence guarantees for the worst class loss and can be easily integrated into existing training setups with minimal computational overhead. We demonstrate an improvement to 32% in the worst class accuracy on CIFAR10, and we observe consistent behavior across CIFAR100 and STL10. Our study highlights the importance of moving beyond average accuracy, which is particularly important in safety-critical applications.

* Code accessible at: https://github.com/LIONS-EPFL/class-focused-online-learning-code

Via

Access Paper or Ask Questions