Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Characterization and Mitigation of Training Instabilities in Microscaling Formats

Jun 25, 2025

Huangyuan Su, Mujin Kwun, Stephanie Gil, Sham Kakade, Nikhil Anand

Figure 1 for Characterization and Mitigation of Training Instabilities in Microscaling Formats

Figure 2 for Characterization and Mitigation of Training Instabilities in Microscaling Formats

Figure 3 for Characterization and Mitigation of Training Instabilities in Microscaling Formats

Figure 4 for Characterization and Mitigation of Training Instabilities in Microscaling Formats

Share this with someone who'll enjoy it:

Abstract:Training large language models is an expensive, compute-bound process that must be repeated as models scale, algorithms improve, and new data is collected. To address this, next-generation hardware accelerators increasingly support lower-precision arithmetic formats, such as the Microscaling (MX) formats introduced in NVIDIA's Blackwell architecture. These formats use a shared scale within blocks of parameters to extend representable range and perform forward/backward GEMM operations in reduced precision for efficiency gains. In this work, we investigate the challenges and viability of block-scaled precision formats during model training. Across nearly one thousand language models trained from scratch -- spanning compute budgets from $2 \times 10^{17}$ to $4.8 \times 10^{19}$ FLOPs and sweeping over a broad range of weight-activation precision combinations -- we consistently observe that training in MX formats exhibits sharp, stochastic instabilities in the loss, particularly at larger compute scales. To explain this phenomenon, we conduct controlled experiments and ablations on a smaller proxy model that exhibits similar behavior as the language model, sweeping across architectural settings, hyperparameters, and precision formats. These experiments motivate a simple model in which multiplicative gradient bias introduced by the quantization of layer-norm affine parameters and a small fraction of activations can trigger runaway divergence. Through \emph{in situ} intervention experiments on our proxy model, we demonstrate that instabilities can be averted or delayed by modifying precision schemes mid-training. Guided by these findings, we evaluate stabilization strategies in the LLM setting and show that certain hybrid configurations recover performance competitive with full-precision training. We release our code at https://github.com/Hither1/systems-scaling.

* 14 pages + appendices

View paper on

Share this with someone who'll enjoy it:

Title:Characterization and Mitigation of Training Instabilities in Microscaling Formats

Paper and Code