Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alec S. Xu

MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention

May 24, 2025

Can Yaras, Alec S. Xu, Pierre Abillama, Changwoo Lee, Laura Balzano

Abstract:Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts. Our code is available at https://github.com/cjyaras/monarch-attention.

Via

Access Paper or Ask Questions

Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective

May 20, 2025

Soo Min Kwon, Alec S. Xu, Can Yaras, Laura Balzano, Qing Qu

Abstract:This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the training and testing covariance matrices. We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts. However, using this framework, we also prove an interesting property of ICL: when trained on task vectors drawn from a union of low-dimensional subspaces, ICL can generalize to any subspace within their span, given sufficiently long prompt lengths. This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training. We empirically show that our results also hold for models such as GPT-2, and conclude with (i) experiments on how our observations extend to nonlinear function classes and (ii) results on how LoRA has the ability to capture distribution shifts.

Via

Access Paper or Ask Questions

Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

Jan 04, 2025

Alec S. Xu, Can Yaras, Peng Wang, Qing Qu

Figure 1 for Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

Figure 2 for Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

Figure 3 for Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

Figure 4 for Understanding How Nonlinear Layers Create Linearly Separable Features for Low-Dimensional Data

Abstract:Deep neural networks have attained remarkable success across diverse classification tasks. Recent empirical studies have shown that deep networks learn features that are linearly separable across classes. However, these findings often lack rigorous justifications, even under relatively simple settings. In this work, we address this gap by examining the linear separation capabilities of shallow nonlinear networks. Specifically, inspired by the low intrinsic dimensionality of image data, we model inputs as a union of low-dimensional subspaces (UoS) and demonstrate that a single nonlinear layer can transform such data into linearly separable sets. Theoretically, we show that this transformation occurs with high probability when using random weights and quadratic activations. Notably, we prove this can be achieved when the network width scales polynomially with the intrinsic dimension of the data rather than the ambient dimension. Experimental results corroborate these theoretical findings and demonstrate that similar linear separation properties hold in practical scenarios beyond our analytical scope. This work bridges the gap between empirical observations and theoretical understanding of the separation capacity of nonlinear networks, offering deeper insights into model interpretability and generalization.

* 32 pages, 9 figures

Via

Access Paper or Ask Questions

HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise

Jan 25, 2023

Alec S. Xu, Laura Balzano, Jeffrey A. Fessler

Figure 1 for HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise

Figure 2 for HeMPPCAT: Mixtures of Probabilistic Principal Component Analysers for Data with Heteroscedastic Noise

Abstract:Mixtures of probabilistic principal component analysis (MPPCA) is a well-known mixture model extension of principal component analysis (PCA). Similar to PCA, MPPCA assumes the data samples in each mixture contain homoscedastic noise. However, datasets with heterogeneous noise across samples are becoming increasingly common, as larger datasets are generated by collecting samples from several sources with varying noise profiles. The performance of MPPCA is suboptimal for data with heteroscedastic noise across samples. This paper proposes a heteroscedastic mixtures of probabilistic PCA technique (HeMPPCAT) that uses a generalized expectation-maximization (GEM) algorithm to jointly estimate the unknown underlying factors, means, and noise variances under a heteroscedastic noise setting. Simulation results illustrate the improved factor estimates and clustering accuracies of HeMPPCAT compared to MPPCA.

Via

Access Paper or Ask Questions