Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yuan Cao

Hefei National Laboratory for Physical Sciences at Microscale and Department of Modern Physics, University of Science and Technology of China, Hefei, China, Shanghai Branch, CAS Center for Excellence in Quantum Information and Quantum Physics, University of Science and Technology of China, Shanghai, China, Shanghai Research Center for Quantum Sciences, Shanghai, China

SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Sep 16, 2025

Yuan Cao, Dong Wang

Figure 1 for SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Figure 2 for SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Figure 3 for SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Figure 4 for SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Abstract:While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4\% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

Via

Access Paper or Ask Questions

Structure-Aware Automatic Channel Pruning by Searching with Graph Embedding

Jun 13, 2025

Zifan Liu, Yuan Cao, Yanwei Yu, Heng Qi, Jie Gui

Abstract:Channel pruning is a powerful technique to reduce the computational overhead of deep neural networks, enabling efficient deployment on resource-constrained devices. However, existing pruning methods often rely on local heuristics or weight-based criteria that fail to capture global structural dependencies within the network, leading to suboptimal pruning decisions and degraded model performance. To address these limitations, we propose a novel structure-aware automatic channel pruning (SACP) framework that utilizes graph convolutional networks (GCNs) to model the network topology and learn the global importance of each channel. By encoding structural relationships within the network, our approach implements topology-aware pruning and this pruning is fully automated, reducing the need for human intervention. We restrict the pruning rate combinations to a specific space, where the number of combinations can be dynamically adjusted, and use a search-based approach to determine the optimal pruning rate combinations. Extensive experiments on benchmark datasets (CIFAR-10, ImageNet) with various models (ResNet, VGG16) demonstrate that SACP outperforms state-of-the-art pruning methods on compression efficiency and competitive on accuracy retention.

* 12 pages, 2 figures

Via

Access Paper or Ask Questions

Through a Compressed Lens: Investigating the Impact of Quantization on LLM Explainability and Interpretability

May 20, 2025

Qianli Wang, Mingyang Wang, Nils Feldhus, Simon Ostermann, Yuan Cao, Hinrich Schütze, Sebastian Möller, Vera Schmitt

Abstract:Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). While prior research has extensively investigated the degradation of various LLM capabilities due to quantization, its effects on model explainability and interpretability, which are crucial for understanding decision-making processes, remain unexplored. To address this gap, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with two explainability methods, counterfactual examples and natural language explanations, as well as two interpretability approaches, knowledge memorization analysis and latent multi-hop reasoning analysis. We complement our analysis with a thorough user study, evaluating selected explainability methods. Our findings reveal that, depending on the configuration, quantization can significantly impact model explainability and interpretability. Notably, the direction of this effect is not consistent, as it strongly depends on (1) the quantization method, (2) the explainability or interpretability approach, and (3) the evaluation protocol. In some settings, human evaluation shows that quantization degrades explainability, while in others, it even leads to improvements. Our work serves as a cautionary tale, demonstrating that quantization can unpredictably affect model transparency. This insight has important implications for deploying LLMs in applications where transparency is a critical requirement.

* In submission

Via

Access Paper or Ask Questions

ScaleGNN: Towards Scalable Graph Neural Networks via Adaptive High-order Neighboring Feature Fusion

Apr 22, 2025

Xiang Li, Haobing Liu, Jianpeng Qi, Yuan Cao, Guoqing Chao, Yanwei Yu

Abstract:Graph Neural Networks (GNNs) have demonstrated strong performance across various graph-based tasks by effectively capturing relational information between nodes. These models rely on iterative message passing to propagate node features, enabling nodes to aggregate information from their neighbors. Recent research has significantly improved the message-passing mechanism, enhancing GNN scalability on large-scale graphs. However, GNNs still face two main challenges: over-smoothing, where excessive message passing results in indistinguishable node representations, especially in deep networks incorporating high-order neighbors; and scalability issues, as traditional architectures suffer from high model complexity and increased inference time due to redundant information aggregation. This paper proposes a novel framework for large-scale graphs named ScaleGNN that simultaneously addresses both challenges by adaptively fusing multi-level graph features. We first construct neighbor matrices for each order, learning their relative information through trainable weights through an adaptive high-order feature fusion module. This allows the model to selectively emphasize informative high-order neighbors while reducing unnecessary computational costs. Additionally, we introduce a High-order redundant feature masking mechanism based on a Local Contribution Score (LCS), which enables the model to retain only the most relevant neighbors at each order, preventing redundant information propagation. Furthermore, low-order enhanced feature aggregation adaptively integrates low-order and high-order features based on task relevance, ensuring effective capture of both local and global structural information without excessive complexity. Extensive experiments on real-world datasets demonstrate that our approach consistently outperforms state-of-the-art GNN models in both accuracy and computational efficiency.

Via

Access Paper or Ask Questions

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Apr 11, 2025

Chenyang Zhang, Xuran Meng, Yuan Cao

Figure 1 for Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Figure 2 for Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Figure 3 for Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Figure 4 for Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Abstract:Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

* 63 pages, 6 figures

Via

Access Paper or Ask Questions

Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

Apr 11, 2025

Chenyang Zhang, Peifeng Gao, Difan Zou, Yuan Cao

Figure 1 for Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

Figure 2 for Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

Figure 3 for Gradient Descent Robustly Learns the Intrinsic Dimension of Data in Training Convolutional Neural Networks

Abstract:Modern neural networks are usually highly over-parameterized. Behind the wide usage of over-parameterized networks is the belief that, if the data are simple, then the trained network will be automatically equivalent to a simple predictor. Following this intuition, many existing works have studied different notions of "ranks" of neural networks and their relation to the rank of data. In this work, we study the rank of convolutional neural networks (CNNs) trained by gradient descent, with a specific focus on the robustness of the rank to image background noises. Specifically, we point out that, when adding background noises to images, the rank of the CNN trained with gradient descent is affected far less compared with the rank of the data. We support our claim with a theoretical case study, where we consider a particular data model to characterize low-rank clean images with added background noises. We prove that CNNs trained by gradient descent can learn the intrinsic dimension of clean images, despite the presence of relatively large background noises. We also conduct experiments on synthetic and real datasets to further validate our claim.

* 43 pages, 4 figures

Via

Access Paper or Ask Questions

On the Robustness of Transformers against Context Hijacking for Linear Classification

Feb 21, 2025

Tianle Li, Chenyang Zhang, Xingwu Chen, Yuan Cao, Difan Zou

Figure 1 for On the Robustness of Transformers against Context Hijacking for Linear Classification

Figure 2 for On the Robustness of Transformers against Context Hijacking for Linear Classification

Figure 3 for On the Robustness of Transformers against Context Hijacking for Linear Classification

Figure 4 for On the Robustness of Transformers against Context Hijacking for Linear Classification

Abstract:Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. However, their predictions can be disrupted by factually correct context, a phenomenon known as context hijacking, revealing a significant robustness issue. To understand this phenomenon theoretically, we explore an in-context linear classification problem based on recent advances in linear transformers. In our setup, context tokens are designed as factually correct query-answer pairs, where the queries are similar to the final query but have opposite labels. Then, we develop a general theoretical analysis on the robustness of the linear transformers, which is formulated as a function of the model depth, training context lengths, and number of hijacking context tokens. A key finding is that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations. We show that this improvement arises because deeper layers enable more fine-grained optimization steps, effectively mitigating interference from context hijacking. This is also well supported by our numerical experiments. Our findings provide theoretical insights into the benefits of deeper architectures and contribute to enhancing the understanding of transformer architectures.

Via

Access Paper or Ask Questions

Transformers versus the EM Algorithm in Multi-class Clustering

Feb 09, 2025

Yihan He, Hong-Yu Chen, Yuan Cao, Jianqing Fan, Han Liu

Abstract:LLMs demonstrate significant inference capacities in complicated machine learning tasks, using the Transformer model as its backbone. Motivated by the limited understanding of such models on the unsupervised learning problems, we study the learning guarantees of Transformers in performing multi-class clustering of the Gaussian Mixture Models. We develop a theory drawing strong connections between the Softmax Attention layers and the workflow of the EM algorithm on clustering the mixture of Gaussians. Our theory provides approximation bounds for the Expectation and Maximization steps by proving the universal approximation abilities of multivariate mappings by Softmax functions. In addition to the approximation guarantees, we also show that with a sufficient number of pre-training samples and an initialization, Transformers can achieve the minimax optimal rate for the problem considered. Our extensive simulations empirically verified our theory by revealing the strong learning capacities of Transformers even beyond the assumptions in the theory, shedding light on the powerful inference capacities of LLMs.

Via

Access Paper or Ask Questions

Transformers and Their Roles as Time Series Foundation Models

Feb 05, 2025

Dennis Wu, Yihan He, Yuan Cao, Jianqing Fan, Han Liu

Figure 1 for Transformers and Their Roles as Time Series Foundation Models

Figure 2 for Transformers and Their Roles as Time Series Foundation Models

Figure 3 for Transformers and Their Roles as Time Series Foundation Models

Figure 4 for Transformers and Their Roles as Time Series Foundation Models

Abstract:We give a comprehensive analysis of transformers as time series foundation models, focusing on their approximation and generalization capabilities. First, we demonstrate that there exist transformers that fit an autoregressive model on input univariate time series via gradient descent. We then analyze MOIRAI, a multivariate time series foundation model capable of handling an arbitrary number of covariates. We prove that it is capable of automatically fitting autoregressive models with an arbitrary number of covariates, offering insights into its design and empirical success. For generalization, we establish bounds for pretraining when the data satisfies Dobrushin's condition. Experiments support our theoretical findings, highlighting the efficacy of transformers as time series foundation models.

* 34 Pages, 2 Figures

Via

Access Paper or Ask Questions

Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Jan 05, 2025

Yuan Cao, Yihan He, Dennis Wu, Hong-Yu Chen, Jianqing Fan, Han Liu

Figure 1 for Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Figure 2 for Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Figure 3 for Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Figure 4 for Transformers Simulate MLE for Sequence Generation in Bayesian Networks

Abstract:Transformers have achieved significant success in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite these achievements, the theoretical understanding of transformers' capabilities remains limited. In this paper, we investigate the theoretical capabilities of transformers to autoregressively generate sequences in Bayesian networks based on in-context maximum likelihood estimation (MLE). Specifically, we consider a setting where a context is formed by a set of independent sequences generated according to a Bayesian network. We demonstrate that there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis highlights the potential of transformers to learn complex probabilistic models and contributes to a better understanding of large language models as a powerful class of sequence generators.

* 51 pages, 17 figures, 5 tables

Via

Access Paper or Ask Questions