Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dan Zhao

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

May 31, 2026

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao, Yong Jiang, Qing Li

Abstract:Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.

Via

Access Paper or Ask Questions

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense

Mar 02, 2026

Nancy Lau, Louis Sloot, Jyoutir Raj, Giuseppe Marco Boscardin, Evan Harris, Dylan Bowman, Mario Brajkovski, Jaideep Chawla, Dan Zhao

Abstract:Large language models (LLMs) are increasingly being deployed as software engineering agents that autonomously contribute to repositories. A major benefit these agents present is their ability to find and patch security vulnerabilities in the codebases they oversee. To estimate the capability of agents in this domain, we introduce ZeroDayBench, a benchmark where LLM agents find and patch 22 novel critical vulnerabilities in open-source codebases. We focus our efforts on three popular frontier agentic LLMs: GPT-5.2, Claude Sonnet 4.5, and Grok 4.1. We find that frontier LLMs are not yet capable of autonomously solving our tasks and observe some behavioral patterns that suggest how these models can be improved in the domain of proactive cyberdefense.

* Accepted to ICLR 2026 Workshop "Agents in the Wild"

Via

Access Paper or Ask Questions

StableQAT: Stable Quantization-Aware Training at Ultra-Low Bitwidths

Jan 27, 2026

Tianyi Chen, Sihan Chen, Xiaoyi Qu, Dan Zhao, Ruomei Yan, Jongwoo Ko, Luming Liang, Pashmina Cameron

Abstract:Quantization-aware training (QAT) is essential for deploying large models under strict memory and latency constraints, yet achieving stable and robust optimization at ultra-low bitwidths remains challenging. Common approaches based on the straight-through estimator (STE) or soft quantizers often suffer from gradient mismatch, instability, or high computational overhead. As such, we propose StableQAT, a unified and efficient QAT framework that stabilizes training in ultra low-bit settings via a novel, lightweight, and theoretically grounded surrogate for backpropagation derived from a discrete Fourier analysis of the rounding operator. StableQAT strictly generalizes STE as the latter arises as a special case of our more expressive surrogate family, yielding smooth, bounded, and inexpensive gradients that improve QAT training performance and stability across various hyperparameter choices. In experiments, StableQAT exhibits stable and efficient QAT at 2-4 bit regimes, demonstrating improved training stability, robustness, and superior performance with negligible training overhead against standard QAT techniques. Our code is available at https://github.com/microsoft/StableQAT.

Via

Access Paper or Ask Questions

Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry

Sep 16, 2025

Oliver Knitter, Dan Zhao, Stefan Leichenauer, Shravan Veerapaneni

Figure 1 for Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry

Figure 2 for Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry

Figure 3 for Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry

Figure 4 for Large Language Model Scaling Laws for Neural Quantum States in Quantum Chemistry

Abstract:Scaling laws have been used to describe how large language model (LLM) performance scales with model size, training data size, or amount of computational resources. Motivated by the fact that neural quantum states (NQS) has increasingly adopted LLM-based components, we seek to understand NQS scaling laws, thereby shedding light on the scalability and optimal performance--resource trade-offs of NQS ansatze. In particular, we identify scaling laws that predict the performance, as measured by absolute error and V-score, for transformer-based NQS as a function of problem size in second-quantized quantum chemistry applications. By performing analogous compute-constrained optimization of the obtained parametric curves, we find that the relationship between model size and training time is highly dependent on loss metric and ansatz, and does not follow the approximately linear relationship found for language models.

* 16 pages, 5 figures, to be submitted for peer review

Via

Access Paper or Ask Questions

FEAT: Full-Dimensional Efficient Attention Transformer for Medical Video Generation

Jun 05, 2025

Huihan Wang, Zhiwen Yang, Hui Zhang, Dan Zhao, Bingzheng Wei, Yan Xu

Abstract:Synthesizing high-quality dynamic medical videos remains a significant challenge due to the need for modeling both spatial consistency and temporal dynamics. Existing Transformer-based approaches face critical limitations, including insufficient channel interactions, high computational complexity from self-attention, and coarse denoising guidance from timestep embeddings when handling varying noise levels. In this work, we propose FEAT, a full-dimensional efficient attention Transformer, which addresses these issues through three key innovations: (1) a unified paradigm with sequential spatial-temporal-channel attention mechanisms to capture global dependencies across all dimensions, (2) a linear-complexity design for attention mechanisms in each dimension, utilizing weighted key-value attention and global channel attention, and (3) a residual value guidance module that provides fine-grained pixel-level guidance to adapt to different noise levels. We evaluate FEAT on standard benchmarks and downstream tasks, demonstrating that FEAT-S, with only 23\% of the parameters of the state-of-the-art model Endora, achieves comparable or even superior performance. Furthermore, FEAT-L surpasses all comparison methods across multiple datasets, showcasing both superior effectiveness and scalability. Code is available at https://github.com/Yaziwel/FEAT.

* This paper has been early accepted by MICCAI 2025

Via

Access Paper or Ask Questions

WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

May 26, 2025

Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Tianyi Chen

Abstract:The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94\%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.

Via

Access Paper or Ask Questions

FedOC: Optimizing Global Prototypes with Orthogonality Constraints for Enhancing Embeddings Separation in Heterogeneous Federated Learning

Feb 22, 2025

Fucheng Guo, Zeyu Luan, Qing Li, Dan Zhao, Yong Jiang

Figure 1 for FedOC: Optimizing Global Prototypes with Orthogonality Constraints for Enhancing Embeddings Separation in Heterogeneous Federated Learning

Figure 2 for FedOC: Optimizing Global Prototypes with Orthogonality Constraints for Enhancing Embeddings Separation in Heterogeneous Federated Learning

Figure 3 for FedOC: Optimizing Global Prototypes with Orthogonality Constraints for Enhancing Embeddings Separation in Heterogeneous Federated Learning

Figure 4 for FedOC: Optimizing Global Prototypes with Orthogonality Constraints for Enhancing Embeddings Separation in Heterogeneous Federated Learning

Abstract:Federated Learning (FL) has emerged as an essential framework for distributed machine learning, especially with its potential for privacy-preserving data processing. However, existing FL frameworks struggle to address statistical and model heterogeneity, which severely impacts model performance. While Heterogeneous Federated Learning (HtFL) introduces prototype-based strategies to address the challenges, current approaches face limitations in achieving optimal separation of prototypes. This paper presents FedOC, a novel HtFL algorithm designed to improve global prototype separation through orthogonality constraints, which not only increase intra-class prototype similarity but also significantly expand the inter-class angular separation. With the guidance of the global prototype, each client keeps its embeddings aligned with the corresponding prototype in the feature space, promoting directional independence that integrates seamlessly with the cross-entropy (CE) loss. We provide theoretical proof of FedOC's convergence under non-convex conditions. Extensive experiments demonstrate that FedOC outperforms seven state-of-the-art baselines, achieving up to a 10.12% accuracy improvement in both statistical and model heterogeneity settings.

Via

Access Paper or Ask Questions

Layer by Layer: Uncovering Hidden Representations in Language Models

Feb 04, 2025

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, Ravid Shwartz-Ziv

Figure 1 for Layer by Layer: Uncovering Hidden Representations in Language Models

Figure 2 for Layer by Layer: Uncovering Hidden Representations in Language Models

Figure 3 for Layer by Layer: Uncovering Hidden Representations in Language Models

Figure 4 for Layer by Layer: Uncovering Hidden Representations in Language Models

Abstract:From extracting features to generating text, the outputs of large language models (LLMs) typically rely on their final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that intermediate layers can encode even richer representations, often improving performance on a wide range of downstream tasks. To explain and quantify these hidden-layer properties, we propose a unified framework of representation quality metrics based on information theory, geometry, and invariance to input perturbations. Our framework highlights how each model layer balances information compression and signal preservation, revealing why mid-depth embeddings can exceed the last layer's performance. Through extensive experiments on 32 text-embedding tasks and comparisons across model architectures (transformers, state-space models) and domains (language, vision), we demonstrate that intermediate layers consistently provide stronger features. These findings challenge the standard focus on final-layer embeddings and open new directions for model analysis and optimization, including strategic use of mid-layer representations for more robust and accurate AI systems.

Via

Access Paper or Ask Questions

Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes

Jan 31, 2025

Zhiyao Xu, Dan Zhao, Qingsong Zou, Jingyu Xiao, Yong Jiang, Zhenhui Yuan, Qing Li

Figure 1 for Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes

Figure 2 for Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes

Figure 3 for Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes

Figure 4 for Synthetic User Behavior Sequence Generation with Large Language Models for Smart Homes

Abstract:In recent years, as smart home systems have become more widespread, security concerns within these environments have become a growing threat. Currently, most smart home security solutions, such as anomaly detection and behavior prediction models, are trained using fixed datasets that are precollected. However, the process of dataset collection is time-consuming and lacks the flexibility needed to adapt to the constantly evolving smart home environment. Additionally, the collection of personal data raises significant privacy concerns for users. Lately, large language models (LLMs) have emerged as a powerful tool for a wide range of tasks across diverse application domains, thanks to their strong capabilities in natural language processing, reasoning, and problem-solving. In this paper, we propose an LLM-based synthetic dataset generation IoTGen framework to enhance the generalization of downstream smart home intelligent models. By generating new synthetic datasets that reflect changes in the environment, smart home intelligent models can be retrained to overcome the limitations of fixed and outdated data, allowing them to better align with the dynamic nature of real-world home environments. Specifically, we first propose a Structure Pattern Perception Compression (SPPC) method tailored for IoT behavior data, which preserves the most informative content in the data while significantly reducing token consumption. Then, we propose a systematic approach to create prompts and implement data generation to automatically generate IoT synthetic data with normative and reasonable properties, assisting task models in adaptive training to improve generalization and real-world performance.

Via

Access Paper or Ask Questions

Retentive Neural Quantum States: Efficient Ansätze for Ab Initio Quantum Chemistry

Nov 06, 2024

Oliver Knitter, Dan Zhao, James Stokes, Martin Ganahl, Stefan Leichenauer, Shravan Veerapaneni

Figure 1 for Retentive Neural Quantum States: Efficient Ansätze for Ab Initio Quantum Chemistry

Figure 2 for Retentive Neural Quantum States: Efficient Ansätze for Ab Initio Quantum Chemistry

Figure 3 for Retentive Neural Quantum States: Efficient Ansätze for Ab Initio Quantum Chemistry

Figure 4 for Retentive Neural Quantum States: Efficient Ansätze for Ab Initio Quantum Chemistry

Abstract:Neural-network quantum states (NQS) has emerged as a powerful application of quantum-inspired deep learning for variational Monte Carlo methods, offering a competitive alternative to existing techniques for identifying ground states of quantum problems. A significant advancement toward improving the practical scalability of NQS has been the incorporation of autoregressive models, most recently transformers, as variational ansatze. Transformers learn sequence information with greater expressiveness than recurrent models, but at the cost of increased time complexity with respect to sequence length. We explore the use of the retentive network (RetNet), a recurrent alternative to transformers, as an ansatz for solving electronic ground state problems in $\textit{ab initio}$ quantum chemistry. Unlike transformers, RetNets overcome this time complexity bottleneck by processing data in parallel during training, and recurrently during inference. We give a simple computational cost estimate of the RetNet and directly compare it with similar estimates for transformers, establishing a clear threshold ratio of problem-to-model size past which the RetNet's time complexity outperforms that of the transformer. Though this efficiency can comes at the expense of decreased expressiveness relative to the transformer, we overcome this gap through training strategies that leverage the autoregressive structure of the model -- namely, variational neural annealing. Our findings support the RetNet as a means of improving the time complexity of NQS without sacrificing accuracy. We provide further evidence that the ablative improvements of neural annealing extend beyond the RetNet architecture, suggesting it would serve as an effective general training strategy for autoregressive NQS.

* 16 pages, 1 figure, to be submitted for peer-reviewed publication

Via

Access Paper or Ask Questions