Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tao Qi

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

May 19, 2026

Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang

Abstract:Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency. However, existing post-training quantization (PTQ) methods predominantly rely on dynamic activation quantization, rendering them incompatible with NPU hardware constraints. To bridge the gap between high-fidelity PTQ and NPU-constrained inference, we propose Quant.npu, a integer-only fully static quantization framework. It incorporates learnable quantization parameters and rotation matrices, enabling low-bit activation-weight quantization without runtime quantization parameters re-computation. Crucially, we identify that initialization and selective optimization of quantization parameters is pivotal for optimization stability, as improper initialization and naive joint optimization induce gradient instability that disrupts the optimization of rotation matrices. To address this, we propose a rotation-and-bit-width-aware initialization tailored to diverse activation profiles and a distribution-aware selective optimization (two-stage quantization pipeline) tailored to rotated and unrotated tensors. Furthermore, we introduce a sensitivity-guided adaptive mixed-precision scheme to balance accuracy with inference efficiency. Extensive experiments on real-world mobile NPUs demonstrate that Quant.npu achieves comparable accuracy to state-of-the-art methods, while reducing inference latency by up to 15.1%.

Via

Access Paper or Ask Questions

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

May 11, 2026

Peiru Yang, Haoran Zheng, Tong Ju, Shiting Wang, Wanchun Ni, Jiajun Liu, Shangguang Wang, Yongfeng Huang, Tao Qi

Abstract:Retrieval-augmented generation (RAG) is a widely adopted paradigm for enhancing LLMs in medical applications by incorporating expert multimodal knowledge during generation. However, the underlying retrieval databases may naturally contain, or be intentionally injected with, adversarial knowledge, which can perturb model outputs and undermine system reliability. To investigate this risk, prior studies have explored knowledge poisoning attacks in medical RAG systems. Nevertheless, most of them rely on the strong assumption that adversaries possess prior knowledge of user queries, which is unrealistic in deployments and substantially limits their practical applicability. In this paper, we propose M\textsuperscript{3}Att, a knowledge-poisoning framework designed for medical multimodal RAG systems, assuming only limited distribution knowledge of the underlying database. Our core idea is to inject covert misinformation into textual data while using paired visual data as a query-agnostic trigger to promote retrieval. We first propose a unified framework that introduces imperceptible perturbations to visual inputs to manipulate retrieval probabilities. Besides, due to the prior medical knowledge in LLMs, naively poisoned medical content with explicit factual errors can be corrected during generation. Thus, we leverage the inherent ambiguity of medical diagnosis and design a covert misinformation injection strategy that degrades diagnostic accuracy while evading model self-correction. Experiments on five LLMs and datasets demonstrate that M\textsuperscript{3}Att consistently produces clinically plausible yet incorrect generations. Codes: https://github.com/ypr17/M3Att.

* Accepted by ACL 2026

Via

Access Paper or Ask Questions

MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Jun 09, 2025

Peiru Yang, Jinhua Yin, Haoran Zheng, Xueying Bai, Huili Wang, Yufei Sun, Xintian Li, Shangguang Wang, Yongfeng Huang, Tao Qi

Figure 1 for MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Figure 2 for MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Figure 3 for MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Figure 4 for MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems

Abstract:Multimodal retrieval-augmented generation (RAG) systems enhance large vision-language models by integrating cross-modal knowledge, enabling their increasing adoption across real-world multimodal tasks. These knowledge databases may contain sensitive information that requires privacy protection. However, multimodal RAG systems inherently grant external users indirect access to such data, making them potentially vulnerable to privacy attacks, particularly membership inference attacks (MIAs). % Existing MIA methods targeting RAG systems predominantly focus on the textual modality, while the visual modality remains relatively underexplored. To bridge this gap, we propose MrM, the first black-box MIA framework targeted at multimodal RAG systems. It utilizes a multi-object data perturbation framework constrained by counterfactual attacks, which can concurrently induce the RAG systems to retrieve the target data and generate information that leaks the membership information. Our method first employs an object-aware data perturbation method to constrain the perturbation to key semantics and ensure successful retrieval. Building on this, we design a counterfact-informed mask selection strategy to prioritize the most informative masked regions, aiming to eliminate the interference of model self-knowledge and amplify attack efficacy. Finally, we perform statistical membership inference by modeling query trials to extract features that reflect the reconstruction of masked semantics from response patterns. Experiments on two visual datasets and eight mainstream commercial visual-language models (e.g., GPT-4o, Gemini-2) demonstrate that MrM achieves consistently strong performance across both sample-level and set-level evaluations, and remains robust under adaptive defenses.

Via

Access Paper or Ask Questions

HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations

Apr 12, 2025

Peiru Yang, Xintian Li, Zhiyang Hu, Jiapeng Wang, Jinhua Yin, Huili Wang, Lizhi He, Shuai Yang, Shangguang Wang, Yongfeng Huang(+1 more)

Figure 1 for HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations

Figure 2 for HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations

Figure 3 for HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations

Figure 4 for HeteRAG: A Heterogeneous Retrieval-augmented Generation Framework with Decoupled Knowledge Representations

Abstract:Retrieval-augmented generation (RAG) methods can enhance the performance of LLMs by incorporating retrieved knowledge chunks into the generation process. In general, the retrieval and generation steps usually have different requirements for these knowledge chunks. The retrieval step benefits from comprehensive information to improve retrieval accuracy, whereas excessively long chunks may introduce redundant contextual information, thereby diminishing both the effectiveness and efficiency of the generation process. However, existing RAG methods typically employ identical representations of knowledge chunks for both retrieval and generation, resulting in suboptimal performance. In this paper, we propose a heterogeneous RAG framework (\myname) that decouples the representations of knowledge chunks for retrieval and generation, thereby enhancing the LLMs in both effectiveness and efficiency. Specifically, we utilize short chunks to represent knowledge to adapt the generation step and utilize the corresponding chunk with its contextual information from multi-granular views to enhance retrieval accuracy. We further introduce an adaptive prompt tuning method for the retrieval model to adapt the heterogeneous retrieval augmented generation process. Extensive experiments demonstrate that \myname achieves significant improvements compared to baselines.

* 10 pages, 5 figures

Via

Access Paper or Ask Questions

Measuring Human Contribution in AI-Assisted Content Generation

Aug 27, 2024

Yueqi Xie, Tao Qi, Jingwei Yi, Ryan Whalen, Junming Huang, Qian Ding, Yu Xie, Xing Xie, Fangzhao Wu

Figure 1 for Measuring Human Contribution in AI-Assisted Content Generation

Figure 2 for Measuring Human Contribution in AI-Assisted Content Generation

Figure 3 for Measuring Human Contribution in AI-Assisted Content Generation

Figure 4 for Measuring Human Contribution in AI-Assisted Content Generation

Abstract:With the growing prevalence of generative artificial intelligence (AI), an increasing amount of content is no longer exclusively generated by humans but by generative AI models with human guidance. This shift presents notable challenges for the delineation of originality due to the varying degrees of human contribution in AI-assisted works. This study raises the research question of measuring human contribution in AI-assisted content generation and introduces a framework to address this question that is grounded in information theory. By calculating mutual information between human input and AI-assisted output relative to self-information of AI-assisted output, we quantify the proportional information contribution of humans in content generation. Our experimental results demonstrate that the proposed measure effectively discriminates between varying degrees of human contribution across multiple creative domains. We hope that this work lays a foundation for measuring human contributions in AI-assisted content generation in the era of generative AI.

Via

Access Paper or Ask Questions

LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation

Aug 01, 2024

Zhen Yang, Wenhui Wang, Tao Qi, Peng Zhang, Tianyun Zhang, Ru Zhang, Jianyi Liu, Yongfeng Huang

Figure 1 for LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation

Figure 2 for LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation

Figure 3 for LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation

Figure 4 for LICM: Effective and Efficient Long Interest Chain Modeling for News Recommendation

Abstract:Accurately recommending personalized candidate news articles to users has always been the core challenge of news recommendation system. News recommendations often require modeling of user interests to match candidate news. Recent efforts have primarily focused on extract local subgraph information, the lack of a comprehensive global news graph extraction has hindered the ability to utilize global news information collaboratively among similar users. To overcome these limitations, we propose an effective and efficient Long Interest Chain Modeling for News Recommendation(LICM), which combines neighbor interest with long-chain interest distilled from a global news click graph based on the collaborative of similar users to enhance news recommendation. For a global news graph based on the click history of all users, long chain interest generated from it can better utilize the high-dimensional information within it, enhancing the effectiveness of collaborative recommendations. We therefore design a comprehensive selection mechanism and interest encoder to obtain long-chain interest from the global graph. Finally, we use a gated network to integrate long-chain information with neighbor information to achieve the final user representation. Experiment results on real-world datasets validate the effectiveness and efficiency of our model to improve the performance of news recommendation.

Via

Access Paper or Ask Questions

FedSampling: A Better Sampling Strategy for Federated Learning

Jun 25, 2023

Tao Qi, Fangzhao Wu, Lingjuan Lyu, Yongfeng Huang, Xing Xie

Abstract:Federated learning (FL) is an important technique for learning models from decentralized data in a privacy-preserving way. Existing FL methods usually uniformly sample clients for local model learning in each round. However, different clients may have significantly different data sizes, and the clients with more data cannot have more opportunities to contribute to model training, which may lead to inferior performance. In this paper, instead of client uniform sampling, we propose a novel data uniform sampling strategy for federated learning (FedSampling), which can effectively improve the performance of federated learning especially when client data size distribution is highly imbalanced across clients. In each federated learning round, local data on each client is randomly sampled for local model learning according to a probability based on the server desired sample size and the total sample size on all available clients. Since the data size on each client is privacy-sensitive, we propose a privacy-preserving way to estimate the total sample size with a differential privacy guarantee. Experiments on four benchmark datasets show that FedSampling can effectively improve the performance of federated learning.

* IJCAI 2023

Via

Access Paper or Ask Questions

FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning

Jun 07, 2022

Tao Qi, Fangzhao Wu, Chuhan Wu, Lingjuan Lyu, Tong Xu, Zhongliang Yang, Yongfeng Huang, Xing Xie

Figure 1 for FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning

Figure 2 for FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning

Figure 3 for FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning

Figure 4 for FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning

Abstract:Vertical federated learning (VFL) is a privacy-preserving machine learning paradigm that can learn models from features distributed on different platforms in a privacy-preserving way. Since in real-world applications the data may contain bias on fairness-sensitive features (e.g., gender), VFL models may inherit bias from training data and become unfair for some user groups. However, existing fair ML methods usually rely on the centralized storage of fairness-sensitive features to achieve model fairness, which are usually inapplicable in federated scenarios. In this paper, we propose a fair vertical federated learning framework (FairVFL), which can improve the fairness of VFL models. The core idea of FairVFL is to learn unified and fair representations of samples based on the decentralized feature fields in a privacy-preserving way. Specifically, each platform with fairness-insensitive features first learns local data representations from local features. Then, these local representations are uploaded to a server and aggregated into a unified representation for the target task. In order to learn fair unified representations, we send them to each platform storing fairness-sensitive features and apply adversarial learning to remove bias from the unified representations inherited from the biased data. Moreover, for protecting user privacy, we further propose a contrastive adversarial learning method to remove privacy information from the unified representations in server before sending them to the platforms keeping fairness-sensitive features. Experiments on two real-world datasets validate that our method can effectively improve model fairness with user privacy well-protected.

Via

Access Paper or Ask Questions

Robust Quantity-Aware Aggregation for Federated Learning

May 22, 2022

Jingwei Yi, Fangzhao Wu, Huishuai Zhang, Bin Zhu, Tao Qi, Guangzhong Sun, Xing Xie

Figure 1 for Robust Quantity-Aware Aggregation for Federated Learning

Figure 2 for Robust Quantity-Aware Aggregation for Federated Learning

Figure 3 for Robust Quantity-Aware Aggregation for Federated Learning

Figure 4 for Robust Quantity-Aware Aggregation for Federated Learning

Abstract:Federated learning (FL) enables multiple clients to collaboratively train models without sharing their local data, and becomes an important privacy-preserving machine learning framework. However, classical FL faces serious security and robustness problem, e.g., malicious clients can poison model updates and at the same time claim large quantities to amplify the impact of their model updates in the model aggregation. Existing defense methods for FL, while all handling malicious model updates, either treat all quantities benign or simply ignore/truncate the quantities of all clients. The former is vulnerable to quantity-enhanced attack, while the latter leads to sub-optimal performance since the local data on different clients is usually in significantly different sizes. In this paper, we propose a robust quantity-aware aggregation algorithm for federated learning, called FedRA, to perform the aggregation with awareness of local data quantities while being able to defend against quantity-enhanced attacks. More specifically, we propose a method to filter malicious clients by jointly considering the uploaded model updates and data quantities from different clients, and performing quantity-aware weighted averaging on model updates from remaining clients. Moreover, as the number of malicious clients participating in the federated learning may dynamically change in different rounds, we also propose a malicious client number estimator to predict how many suspicious clients should be filtered in each round. Experiments on four public datasets demonstrate the effectiveness of our FedRA method in defending FL against quantity-enhanced attacks.

Via

Access Paper or Ask Questions

FedCL: Federated Contrastive Learning for Privacy-Preserving Recommendation

Apr 21, 2022

Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, Xing Xie

Figure 1 for FedCL: Federated Contrastive Learning for Privacy-Preserving Recommendation

Figure 2 for FedCL: Federated Contrastive Learning for Privacy-Preserving Recommendation

Figure 3 for FedCL: Federated Contrastive Learning for Privacy-Preserving Recommendation

Figure 4 for FedCL: Federated Contrastive Learning for Privacy-Preserving Recommendation

Abstract:Contrastive learning is widely used for recommendation model learning, where selecting representative and informative negative samples is critical. Existing methods usually focus on centralized data, where abundant and high-quality negative samples are easy to obtain. However, centralized user data storage and exploitation may lead to privacy risks and concerns, while decentralized user data on a single client can be too sparse and biased for accurate contrastive learning. In this paper, we propose a federated contrastive learning method named FedCL for privacy-preserving recommendation, which can exploit high-quality negative samples for effective model training with privacy well protected. We first infer user embeddings from local user data through the local model on each client, and then perturb them with local differential privacy (LDP) before sending them to a central server for hard negative sampling. Since individual user embedding contains heavy noise due to LDP, we propose to cluster user embeddings on the server to mitigate the influence of noise, and the cluster centroids are used to retrieve hard negative samples from the item pool. These hard negative samples are delivered to user clients and mixed with the observed negative samples from local data as well as in-batch negatives constructed from positive samples for federated model training. Extensive experiments on four benchmark datasets show FedCL can empower various recommendation methods in a privacy-preserving way.

Via

Access Paper or Ask Questions