Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

May 29, 2025

Jinwen Chen, Hainan Zhang, Fei Sun, Qinnan Zhang, Sijia Wen, Ziwei Wang, Zhiming Zheng

Figure 1 for Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Figure 2 for Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Figure 3 for Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Figure 4 for Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Share this with someone who'll enjoy it:

Abstract:Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.

View paper on

Share this with someone who'll enjoy it:

Title:Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Paper and Code