Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

David J. Miller

Inverting Trojans in LLMs

Sep 19, 2025

Zhengxing Li, Guangmingmei Yang, Jayaram Raghuram, David J. Miller, George Kesidis

Abstract:While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third, for LLMs there is the need to blacklist tokens that have strong marginal associations with the putative target response (class) of an attack, as such tokens give false detection signals. However, good blacklists may not exist for some domains. We propose a LLM trigger inversion approach with three key components: i) discrete search, with putative triggers greedily accreted, starting from a select list of singletons; ii) implicit blacklisting, achieved by evaluating the average cosine similarity, in activation space, between a candidate trigger and a small clean set of samples from the putative target class; iii) detection when a candidate trigger elicits high misclassifications, and with unusually high decision confidence. Unlike many recent works, we demonstrate that our approach reliably detects and successfully inverts ground-truth backdoor trigger phrases.

Via

Access Paper or Ask Questions

On Trojans in Refined Language Models

Jun 12, 2024

Jayaram Raghuram, George Kesidis, David J. Miller

Figure 1 for On Trojans in Refined Language Models

Figure 2 for On Trojans in Refined Language Models

Figure 3 for On Trojans in Refined Language Models

Figure 4 for On Trojans in Refined Language Models

Abstract:A Trojan in a language model can be inserted when the model is refined for a particular application such as determining the sentiment of product reviews. In this paper, we clarify and empirically explore variations of the data-poisoning threat model. We then empirically assess two simple defenses each for a different defense scenario. Finally, we provide a brief survey of related attacks and defenses.

Via

Access Paper or Ask Questions

Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Feb 03, 2024

Xi Li, Hang Wang, David J. Miller, George Kesidis

Figure 1 for Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Figure 2 for Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Figure 3 for Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Figure 4 for Universal Post-Training Reverse-Engineering Defense Against Backdoors in Deep Neural Networks

Abstract:A variety of defenses have been proposed against backdoors attacks on deep neural network (DNN) classifiers. Universal methods seek to reliably detect and/or mitigate backdoors irrespective of the incorporation mechanism used by the attacker, while reverse-engineering methods often explicitly assume one. In this paper, we describe a new detector that: relies on internal feature map of the defended DNN to detect and reverse-engineer the backdoor and identify its target class; can operate post-training (without access to the training dataset); is highly effective for various incorporation mechanisms (i.e., is universal); and which has low computational overhead and so is scalable. Our detection approach is evaluated for different attacks on a benchmark CIFAR-10 image classifier.

Via

Access Paper or Ask Questions

Post-Training Overfitting Mitigation in DNN Classifiers

Sep 28, 2023

Hang Wang, David J. Miller, George Kesidis

Figure 1 for Post-Training Overfitting Mitigation in DNN Classifiers

Figure 2 for Post-Training Overfitting Mitigation in DNN Classifiers

Figure 3 for Post-Training Overfitting Mitigation in DNN Classifiers

Figure 4 for Post-Training Overfitting Mitigation in DNN Classifiers

Abstract:Well-known (non-malicious) sources of overfitting in deep neural net (DNN) classifiers include: i) large class imbalances; ii) insufficient training-set diversity; and iii) over-training. In recent work, it was shown that backdoor data-poisoning also induces overfitting, with unusually large classification margins to the attacker's target class, mediated particularly by (unbounded) ReLU activations that allow large signals to propagate in the DNN. Thus, an effective post-training (with no knowledge of the training set or training process) mitigation approach against backdoors was proposed, leveraging a small clean dataset, based on bounding neural activations. Improving upon that work, we threshold activations specifically to limit maximum margins (MMs), which yields performance gains in backdoor mitigation. We also provide some analytical support for this mitigation approach. Most importantly, we show that post-training MM-based regularization substantially mitigates non-malicious overfitting due to class imbalances and overtraining. Thus, unlike adversarial training, which provides some resilience against attacks but which harms clean (attack-free) generalization, we demonstrate an approach originating from adversarial learning that helps clean generalization accuracy. Experiments on CIFAR-10 and CIFAR-100, in comparison with peer methods, demonstrate strong performance of our methods.

Via

Access Paper or Ask Questions

Backdoor Mitigation by Correcting the Distribution of Neural Activations

Aug 18, 2023

Xi Li, Zhen Xiang, David J. Miller, George Kesidis

Figure 1 for Backdoor Mitigation by Correcting the Distribution of Neural Activations

Figure 2 for Backdoor Mitigation by Correcting the Distribution of Neural Activations

Figure 3 for Backdoor Mitigation by Correcting the Distribution of Neural Activations

Figure 4 for Backdoor Mitigation by Correcting the Distribution of Neural Activations

Abstract:Backdoor (Trojan) attacks are an important type of adversarial exploit against deep neural networks (DNNs), wherein a test instance is (mis)classified to the attacker's target class whenever the attacker's backdoor trigger is present. In this paper, we reveal and analyze an important property of backdoor attacks: a successful attack causes an alteration in the distribution of internal layer activations for backdoor-trigger instances, compared to that for clean instances. Even more importantly, we find that instances with the backdoor trigger will be correctly classified to their original source classes if this distribution alteration is corrected. Based on our observations, we propose an efficient and effective method that achieves post-training backdoor mitigation by correcting the distribution alteration using reverse-engineered triggers. Notably, our method does not change any trainable parameters of the DNN, but achieves generally better mitigation performance than existing methods that do require intensive DNN parameter tuning. It also efficiently detects test instances with the trigger, which may help to catch adversarial entities in the act of exploiting the backdoor.

Via

Access Paper or Ask Questions

Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

Aug 08, 2023

Hang Wang, Zhen Xiang, David J. Miller, George Kesidis

Figure 1 for Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

Figure 2 for Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

Figure 3 for Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

Figure 4 for Improved Activation Clipping for Universal Backdoor Mitigation and Test-Time Detection

Abstract:Deep neural networks are vulnerable to backdoor attacks (Trojans), where an attacker poisons the training set with backdoor triggers so that the neural network learns to classify test-time triggers to the attacker's designated target class. Recent work shows that backdoor poisoning induces over-fitting (abnormally large activations) in the attacked model, which motivates a general, post-training clipping method for backdoor mitigation, i.e., with bounds on internal-layer activations learned using a small set of clean samples. We devise a new such approach, choosing the activation bounds to explicitly limit classification margins. This method gives superior performance against peer methods for CIFAR-10 image classification. We also show that this method has strong robustness against adaptive attacks, X2X attacks, and on different datasets. Finally, we demonstrate a method extension for test-time detection and correction based on the output differences between the original and activation-bounded networks. The code of our method is online available.

Via

Access Paper or Ask Questions

Universal Post-Training Backdoor Detection

May 13, 2022

Hang Wang, Zhen Xiang, David J. Miller, George Kesidis

Figure 1 for Universal Post-Training Backdoor Detection

Figure 2 for Universal Post-Training Backdoor Detection

Figure 3 for Universal Post-Training Backdoor Detection

Figure 4 for Universal Post-Training Backdoor Detection

Abstract:A Backdoor attack (BA) is an important type of adversarial attack against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)classified to the attacker's target class when a backdoor pattern (BP) is embedded. In this paper, we focus on the post-training backdoor defense scenario commonly considered in the literature, where the defender aims to detect whether a trained classifier was backdoor attacked, without any access to the training set. To the best of our knowledge, existing post-training backdoor defenses are all designed for BAs with presumed BP types, where each BP type has a specific embedding function. They may fail when the actual BP type used by the attacker (unknown to the defender) is different from the BP type assumed by the defender. In contrast, we propose a universal post-training defense that detects BAs with arbitrary types of BPs, without making any assumptions about the BP type. Our detector leverages the influence of the BA, independently of the BP type, on the landscape of the classifier's outputs prior to the softmax layer. For each class, a maximum margin statistic is estimated using a set of random vectors; detection inference is then performed by applying an unsupervised anomaly detector to these statistics. Thus, our detector is also an advance relative to most existing post-training methods by not needing any legitimate clean samples, and can efficiently detect BAs with arbitrary numbers of source classes. These advantages of our detector over several state-of-the-art methods are demonstrated on four datasets, for three different types of BPs, and for a variety of attack configurations. Finally, we propose a novel, general approach for BA mitigation once a detection is made.

Via

Access Paper or Ask Questions

Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios

Jan 20, 2022

Zhen Xiang, David J. Miller, George Kesidis

Figure 1 for Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios

Figure 2 for Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios

Figure 3 for Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios

Figure 4 for Post-Training Detection of Backdoor Attacks for Two-Class and Multi-Attack Scenarios

Abstract:Backdoor attacks (BAs) are an emerging threat to deep neural network classifiers. A victim classifier will predict to an attacker-desired target class whenever a test sample is embedded with the same backdoor pattern (BP) that was used to poison the classifier's training set. Detecting whether a classifier is backdoor attacked is not easy in practice, especially when the defender is, e.g., a downstream user without access to the classifier's training set. This challenge is addressed here by a reverse-engineering defense (RED), which has been shown to yield state-of-the-art performance in several domains. However, existing REDs are not applicable when there are only {\it two classes} or when {\it multiple attacks} are present. These scenarios are first studied in the current paper, under the practical constraints that the defender neither has access to the classifier's training set nor to supervision from clean reference classifiers trained for the same domain. We propose a detection framework based on BP reverse-engineering and a novel {\it expected transferability} (ET) statistic. We show that our ET statistic is effective {\it using the same detection threshold}, irrespective of the classification domain, the attack configuration, and the BP reverse-engineering algorithm that is used. The excellent performance of our method is demonstrated on six benchmark datasets. Notably, our detection framework is also applicable to multi-class scenarios with multiple attacks.

* Accepted to ICLR2022

Via

Access Paper or Ask Questions

Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

Dec 06, 2021

Xi Li, Zhen Xiang, David J. Miller, George Kesidis

Figure 1 for Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

Figure 2 for Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

Figure 3 for Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

Figure 4 for Test-Time Detection of Backdoor Triggers for Poisoned Deep Neural Networks

Abstract:Backdoor (Trojan) attacks are emerging threats against deep neural networks (DNN). A DNN being attacked will predict to an attacker-desired target class whenever a test sample from any source class is embedded with a backdoor pattern; while correctly classifying clean (attack-free) test samples. Existing backdoor defenses have shown success in detecting whether a DNN is attacked and in reverse-engineering the backdoor pattern in a "post-training" regime: the defender has access to the DNN to be inspected and a small, clean dataset collected independently, but has no access to the (possibly poisoned) training set of the DNN. However, these defenses neither catch culprits in the act of triggering the backdoor mapping, nor mitigate the backdoor attack at test-time. In this paper, we propose an "in-flight" defense against backdoor attacks on image classification that 1) detects use of a backdoor trigger at test-time; and 2) infers the class of origin (source class) for a detected trigger example. The effectiveness of our defense is demonstrated experimentally against different strong backdoor attacks.

Via

Access Paper or Ask Questions

Detecting Backdoor Attacks Against Point Cloud Classifiers

Oct 20, 2021

Zhen Xiang, David J. Miller, Siheng Chen, Xi Li, George Kesidis

Figure 1 for Detecting Backdoor Attacks Against Point Cloud Classifiers

Figure 2 for Detecting Backdoor Attacks Against Point Cloud Classifiers

Figure 3 for Detecting Backdoor Attacks Against Point Cloud Classifiers

Abstract:Backdoor attacks (BA) are an emerging threat to deep neural network classifiers. A classifier being attacked will predict to the attacker's target class when a test sample from a source class is embedded with the backdoor pattern (BP). Recently, the first BA against point cloud (PC) classifiers was proposed, creating new threats to many important applications including autonomous driving. Such PC BAs are not detectable by existing BA defenses due to their special BP embedding mechanism. In this paper, we propose a reverse-engineering defense that infers whether a PC classifier is backdoor attacked, without access to its training set or to any clean classifiers for reference. The effectiveness of our defense is demonstrated on the benchmark ModeNet40 dataset for PCs.

Via

Access Paper or Ask Questions