Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Guanhong Tao

Backdoor Attack on Vision Language Models with Stealthy Semantic Manipulation

Jun 08, 2025

Zhiyuan Zhong, Zhen Sun, Yepang Liu, Xinlei He, Guanhong Tao

Abstract:Vision Language Models (VLMs) have shown remarkable performance, but are also vulnerable to backdoor attacks whereby the adversary can manipulate the model's outputs through hidden triggers. Prior attacks primarily rely on single-modality triggers, leaving the crucial cross-modal fusion nature of VLMs largely unexplored. Unlike prior work, we identify a novel attack surface that leverages cross-modal semantic mismatches as implicit triggers. Based on this insight, we propose BadSem (Backdoor Attack with Semantic Manipulation), a data poisoning attack that injects stealthy backdoors by deliberately misaligning image-text pairs during training. To perform the attack, we construct SIMBad, a dataset tailored for semantic manipulation involving color and object attributes. Extensive experiments across four widely used VLMs show that BadSem achieves over 98% average ASR, generalizes well to out-of-distribution datasets, and can transfer across poisoning modalities. Our detailed analysis using attention visualization shows that backdoored models focus on semantically sensitive regions under mismatched conditions while maintaining normal behavior on clean inputs. To mitigate the attack, we try two defense strategies based on system prompt and supervised fine-tuning but find that both of them fail to mitigate the semantic backdoor. Our findings highlight the urgent need to address semantic vulnerabilities in VLMs for their safer deployment.

Via

Access Paper or Ask Questions

A Comprehensive Study of LLM Secure Code Generation

Mar 18, 2025

Shih-Chieh Dai, Jun Xu, Guanhong Tao

Abstract:LLMs are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current evaluation schemes leave several concerns unaddressed. Specifically, most existing studies evaluate security and functional correctness separately, using different datasets. That is, they assess vulnerabilities using security-related code datasets while validating functionality with general code datasets. In addition, prior research primarily relies on a single static analyzer, CodeQL, to detect vulnerabilities in generated code, which limits the scope of security evaluation. In this work, we conduct a comprehensive study to systematically assess the improvements introduced by four state-of-the-art secure code generation techniques. Specifically, we apply both security inspection and functionality validation to the same generated code and evaluate these two aspects together. We also employ three popular static analyzers and two LLMs to identify potential vulnerabilities in the generated code. Our study reveals that existing techniques often compromise the functionality of generated code to enhance security. Their overall performance remains limited when evaluating security and functionality together. In fact, many techniques even degrade the performance of the base LLM. Our further inspection reveals that these techniques often either remove vulnerable lines of code entirely or generate ``garbage code'' that is unrelated to the intended task. Moreover, the commonly used static analyzer CodeQL fails to detect several vulnerabilities, further obscuring the actual security improvements achieved by existing techniques. Our study serves as a guideline for a more rigorous and comprehensive evaluation of secure code generation performance in future work.

Via

Access Paper or Ask Questions

How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies

Feb 06, 2025

Basavasagar Patil, Akansha Kalra, Guanhong Tao, Daniel S. Brown

Figure 1 for How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies

Figure 2 for How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies

Figure 3 for How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies

Figure 4 for How vulnerable is my policy? Adversarial attacks on modern behavior cloning policies

Abstract:Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks, but their vulnerability to adversarial attacks remains underexplored. This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms, including Behavior Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and VQ-Behavior Transformer (VQ-BET). We study the vulnerability of these methods to untargeted, targeted and universal adversarial perturbations. While explicit policies, such as BC, LSTM-GMM and VQ-BET can be attacked in the same manner as standard computer vision models, we find that attacks for implicit and denoising policy models are nuanced and require developing novel attack methods. Our experiments on several simulated robotic manipulation tasks reveal that most of the current methods are highly vulnerable to adversarial perturbations. We also show that these attacks are transferable across algorithms, architectures, and tasks, raising concerning security vulnerabilities with potentially a white-box threat model. In addition, we test the efficacy of a randomized smoothing, a widely used adversarial defense technique, and highlight its limitation in defending against attacks on complex and multi-modal action distribution common in complex control tasks. In summary, our findings highlight the vulnerabilities of modern BC algorithms, paving way for future work in addressing such limitations.

Via

Access Paper or Ask Questions

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Jan 07, 2025

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, XiaoFeng Wang, Bo Li

Figure 1 for PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Figure 2 for PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Figure 3 for PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Figure 4 for PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

Abstract:Text-to-image (T2I) models have been shown to be vulnerable to misuse, particularly in generating not-safe-for-work (NSFW) content, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. Extensive experiments across three datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 7.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.

* 16 pages, 8 figures, 10 tables

Via

Access Paper or Ask Questions

Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Nov 22, 2024

Soumil Datta, Shih-Chieh Dai, Leo Yu, Guanhong Tao

Figure 1 for Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Figure 2 for Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Figure 3 for Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Figure 4 for Exploiting Watermark-Based Defense Mechanisms in Text-to-Image Diffusion Models for Unauthorized Data Usage

Abstract:Text-to-image diffusion models, such as Stable Diffusion, have shown exceptional potential in generating high-quality images. However, recent studies highlight concerns over the use of unauthorized data in training these models, which may lead to intellectual property infringement or privacy violations. A promising approach to mitigate these issues is to apply a watermark to images and subsequently check if generative models reproduce similar watermark features. In this paper, we examine the robustness of various watermark-based protection methods applied to text-to-image models. We observe that common image transformations are ineffective at removing the watermark effect. Therefore, we propose \tech{}, that leverages the diffusion process to conduct controlled image generation on the protected input, preserving the high-level features of the input while ignoring the low-level details utilized by watermarks. A small number of generated images are then used to fine-tune protected models. Our experiments on three datasets and 140 text-to-image diffusion models reveal that existing state-of-the-art protections are not robust against RATTAN.

Via

Access Paper or Ask Questions

UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Jul 16, 2024

Siyuan Cheng, Guangyu Shen, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Hanxi Guo, Shiqing Ma, Xiangyu Zhang

Figure 1 for UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Figure 2 for UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Figure 3 for UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Figure 4 for UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Abstract:Deep neural networks (DNNs) have demonstrated effectiveness in various fields. However, DNNs are vulnerable to backdoor attacks, which inject a unique pattern, called trigger, into the input to cause misclassification to an attack-chosen target label. While existing works have proposed various methods to mitigate backdoor effects in poisoned models, they tend to be less effective against recent advanced attacks. In this paper, we introduce a novel post-training defense technique UNIT that can effectively eliminate backdoor effects for a variety of attacks. In specific, UNIT approximates a unique and tight activation distribution for each neuron in the model. It then proactively dispels substantially large activation values that exceed the approximated boundaries. Our experimental results demonstrate that UNIT outperforms 7 popular defense methods against 14 existing backdoor attacks, including 2 advanced attacks, using only 5\% of clean training data. UNIT is also cost efficient. The code is accessible at https://github.com/Megum1/UNIT.

* The 18th European Conference on Computer Vision ECCV 2024

Via

Access Paper or Ask Questions

Threat Behavior Textual Search by Attention Graph Isomorphism

Apr 18, 2024

Chanwoo Bae, Guanhong Tao, Zhuo Zhang, Xiangyu Zhang

Abstract:Cyber attacks cause over \$1 trillion loss every year. An important task for cyber security analysts is attack forensics. It entails understanding malware behaviors and attack origins. However, existing automated or manual malware analysis can only disclose a subset of behaviors due to inherent difficulties (e.g., malware cloaking and obfuscation). As such, analysts often resort to text search techniques to identify existing malware reports based on the symptoms they observe, exploiting the fact that malware samples share a lot of similarity, especially those from the same origin. In this paper, we propose a novel malware behavior search technique that is based on graph isomorphism at the attention layers of Transformer models. We also compose a large dataset collected from various agencies to facilitate such research. Our technique outperforms state-of-the-art methods, such as those based on sentence embeddings and keywords by 6-14%. In the case study of 10 real-world malwares, our technique can correctly attribute 8 of them to their ground truth origins while using Google only works for 3 cases.

* Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

Via

Access Paper or Ask Questions

LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Mar 25, 2024

Siyuan Cheng, Guanhong Tao, Yingqi Liu, Guangyu Shen, Shengwei An, Shiwei Feng, Xiangzhe Xu, Kaiyuan Zhang, Shiqing Ma, Xiangyu Zhang

Figure 1 for LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Figure 2 for LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Figure 3 for LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Figure 4 for LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Abstract:Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS.

* IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Via

Access Paper or Ask Questions

Rapid Optimization for Jailbreaking LLMs via Subconscious Exploitation and Echopraxia

Feb 08, 2024

Guangyu Shen, Siyuan Cheng, Kaiyuan Zhang, Guanhong Tao, Shengwei An, Lu Yan, Zhuo Zhang, Shiqing Ma, Xiangyu Zhang

Abstract:Large Language Models (LLMs) have become prevalent across diverse sectors, transforming human life with their extraordinary reasoning and comprehension abilities. As they find increased use in sensitive tasks, safety concerns have gained widespread attention. Extensive efforts have been dedicated to aligning LLMs with human moral principles to ensure their safe deployment. Despite their potential, recent research indicates aligned LLMs are prone to specialized jailbreaking prompts that bypass safety measures to elicit violent and harmful content. The intrinsic discrete nature and substantial scale of contemporary LLMs pose significant challenges in automatically generating diverse, efficient, and potent jailbreaking prompts, representing a continuous obstacle. In this paper, we introduce RIPPLE (Rapid Optimization via Subconscious Exploitation and Echopraxia), a novel optimization-based method inspired by two psychological concepts: subconsciousness and echopraxia, which describe the processes of the mind that occur without conscious awareness and the involuntary mimicry of actions, respectively. Evaluations across 6 open-source LLMs and 4 commercial LLM APIs show RIPPLE achieves an average Attack Success Rate of 91.5\%, outperforming five current methods by up to 47.0\% with an 8x reduction in overhead. Furthermore, it displays significant transferability and stealth, successfully evading established detection mechanisms. The code of our work is available at \url{https://github.com/SolidShen/RIPPLE_official/tree/official}

Via

Access Paper or Ask Questions

Make Them Spill the Beans! Coercive Knowledge Extraction from LLMs

Dec 08, 2023

Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, Xiangyu Zhang

Figure 1 for Make Them Spill the Beans! Coercive Knowledge Extraction from LLMs

Figure 2 for Make Them Spill the Beans! Coercive Knowledge Extraction from LLMs

Figure 3 for Make Them Spill the Beans! Coercive Knowledge Extraction from LLMs

Figure 4 for Make Them Spill the Beans! Coercive Knowledge Extraction from LLMs

Abstract:Large Language Models (LLMs) are now widely used in various applications, making it crucial to align their ethical standards with human values. However, recent jail-breaking methods demonstrate that this alignment can be undermined using carefully constructed prompts. In our study, we reveal a new threat to LLM alignment when a bad actor has access to the model's output logits, a common feature in both open-source LLMs and many commercial LLM APIs (e.g., certain GPT models). It does not rely on crafting specific prompts. Instead, it exploits the fact that even when an LLM rejects a toxic request, a harmful response often hides deep in the output logits. By forcefully selecting lower-ranked output tokens during the auto-regressive generation process at a few critical output positions, we can compel the model to reveal these hidden responses. We term this process model interrogation. This approach differs from and outperforms jail-breaking methods, achieving 92% effectiveness compared to 62%, and is 10 to 20 times faster. The harmful content uncovered through our method is more relevant, complete, and clear. Additionally, it can complement jail-breaking strategies, with which results in further boosting attack performance. Our findings indicate that interrogation can extract toxic knowledge even from models specifically designed for coding tasks.

Via

Access Paper or Ask Questions