Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Daniel Kang

MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Feb 25, 2025

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-wei Chang, Daniel Kang, Heng Ji

Figure 1 for MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Figure 2 for MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Figure 3 for MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Figure 4 for MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Abstract:Multimodal large language models (MLLMs) equipped with Retrieval Augmented Generation (RAG) leverage both their rich parametric knowledge and the dynamic, external knowledge to excel in tasks such as Question Answering. While RAG enhances MLLMs by grounding responses in query-relevant external knowledge, this reliance poses a critical yet underexplored safety risk: knowledge poisoning attacks, where misinformation or irrelevant knowledge is intentionally injected into external knowledge bases to manipulate model outputs to be incorrect and even harmful. To expose such vulnerabilities in multimodal RAG, we propose MM-PoisonRAG, a novel knowledge poisoning attack framework with two attack strategies: Localized Poisoning Attack (LPA), which injects query-specific misinformation in both text and images for targeted manipulation, and Globalized Poisoning Attack (GPA) to provide false guidance during MLLM generation to elicit nonsensical responses across all queries. We evaluate our attacks across multiple tasks, models, and access settings, demonstrating that LPA successfully manipulates the MLLM to generate attacker-controlled answers, with a success rate of up to 56% on MultiModalQA. Moreover, GPA completely disrupts model generation to 0% accuracy with just a single irrelevant knowledge injection. Our results highlight the urgent need for robust defenses against knowledge poisoning to safeguard multimodal RAG frameworks.

* Code is available at https://github.com/HyeonjeongHa/MM-PoisonRAG

Via

Access Paper or Ask Questions

Voice-Enabled AI Agents can Perform Common Scams

Oct 21, 2024

Richard Fang, Dylan Bowman, Daniel Kang

Abstract:Recent advances in multi-modal, highly capable LLMs have enabled voice-enabled AI agents. These agents are enabling new applications, such as voice-enabled autonomous customer service. However, with all AI capabilities, these new capabilities have the potential for dual use. In this work, we show that voice-enabled AI agents can perform the actions necessary to perform common scams. To do so, we select a list of common scams collected by the government and construct voice-enabled agents with directions to perform these scams. We conduct experiments on our voice-enabled agents and show that they can indeed perform the actions necessary to autonomously perform such scams. Our results raise questions around the widespread deployment of voice-enabled AI agents.

Via

Access Paper or Ask Questions

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Jun 02, 2024

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang

Figure 1 for Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Figure 2 for Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Figure 3 for Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Figure 4 for Teams of LLM Agents can Exploit Zero-Day Vulnerabilities

Abstract:LLM agents have become increasingly sophisticated, especially in the realm of cybersecurity. Researchers have shown that LLM agents can exploit real-world vulnerabilities when given a description of the vulnerability and toy capture-the-flag problems. However, these agents still perform poorly on real-world vulnerabilities that are unknown to the agent ahead of time (zero-day vulnerabilities). In this work, we show that teams of LLM agents can exploit real-world, zero-day vulnerabilities. Prior agents struggle with exploring many different vulnerabilities and long-range planning when used alone. To resolve this, we introduce HPTSA, a system of agents with a planning agent that can launch subagents. The planning agent explores the system and determines which subagents to call, resolving long-term planning issues when trying different vulnerabilities. We construct a benchmark of 15 real-world vulnerabilities and show that our team of agents improve over prior work by up to 4.5$\times$.

Via

Access Paper or Ask Questions

LLM Agents can Autonomously Exploit One-day Vulnerabilities

Apr 11, 2024

Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang

Abstract:LLMs have becoming increasingly powerful, both in their benign and malicious uses. With the increase in capabilities, researchers have been increasingly interested in their ability to exploit cybersecurity vulnerabilities. In particular, recent work has conducted preliminary studies on the ability of LLM agents to autonomously hack websites. However, these studies are limited to simple vulnerabilities. In this work, we show that LLM agents can autonomously exploit one-day vulnerabilities in real-world systems. To show this, we collected a dataset of 15 one-day vulnerabilities that include ones categorized as critical severity in the CVE description. When given the CVE description, GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test (GPT-3.5, open-source LLMs) and open-source vulnerability scanners (ZAP and Metasploit). Fortunately, our GPT-4 agent requires the CVE description for high performance: without the description, GPT-4 can exploit only 7% of the vulnerabilities. Our findings raise questions around the widespread deployment of highly capable LLM agents.

Via

Access Paper or Ask Questions

Trustless Audits without Revealing Data or Models

Apr 06, 2024

Suppakit Waiwitlikhit, Ion Stoica, Yi Sun, Tatsunori Hashimoto, Daniel Kang

Abstract:There is an increasing conflict between business incentives to hide models and data as trade secrets, and the societal need for algorithmic transparency. For example, a rightsholder wishing to know whether their copyrighted works have been used during training must convince the model provider to allow a third party to audit the model and data. Finding a mutually agreeable third party is difficult, and the associated costs often make this approach impractical. In this work, we show that it is possible to simultaneously allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties. We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights, alongside a zero-knowledge proof (ZKP) certifying that published commitments are derived from training the model. Model providers can then respond to audit requests by privately computing any function F of the dataset (or model) and releasing the output of F alongside another ZKP certifying the correct execution of F. To enable ZkAudit, we develop new methods of computing ZKPs for SGD on modern neural nets for simple recommender systems and image classification models capable of high accuracies on ImageNet. Empirically, we show it is possible to provide trustless audits of DNNs, including copyright, censorship, and counterfactual audits with little to no loss in accuracy.

Via

Access Paper or Ask Questions

A Safe Harbor for AI Evaluation and Red Teaming

Mar 07, 2024

Shayne Longpre, Sayash Kapoor, Kevin Klyman, Ashwin Ramaswami, Rishi Bommasani, Borhane Blili-Hamelin, Yangsibo Huang, Aviya Skowron, Zheng-Xin Yong, Suhas Kotha(+13 more)

Figure 1 for A Safe Harbor for AI Evaluation and Red Teaming

Figure 2 for A Safe Harbor for AI Evaluation and Red Teaming

Figure 3 for A Safe Harbor for AI Evaluation and Red Teaming

Figure 4 for A Safe Harbor for AI Evaluation and Red Teaming

Abstract:Independent evaluation and red teaming are critical for identifying the risks posed by generative AI systems. However, the terms of service and enforcement strategies used by prominent AI companies to deter model misuse have disincentives on good faith safety evaluations. This causes some researchers to fear that conducting such research or releasing their findings will result in account suspensions or legal reprisal. Although some companies offer researcher access programs, they are an inadequate substitute for independent research access, as they have limited community representation, receive inadequate funding, and lack independence from corporate incentives. We propose that major AI developers commit to providing a legal and technical safe harbor, indemnifying public interest safety research and protecting it from the threat of account suspensions or legal reprisal. These proposals emerged from our collective experience conducting safety, privacy, and trustworthiness research on generative AI systems, where norms and incentives could be better aligned with public interests, without exacerbating model misuse. We believe these commitments are a necessary step towards more inclusive and unimpeded community efforts to tackle the risks of generative AI.

Via

Access Paper or Ask Questions

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Mar 05, 2024

Qiusi Zhan, Zhixiang Liang, Zifan Ying, Daniel Kang

Figure 1 for InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Figure 2 for InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Figure 3 for InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Figure 4 for InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Abstract:Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.

* 26 pages, 5 figures, 7 tables

Via

Access Paper or Ask Questions

LLM Agents can Autonomously Hack Websites

Feb 16, 2024

Richard Fang, Rohan Bindu, Akul Gupta, Qiusi Zhan, Daniel Kang

Figure 1 for LLM Agents can Autonomously Hack Websites

Figure 2 for LLM Agents can Autonomously Hack Websites

Figure 3 for LLM Agents can Autonomously Hack Websites

Figure 4 for LLM Agents can Autonomously Hack Websites

Abstract:In recent years, large language models (LLMs) have become increasingly capable and can now interact with tools (i.e., call functions), read documents, and recursively call themselves. As a result, these LLMs can now function autonomously as agents. With the rise in capabilities of these agents, recent work has speculated on how LLM agents would affect cybersecurity. However, not much is known about the offensive capabilities of LLM agents. In this work, we show that LLM agents can autonomously hack websites, performing tasks as complex as blind database schema extraction and SQL injections without human feedback. Importantly, the agent does not need to know the vulnerability beforehand. This capability is uniquely enabled by frontier models that are highly capable of tool use and leveraging extended context. Namely, we show that GPT-4 is capable of such hacks, but existing open-source models are not. Finally, we show that GPT-4 is capable of autonomously finding vulnerabilities in websites in the wild. Our findings raise questions about the widespread deployment of LLMs.

Via

Access Paper or Ask Questions

Removing RLHF Protections in GPT-4 via Fine-Tuning

Nov 10, 2023

Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang

Abstract:As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not decrease usefulness despite using weaker models to generate training data. Our results show the need for further research on protections on LLMs.

Via

Access Paper or Ask Questions

Identifying and Mitigating the Security Risks of Generative AI

Aug 28, 2023

Clark Barrett, Brad Boyd, Ellie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi(+13 more)

Figure 1 for Identifying and Mitigating the Security Risks of Generative AI

Abstract:Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.

Via

Access Paper or Ask Questions