Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katarzyna Kapusta

Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Mar 08, 2025

Thomas Winninger, Boussad Addad, Katarzyna Kapusta

Figure 1 for Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Figure 2 for Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Figure 3 for Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Figure 4 for Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models

Abstract:Traditional white-box methods for creating adversarial perturbations against LLMs typically rely only on gradient computation from the targeted model, ignoring the internal mechanisms responsible for attack success or failure. Conversely, interpretability studies that analyze these internal mechanisms lack practical applications beyond runtime interventions. We bridge this gap by introducing a novel white-box approach that leverages mechanistic interpretability techniques to craft practical adversarial inputs. Specifically, we first identify acceptance subspaces - sets of feature vectors that do not trigger the model's refusal mechanisms - then use gradient-based optimization to reroute embeddings from refusal subspaces to acceptance subspaces, effectively achieving jailbreaks. This targeted approach significantly reduces computation cost, achieving attack success rates of 80-95\% on state-of-the-art models including Gemma2, Llama3.2, and Qwen2.5 within minutes or even seconds, compared to existing techniques that often fail or require hours of computation. We believe this approach opens a new direction for both attack research and defense development. Furthermore, it showcases a practical application of mechanistic interpretability where other methods are less efficient, which highlights its utility. The code and generated datasets are available at https://github.com/Sckathach/subspace-rerouting.

Via

Access Paper or Ask Questions

DiffGuard: Text-Based Safety Checker for Diffusion Models

Nov 25, 2024

Massine El Khader, Elias Al Bouzidi, Abdellah Oumida, Mohammed Sbaihi, Eliott Binard, Jean-Philippe Poli, Wassila Ouerdane, Boussad Addad, Katarzyna Kapusta

Abstract:Recent advances in Diffusion Models have enabled the generation of images from text, with powerful closed-source models like DALL-E and Midjourney leading the way. However, open-source alternatives, such as StabilityAI's Stable Diffusion, offer comparable capabilities. These open-source models, hosted on Hugging Face, come equipped with ethical filter protections designed to prevent the generation of explicit images. This paper reveals first their limitations and then presents a novel text-based safety filter that outperforms existing solutions. Our research is driven by the critical need to address the misuse of AI-generated content, especially in the context of information warfare. DiffGuard enhances filtering efficacy, achieving a performance that surpasses the best existing filters by over 14%.

Via

Access Paper or Ask Questions

When Federated Learning meets Watermarking: A Comprehensive Overview of Techniques for Intellectual Property Protection

Aug 07, 2023

Mohammed Lansari, Reda Bellafqira, Katarzyna Kapusta, Vincent Thouvenot, Olivier Bettan, Gouenou Coatrieux

Figure 1 for When Federated Learning meets Watermarking: A Comprehensive Overview of Techniques for Intellectual Property Protection

Figure 2 for When Federated Learning meets Watermarking: A Comprehensive Overview of Techniques for Intellectual Property Protection

Figure 3 for When Federated Learning meets Watermarking: A Comprehensive Overview of Techniques for Intellectual Property Protection

Figure 4 for When Federated Learning meets Watermarking: A Comprehensive Overview of Techniques for Intellectual Property Protection

Abstract:Federated Learning (FL) is a technique that allows multiple participants to collaboratively train a Deep Neural Network (DNN) without the need of centralizing their data. Among other advantages, it comes with privacy-preserving properties making it attractive for application in sensitive contexts, such as health care or the military. Although the data are not explicitly exchanged, the training procedure requires sharing information about participants' models. This makes the individual models vulnerable to theft or unauthorized distribution by malicious actors. To address the issue of ownership rights protection in the context of Machine Learning (ML), DNN Watermarking methods have been developed during the last five years. Most existing works have focused on watermarking in a centralized manner, but only a few methods have been designed for FL and its unique constraints. In this paper, we provide an overview of recent advancements in Federated Learning watermarking, shedding light on the new challenges and opportunities that arise in this field.

* 2figures, 14pages, 3tables

Via

Access Paper or Ask Questions