Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Nov 15, 2023

Zhexin Zhang, Junxiao Yang, Pei Ke, Minlie Huang

Figure 1 for Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Figure 2 for Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Figure 3 for Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Figure 4 for Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Share this with someone who'll enjoy it:

Abstract:Large Language Models (LLMs) continue to advance in their capabilities, yet this progress is accompanied by a growing array of safety risks. While significant attention has been dedicated to exploiting weaknesses in LLMs through jailbreaking attacks, there remains a paucity of exploration into defending against these attacks. We point out a pivotal factor contributing to the success of jailbreaks: the inherent conflict between the goals of being helpful and ensuring safety. To counter jailbreaking attacks, we propose to integrate goal prioritization at both training and inference stages. Implementing goal prioritization during inference substantially diminishes the Attack Success Rate (ASR) of jailbreaking attacks, reducing it from 66.4% to 2.0% for ChatGPT and from 68.2% to 19.4% for Vicuna-33B, without compromising general performance. Furthermore, integrating the concept of goal prioritization into the training phase reduces the ASR from 71.0% to 6.6% for LLama2-13B. Remarkably, even in scenarios where no jailbreaking samples are included during training, our approach slashes the ASR by half, decreasing it from 71.0% to 34.0%. Additionally, our findings reveal that while stronger LLMs face greater safety risks, they also possess a greater capacity to be steered towards defending against such attacks. We hope our work could contribute to the comprehension of jailbreaking attacks and defenses, and shed light on the relationship between LLMs' capability and safety. Our code will be available at \url{https://github.com/thu-coai/JailbreakDefense_GoalPriority}.

* 14 pages

View paper on

Share this with someone who'll enjoy it:

Title:Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization

Paper and Code