Picture for Xiangyu Qi

Xiangyu Qi

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

Add code
Jun 25, 2024
Viaarxiv icon

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Add code
Jun 20, 2024
Viaarxiv icon

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Add code
Jun 10, 2024
Viaarxiv icon

AI Risk Management Should Incorporate Both Safety and Security

Add code
May 29, 2024
Figure 1 for AI Risk Management Should Incorporate Both Safety and Security
Viaarxiv icon

Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment

Add code
Feb 27, 2024
Figure 1 for Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
Figure 2 for Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
Figure 3 for Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
Figure 4 for Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
Viaarxiv icon

Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications

Add code
Feb 07, 2024
Figure 1 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Figure 2 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Figure 3 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Figure 4 for Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Viaarxiv icon

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Add code
Oct 05, 2023
Figure 1 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 2 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 3 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Figure 4 for Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Viaarxiv icon

BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection

Add code
Aug 23, 2023
Figure 1 for BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Figure 2 for BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Figure 3 for BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Figure 4 for BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
Viaarxiv icon

Visual Adversarial Examples Jailbreak Large Language Models

Add code
Jun 22, 2023
Figure 1 for Visual Adversarial Examples Jailbreak Large Language Models
Figure 2 for Visual Adversarial Examples Jailbreak Large Language Models
Figure 3 for Visual Adversarial Examples Jailbreak Large Language Models
Figure 4 for Visual Adversarial Examples Jailbreak Large Language Models
Viaarxiv icon

Uncovering Adversarial Risks of Test-Time Adaptation

Add code
Feb 04, 2023
Figure 1 for Uncovering Adversarial Risks of Test-Time Adaptation
Figure 2 for Uncovering Adversarial Risks of Test-Time Adaptation
Figure 3 for Uncovering Adversarial Risks of Test-Time Adaptation
Figure 4 for Uncovering Adversarial Risks of Test-Time Adaptation
Viaarxiv icon