Picture for Maksym Andriushchenko

Maksym Andriushchenko

Saarland University

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Add code
Oct 11, 2024
Figure 1 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 2 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 3 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 4 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Viaarxiv icon

Does Refusal Training in LLMs Generalize to the Past Tense?

Add code
Jul 16, 2024
Figure 1 for Does Refusal Training in LLMs Generalize to the Past Tense?
Figure 2 for Does Refusal Training in LLMs Generalize to the Past Tense?
Figure 3 for Does Refusal Training in LLMs Generalize to the Past Tense?
Figure 4 for Does Refusal Training in LLMs Generalize to the Past Tense?
Viaarxiv icon

Improving Alignment and Robustness with Circuit Breakers

Add code
Jun 10, 2024
Figure 1 for Improving Alignment and Robustness with Circuit Breakers
Figure 2 for Improving Alignment and Robustness with Circuit Breakers
Figure 3 for Improving Alignment and Robustness with Circuit Breakers
Figure 4 for Improving Alignment and Robustness with Circuit Breakers
Viaarxiv icon

Improving Alignment and Robustness with Short Circuiting

Add code
Jun 06, 2024
Figure 1 for Improving Alignment and Robustness with Short Circuiting
Figure 2 for Improving Alignment and Robustness with Short Circuiting
Figure 3 for Improving Alignment and Robustness with Short Circuiting
Figure 4 for Improving Alignment and Robustness with Short Circuiting
Viaarxiv icon

Is In-Context Learning Sufficient for Instruction Following in LLMs?

Add code
May 30, 2024
Figure 1 for Is In-Context Learning Sufficient for Instruction Following in LLMs?
Figure 2 for Is In-Context Learning Sufficient for Instruction Following in LLMs?
Figure 3 for Is In-Context Learning Sufficient for Instruction Following in LLMs?
Figure 4 for Is In-Context Learning Sufficient for Instruction Following in LLMs?
Viaarxiv icon

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Add code
Apr 22, 2024
Figure 1 for Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Figure 2 for Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Figure 3 for Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Figure 4 for Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
Viaarxiv icon

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Add code
Apr 02, 2024
Viaarxiv icon

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Add code
Mar 28, 2024
Figure 1 for JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Figure 2 for JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Figure 3 for JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Figure 4 for JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Viaarxiv icon

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning

Add code
Feb 07, 2024
Figure 1 for Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Figure 2 for Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Figure 3 for Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Figure 4 for Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning
Viaarxiv icon

Scaling Compute Is Not All You Need for Adversarial Robustness

Add code
Dec 20, 2023
Viaarxiv icon