Alert button
Picture for Simon Lermen

Simon Lermen

Alert button

Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

Add code
Bookmark button
Alert button
Dec 08, 2023
Simon Lermen, Ondřej Kvapil

Viaarxiv icon

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

Add code
Bookmark button
Alert button
Oct 31, 2023
Pranav Gade, Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Figure 1 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Figure 2 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Figure 3 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Figure 4 for BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B
Viaarxiv icon

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

Add code
Bookmark button
Alert button
Oct 31, 2023
Simon Lermen, Charlie Rogers-Smith, Jeffrey Ladish

Figure 1 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 2 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 3 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Figure 4 for LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Viaarxiv icon

Evaluating Shutdown Avoidance of Language Models in Textual Scenarios

Add code
Bookmark button
Alert button
Jul 03, 2023
Teun van der Weij, Simon Lermen, Leon lang

Viaarxiv icon