Picture for Severin Field

Severin Field

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Add code
Nov 02, 2024
Viaarxiv icon

Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language

Add code
Oct 03, 2024
Viaarxiv icon

Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals

Add code
May 11, 2024
Figure 1 for Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Figure 2 for Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Figure 3 for Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Figure 4 for Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Viaarxiv icon