Picture for Andy Arditi

Andy Arditi

Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs

Add code
Dec 10, 2025
Viaarxiv icon

Refusal in Language Models Is Mediated by a Single Direction

Add code
Jun 17, 2024
Figure 1 for Refusal in Language Models Is Mediated by a Single Direction
Figure 2 for Refusal in Language Models Is Mediated by a Single Direction
Figure 3 for Refusal in Language Models Is Mediated by a Single Direction
Figure 4 for Refusal in Language Models Is Mediated by a Single Direction
Viaarxiv icon