Picture for Xander Davies

Xander Davies

An Example Safety Case for Safeguards Against Misuse

Add code
May 23, 2025
Viaarxiv icon

Fundamental Limitations in Defending LLM Finetuning APIs

Add code
Feb 20, 2025
Viaarxiv icon

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Add code
Oct 11, 2024
Figure 1 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 2 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 3 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 4 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Viaarxiv icon

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Add code
Sep 12, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Discovering Variable Binding Circuitry with Desiderata

Add code
Jul 07, 2023
Figure 1 for Discovering Variable Binding Circuitry with Desiderata
Figure 2 for Discovering Variable Binding Circuitry with Desiderata
Figure 3 for Discovering Variable Binding Circuitry with Desiderata
Figure 4 for Discovering Variable Binding Circuitry with Desiderata
Viaarxiv icon

Sparse Distributed Memory is a Continual Learner

Add code
Mar 20, 2023
Viaarxiv icon

Unifying Grokking and Double Descent

Add code
Mar 10, 2023
Viaarxiv icon