Picture for Alexandra Souly

Alexandra Souly

Prefill Awareness in Large Language Models

Add code
Jun 10, 2026
Viaarxiv icon

Evaluating whether AI models would sabotage AI safety research

Add code
Apr 27, 2026
Viaarxiv icon

UK AISI Alignment Evaluation Case-Study

Add code
Apr 01, 2026
Viaarxiv icon

When Do LLM Preferences Predict Downstream Behavior?

Add code
Feb 21, 2026
Viaarxiv icon

Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents

Add code
Oct 26, 2025
Figure 1 for Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Figure 2 for Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Figure 3 for Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Figure 4 for Breaking Agent Backbones: Evaluating the Security of Backbone LLMs in AI Agents
Viaarxiv icon

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Add code
Oct 08, 2025
Viaarxiv icon

Fundamental Limitations in Defending LLM Finetuning APIs

Add code
Feb 20, 2025
Figure 1 for Fundamental Limitations in Defending LLM Finetuning APIs
Figure 2 for Fundamental Limitations in Defending LLM Finetuning APIs
Figure 3 for Fundamental Limitations in Defending LLM Finetuning APIs
Figure 4 for Fundamental Limitations in Defending LLM Finetuning APIs
Viaarxiv icon

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Add code
Oct 11, 2024
Figure 1 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 2 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 3 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Figure 4 for AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Viaarxiv icon

A StrongREJECT for Empty Jailbreaks

Add code
Feb 15, 2024
Viaarxiv icon

Leading the Pack: N-player Opponent Shaping

Add code
Dec 26, 2023
Figure 1 for Leading the Pack: N-player Opponent Shaping
Figure 2 for Leading the Pack: N-player Opponent Shaping
Figure 3 for Leading the Pack: N-player Opponent Shaping
Figure 4 for Leading the Pack: N-player Opponent Shaping
Viaarxiv icon