Picture for Buck Shlegeris

Buck Shlegeris

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Viaarxiv icon

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Viaarxiv icon

Ctrl-Z: Controlling AI Agents via Resampling

Add code
Apr 14, 2025
Viaarxiv icon

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Add code
Apr 07, 2025
Viaarxiv icon

A sketch of an AI control safety case

Add code
Jan 28, 2025
Figure 1 for A sketch of an AI control safety case
Figure 2 for A sketch of an AI control safety case
Figure 3 for A sketch of an AI control safety case
Figure 4 for A sketch of an AI control safety case
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols

Add code
Dec 17, 2024
Figure 1 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 2 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 3 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 4 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Viaarxiv icon

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Add code
Nov 26, 2024
Figure 1 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 2 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 3 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 4 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon