Picture for Buck Shlegeris

Buck Shlegeris

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Figure 1 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 2 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 3 for The Singapore Consensus on Global AI Safety Research Priorities
Viaarxiv icon

Ctrl-Z: Controlling AI Agents via Resampling

Add code
Apr 14, 2025
Viaarxiv icon

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Add code
Apr 07, 2025
Viaarxiv icon

A sketch of an AI control safety case

Add code
Jan 28, 2025
Figure 1 for A sketch of an AI control safety case
Figure 2 for A sketch of an AI control safety case
Figure 3 for A sketch of an AI control safety case
Figure 4 for A sketch of an AI control safety case
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols

Add code
Dec 17, 2024
Figure 1 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 2 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 3 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Figure 4 for Subversion Strategy Eval: Evaluating AI's stateless strategic capabilities against control protocols
Viaarxiv icon

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Add code
Nov 26, 2024
Figure 1 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 2 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 3 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 4 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Sabotage Evaluations for Frontier Models

Add code
Oct 28, 2024
Figure 1 for Sabotage Evaluations for Frontier Models
Figure 2 for Sabotage Evaluations for Frontier Models
Figure 3 for Sabotage Evaluations for Frontier Models
Figure 4 for Sabotage Evaluations for Frontier Models
Viaarxiv icon