Picture for Ryan Greenblatt

Ryan Greenblatt

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Stress-Testing Capability Elicitation With Password-Locked Models

Add code
May 29, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Viaarxiv icon

Preventing Language Models From Hiding Their Reasoning

Add code
Oct 31, 2023
Figure 1 for Preventing Language Models From Hiding Their Reasoning
Figure 2 for Preventing Language Models From Hiding Their Reasoning
Figure 3 for Preventing Language Models From Hiding Their Reasoning
Figure 4 for Preventing Language Models From Hiding Their Reasoning
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Figure 1 for Benchmarks for Detecting Measurement Tampering
Figure 2 for Benchmarks for Detecting Measurement Tampering
Figure 3 for Benchmarks for Detecting Measurement Tampering
Figure 4 for Benchmarks for Detecting Measurement Tampering
Viaarxiv icon