Picture for Ryan Greenblatt

Ryan Greenblatt

Stress-Testing Capability Elicitation With Password-Locked Models

Add code
May 29, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Figure 1 for AI Control: Improving Safety Despite Intentional Subversion
Figure 2 for AI Control: Improving Safety Despite Intentional Subversion
Figure 3 for AI Control: Improving Safety Despite Intentional Subversion
Figure 4 for AI Control: Improving Safety Despite Intentional Subversion
Viaarxiv icon

Preventing Language Models From Hiding Their Reasoning

Add code
Oct 31, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Figure 1 for Benchmarks for Detecting Measurement Tampering
Figure 2 for Benchmarks for Detecting Measurement Tampering
Figure 3 for Benchmarks for Detecting Measurement Tampering
Figure 4 for Benchmarks for Detecting Measurement Tampering
Viaarxiv icon