Picture for Buck Shlegeris

Buck Shlegeris

Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols

Add code
Sep 12, 2024
Viaarxiv icon

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Viaarxiv icon

Generalized Wick Decompositions

Add code
Oct 10, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Viaarxiv icon

Language models are better than humans at next-token prediction

Add code
Dec 21, 2022
Viaarxiv icon

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Add code
Nov 01, 2022
Viaarxiv icon

Polysemanticity and Capacity in Neural Networks

Add code
Oct 04, 2022
Figure 1 for Polysemanticity and Capacity in Neural Networks
Figure 2 for Polysemanticity and Capacity in Neural Networks
Figure 3 for Polysemanticity and Capacity in Neural Networks
Figure 4 for Polysemanticity and Capacity in Neural Networks
Viaarxiv icon

Adversarial Training for High-Stakes Reliability

Add code
May 04, 2022
Figure 1 for Adversarial Training for High-Stakes Reliability
Figure 2 for Adversarial Training for High-Stakes Reliability
Figure 3 for Adversarial Training for High-Stakes Reliability
Figure 4 for Adversarial Training for High-Stakes Reliability
Viaarxiv icon