Picture for Buck Shlegeris

Buck Shlegeris

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Add code
Jun 17, 2024
Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Jan 17, 2024
Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Dec 14, 2023
Figure 1 for AI Control: Improving Safety Despite Intentional Subversion
Figure 2 for AI Control: Improving Safety Despite Intentional Subversion
Figure 3 for AI Control: Improving Safety Despite Intentional Subversion
Figure 4 for AI Control: Improving Safety Despite Intentional Subversion
Viaarxiv icon

Generalized Wick Decompositions

Add code
Oct 10, 2023
Viaarxiv icon

Benchmarks for Detecting Measurement Tampering

Add code
Sep 07, 2023
Figure 1 for Benchmarks for Detecting Measurement Tampering
Figure 2 for Benchmarks for Detecting Measurement Tampering
Figure 3 for Benchmarks for Detecting Measurement Tampering
Figure 4 for Benchmarks for Detecting Measurement Tampering
Viaarxiv icon

Language models are better than humans at next-token prediction

Add code
Dec 21, 2022
Figure 1 for Language models are better than humans at next-token prediction
Figure 2 for Language models are better than humans at next-token prediction
Figure 3 for Language models are better than humans at next-token prediction
Viaarxiv icon

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Add code
Nov 01, 2022
Figure 1 for Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Figure 2 for Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Figure 3 for Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Figure 4 for Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Viaarxiv icon

Polysemanticity and Capacity in Neural Networks

Add code
Oct 04, 2022
Figure 1 for Polysemanticity and Capacity in Neural Networks
Figure 2 for Polysemanticity and Capacity in Neural Networks
Figure 3 for Polysemanticity and Capacity in Neural Networks
Figure 4 for Polysemanticity and Capacity in Neural Networks
Viaarxiv icon

Adversarial Training for High-Stakes Reliability

Add code
May 04, 2022
Figure 1 for Adversarial Training for High-Stakes Reliability
Figure 2 for Adversarial Training for High-Stakes Reliability
Figure 3 for Adversarial Training for High-Stakes Reliability
Figure 4 for Adversarial Training for High-Stakes Reliability
Viaarxiv icon

Supervising strong learners by amplifying weak experts

Add code
Oct 19, 2018
Figure 1 for Supervising strong learners by amplifying weak experts
Figure 2 for Supervising strong learners by amplifying weak experts
Figure 3 for Supervising strong learners by amplifying weak experts
Figure 4 for Supervising strong learners by amplifying weak experts
Viaarxiv icon