Alert button
Picture for Kshitij Sachan

Kshitij Sachan

Alert button

Debating with More Persuasive LLMs Leads to More Truthful Answers

Add code
Bookmark button
Alert button
Feb 15, 2024
Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Bookmark button
Alert button
Jan 17, 2024
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Viaarxiv icon

AI Control: Improving Safety Despite Intentional Subversion

Add code
Bookmark button
Alert button
Dec 14, 2023
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger

Viaarxiv icon

Polysemanticity and Capacity in Neural Networks

Add code
Bookmark button
Alert button
Oct 04, 2022
Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, Buck Shlegeris

Figure 1 for Polysemanticity and Capacity in Neural Networks
Figure 2 for Polysemanticity and Capacity in Neural Networks
Figure 3 for Polysemanticity and Capacity in Neural Networks
Figure 4 for Polysemanticity and Capacity in Neural Networks
Viaarxiv icon