Alert button
Picture for Samuel R. Bowman

Samuel R. Bowman

Alert button

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Add code
Bookmark button
Alert button
Mar 08, 2024
James Chua, Edward Rees, Hunar Batra, Samuel R. Bowman, Julian Michael, Ethan Perez, Miles Turpin

Figure 1 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 2 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 3 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 4 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Viaarxiv icon

Debating with More Persuasive LLMs Leads to More Truthful Answers

Add code
Bookmark button
Alert button
Feb 15, 2024
Akbir Khan, John Hughes, Dan Valentine, Laura Ruis, Kshitij Sachan, Ansh Radhakrishnan, Edward Grefenstette, Samuel R. Bowman, Tim Rocktäschel, Ethan Perez

Viaarxiv icon

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Bookmark button
Alert button
Jan 17, 2024
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Viaarxiv icon

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Add code
Bookmark button
Alert button
Nov 20, 2023
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, Samuel R. Bowman

Viaarxiv icon

Debate Helps Supervise Unreliable Experts

Add code
Bookmark button
Alert button
Nov 15, 2023
Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, Samuel R. Bowman

Viaarxiv icon

Towards Understanding Sycophancy in Language Models

Add code
Bookmark button
Alert button
Oct 27, 2023
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, Ethan Perez

Figure 1 for Towards Understanding Sycophancy in Language Models
Figure 2 for Towards Understanding Sycophancy in Language Models
Figure 3 for Towards Understanding Sycophancy in Language Models
Figure 4 for Towards Understanding Sycophancy in Language Models
Viaarxiv icon