Alert button
Picture for Daniel M. Ziegler

Daniel M. Ziegler

Alert button

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Bookmark button
Alert button
Jan 17, 2024
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Viaarxiv icon

Adversarial Training for High-Stakes Reliability

Add code
Bookmark button
Alert button
May 04, 2022
Daniel M. Ziegler, Seraphina Nix, Lawrence Chan, Tim Bauman, Peter Schmidt-Nielsen, Tao Lin, Adam Scherlis, Noa Nabeshima, Ben Weinstein-Raun, Daniel de Haas, Buck Shlegeris, Nate Thomas

Figure 1 for Adversarial Training for High-Stakes Reliability
Figure 2 for Adversarial Training for High-Stakes Reliability
Figure 3 for Adversarial Training for High-Stakes Reliability
Figure 4 for Adversarial Training for High-Stakes Reliability
Viaarxiv icon

Recursively Summarizing Books with Human Feedback

Add code
Bookmark button
Alert button
Sep 27, 2021
Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano

Figure 1 for Recursively Summarizing Books with Human Feedback
Figure 2 for Recursively Summarizing Books with Human Feedback
Figure 3 for Recursively Summarizing Books with Human Feedback
Figure 4 for Recursively Summarizing Books with Human Feedback
Viaarxiv icon

Scaling Laws for Autoregressive Generative Modeling

Add code
Bookmark button
Alert button
Nov 06, 2020
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish

Figure 1 for Scaling Laws for Autoregressive Generative Modeling
Figure 2 for Scaling Laws for Autoregressive Generative Modeling
Figure 3 for Scaling Laws for Autoregressive Generative Modeling
Figure 4 for Scaling Laws for Autoregressive Generative Modeling
Viaarxiv icon

Learning to summarize from human feedback

Add code
Bookmark button
Alert button
Sep 02, 2020
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano

Figure 1 for Learning to summarize from human feedback
Figure 2 for Learning to summarize from human feedback
Figure 3 for Learning to summarize from human feedback
Figure 4 for Learning to summarize from human feedback
Viaarxiv icon

Language Models are Few-Shot Learners

Add code
Bookmark button
Alert button
Jun 05, 2020
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Figure 1 for Language Models are Few-Shot Learners
Figure 2 for Language Models are Few-Shot Learners
Figure 3 for Language Models are Few-Shot Learners
Figure 4 for Language Models are Few-Shot Learners
Viaarxiv icon