Alert button
Picture for Sören Mindermann

Sören Mindermann

Alert button

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Add code
Bookmark button
Alert button
Jan 17, 2024
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Add code
Bookmark button
Alert button
Oct 26, 2023
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

Viaarxiv icon

Specific versus General Principles for Constitutional AI

Add code
Bookmark button
Alert button
Oct 20, 2023
Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova DasSarma, Oliver Rausch, Robin Larson, Shannon Yang, Shauna Kravec, Timothy Telleen-Lawton, Thomas I. Liao, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, Sören Mindermann, Nicholas Joseph, Sam McCandlish, Jared Kaplan

Figure 1 for Specific versus General Principles for Constitutional AI
Figure 2 for Specific versus General Principles for Constitutional AI
Figure 3 for Specific versus General Principles for Constitutional AI
Figure 4 for Specific versus General Principles for Constitutional AI
Viaarxiv icon

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Add code
Bookmark button
Alert button
Sep 26, 2023
Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

Viaarxiv icon

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

Add code
Bookmark button
Alert button
Jun 16, 2022
Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

Figure 1 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 2 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 3 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 4 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Viaarxiv icon

Prioritized training on points that are learnable, worth learning, and not yet learned

Add code
Bookmark button
Alert button
Jul 06, 2021
Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma, Adrien Morisot, Aidan N. Gomez, Sebastian Farquhar, Jan Brauner, Yarin Gal

Figure 1 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 2 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 3 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 4 for Prioritized training on points that are learnable, worth learning, and not yet learned
Viaarxiv icon

Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

Add code
Bookmark button
Alert button
Mar 08, 2021
Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit

Figure 1 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 2 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 3 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 4 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Viaarxiv icon

On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission

Add code
Bookmark button
Alert button
Jul 27, 2020
Mrinank Sharma, Sören Mindermann, Jan Markus Brauner, Gavin Leech, Anna B. Stephenson, Tomáš Gavenčiak, Jan Kulveit, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal

Figure 1 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 2 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 3 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 4 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Viaarxiv icon

Identifying Causal Effect Inference Failure with Uncertainty-Aware Models

Add code
Bookmark button
Alert button
Jul 01, 2020
Andrew Jesson, Sören Mindermann, Uri Shalit, Yarin Gal

Figure 1 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Figure 2 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Figure 3 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Figure 4 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Viaarxiv icon