Alert button
Picture for Sören Mindermann

Sören Mindermann

Alert button

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Jan 17, 2024
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, Sören Mindermann, Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez

Viaarxiv icon

Managing AI Risks in an Era of Rapid Progress

Oct 26, 2023
Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann

Viaarxiv icon

Specific versus General Principles for Constitutional AI

Oct 20, 2023
Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova DasSarma, Oliver Rausch, Robin Larson, Shannon Yang, Shauna Kravec, Timothy Telleen-Lawton, Thomas I. Liao, Tom Henighan, Tristan Hume, Zac Hatfield-Dodds, Sören Mindermann, Nicholas Joseph, Sam McCandlish, Jared Kaplan

Figure 1 for Specific versus General Principles for Constitutional AI
Figure 2 for Specific versus General Principles for Constitutional AI
Figure 3 for Specific versus General Principles for Constitutional AI
Figure 4 for Specific versus General Principles for Constitutional AI
Viaarxiv icon

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Sep 26, 2023
Lorenzo Pacchiardi, Alex J. Chan, Sören Mindermann, Ilan Moscovitz, Alexa Y. Pan, Yarin Gal, Owain Evans, Jan Brauner

Viaarxiv icon

Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt

Jun 16, 2022
Sören Mindermann, Jan Brauner, Muhammed Razzak, Mrinank Sharma, Andreas Kirsch, Winnie Xu, Benedikt Höltgen, Aidan N. Gomez, Adrien Morisot, Sebastian Farquhar, Yarin Gal

Figure 1 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 2 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 3 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Figure 4 for Prioritized Training on Points that are Learnable, Worth Learning, and Not Yet Learnt
Viaarxiv icon

Prioritized training on points that are learnable, worth learning, and not yet learned

Jul 06, 2021
Sören Mindermann, Muhammed Razzak, Winnie Xu, Andreas Kirsch, Mrinank Sharma, Adrien Morisot, Aidan N. Gomez, Sebastian Farquhar, Jan Brauner, Yarin Gal

Figure 1 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 2 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 3 for Prioritized training on points that are learnable, worth learning, and not yet learned
Figure 4 for Prioritized training on points that are learnable, worth learning, and not yet learned
Viaarxiv icon

Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

Mar 08, 2021
Andrew Jesson, Sören Mindermann, Yarin Gal, Uri Shalit

Figure 1 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 2 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 3 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Figure 4 for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding
Viaarxiv icon

On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission

Jul 27, 2020
Mrinank Sharma, Sören Mindermann, Jan Markus Brauner, Gavin Leech, Anna B. Stephenson, Tomáš Gavenčiak, Jan Kulveit, Yee Whye Teh, Leonid Chindelevitch, Yarin Gal

Figure 1 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 2 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 3 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Figure 4 for On the robustness of effectiveness estimation of nonpharmaceutical interventions against COVID-19 transmission
Viaarxiv icon

Identifying Causal Effect Inference Failure with Uncertainty-Aware Models

Jul 01, 2020
Andrew Jesson, Sören Mindermann, Uri Shalit, Yarin Gal

Figure 1 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Figure 2 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Figure 3 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Figure 4 for Identifying Causal Effect Inference Failure with Uncertainty-Aware Models
Viaarxiv icon