Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

Conditioning Predictive Models: Risks and Strategies


Feb 06, 2023
Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, Kate Woolverton

Add code


   Access Paper or Ask Questions

Discovering Language Model Behaviors with Model-Written Evaluations


Dec 19, 2022
Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Ben Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, Jared Kaplan

Add code

* for associated data visualizations, see https://www.evals.anthropic.com/model-written/ for full datasets, see https://github.com/anthropics/evals 

   Access Paper or Ask Questions

Engineering Monosemanticity in Toy Models


Nov 16, 2022
Adam S. Jermyn, Nicholas Schiefer, Evan Hubinger

Add code

* 31 pages, 26 figures 

   Access Paper or Ask Questions

An overview of 11 proposals for building safe advanced AI


Dec 04, 2020
Evan Hubinger

Add code


   Access Paper or Ask Questions

Risks from Learned Optimization in Advanced Machine Learning Systems


Jun 11, 2019
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

Add code


   Access Paper or Ask Questions