Alert button
Picture for Dylan Hadfield-Menell

Dylan Hadfield-Menell

Alert button

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Add code
Bookmark button
Alert button
Apr 03, 2024
Stephen Casper, Jieun Yun, Joonhyuk Baek, Yeseong Jung, Minhwan Kim, Kiwan Kwon, Saerom Park, Hayden Moore, David Shriver, Marissa Connor, Keltin Grimes, Angus Nicolson, Arush Tagade, Jessica Rumbelow, Hieu Minh Nguyen, Dylan Hadfield-Menell

Viaarxiv icon

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Add code
Bookmark button
Alert button
Mar 08, 2024
Stephen Casper, Lennart Schulze, Oam Patel, Dylan Hadfield-Menell

Figure 1 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 2 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 3 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 4 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Add code
Bookmark button
Alert button
Feb 26, 2024
Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell

Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Bookmark button
Alert button
Jan 25, 2024
Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

Add code
Bookmark button
Alert button
Dec 13, 2023
Anand Siththaranjan, Cassidy Laidlaw, Dylan Hadfield-Menell

Viaarxiv icon

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Add code
Bookmark button
Alert button
Nov 27, 2023
Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, Jacob Andreas

Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Bookmark button
Alert button
Jul 27, 2023
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell

Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Measuring the Success of Diffusion Models at Imitating Human Artists

Add code
Bookmark button
Alert button
Jul 08, 2023
Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell

Figure 1 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 2 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 3 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 4 for Measuring the Success of Diffusion Models at Imitating Human Artists
Viaarxiv icon

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Add code
Bookmark button
Alert button
Jun 21, 2023
Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, Dylan Hadfield-Menell

Figure 1 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 2 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 3 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 4 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Viaarxiv icon

Recommending to Strategic Users

Add code
Bookmark button
Alert button
Feb 13, 2023
Andreas Haupt, Dylan Hadfield-Menell, Chara Podimata

Figure 1 for Recommending to Strategic Users
Figure 2 for Recommending to Strategic Users
Figure 3 for Recommending to Strategic Users
Figure 4 for Recommending to Strategic Users
Viaarxiv icon