Picture for Dylan Hadfield-Menell

Dylan Hadfield-Menell

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Add code
Apr 03, 2024
Viaarxiv icon

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Add code
Mar 08, 2024
Figure 1 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 2 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 3 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 4 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Add code
Feb 26, 2024
Figure 1 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 2 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 3 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 4 for Eight Methods to Evaluate Robust Unlearning in LLMs
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

Add code
Dec 13, 2023
Viaarxiv icon

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Add code
Nov 27, 2023
Figure 1 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 2 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 3 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 4 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Measuring the Success of Diffusion Models at Imitating Human Artists

Add code
Jul 08, 2023
Figure 1 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 2 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 3 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 4 for Measuring the Success of Diffusion Models at Imitating Human Artists
Viaarxiv icon

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Add code
Jun 21, 2023
Figure 1 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 2 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 3 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 4 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Viaarxiv icon

Recommending to Strategic Users

Add code
Feb 13, 2023
Figure 1 for Recommending to Strategic Users
Figure 2 for Recommending to Strategic Users
Figure 3 for Recommending to Strategic Users
Figure 4 for Recommending to Strategic Users
Viaarxiv icon