Picture for Dylan Hadfield-Menell

Dylan Hadfield-Menell

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF

Add code
Dec 13, 2023
Viaarxiv icon

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Add code
Nov 27, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Measuring the Success of Diffusion Models at Imitating Human Artists

Add code
Jul 08, 2023
Figure 1 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 2 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 3 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 4 for Measuring the Success of Diffusion Models at Imitating Human Artists
Viaarxiv icon

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Add code
Jun 21, 2023
Figure 1 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 2 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 3 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Figure 4 for Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Viaarxiv icon

Recommending to Strategic Users

Add code
Feb 13, 2023
Figure 1 for Recommending to Strategic Users
Figure 2 for Recommending to Strategic Users
Figure 3 for Recommending to Strategic Users
Figure 4 for Recommending to Strategic Users
Viaarxiv icon

Benchmarking Interpretability Tools for Deep Neural Networks

Add code
Feb 08, 2023
Viaarxiv icon

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

Add code
Nov 22, 2022
Figure 1 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Figure 2 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Figure 3 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Figure 4 for Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks
Viaarxiv icon

White-Box Adversarial Policies in Deep Reinforcement Learning

Add code
Sep 05, 2022
Figure 1 for White-Box Adversarial Policies in Deep Reinforcement Learning
Figure 2 for White-Box Adversarial Policies in Deep Reinforcement Learning
Figure 3 for White-Box Adversarial Policies in Deep Reinforcement Learning
Figure 4 for White-Box Adversarial Policies in Deep Reinforcement Learning
Viaarxiv icon