Picture for Stephen Casper

Stephen Casper

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Viaarxiv icon

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Add code
Apr 03, 2024
Viaarxiv icon

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Add code
Mar 08, 2024
Figure 1 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 2 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 3 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 4 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Add code
Feb 26, 2024
Figure 1 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 2 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 3 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 4 for Eight Methods to Evaluate Robust Unlearning in LLMs
Viaarxiv icon

Rethinking Machine Unlearning for Large Language Models

Add code
Feb 15, 2024
Figure 1 for Rethinking Machine Unlearning for Large Language Models
Figure 2 for Rethinking Machine Unlearning for Large Language Models
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Add code
Nov 27, 2023
Figure 1 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 2 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 3 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 4 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Viaarxiv icon

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Add code
Nov 06, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Add code
Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Measuring the Success of Diffusion Models at Imitating Human Artists

Add code
Jul 08, 2023
Figure 1 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 2 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 3 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 4 for Measuring the Success of Diffusion Models at Imitating Human Artists
Viaarxiv icon