Picture for Stephen Casper

Stephen Casper

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Add code
Apr 15, 2024
Viaarxiv icon

The SaTML '24 CNN Interpretability Competition: New Innovations for Concept-Level Interpretability

Add code
Apr 03, 2024
Viaarxiv icon

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Add code
Mar 08, 2024
Figure 1 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 2 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 3 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Figure 4 for Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Viaarxiv icon

Eight Methods to Evaluate Robust Unlearning in LLMs

Feb 26, 2024
Figure 1 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 2 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 3 for Eight Methods to Evaluate Robust Unlearning in LLMs
Figure 4 for Eight Methods to Evaluate Robust Unlearning in LLMs
Viaarxiv icon

Rethinking Machine Unlearning for Large Language Models

Feb 15, 2024
Figure 1 for Rethinking Machine Unlearning for Large Language Models
Figure 2 for Rethinking Machine Unlearning for Large Language Models
Viaarxiv icon

Black-Box Access is Insufficient for Rigorous AI Audits

Jan 25, 2024
Figure 1 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 2 for Black-Box Access is Insufficient for Rigorous AI Audits
Figure 3 for Black-Box Access is Insufficient for Rigorous AI Audits
Viaarxiv icon

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Add code
Nov 27, 2023
Figure 1 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 2 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 3 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Figure 4 for Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Viaarxiv icon

Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation

Add code
Nov 06, 2023
Viaarxiv icon

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Jul 27, 2023
Figure 1 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 2 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 3 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Figure 4 for Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Viaarxiv icon

Measuring the Success of Diffusion Models at Imitating Human Artists

Add code
Jul 08, 2023
Figure 1 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 2 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 3 for Measuring the Success of Diffusion Models at Imitating Human Artists
Figure 4 for Measuring the Success of Diffusion Models at Imitating Human Artists
Viaarxiv icon