Picture for Peter Henderson

Peter Henderson

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Add code
Mar 09, 2025
Viaarxiv icon

Breaking Down Bias: On The Limits of Generalizable Pruning Strategies

Add code
Feb 11, 2025
Viaarxiv icon

On Evaluating the Durability of Safeguards for Open-Weight LLMs

Add code
Dec 10, 2024
Figure 1 for On Evaluating the Durability of Safeguards for Open-Weight LLMs
Figure 2 for On Evaluating the Durability of Safeguards for Open-Weight LLMs
Figure 3 for On Evaluating the Durability of Safeguards for Open-Weight LLMs
Figure 4 for On Evaluating the Durability of Safeguards for Open-Weight LLMs
Viaarxiv icon

The Mirage of Artificial Intelligence Terms of Use Restrictions

Add code
Dec 10, 2024
Figure 1 for The Mirage of Artificial Intelligence Terms of Use Restrictions
Viaarxiv icon

An Adversarial Perspective on Machine Unlearning for AI Safety

Add code
Sep 26, 2024
Viaarxiv icon

Evaluating Copyright Takedown Methods for Language Models

Add code
Jun 26, 2024
Figure 1 for Evaluating Copyright Takedown Methods for Language Models
Figure 2 for Evaluating Copyright Takedown Methods for Language Models
Figure 3 for Evaluating Copyright Takedown Methods for Language Models
Figure 4 for Evaluating Copyright Takedown Methods for Language Models
Viaarxiv icon

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Add code
Jun 26, 2024
Figure 1 for The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Figure 2 for The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Figure 3 for The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Viaarxiv icon

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Add code
Jun 20, 2024
Viaarxiv icon

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

Add code
Jun 20, 2024
Figure 1 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Figure 2 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Figure 3 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Figure 4 for SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
Viaarxiv icon

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Add code
Jun 10, 2024
Figure 1 for Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Figure 2 for Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Figure 3 for Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Figure 4 for Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Viaarxiv icon