Picture for Ethan Perez

Ethan Perez

The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?

Add code
Jan 30, 2026
Viaarxiv icon

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Add code
Jan 08, 2026
Viaarxiv icon

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Add code
Aug 23, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Reasoning Models Don't Always Say What They Think

Add code
May 08, 2025
Figure 1 for Reasoning Models Don't Always Say What They Think
Figure 2 for Reasoning Models Don't Always Say What They Think
Figure 3 for Reasoning Models Don't Always Say What They Think
Figure 4 for Reasoning Models Don't Always Say What They Think
Viaarxiv icon

Forecasting Rare Language Model Behaviors

Add code
Feb 24, 2025
Figure 1 for Forecasting Rare Language Model Behaviors
Figure 2 for Forecasting Rare Language Model Behaviors
Figure 3 for Forecasting Rare Language Model Behaviors
Figure 4 for Forecasting Rare Language Model Behaviors
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Best-of-N Jailbreaking

Add code
Dec 04, 2024
Figure 1 for Best-of-N Jailbreaking
Figure 2 for Best-of-N Jailbreaking
Figure 3 for Best-of-N Jailbreaking
Figure 4 for Best-of-N Jailbreaking
Viaarxiv icon