Picture for Ethan Perez

Ethan Perez

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Add code
Aug 23, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Reasoning Models Don't Always Say What They Think

Add code
May 08, 2025
Figure 1 for Reasoning Models Don't Always Say What They Think
Figure 2 for Reasoning Models Don't Always Say What They Think
Figure 3 for Reasoning Models Don't Always Say What They Think
Figure 4 for Reasoning Models Don't Always Say What They Think
Viaarxiv icon

Forecasting Rare Language Model Behaviors

Add code
Feb 24, 2025
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Best-of-N Jailbreaking

Add code
Dec 04, 2024
Figure 1 for Best-of-N Jailbreaking
Figure 2 for Best-of-N Jailbreaking
Figure 3 for Best-of-N Jailbreaking
Figure 4 for Best-of-N Jailbreaking
Viaarxiv icon

Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Add code
Dec 03, 2024
Figure 1 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 2 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 3 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Figure 4 for Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
Viaarxiv icon

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Add code
Nov 26, 2024
Figure 1 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 2 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 3 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Figure 4 for Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Viaarxiv icon