Picture for Constantin Weisser

Constantin Weisser

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

Add code
Nov 04, 2024
Figure 1 for Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Figure 2 for Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Figure 3 for Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Figure 4 for Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Viaarxiv icon