Picture for Jerry Wei

Jerry Wei

Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

Add code
Apr 16, 2026
Viaarxiv icon

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Add code
Mar 30, 2026
Viaarxiv icon

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Add code
Jan 08, 2026
Viaarxiv icon

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Add code
Jan 31, 2025
Figure 1 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 2 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 3 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Figure 4 for Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
Viaarxiv icon

Best Practices and Lessons Learned on Synthetic Data for Language Models

Add code
Apr 11, 2024
Figure 1 for Best Practices and Lessons Learned on Synthetic Data for Language Models
Viaarxiv icon

Long-form factuality in large language models

Add code
Apr 03, 2024
Figure 1 for Long-form factuality in large language models
Figure 2 for Long-form factuality in large language models
Figure 3 for Long-form factuality in large language models
Figure 4 for Long-form factuality in large language models
Viaarxiv icon

FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

Add code
Oct 05, 2023
Figure 1 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Figure 2 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Figure 3 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Figure 4 for FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation
Viaarxiv icon

Simple synthetic data reduces sycophancy in large language models

Add code
Aug 07, 2023
Figure 1 for Simple synthetic data reduces sycophancy in large language models
Figure 2 for Simple synthetic data reduces sycophancy in large language models
Figure 3 for Simple synthetic data reduces sycophancy in large language models
Figure 4 for Simple synthetic data reduces sycophancy in large language models
Viaarxiv icon

Symbol tuning improves in-context learning in language models

Add code
May 15, 2023
Figure 1 for Symbol tuning improves in-context learning in language models
Figure 2 for Symbol tuning improves in-context learning in language models
Figure 3 for Symbol tuning improves in-context learning in language models
Figure 4 for Symbol tuning improves in-context learning in language models
Viaarxiv icon

Larger language models do in-context learning differently

Add code
Mar 08, 2023
Figure 1 for Larger language models do in-context learning differently
Figure 2 for Larger language models do in-context learning differently
Figure 3 for Larger language models do in-context learning differently
Figure 4 for Larger language models do in-context learning differently
Viaarxiv icon