Picture for Julian Michael

Julian Michael

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Viaarxiv icon

FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

Add code
Jun 17, 2025
Viaarxiv icon

International AI Safety Report

Add code
Jan 29, 2025
Viaarxiv icon

Alignment faking in large language models

Add code
Dec 18, 2024
Viaarxiv icon

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Add code
Nov 12, 2024
Figure 1 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Figure 2 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Figure 3 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Figure 4 for Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
Viaarxiv icon

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

Add code
Sep 25, 2024
Viaarxiv icon

Analyzing the Role of Semantic Representations in the Era of Large Language Models

Add code
May 02, 2024
Viaarxiv icon

Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Add code
Mar 08, 2024
Figure 1 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 2 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 3 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Figure 4 for Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Viaarxiv icon

The Case for Scalable, Data-Driven Theory: A Paradigm for Scientific Progress in NLP

Add code
Dec 01, 2023
Viaarxiv icon

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Add code
Nov 20, 2023
Viaarxiv icon