Picture for Fabien Roger

Fabien Roger

How Useful Is Cross-Domain Generalization for Training LLM Monitors?

Add code
May 12, 2026
Viaarxiv icon

Classifier Context Rot: Monitor Performance Degrades with Context Length

Add code
May 12, 2026
Viaarxiv icon

Self-Attribution Bias: When AI Monitors Go Easy on Themselves

Add code
Mar 04, 2026
Viaarxiv icon

Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation

Add code
Feb 23, 2026
Viaarxiv icon

Excess Description Length of Learning Generalizable Predictors

Add code
Jan 08, 2026
Viaarxiv icon

Steering Language Models with Weight Arithmetic

Add code
Nov 07, 2025
Viaarxiv icon

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Add code
Oct 06, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

Reasoning Models Don't Always Say What They Think

Add code
May 08, 2025
Figure 1 for Reasoning Models Don't Always Say What They Think
Figure 2 for Reasoning Models Don't Always Say What They Think
Figure 3 for Reasoning Models Don't Always Say What They Think
Figure 4 for Reasoning Models Don't Always Say What They Think
Viaarxiv icon

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Figure 1 for Auditing language models for hidden objectives
Figure 2 for Auditing language models for hidden objectives
Figure 3 for Auditing language models for hidden objectives
Figure 4 for Auditing language models for hidden objectives
Viaarxiv icon