Picture for Marius Hobbhahn

Marius Hobbhahn

Stress Testing Deliberative Alignment for Anti-Scheming Training

Add code
Sep 19, 2025
Figure 1 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Figure 2 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Figure 3 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Figure 4 for Stress Testing Deliberative Alignment for Anti-Scheming Training
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

Large Language Models Often Know When They Are Being Evaluated

Add code
May 28, 2025
Viaarxiv icon

Technical Report: Evaluating Goal Drift in Language Model Agents

Add code
May 05, 2025
Viaarxiv icon

Forecasting Frontier Language Model Agent Capabilities

Add code
Feb 21, 2025
Figure 1 for Forecasting Frontier Language Model Agent Capabilities
Figure 2 for Forecasting Frontier Language Model Agent Capabilities
Figure 3 for Forecasting Frontier Language Model Agent Capabilities
Figure 4 for Forecasting Frontier Language Model Agent Capabilities
Viaarxiv icon

Detecting Strategic Deception Using Linear Probes

Add code
Feb 05, 2025
Figure 1 for Detecting Strategic Deception Using Linear Probes
Figure 2 for Detecting Strategic Deception Using Linear Probes
Figure 3 for Detecting Strategic Deception Using Linear Probes
Figure 4 for Detecting Strategic Deception Using Linear Probes
Viaarxiv icon

Frontier Models are Capable of In-context Scheming

Add code
Dec 06, 2024
Viaarxiv icon

Towards evaluations-based safety cases for AI scheming

Add code
Nov 07, 2024
Viaarxiv icon

Analyzing Probabilistic Methods for Evaluating Agent Capabilities

Add code
Sep 24, 2024
Figure 1 for Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Figure 2 for Analyzing Probabilistic Methods for Evaluating Agent Capabilities
Viaarxiv icon

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Add code
Jul 05, 2024
Figure 1 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Figure 2 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Figure 3 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Figure 4 for Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
Viaarxiv icon