Picture for Mary Phuong

Mary Phuong

Spilling the Beans: Teaching LLMs to Self-Report Their Hidden Objectives

Add code
Nov 14, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Add code
May 29, 2025
Viaarxiv icon

Evaluating Frontier Models for Stealth and Situational Awareness

Add code
May 02, 2025
Viaarxiv icon

From Stability to Inconsistency: A Study of Moral Preferences in LLMs

Add code
Apr 08, 2025
Figure 1 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Figure 2 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Figure 3 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Figure 4 for From Stability to Inconsistency: A Study of Moral Preferences in LLMs
Viaarxiv icon

Evaluating Frontier Models for Dangerous Capabilities

Add code
Mar 20, 2024
Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

Gemini: A Family of Highly Capable Multimodal Models

Add code
Dec 19, 2023
Viaarxiv icon

Model evaluation for extreme risks

Add code
May 24, 2023
Figure 1 for Model evaluation for extreme risks
Figure 2 for Model evaluation for extreme risks
Figure 3 for Model evaluation for extreme risks
Figure 4 for Model evaluation for extreme risks
Viaarxiv icon

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Add code
Oct 04, 2022
Figure 1 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Figure 2 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Figure 3 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Figure 4 for Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals
Viaarxiv icon

Formal Algorithms for Transformers

Add code
Jul 19, 2022
Viaarxiv icon