Picture for Diogo Cruz

Diogo Cruz

Asymmetric Goal Drift in Coding Agents Under Value Conflict

Add code
Mar 03, 2026
Viaarxiv icon

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Add code
Mar 03, 2026
Viaarxiv icon

Multi-Turn Jailbreaks Are Simpler Than They Seem

Add code
Aug 11, 2025
Viaarxiv icon

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Add code
Jun 11, 2025
Figure 1 for Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Figure 2 for Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Figure 3 for Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Figure 4 for Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods
Viaarxiv icon

Understanding the learned look-ahead behavior of chess neural networks

Add code
May 26, 2025
Viaarxiv icon

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

Add code
Nov 07, 2023
Figure 1 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 2 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 3 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 4 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Viaarxiv icon