Picture for Diogo Cruz

Diogo Cruz

Multi-Turn Jailbreaks Are Simpler Than They Seem

Add code
Aug 11, 2025
Viaarxiv icon

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

Add code
Jun 11, 2025
Viaarxiv icon

Understanding the learned look-ahead behavior of chess neural networks

Add code
May 26, 2025
Viaarxiv icon

Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features

Add code
Nov 07, 2023
Figure 1 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 2 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 3 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Figure 4 for Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features
Viaarxiv icon