Picture for Lasse Ruttert

Lasse Ruttert

Reinforcement Learning Amplifies Emergent Misalignment from Harmless Rewards

Add code
May 29, 2026
Viaarxiv icon