Picture for Federico Torrielli

Federico Torrielli

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

Add code
Jun 09, 2026
Viaarxiv icon

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

Add code
Jun 08, 2026
Viaarxiv icon

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

Add code
May 29, 2026
Viaarxiv icon

Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals

Add code
May 25, 2026
Viaarxiv icon