Picture for Dan Mossing

Dan Mossing

Weight-sparse transformers have interpretable circuits

Add code
Nov 17, 2025
Figure 1 for Weight-sparse transformers have interpretable circuits
Figure 2 for Weight-sparse transformers have interpretable circuits
Figure 3 for Weight-sparse transformers have interpretable circuits
Figure 4 for Weight-sparse transformers have interpretable circuits
Viaarxiv icon

Persona Features Control Emergent Misalignment

Add code
Jun 24, 2025
Figure 1 for Persona Features Control Emergent Misalignment
Figure 2 for Persona Features Control Emergent Misalignment
Figure 3 for Persona Features Control Emergent Misalignment
Figure 4 for Persona Features Control Emergent Misalignment
Viaarxiv icon

Investigating task-specific prompts and sparse autoencoders for activation monitoring

Add code
Apr 28, 2025
Viaarxiv icon