Picture for Mike Vaiana

Mike Vaiana

Learning Self-Interpretation from Interpretability Artifacts: Training Lightweight Adapters on Vector-Label Pairs

Add code
Feb 10, 2026
Viaarxiv icon

Endogenous Resistance to Activation Steering in Language Models

Add code
Feb 06, 2026
Viaarxiv icon