Picture for Adrià Garriga-Alonso

Adrià Garriga-Alonso

Shammie

Interpreting learned search: finding a transition model and value function in an RNN that plays Sokoban

Add code
Jun 11, 2025
Viaarxiv icon

Feature Hedging: Correlated Features Break Narrow Sparse Autoencoders

Add code
May 16, 2025
Viaarxiv icon

Among Us: A Sandbox for Agentic Deception

Add code
Apr 05, 2025
Viaarxiv icon

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Add code
Apr 02, 2025
Viaarxiv icon

Hypothesis Testing the Circuit Hypothesis in LLMs

Add code
Oct 16, 2024
Viaarxiv icon

Planning behavior in a recurrent neural network that plays Sokoban

Add code
Jul 22, 2024
Viaarxiv icon

Adversarial Circuit Evaluation

Add code
Jul 21, 2024
Viaarxiv icon

Investigating the Indirect Object Identification circuit in Mamb

Add code
Jul 19, 2024
Viaarxiv icon

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification

Add code
Jul 19, 2024
Viaarxiv icon

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

Add code
Jul 19, 2024
Figure 1 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 2 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 3 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Figure 4 for InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
Viaarxiv icon