Picture for Kola Ayonrinde

Kola Ayonrinde

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Add code
May 02, 2025
Viaarxiv icon

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Add code
May 01, 2025
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Add code
Nov 04, 2024
Viaarxiv icon

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Add code
Oct 15, 2024
Viaarxiv icon