Picture for Kola Ayonrinde

Kola Ayonrinde

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Add code
May 02, 2025
Figure 1 for Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Figure 2 for Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Figure 3 for Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Figure 4 for Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Viaarxiv icon

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Add code
May 01, 2025
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders

Add code
Nov 04, 2024
Figure 1 for Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
Figure 2 for Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
Figure 3 for Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
Figure 4 for Adaptive Sparse Allocation with Mutual Choice & Feature Choice Sparse Autoencoders
Viaarxiv icon

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Add code
Oct 15, 2024
Viaarxiv icon