Picture for Aaron Mueller

Aaron Mueller

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Add code
Dec 23, 2025
Figure 1 for Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Figure 2 for Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Figure 3 for Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Figure 4 for Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Viaarxiv icon

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

Add code
Dec 17, 2025
Viaarxiv icon

BabyVLM-V2: Toward Developmentally Grounded Pretraining and Benchmarking of Vision Foundation Models

Add code
Dec 11, 2025
Viaarxiv icon

In-Context Learning Without Copying

Add code
Nov 07, 2025
Viaarxiv icon

Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Add code
Sep 05, 2025
Figure 1 for Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Figure 2 for Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Figure 3 for Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Figure 4 for Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Viaarxiv icon

How to Improve the Robustness of Closed-Source Models on NLI

Add code
May 26, 2025
Viaarxiv icon

SAEs Are Good for Steering -- If You Select the Right Features

Add code
May 26, 2025
Viaarxiv icon

MIB: A Mechanistic Interpretability Benchmark

Add code
Apr 17, 2025
Figure 1 for MIB: A Mechanistic Interpretability Benchmark
Figure 2 for MIB: A Mechanistic Interpretability Benchmark
Figure 3 for MIB: A Mechanistic Interpretability Benchmark
Figure 4 for MIB: A Mechanistic Interpretability Benchmark
Viaarxiv icon

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Add code
Apr 10, 2025
Figure 1 for Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Figure 2 for Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Figure 3 for Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Figure 4 for Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Viaarxiv icon

Position-aware Automatic Circuit Discovery

Add code
Feb 07, 2025
Viaarxiv icon