Picture for Stefan Heimersheim

Stefan Heimersheim

SCALAR: Benchmarking SAE Interaction Sparsity in Toy LLMs

Add code
Nov 10, 2025
Viaarxiv icon

Transformers Don't Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT-2 XL and the Implications for Mechanistic Interpretability

Add code
Jul 03, 2025
Viaarxiv icon

Detecting Strategic Deception Using Linear Probes

Add code
Feb 05, 2025
Figure 1 for Detecting Strategic Deception Using Linear Probes
Figure 2 for Detecting Strategic Deception Using Linear Probes
Figure 3 for Detecting Strategic Deception Using Linear Probes
Figure 4 for Detecting Strategic Deception Using Linear Probes
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Add code
Jan 24, 2025
Figure 1 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 2 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 3 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Figure 4 for Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Viaarxiv icon

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Add code
Oct 16, 2024
Figure 1 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Figure 2 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Figure 3 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Figure 4 for Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
Viaarxiv icon

Evolution of SAE Features Across Layers in LLMs

Add code
Oct 11, 2024
Figure 1 for Evolution of SAE Features Across Layers in LLMs
Figure 2 for Evolution of SAE Features Across Layers in LLMs
Figure 3 for Evolution of SAE Features Across Layers in LLMs
Figure 4 for Evolution of SAE Features Across Layers in LLMs
Viaarxiv icon

Characterizing stable regions in the residual stream of LLMs

Add code
Sep 26, 2024
Figure 1 for Characterizing stable regions in the residual stream of LLMs
Figure 2 for Characterizing stable regions in the residual stream of LLMs
Figure 3 for Characterizing stable regions in the residual stream of LLMs
Figure 4 for Characterizing stable regions in the residual stream of LLMs
Viaarxiv icon

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Add code
May 17, 2024
Figure 1 for Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
Viaarxiv icon

How to use and interpret activation patching

Add code
Apr 23, 2024
Figure 1 for How to use and interpret activation patching
Figure 2 for How to use and interpret activation patching
Viaarxiv icon