Picture for Sarah Wiegreffe

Sarah Wiegreffe

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

Add code
Apr 09, 2026
Viaarxiv icon

Are Latent Reasoning Models Easily Interpretable?

Add code
Apr 06, 2026
Viaarxiv icon

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Add code
Feb 02, 2026
Viaarxiv icon

MIB: A Mechanistic Interpretability Benchmark

Add code
Apr 17, 2025
Figure 1 for MIB: A Mechanistic Interpretability Benchmark
Figure 2 for MIB: A Mechanistic Interpretability Benchmark
Figure 3 for MIB: A Mechanistic Interpretability Benchmark
Figure 4 for MIB: A Mechanistic Interpretability Benchmark
Viaarxiv icon

On Linear Representations and Pretraining Data Frequency in Language Models

Add code
Apr 16, 2025
Viaarxiv icon

Mechanistic?

Add code
Oct 07, 2024
Viaarxiv icon

Plausibly Problematic Questions in Multiple-Choice Benchmarks for Commonsense Reasoning

Add code
Oct 06, 2024
Viaarxiv icon

Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

Add code
Jul 21, 2024
Viaarxiv icon

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

Add code
Jan 12, 2024
Viaarxiv icon

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals

Add code
Nov 16, 2023
Viaarxiv icon