Picture for Samuel Marks

Samuel Marks

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Add code
Jun 12, 2025
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Viaarxiv icon

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Add code
Nov 28, 2024
Viaarxiv icon

Erasing Conceptual Knowledge from Language Models

Add code
Oct 03, 2024
Viaarxiv icon

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

Add code
Aug 02, 2024
Figure 1 for The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Figure 2 for The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Figure 3 for The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Viaarxiv icon

Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models

Add code
Jul 31, 2024
Figure 1 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 2 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 3 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Figure 4 for Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Viaarxiv icon

NNsight and NDIF: Democratizing Access to Foundation Model Internals

Add code
Jul 18, 2024
Figure 1 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Figure 2 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Figure 3 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Figure 4 for NNsight and NDIF: Democratizing Access to Foundation Model Internals
Viaarxiv icon

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Add code
Jun 20, 2024
Viaarxiv icon