Picture for Samuel Marks

Samuel Marks

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Add code
Oct 23, 2025
Viaarxiv icon

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Add code
Oct 06, 2025
Viaarxiv icon

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Add code
Jul 22, 2025
Viaarxiv icon

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Add code
Jun 12, 2025
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Add code
Nov 28, 2024
Viaarxiv icon

Erasing Conceptual Knowledge from Language Models

Add code
Oct 03, 2024
Viaarxiv icon

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability

Add code
Aug 02, 2024
Figure 1 for The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Figure 2 for The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Figure 3 for The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Viaarxiv icon