Picture for Samuel Marks

Samuel Marks

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Add code
Dec 17, 2025
Viaarxiv icon

Auditing Games for Sandbagging

Add code
Dec 08, 2025
Viaarxiv icon

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Add code
Oct 23, 2025
Viaarxiv icon

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Add code
Oct 06, 2025
Viaarxiv icon

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Add code
Jul 22, 2025
Viaarxiv icon

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Add code
Jun 12, 2025
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Figure 1 for Auditing language models for hidden objectives
Figure 2 for Auditing language models for hidden objectives
Figure 3 for Auditing language models for hidden objectives
Figure 4 for Auditing language models for hidden objectives
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks

Add code
Nov 28, 2024
Viaarxiv icon