Picture for Samuel Marks

Samuel Marks

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Add code
Mar 05, 2026
Viaarxiv icon

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Add code
Feb 26, 2026
Viaarxiv icon

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers

Add code
Dec 17, 2025
Figure 1 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Figure 2 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Figure 3 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Figure 4 for Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Viaarxiv icon

Auditing Games for Sandbagging

Add code
Dec 08, 2025
Viaarxiv icon

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Add code
Oct 23, 2025
Figure 1 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 2 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 3 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 4 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Viaarxiv icon

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Add code
Oct 06, 2025
Viaarxiv icon

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Add code
Jul 22, 2025
Viaarxiv icon

Robustly Improving LLM Fairness in Realistic Settings via Interpretability

Add code
Jun 12, 2025
Figure 1 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Figure 2 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Figure 3 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Figure 4 for Robustly Improving LLM Fairness in Realistic Settings via Interpretability
Viaarxiv icon

Unsupervised Elicitation of Language Models

Add code
Jun 11, 2025
Figure 1 for Unsupervised Elicitation of Language Models
Figure 2 for Unsupervised Elicitation of Language Models
Figure 3 for Unsupervised Elicitation of Language Models
Figure 4 for Unsupervised Elicitation of Language Models
Viaarxiv icon

Auditing language models for hidden objectives

Add code
Mar 14, 2025
Figure 1 for Auditing language models for hidden objectives
Figure 2 for Auditing language models for hidden objectives
Figure 3 for Auditing language models for hidden objectives
Figure 4 for Auditing language models for hidden objectives
Viaarxiv icon