Picture for Neel Nanda

Neel Nanda

Google DeepMind

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Add code
Mar 05, 2026
Viaarxiv icon

Simple LLM Baselines are Competitive for Model Diffing

Add code
Feb 10, 2026
Viaarxiv icon

Emergent Misalignment is Easy, Narrow Misalignment is Hard

Add code
Feb 08, 2026
Viaarxiv icon

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering

Add code
Jan 28, 2026
Viaarxiv icon

Building Production-Ready Probes For Gemini

Add code
Jan 16, 2026
Viaarxiv icon

Interpretable Embeddings with Sparse Autoencoders: A Data Analysis Toolkit

Add code
Dec 10, 2025
Viaarxiv icon

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Add code
Oct 31, 2025
Viaarxiv icon

Steering Evaluation-Aware Language Models To Act Like They Are Deployed

Add code
Oct 23, 2025
Figure 1 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 2 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 3 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Figure 4 for Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Viaarxiv icon

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Add code
Jul 22, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Figure 1 for Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Viaarxiv icon