Picture for Neel Nanda

Neel Nanda

Google DeepMind

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

Add code
Jul 22, 2025
Viaarxiv icon

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Add code
Jul 15, 2025
Viaarxiv icon

Thought Anchors: Which LLM Reasoning Steps Matter?

Add code
Jun 23, 2025
Viaarxiv icon

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Add code
Jun 13, 2025
Viaarxiv icon

How Visual Representations Map to Language Feature Space in Multimodal LLMs

Add code
Jun 13, 2025
Viaarxiv icon

Convergent Linear Representations of Emergent Misalignment

Add code
Jun 13, 2025
Viaarxiv icon

Model Organisms for Emergent Misalignment

Add code
Jun 13, 2025
Viaarxiv icon

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Add code
May 23, 2025
Viaarxiv icon

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Add code
May 20, 2025
Viaarxiv icon

Scaling sparse feature circuit finding for in-context learning

Add code
Apr 18, 2025
Viaarxiv icon