Picture for Neel Nanda

Neel Nanda

Google DeepMind

Thought Anchors: Which LLM Reasoning Steps Matter?

Add code
Jun 23, 2025
Viaarxiv icon

Convergent Linear Representations of Emergent Misalignment

Add code
Jun 13, 2025
Viaarxiv icon

How Visual Representations Map to Language Feature Space in Multimodal LLMs

Add code
Jun 13, 2025
Viaarxiv icon

Because we have LLMs, we Can and Should Pursue Agentic Interpretability

Add code
Jun 13, 2025
Viaarxiv icon

Model Organisms for Emergent Misalignment

Add code
Jun 13, 2025
Viaarxiv icon

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Add code
May 23, 2025
Viaarxiv icon

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Add code
May 20, 2025
Viaarxiv icon

Scaling sparse feature circuit finding for in-context learning

Add code
Apr 18, 2025
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Add code
Mar 13, 2025
Viaarxiv icon