Picture for Neel Nanda

Neel Nanda

Google DeepMind

Inference-Time Decomposition of Activations (ITDA): A Scalable Approach to Interpreting Large Language Models

Add code
May 23, 2025
Viaarxiv icon

Towards eliciting latent knowledge from LLMs with mechanistic interpretability

Add code
May 20, 2025
Viaarxiv icon

Scaling sparse feature circuit finding for in-context learning

Add code
Apr 18, 2025
Viaarxiv icon

An Approach to Technical AGI Safety and Security

Add code
Apr 02, 2025
Viaarxiv icon

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Add code
Mar 13, 2025
Viaarxiv icon

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Add code
Mar 13, 2025
Viaarxiv icon

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Add code
Feb 23, 2025
Viaarxiv icon

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Add code
Feb 07, 2025
Figure 1 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Figure 2 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Figure 3 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Figure 4 for Sparse Autoencoders Do Not Find Canonical Units of Analysis
Viaarxiv icon

Open Problems in Mechanistic Interpretability

Add code
Jan 27, 2025
Figure 1 for Open Problems in Mechanistic Interpretability
Figure 2 for Open Problems in Mechanistic Interpretability
Figure 3 for Open Problems in Mechanistic Interpretability
Figure 4 for Open Problems in Mechanistic Interpretability
Viaarxiv icon

BatchTopK Sparse Autoencoders

Add code
Dec 09, 2024
Viaarxiv icon