Picture for Neel Nanda

Neel Nanda

Google DeepMind

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Add code
Jul 19, 2024
Viaarxiv icon

Interpreting Attention Layer Outputs with Sparse Autoencoders

Add code
Jun 25, 2024
Viaarxiv icon

Confidence Regulation Neurons in Language Models

Add code
Jun 24, 2024
Viaarxiv icon

Transcoders Find Interpretable LLM Feature Circuits

Add code
Jun 17, 2024
Viaarxiv icon

Refusal in Language Models Is Mediated by a Single Direction

Add code
Jun 17, 2024
Viaarxiv icon

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Add code
May 16, 2024
Figure 1 for Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Figure 2 for Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Figure 3 for Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Figure 4 for Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Viaarxiv icon

Improving Dictionary Learning with Gated Sparse Autoencoders

Add code
Apr 30, 2024
Viaarxiv icon

How to use and interpret activation patching

Add code
Apr 23, 2024
Viaarxiv icon

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Add code
Mar 01, 2024
Figure 1 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 2 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 3 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 4 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Viaarxiv icon

Explorations of Self-Repair in Language Models

Add code
Feb 23, 2024
Figure 1 for Explorations of Self-Repair in Language Models
Figure 2 for Explorations of Self-Repair in Language Models
Figure 3 for Explorations of Self-Repair in Language Models
Figure 4 for Explorations of Self-Repair in Language Models
Viaarxiv icon