Alert button
Picture for Neel Nanda

Neel Nanda

Alert button

Google DeepMind

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Add code
Bookmark button
Alert button
Mar 01, 2024
János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

Figure 1 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 2 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 3 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 4 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Viaarxiv icon

Explorations of Self-Repair in Language Models

Add code
Bookmark button
Alert button
Feb 23, 2024
Cody Rushing, Neel Nanda

Viaarxiv icon

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Add code
Bookmark button
Alert button
Feb 11, 2024
Bilal Chughtai, Alan Cooney, Neel Nanda

Viaarxiv icon

Universal Neurons in GPT2 Language Models

Add code
Bookmark button
Alert button
Jan 22, 2024
Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

Viaarxiv icon

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Add code
Bookmark button
Alert button
Dec 06, 2023
Aleksandar Makelov, Georg Lange, Neel Nanda

Viaarxiv icon

Training Dynamics of Contextual N-Grams in Language Models

Add code
Bookmark button
Alert button
Nov 01, 2023
Lucia Quirke, Lovis Heindrich, Wes Gurnee, Neel Nanda

Viaarxiv icon

Linear Representations of Sentiment in Large Language Models

Add code
Bookmark button
Alert button
Oct 23, 2023
Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda

Viaarxiv icon

Copy Suppression: Comprehensively Understanding an Attention Head

Add code
Bookmark button
Alert button
Oct 06, 2023
Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, Neel Nanda

Viaarxiv icon

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

Add code
Bookmark button
Alert button
Sep 27, 2023
Fred Zhang, Neel Nanda

Figure 1 for Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Figure 2 for Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Figure 3 for Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Figure 4 for Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Viaarxiv icon