Picture for János Kramár

János Kramár

Google DeepMind

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Add code
Jul 19, 2024
Viaarxiv icon

On scalable oversight with weak LLMs judging strong LLMs

Add code
Jul 05, 2024
Viaarxiv icon

Improving Dictionary Learning with Gated Sparse Autoencoders

Add code
Apr 30, 2024
Viaarxiv icon

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Add code
Mar 01, 2024
Figure 1 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 2 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 3 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 4 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Viaarxiv icon

Explaining grokking through circuit efficiency

Add code
Sep 05, 2023
Figure 1 for Explaining grokking through circuit efficiency
Figure 2 for Explaining grokking through circuit efficiency
Figure 3 for Explaining grokking through circuit efficiency
Figure 4 for Explaining grokking through circuit efficiency
Viaarxiv icon

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Add code
Jul 24, 2023
Figure 1 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 2 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 3 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 4 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Viaarxiv icon

Tracr: Compiled Transformers as a Laboratory for Interpretability

Add code
Jan 12, 2023
Figure 1 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Figure 2 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Figure 3 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Figure 4 for Tracr: Compiled Transformers as a Laboratory for Interpretability
Viaarxiv icon

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

Add code
Jun 17, 2020
Figure 1 for Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Figure 2 for Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Figure 3 for Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Figure 4 for Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Viaarxiv icon

OpenSpiel: A Framework for Reinforcement Learning in Games

Add code
Oct 10, 2019
Figure 1 for OpenSpiel: A Framework for Reinforcement Learning in Games
Figure 2 for OpenSpiel: A Framework for Reinforcement Learning in Games
Figure 3 for OpenSpiel: A Framework for Reinforcement Learning in Games
Figure 4 for OpenSpiel: A Framework for Reinforcement Learning in Games
Viaarxiv icon

Learning Reciprocity in Complex Sequential Social Dilemmas

Add code
Mar 19, 2019
Figure 1 for Learning Reciprocity in Complex Sequential Social Dilemmas
Figure 2 for Learning Reciprocity in Complex Sequential Social Dilemmas
Figure 3 for Learning Reciprocity in Complex Sequential Social Dilemmas
Figure 4 for Learning Reciprocity in Complex Sequential Social Dilemmas
Viaarxiv icon