Alert button
Picture for Tom Lieberum

Tom Lieberum

Alert button

Google DeepMind

Evaluating Frontier Models for Dangerous Capabilities

Add code
Bookmark button
Alert button
Mar 20, 2024
Mary Phuong, Matthew Aitchison, Elliot Catt, Sarah Cogan, Alexandre Kaskasoli, Victoria Krakovna, David Lindner, Matthew Rahtz, Yannis Assael, Sarah Hodkinson, Heidi Howard, Tom Lieberum, Ramana Kumar, Maria Abi Raad, Albert Webson, Lewis Ho, Sharon Lin, Sebastian Farquhar, Marcus Hutter, Gregoire Deletang, Anian Ruoss, Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, Toby Shevlane

Figure 1 for Evaluating Frontier Models for Dangerous Capabilities
Figure 2 for Evaluating Frontier Models for Dangerous Capabilities
Figure 3 for Evaluating Frontier Models for Dangerous Capabilities
Figure 4 for Evaluating Frontier Models for Dangerous Capabilities
Viaarxiv icon

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Add code
Bookmark button
Alert button
Mar 01, 2024
János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda

Figure 1 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 2 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 3 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Figure 4 for AtP*: An efficient and scalable method for localizing LLM behaviour to components
Viaarxiv icon

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Add code
Bookmark button
Alert button
Jul 24, 2023
Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik

Figure 1 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 2 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 3 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Figure 4 for Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
Viaarxiv icon

Progress measures for grokking via mechanistic interpretability

Add code
Bookmark button
Alert button
Jan 13, 2023
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, Jacob Steinhardt

Figure 1 for Progress measures for grokking via mechanistic interpretability
Figure 2 for Progress measures for grokking via mechanistic interpretability
Figure 3 for Progress measures for grokking via mechanistic interpretability
Figure 4 for Progress measures for grokking via mechanistic interpretability
Viaarxiv icon

Retrospective on the 2021 BASALT Competition on Learning from Human Feedback

Add code
Bookmark button
Alert button
Apr 14, 2022
Rohin Shah, Steven H. Wang, Cody Wild, Stephanie Milani, Anssi Kanervisto, Vinicius G. Goecks, Nicholas Waytowich, David Watkins-Valls, Bharat Prakash, Edmund Mills, Divyansh Garg, Alexander Fries, Alexandra Souly, Chan Jun Shern, Daniel del Castillo, Tom Lieberum

Figure 1 for Retrospective on the 2021 BASALT Competition on Learning from Human Feedback
Figure 2 for Retrospective on the 2021 BASALT Competition on Learning from Human Feedback
Figure 3 for Retrospective on the 2021 BASALT Competition on Learning from Human Feedback
Figure 4 for Retrospective on the 2021 BASALT Competition on Learning from Human Feedback
Viaarxiv icon