Alert button
Picture for Lee Sharkey

Lee Sharkey

Alert button

Black-Box Access is Insufficient for Rigorous AI Audits

Add code
Bookmark button
Alert button
Jan 25, 2024
Stephen Casper, Carson Ezell, Charlotte Siegmann, Noam Kolt, Taylor Lynn Curtis, Benjamin Bucknall, Andreas Haupt, Kevin Wei, Jérémy Scheurer, Marius Hobbhahn, Lee Sharkey, Satyapriya Krishna, Marvin Von Hagen, Silas Alberti, Alan Chan, Qinyi Sun, Michael Gerovitch, David Bau, Max Tegmark, David Krueger, Dylan Hadfield-Menell

Viaarxiv icon

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Add code
Bookmark button
Alert button
Sep 19, 2023
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

Viaarxiv icon

A technical note on bilinear layers for interpretability

Add code
Bookmark button
Alert button
May 05, 2023
Lee Sharkey

Viaarxiv icon

Circumventing interpretability: How to defeat mind-readers

Add code
Bookmark button
Alert button
Dec 21, 2022
Lee Sharkey

Figure 1 for Circumventing interpretability: How to defeat mind-readers
Figure 2 for Circumventing interpretability: How to defeat mind-readers
Figure 3 for Circumventing interpretability: How to defeat mind-readers
Figure 4 for Circumventing interpretability: How to defeat mind-readers
Viaarxiv icon

Interpreting Neural Networks through the Polytope Lens

Add code
Bookmark button
Alert button
Nov 22, 2022
Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, Connor Leahy

Figure 1 for Interpreting Neural Networks through the Polytope Lens
Figure 2 for Interpreting Neural Networks through the Polytope Lens
Figure 3 for Interpreting Neural Networks through the Polytope Lens
Figure 4 for Interpreting Neural Networks through the Polytope Lens
Viaarxiv icon

Objective Robustness in Deep Reinforcement Learning

Add code
Bookmark button
Alert button
Jun 08, 2021
Jack Koch, Lauro Langosco, Jacob Pfau, James Le, Lee Sharkey

Figure 1 for Objective Robustness in Deep Reinforcement Learning
Figure 2 for Objective Robustness in Deep Reinforcement Learning
Figure 3 for Objective Robustness in Deep Reinforcement Learning
Figure 4 for Objective Robustness in Deep Reinforcement Learning
Viaarxiv icon