Alert button
Picture for Been Kim

Been Kim

Alert button

Richard

Getting aligned on representational alignment

Nov 02, 2023
Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine L. Hermann, Kerem Oktar, Klaus Greff, Martin N. Hebart, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas P. O'Connell, Thomas Unterthiner, Andrew K. Lampinen, Klaus-Robert Müller, Mariya Toneva, Thomas L. Griffiths

Figure 1 for Getting aligned on representational alignment
Figure 2 for Getting aligned on representational alignment
Figure 3 for Getting aligned on representational alignment
Figure 4 for Getting aligned on representational alignment
Viaarxiv icon

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero

Oct 25, 2023
Lisa Schut, Nenad Tomasev, Tom McGrath, Demis Hassabis, Ulrich Paquet, Been Kim

Viaarxiv icon

State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding

Sep 21, 2023
Devleena Das, Sonia Chernova, Been Kim

Figure 1 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Figure 2 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Figure 3 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Figure 4 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Viaarxiv icon

Don't trust your eyes: on the (un)reliability of feature visualizations

Jun 21, 2023
Robert Geirhos, Roland S. Zimmermann, Blair Bilodeau, Wieland Brendel, Been Kim

Figure 1 for Don't trust your eyes: on the (un)reliability of feature visualizations
Figure 2 for Don't trust your eyes: on the (un)reliability of feature visualizations
Figure 3 for Don't trust your eyes: on the (un)reliability of feature visualizations
Figure 4 for Don't trust your eyes: on the (un)reliability of feature visualizations
Viaarxiv icon

Gaussian Process Probes (GPP) for Uncertainty-Aware Probing

May 29, 2023
Zi Wang, Alexander Ku, Jason Baldridge, Thomas L. Griffiths, Been Kim

Figure 1 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 2 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 3 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 4 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Viaarxiv icon

Model evaluation for extreme risks

May 24, 2023
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, Allan Dafoe

Figure 1 for Model evaluation for extreme risks
Figure 2 for Model evaluation for extreme risks
Figure 3 for Model evaluation for extreme risks
Figure 4 for Model evaluation for extreme risks
Viaarxiv icon

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Jan 10, 2023
Peter Hase, Mohit Bansal, Been Kim, Asma Ghandeharioun

Figure 1 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Figure 2 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Figure 3 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Figure 4 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Viaarxiv icon

Impossibility Theorems for Feature Attribution

Dec 22, 2022
Blair Bilodeau, Natasha Jaques, Pang Wei Koh, Been Kim

Figure 1 for Impossibility Theorems for Feature Attribution
Figure 2 for Impossibility Theorems for Feature Attribution
Figure 3 for Impossibility Theorems for Feature Attribution
Figure 4 for Impossibility Theorems for Feature Attribution
Viaarxiv icon