Picture for Been Kim

Been Kim

Getting aligned on representational alignment

Add code
Nov 02, 2023
Figure 1 for Getting aligned on representational alignment
Figure 2 for Getting aligned on representational alignment
Figure 3 for Getting aligned on representational alignment
Figure 4 for Getting aligned on representational alignment
Viaarxiv icon

Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero

Add code
Oct 25, 2023
Viaarxiv icon

State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding

Add code
Sep 21, 2023
Figure 1 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Figure 2 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Figure 3 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Figure 4 for State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
Viaarxiv icon

Don't trust your eyes: on the reliability of feature visualizations

Add code
Jun 21, 2023
Figure 1 for Don't trust your eyes: on the reliability of feature visualizations
Figure 2 for Don't trust your eyes: on the reliability of feature visualizations
Figure 3 for Don't trust your eyes: on the reliability of feature visualizations
Figure 4 for Don't trust your eyes: on the reliability of feature visualizations
Viaarxiv icon

Gaussian Process Probes (GPP) for Uncertainty-Aware Probing

Add code
May 29, 2023
Figure 1 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 2 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 3 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Figure 4 for Gaussian Process Probes (GPP) for Uncertainty-Aware Probing
Viaarxiv icon

Model evaluation for extreme risks

Add code
May 24, 2023
Figure 1 for Model evaluation for extreme risks
Figure 2 for Model evaluation for extreme risks
Figure 3 for Model evaluation for extreme risks
Figure 4 for Model evaluation for extreme risks
Viaarxiv icon

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models

Add code
Jan 10, 2023
Figure 1 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Figure 2 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Figure 3 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Figure 4 for Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models
Viaarxiv icon

Impossibility Theorems for Feature Attribution

Add code
Dec 22, 2022
Figure 1 for Impossibility Theorems for Feature Attribution
Figure 2 for Impossibility Theorems for Feature Attribution
Figure 3 for Impossibility Theorems for Feature Attribution
Figure 4 for Impossibility Theorems for Feature Attribution
Viaarxiv icon

On the Relationship Between Explanation and Prediction: A Causal View

Add code
Dec 20, 2022
Figure 1 for On the Relationship Between Explanation and Prediction: A Causal View
Figure 2 for On the Relationship Between Explanation and Prediction: A Causal View
Figure 3 for On the Relationship Between Explanation and Prediction: A Causal View
Figure 4 for On the Relationship Between Explanation and Prediction: A Causal View
Viaarxiv icon

Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation

Add code
Dec 09, 2022
Figure 1 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Figure 2 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Figure 3 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Figure 4 for Post hoc Explanations may be Ineffective for Detecting Unknown Spurious Correlation
Viaarxiv icon