Picture for Zhengxuan Wu

Zhengxuan Wu

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Add code
Mar 12, 2024
Figure 1 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 2 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 3 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Viaarxiv icon

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Add code
Feb 27, 2024
Figure 1 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 2 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 3 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 4 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Viaarxiv icon

A Reply to Makelov et al. 's "Interpretability Illusion" Arguments

Add code
Jan 23, 2024
Viaarxiv icon

Rigorously Assessing Natural Language Explanations of Neurons

Add code
Sep 19, 2023
Viaarxiv icon

MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions

Add code
May 24, 2023
Viaarxiv icon

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

Add code
May 15, 2023
Figure 1 for Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Figure 2 for Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Figure 3 for Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Figure 4 for Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Viaarxiv icon

ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation

Add code
Mar 24, 2023
Viaarxiv icon

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Add code
Mar 05, 2023
Viaarxiv icon

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

Add code
Dec 19, 2022
Viaarxiv icon

Causal Proxy Models for Concept-Based Model Explanations

Add code
Sep 28, 2022
Figure 1 for Causal Proxy Models for Concept-Based Model Explanations
Figure 2 for Causal Proxy Models for Concept-Based Model Explanations
Figure 3 for Causal Proxy Models for Concept-Based Model Explanations
Figure 4 for Causal Proxy Models for Concept-Based Model Explanations
Viaarxiv icon