Picture for Zhengxuan Wu

Zhengxuan Wu

Improved Representation Steering for Language Models

Add code
May 27, 2025
Viaarxiv icon

GIM: Improved Interpretability for Large Language Models

Add code
May 23, 2025
Viaarxiv icon

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Add code
Jan 29, 2025
Viaarxiv icon

Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

Add code
Jul 31, 2024
Viaarxiv icon

ReFT: Representation Finetuning for Language Models

Add code
Apr 08, 2024
Viaarxiv icon

Mapping the Increasing Use of LLMs in Scientific Papers

Add code
Apr 01, 2024
Figure 1 for Mapping the Increasing Use of LLMs in Scientific Papers
Figure 2 for Mapping the Increasing Use of LLMs in Scientific Papers
Figure 3 for Mapping the Increasing Use of LLMs in Scientific Papers
Figure 4 for Mapping the Increasing Use of LLMs in Scientific Papers
Viaarxiv icon

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Add code
Mar 12, 2024
Figure 1 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 2 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 3 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Viaarxiv icon

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Add code
Mar 12, 2024
Figure 1 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Figure 2 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Figure 3 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Figure 4 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Viaarxiv icon

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Add code
Feb 27, 2024
Figure 1 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 2 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 3 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 4 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Viaarxiv icon

A Reply to Makelov et al. 's "Interpretability Illusion" Arguments

Add code
Jan 23, 2024
Viaarxiv icon