Picture for Zhengxuan Wu

Zhengxuan Wu

LLMs Encode Harmfulness and Refusal Separately

Add code
Jul 16, 2025
Viaarxiv icon

Improved Representation Steering for Language Models

Add code
May 27, 2025
Viaarxiv icon

GIM: Improved Interpretability for Large Language Models

Add code
May 23, 2025
Viaarxiv icon

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Add code
Jan 29, 2025
Viaarxiv icon

Dancing in Chains: Reconciling Instruction Following and Faithfulness in Language Models

Add code
Jul 31, 2024
Viaarxiv icon

ReFT: Representation Finetuning for Language Models

Add code
Apr 08, 2024
Figure 1 for ReFT: Representation Finetuning for Language Models
Figure 2 for ReFT: Representation Finetuning for Language Models
Figure 3 for ReFT: Representation Finetuning for Language Models
Figure 4 for ReFT: Representation Finetuning for Language Models
Viaarxiv icon

Mapping the Increasing Use of LLMs in Scientific Papers

Add code
Apr 01, 2024
Figure 1 for Mapping the Increasing Use of LLMs in Scientific Papers
Figure 2 for Mapping the Increasing Use of LLMs in Scientific Papers
Figure 3 for Mapping the Increasing Use of LLMs in Scientific Papers
Figure 4 for Mapping the Increasing Use of LLMs in Scientific Papers
Viaarxiv icon

pyvene: A Library for Understanding and Improving PyTorch Models via Interventions

Add code
Mar 12, 2024
Figure 1 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 2 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Figure 3 for pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Viaarxiv icon

In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation

Add code
Mar 12, 2024
Figure 1 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Figure 2 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Figure 3 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Figure 4 for In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation
Viaarxiv icon

RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations

Add code
Feb 27, 2024
Figure 1 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 2 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 3 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Figure 4 for RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
Viaarxiv icon