Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Feb 21, 2025

Xuansheng Wu, Jiayi Yuan, Wenlin Yao, Xiaoming Zhai, Ninghao Liu

Figure 1 for Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Figure 2 for Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Figure 3 for Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Figure 4 for Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Share this with someone who'll enjoy it:

Abstract:Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.

* Pre-print. 20 pages, 5 figures

View paper on

Share this with someone who'll enjoy it:

Title:Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

Paper and Code