Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bofan Gong

Probing the Vulnerability of Large Language Models to Polysemantic Interventions

May 16, 2025

Bofan Gong, Shiyang Lai, Dawn Song

Figure 1 for Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Figure 2 for Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Figure 3 for Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Figure 4 for Probing the Vulnerability of Large Language Models to Polysemantic Interventions

Abstract:Polysemanticity -- where individual neurons encode multiple unrelated features -- is a well-known characteristic of large neural networks and remains a central challenge in the interpretability of language models. At the same time, its implications for model safety are also poorly understood. Leveraging recent advances in sparse autoencoders, we investigate the polysemantic structure of two small models (Pythia-70M and GPT-2-Small) and evaluate their vulnerability to targeted, covert interventions at the prompt, feature, token, and neuron levels. Our analysis reveals a consistent polysemantic topology shared across both models. Strikingly, we demonstrate that this structure can be exploited to mount effective interventions on two larger, black-box instruction-tuned models (LLaMA3.1-8B-Instruct and Gemma-2-9B-Instruct). These findings suggest not only the generalizability of the interventions but also point to a stable and transferable polysemantic structure that could potentially persist across architectures and training regimes.

Via

Access Paper or Ask Questions