Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Florent Draye

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Feb 04, 2026

Moritz Miller, Florent Draye, Bernhard Schölkopf

Abstract:With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $\textit{Independent Causal Mechanisms}$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $\texttt{https://github.com/mrtzmllr/sae-icm}$.

Via

Access Paper or Ask Questions

Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Nov 13, 2025

Abir Harrasse, Florent Draye, Zhijing Jin, Bernhard Schölkopf

Figure 1 for Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Figure 2 for Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Figure 3 for Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Figure 4 for Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Abstract:Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance still favor the dominant training language? To address this, we train a series of LLMs on different mixtures of multilingual data and analyze their internal mechanisms using cross-layer transcoders (CLT) and attribution graphs. Our results provide strong evidence for pivot language representations: the model employs nearly identical representations across languages, while language-specific decoding emerges in later layers. Attribution analyses reveal that decoding relies in part on a small set of high-frequency language features in the final layers, which linearly read out language identity from the first layers in the model. By intervening on these features, we can suppress one language and substitute another in the model's outputs. Finally, we study how the dominant training language influences these mechanisms across attribution graphs and decoding pathways. We argue that understanding this pivot-language mechanism is crucial for improving multilingual alignment in LLMs.

* 28 pages, 35 figures, under review. Extensive supplementary materials. Code and models available at https://huggingface.co/collections/CausalNLP/multilingual-tinystories-6862b6562414eb84d183f82a and https://huggingface.co/flodraye

Via

Access Paper or Ask Questions