Sparse Autoencoder


DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders

Add code
Feb 05, 2026
Viaarxiv icon

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Add code
Feb 05, 2026
Viaarxiv icon

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning

Add code
Feb 05, 2026
Viaarxiv icon

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Add code
Feb 04, 2026
Viaarxiv icon

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Add code
Feb 04, 2026
Viaarxiv icon

Variational Sparse Paired Autoencoders (vsPAIR) for Inverse Problems and Uncertainty Quantification

Add code
Feb 03, 2026
Viaarxiv icon

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

Add code
Feb 02, 2026
Viaarxiv icon

Mechanistic Indicators of Steering Effectiveness in Large Language Models

Add code
Feb 02, 2026
Viaarxiv icon

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Add code
Feb 01, 2026
Viaarxiv icon

Refining Decision Boundaries In Anomaly Detection Using Similarity Search Within the Feature Space

Add code
Feb 02, 2026
Viaarxiv icon