Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Mar 01, 2026

Subramanyam Sahoo, Vinija Jain, Divya Chaudhary, Aman Chadha

Share this with someone who'll enjoy it:

Abstract:Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

* Accepted at the ICBINB: Where LLMs Need to Improve workshop at ICLR 2026. 12 pages and 3 Figures

View paper on

Share this with someone who'll enjoy it:

Title:I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Paper and Code