Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shivam Arora

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

Apr 28, 2026

Chayanon Kitkana, Shivam Arora

Abstract:In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

* Published in ICLR 2026 Sci4DL Workshop

Via

Access Paper or Ask Questions

Building Better Deception Probes Using Targeted Instruction Pairs

Feb 01, 2026

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

Abstract:Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we identify the importance of the instruction pair used during training. Furthermore, we show that targeting specific deceptive behaviors through a human-interpretable taxonomy of deception leads to improved results on evaluation datasets. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given the heterogeneity of deception types across datasets, we conclude that organizations should design specialized probes targeting their specific threat models rather than seeking a universal deception detector.

Via

Access Paper or Ask Questions

Model-free machine learning of conservation laws from data

Jan 12, 2023

Shivam Arora, Alex Bihlo, Rüdiger Brecht, Pavel Holba

Figure 1 for Model-free machine learning of conservation laws from data

Figure 2 for Model-free machine learning of conservation laws from data

Figure 3 for Model-free machine learning of conservation laws from data

Figure 4 for Model-free machine learning of conservation laws from data

Abstract:We present a machine learning based method for learning first integrals of systems of ordinary differential equations from given trajectory data. The method is model-free in that it does not require explicit knowledge of the underlying system of differential equations that generated the trajectories. As a by-product, once the first integrals have been learned, also the system of differential equations will be known. We illustrate our method by considering several classical problems from the mathematical sciences.

* 18 pages, 8 figures

Via

Access Paper or Ask Questions