Abstract:Warning-framed content in training data (e.g., "DO NOT USE - this code is vulnerable") does not, it turns out, teach language models to avoid the warned-against behavior. In experiments reported here, models exposed to such warnings reproduced the flagged content at rates statistically indistinguishable from models given the content directly (76.7% vs. 83.3%). Why? Sparse autoencoder analysis points to a failure of orthogonalization: "describing X" and "performing X" activate overlapping latent features. Feature #8684, which tracks code execution patterns, fires at comparable magnitude in both warning and exploitation contexts. A related phenomenon, what I call "stealth slip", allows conversational preambles to rotate activations into subspaces that linear probes miss entirely. Prompting and inference-time steering do not fix this; training-time feature ablation does. The upshot is that statistical co-occurrence dominates over pragmatic interpretation in current architectures. Models learn what tends to follow a context, not why it appeared there.
Abstract:Although sparse autoencoders (SAEs) are crucial for identifying interpretable features in neural networks, it is still challenging to distinguish between real computational patterns and erroneous correlations. We introduce Model-X knockoffs to SAE feature selection, using knock-off+ to control the false discovery rate (FDR) with finite-sample guarantees under the standard Model-X assumptions (in our case, via a Gaussian surrogate for the latent distribution). We select 129 features at a target FDR q=0.1 after analyzing 512 high-activity SAE latents for sentiment classification using Pythia-70M. About 25% of the latents under examination carry task-relevant signal, whereas 75% do not, according to the chosen set, which displays a 5.40x separation in knockoff statistics compared to non-selected features. Our method offers a re-producible and principled framework for reliable feature discovery by combining SAEs with multiple-testing-aware inference, advancing the foundations of mechanistic interpretability.




Abstract:This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a data-driven alternative to traditional risk management strategies, the computational burden of training neural networks with first-order methods remains a significant impediment to practical implementation. The proposed architecture couples Long Short-Term Memory (LSTM) networks with K-FAC second-order optimization, specifically addressing the challenges of sequential financial data and curvature estimation in recurrent networks. Empirical validation using simulated paths from a calibrated Heston stochastic volatility model demonstrates that the K-FAC implementation achieves marked improvements in convergence dynamics and hedging efficacy. The methodology yields a 78.3% reduction in transaction costs ($t = 56.88$, $p < 0.001$) and a 34.4% decrease in profit and loss (P&L) variance compared to Adam optimization. Moreover, the K-FAC-enhanced model exhibits superior risk-adjusted performance with a Sharpe ratio of 0.0401, contrasting with $-0.0025$ for the baseline model. These results provide compelling evidence that second-order optimization methods can materially enhance the tractability of Deep Hedging implementations. The findings contribute to the growing literature on computational methods in quantitative finance while highlighting the potential for advanced optimization techniques to bridge the gap between theoretical frameworks and practical applications in financial markets.