Abstract:Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (Δ) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-Ω(Δ/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.
Abstract:Early-stage degradation in oscillatory systems often manifests as geometric distortions of the dynamics, such as phase jitter, frequency drift, or loss of coherence, long before changes in signal energy are detectable. In this regime, classical energy-based diagnostics and unconstrained learned representations are structurally insensitive, leading to delayed or unstable detection. We introduce GO-OSC, a geometry-aware representation learning framework for oscillatory time series that enforces a canonical and identifiable latent parameterization, enabling stable comparison and aggregation across short, unlabeled windows. Building on this representation, we define a family of invariant linear geometric probes that target degradation-relevant directions in latent space. We provide theoretical results showing that under early phase-only degradation, energy-based statistics have zero first-order detection power, whereas geometric probes achieve strictly positive sensitivity. Our analysis characterizes when and why linear probing fails under non-identifiable representations and shows how canonicalization restores statistical detectability. Experiments on synthetic benchmarks and real vibration datasets validate the theory, demonstrating earlier detection, improved data efficiency, and robustness to operating condition changes.