Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Taha Bouhsine

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Jun 11, 2026

Taha Bouhsine

Abstract:Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(λ)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/λ)\log(d_{\mathrm{eff}}/δ))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/δ))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

Via

Access Paper or Ask Questions

A Universal Reproducing Kernel Hilbert Space from Polynomial Alignment and IMQ Distance

May 05, 2026

Taha Bouhsine

Abstract:We introduce the Yat kernel $$k_{b,\varepsilon}(\mathbf{w},\mathbf{x})=\frac{(\mathbf{w}^\top\mathbf{x}+b)^2}{\|\mathbf{x}-\mathbf{w}\|^2+\varepsilon},\qquad b\ge 0,\ \varepsilon>0,$$ a rational hidden-unit primitive whose units are Mercer sections over a shared input/weight space. For $b\ge 0$ the kernel is PSD; for $b>0$ it dominates a scaled inverse-multiquadric (IMQ) in the Loewner order, yielding fixed-kernel universality, characteristicness, and strict positive definiteness on every compact domain. The polynomial numerator opens nonradial alignment channels absent from finite IMQ expansions, witnessed by the directional far-field trace $T_\infty g_\varepsilon(\cdot;\mathbf{w},b)(\mathbf{u})=(\mathbf{u}^\top\mathbf{w})^2$. Algebraically, a second finite difference in the bias recovers any IMQ atom from three positive-bias Yat atoms exactly, sharp at three atoms in every dimension at exact pointwise equality. A trained shared-$(b,\varepsilon)$ Yat layer is therefore a finite learned-center expansion in a fixed universal characteristic RKHS, with closed-form norm $\boldsymbolα^\top\mathbf{K}\boldsymbolα$ and explicit diagonal $(\|\mathbf{x}\|^2+b)^2/\varepsilon$ driving a Rademacher generalization bound.

Via

Access Paper or Ask Questions

In Defense of Cosine Similarity: Normalization Eliminates the Gauge Freedom

Feb 23, 2026

Taha Bouhsine

Abstract:Steck, Ekanadham, and Kallus [arXiv:2403.05440] demonstrate that cosine similarity of learned embeddings from matrix factorization models can be rendered arbitrary by a diagonal ``gauge'' matrix $D$. Their result is correct and important for practitioners who compute cosine similarity on embeddings trained with dot-product objectives. However, we argue that their conclusion, cautioning against cosine similarity in general, conflates the pathology of an incompatible training objective with the geometric validity of cosine distance on the unit sphere. We prove that when embeddings are constrained to the unit sphere $\mathbb{S}^{d-1}$ (either during or after training with an appropriate objective), the $D$-matrix ambiguity vanishes identically, and cosine distance reduces to exactly half the squared Euclidean distance. This monotonic equivalence implies that cosine-based and Euclidean-based neighbor rankings are identical on normalized embeddings. The ``problem'' with cosine similarity is not cosine similarity, it is the failure to normalize.

Via

Access Paper or Ask Questions

SLAY: Geometry-Aware Spherical Linearized Attention with Yat-Kernel

Feb 04, 2026

Jose Miguel Luna, Taha Bouhsine, Krzysztof Choromanski

Abstract:We propose a new class of linear-time attention mechanisms based on a relaxed and computationally efficient formulation of the recently introduced E-Product, often referred to as the Yat-kernel (Bouhsine, 2025). The resulting interactions are geometry-aware and inspired by inverse-square interactions in physics. Our method, Spherical Linearized Attention with Yat Kernels (SLAY), constrains queries and keys to the unit sphere so that attention depends only on angular alignment. Using Bernstein's theorem, we express the spherical Yat-kernel as a nonnegative mixture of polynomial-exponential product kernels and derive a strictly positive random-feature approximation enabling linear-time O(L) attention. We establish positive definiteness and boundedness on the sphere and show that the estimator yields well-defined, nonnegative attention scores. Empirically, SLAY achieves performance that is nearly indistinguishable from standard softmax attention while retaining linear time and memory scaling, and consistently outperforms prior linear-time attention mechanisms such as Performers and Cosformers. To the best of our knowledge, SLAY represents the closest linear-time approximation to softmax attention reported to date, enabling scalable Transformers without the typical performance trade-offs of attention linearization.

* ICML 2026, 8 pages main body, 27 pages total

Via

Access Paper or Ask Questions

Deep Learning 2.0: Artificial Neurons That Matter -- Reject Correlation, Embrace Orthogonality

Nov 12, 2024

Taha Bouhsine

Figure 1 for Deep Learning 2.0: Artificial Neurons That Matter -- Reject Correlation, Embrace Orthogonality

Figure 2 for Deep Learning 2.0: Artificial Neurons That Matter -- Reject Correlation, Embrace Orthogonality

Figure 3 for Deep Learning 2.0: Artificial Neurons That Matter -- Reject Correlation, Embrace Orthogonality

Figure 4 for Deep Learning 2.0: Artificial Neurons That Matter -- Reject Correlation, Embrace Orthogonality

Abstract:We introduce a yat-product-powered neural network, the Neural Matter Network (NMN), a breakthrough in deep learning that achieves non-linear pattern recognition without activation functions. Our key innovation relies on the yat-product and yat-product, which naturally induces non-linearity by projecting inputs into a pseudo-metric space, eliminating the need for traditional activation functions while maintaining only a softmax layer for final class probability distribution. This approach simplifies network architecture and provides unprecedented transparency into the network's decision-making process. Our comprehensive empirical evaluation across different datasets demonstrates that NMN consistently outperforms traditional MLPs. The results challenge the assumption that separate activation functions are necessary for effective deep-learning models. The implications of this work extend beyond immediate architectural benefits, by eliminating intermediate activation functions while preserving non-linear capabilities, yat-MLP establishes a new paradigm for neural network design that combines simplicity with effectiveness. Most importantly, our approach provides unprecedented insights into the traditionally opaque "black-box" nature of neural networks, offering a clearer understanding of how these models process and classify information.

* Submitted to CVPR 2025

Via

Access Paper or Ask Questions

SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

Oct 07, 2024

Taha Bouhsine, Imad El Aaroussi, Atik Faysal, Wang Huaxia

Figure 1 for SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

Figure 2 for SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

Figure 3 for SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

Figure 4 for SimO Loss: Anchor-Free Contrastive Loss for Fine-Grained Supervised Contrastive Learning

Abstract:We introduce a novel anchor-free contrastive learning (AFCL) method leveraging our proposed Similarity-Orthogonality (SimO) loss. Our approach minimizes a semi-metric discriminative loss function that simultaneously optimizes two key objectives: reducing the distance and orthogonality between embeddings of similar inputs while maximizing these metrics for dissimilar inputs, facilitating more fine-grained contrastive learning. The AFCL method, powered by SimO loss, creates a fiber bundle topological structure in the embedding space, forming class-specific, internally cohesive yet orthogonal neighborhoods. We validate the efficacy of our method on the CIFAR-10 dataset, providing visualizations that demonstrate the impact of SimO loss on the embedding space. Our results illustrate the formation of distinct, orthogonal class neighborhoods, showcasing the method's ability to create well-structured embeddings that balance class separation with intra-class variability. This work opens new avenues for understanding and leveraging the geometric properties of learned representations in various machine learning tasks.

Via

Access Paper or Ask Questions