Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Sonia Mazelet

SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport

Feb 26, 2026

Simon Roschmann, Paul Krzakala, Sonia Mazelet, Quentin Bouniot, Zeynep Akata

Abstract:The Platonic Representation Hypothesis posits that neural networks trained on different modalities converge toward a shared statistical model of the world. Recent work exploits this convergence by aligning frozen pretrained vision and language models with lightweight alignment layers, but typically relies on contrastive losses and millions of paired samples. In this work, we ask whether meaningful alignment can be achieved with substantially less supervision. We introduce a semi-supervised setting in which pretrained unimodal encoders are aligned using a small number of image-text pairs together with large amounts of unpaired data. To address this challenge, we propose SOTAlign, a two-stage framework that first recovers a coarse shared geometry from limited paired data using a linear teacher, then refines the alignment on unpaired samples via an optimal-transport-based divergence that transfers relational structure without overconstraining the target space. Unlike existing semi-supervised methods, SOTAlign effectively leverages unpaired images and text, learning robust joint embeddings across datasets and encoder pairs, and significantly outperforming supervised and semi-supervised baselines.

* Preprint

Via

Access Paper or Ask Questions

Binding in hippocampal-entorhinal circuits enables compositionality in cognitive maps

Jun 27, 2024

Christopher J. Kymn, Sonia Mazelet, Anthony Thomas, Denis Kleyko, E. Paxon Frady, Friedrich T. Sommer, Bruno A. Olshausen

Abstract:We propose a normative model for spatial representation in the hippocampal formation that combines optimality principles, such as maximizing coding range and spatial information per neuron, with an algebraic framework for computing in distributed representation. Spatial position is encoded in a residue number system, with individual residues represented by high-dimensional, complex-valued vectors. These are composed into a single vector representing position by a similarity-preserving, conjunctive vector-binding operation. Self-consistency between the representations of the overall position and of the individual residues is enforced by a modular attractor network whose modules correspond to the grid cell modules in entorhinal cortex. The vector binding operation can also associate different contexts to spatial representations, yielding a model for entorhinal cortex and hippocampus. We show that the model achieves normative desiderata including superlinear scaling of patterns with dimension, robust error correction, and hexagonal, carry-free encoding of spatial position. These properties in turn enable robust path integration and association with sensory inputs. More generally, the model formalizes how compositional computations could occur in the hippocampal formation and leads to testable experimental predictions.

* 23 pages, 12 figures

Via

Access Paper or Ask Questions

Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

Apr 29, 2024

Christopher J. Kymn, Sonia Mazelet, Annabel Ng, Denis Kleyko, Bruno A. Olshausen

Figure 1 for Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

Figure 2 for Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

Figure 3 for Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

Figure 4 for Compositional Factorization of Visual Scenes with Convolutional Sparse Coding and Resonator Networks

Abstract:We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions