Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tomas Berriel Martins

Cross-Attentive Multiview Fusion of Vision-Language Embeddings

Apr 14, 2026

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

Abstract:Vision-language models have been key to the development of open-vocabulary 2D semantic segmentation. Lifting these models from 2D images to 3D scenes, however, remains a challenging problem. Existing approaches typically back-project and average 2D descriptors across views, or heuristically select a single representative one, often resulting in suboptimal 3D representations. In this work, we introduce a novel multiview transformer architecture that cross-attends across vision-language descriptors from multiple viewpoints and fuses them into a unified per-3D-instance embedding. As a second contribution, we leverage multiview consistency as a self-supervision signal for this fusion, which significantly improves performance when added to a standard supervised target-class loss. Our Cross-Attentive Multiview Fusion, which we denote with its acronym CAMFusion, not only consistently outperforms naive averaging or single-view descriptor selection, but also achieves state-of-the-art results on 3D semantic and instance classification benchmarks, including zero-shot evaluations on out-of-domain datasets.

Via

Access Paper or Ask Questions

OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

Nov 22, 2024

Tomas Berriel Martins, Martin R. Oswald, Javier Civera

Figure 1 for OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

Figure 2 for OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

Figure 3 for OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

Figure 4 for OVO-SLAM: Open-Vocabulary Online Simultaneous Localization and Mapping

Abstract:This paper presents the first Open-Vocabulary Online 3D semantic SLAM pipeline, that we denote as OVO-SLAM. Our primary contribution is in the pipeline itself, particularly in the mapping thread. Given a set of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors, calculated through a novel aggregation from the viewpoints where these 3D segments are observed. Notably, our OVO-SLAM pipeline is not only faster but also achieves better segmentation metrics compared to offline approaches in the literature. Along with superior segmentation performance, we show experimental results of our contributions integrated with Gaussian-SLAM, being the first ones demonstrating end-to-end open-vocabulary online 3D reconstructions without relying on ground-truth camera poses or scene geometry.

Via

Access Paper or Ask Questions