Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Recurrence Meets Transformers for Universal Multimodal Retrieval

Sep 10, 2025

Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

Figure 1 for Recurrence Meets Transformers for Universal Multimodal Retrieval

Figure 2 for Recurrence Meets Transformers for Universal Multimodal Retrieval

Figure 3 for Recurrence Meets Transformers for Universal Multimodal Retrieval

Figure 4 for Recurrence Meets Transformers for Universal Multimodal Retrieval

Share this with someone who'll enjoy it:

Abstract:With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

View paper on

Share this with someone who'll enjoy it:

Title:Recurrence Meets Transformers for Universal Multimodal Retrieval

Paper and Code