Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Shervin Ghasemlou

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Oct 30, 2025

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary(+31 more)

Figure 1 for CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Figure 2 for CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Figure 3 for CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Figure 4 for CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Abstract:Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

Via

Access Paper or Ask Questions

Doppelgänger's Watch: A Split Objective Approach to Large Language Models

Sep 09, 2024

Shervin Ghasemlou, Ashish Katiyar, Aparajita Saraf, Seungwhan Moon, Mangesh Pujari, Pinar Donmez, Babak Damavandi, Anuj Kumar

Abstract:In this paper, we investigate the problem of "generation supervision" in large language models, and present a novel bicameral architecture to separate supervision signals from their core capability, helpfulness. Doppelg\"anger, a new module parallel to the underlying language model, supervises the generation of each token, and learns to concurrently predict the supervision score(s) of the sequences up to and including each token. In this work, we present the theoretical findings, and leave the report on experimental results to a forthcoming publication.

Via

Access Paper or Ask Questions

Toward a language-theoretic foundation for planning and filtering

Jul 23, 2018

Fatemeh Zahra Saberifar, Shervin Ghasemlou, Dylan A. Shell, Jason M. O'Kane

Figure 1 for Toward a language-theoretic foundation for planning and filtering

Figure 2 for Toward a language-theoretic foundation for planning and filtering

Figure 3 for Toward a language-theoretic foundation for planning and filtering

Figure 4 for Toward a language-theoretic foundation for planning and filtering

Abstract:We address problems underlying the algorithmic question of automating the co-design of robot hardware in tandem with its apposite software. Specifically, we consider the impact that degradations of a robot's sensor and actuation suites may have on the ability of that robot to complete its tasks. We introduce a new formal structure that generalizes and consolidates a variety of well-known structures including many forms of plans, planning problems, and filters, into a single data structure called a procrustean graph, and give these graph structures semantics in terms of ideas based in formal language theory. We describe a collection of operations on procrustean graphs (both semantics-preserving and semantics-mutating), and show how a family of questions about the destructiveness of a change to the robot hardware can be answered by applying these operations. We also highlight the connections between this new approach and existing threads of research, including combinatorial filtering, Erdmann's strategy complexes, and hybrid automata.

* Accepted to appear in IJRR Special Issue on WAFR'16. Keywords: planning; combinatorial filter; design automation

Via

Access Paper or Ask Questions