Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Parsa Madinei

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

May 13, 2026

Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

Abstract:Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

Via

Access Paper or Ask Questions

IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models

Feb 18, 2026

Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein

Abstract:We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.

Via

Access Paper or Ask Questions

ARChef: An iOS-Based Augmented Reality Cooking Assistant Powered by Multimodal Gemini LLM

Dec 01, 2024

Rithik Vir, Parsa Madinei

Abstract:Cooking meals can be difficult, causing many to use cookbooks and online recipes, which results in missing ingredients, nutritional hazards, unsatisfactory meals. Using Augmented Reality (AR) can address this issue, however, current AR cooking applications have poor user interfaces and limited accessibility. This paper proposes a prototype of an iOS application that integrates AR and Computer Vision (CV) into the cooking process. We leverage Google's Gemini Large Language Model (LLM) to identify ingredients based on the camera's field of vision, and generate recipe choices with their nutritional information. Additionally, this application uses Apple's ARKit to create an AR user interface compatible with iOS devices. Users can personalize their meal suggestions by inputting their dietary preferences and rating each meal. The application's effectiveness is evaluated through user experience surveys. This application contributes to the field of accessible cooking assistance technologies, aiming to reduce food wastage and improve the meal planning experience.

Via

Access Paper or Ask Questions