Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Makarand Tapaswi

CVIT, IIIT Hyderabad

STRinGS: Selective Text Refinement in Gaussian Splatting

Dec 08, 2025

Abhinav Raundhal, Gaurav Behera, P J Narayanan, Ravi Kiran Sarvadevabhatla, Makarand Tapaswi

Abstract:Text as signs, labels, or instructions is a critical element of real-world scenes as they can convey important contextual information. 3D representations such as 3D Gaussian Splatting (3DGS) struggle to preserve fine-grained text details, while achieving high visual fidelity. Small errors in textual element reconstruction can lead to significant semantic loss. We propose STRinGS, a text-aware, selective refinement framework to address this issue for 3DGS reconstruction. Our method treats text and non-text regions separately, refining text regions first and merging them with non-text regions later for full-scene optimization. STRinGS produces sharp, readable text even in challenging configurations. We introduce a text readability measure OCR Character Error Rate (CER) to evaluate the efficacy on text regions. STRinGS results in a 63.6% relative improvement over 3DGS at just 7K iterations. We also introduce a curated dataset STRinGS-360 with diverse text scenarios to evaluate text readability in 3D reconstruction. Our method and dataset together push the boundaries of 3D scene understanding in text-rich environments, paving the way for more robust text-aware reconstruction methods.

* Accepted to WACV 2026. Project Page, see https://STRinGS-official.github.io

Via

Access Paper or Ask Questions

MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Nov 08, 2025

Shivank Saxena, Dhruv Srivastava, Makarand Tapaswi

Figure 1 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Figure 2 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Figure 3 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Figure 4 for MALeR: Improving Compositional Fidelity in Layout-Guided Generation

Abstract:Recent advances in text-to-image models have enabled a new era of creative and controllable image generation. However, generating compositional scenes with multiple subjects and attributes remains a significant challenge. To enhance user control over subject placement, several layout-guided methods have been proposed. However, these methods face numerous challenges, particularly in compositional scenes. Unintended subjects often appear outside the layouts, generated images can be out-of-distribution and contain unnatural artifacts, or attributes bleed across subjects, leading to incorrect visual outputs. In this work, we propose MALeR, a method that addresses each of these challenges. Given a text prompt and corresponding layouts, our method prevents subjects from appearing outside the given layouts while being in-distribution. Additionally, we propose a masked, attribute-aware binding mechanism that prevents attribute leakage, enabling accurate rendering of subjects with multiple attributes, even in complex compositional scenes. Qualitative and quantitative evaluation demonstrates that our method achieves superior performance in compositional accuracy, generation consistency, and attribute binding compared to previous work. MALeR is particularly adept at generating images of scenes with multiple subjects and multiple attributes per subject.

* ACM TOG Dec 2025, Siggraph Asia, Project page: https://katha-ai.github.io/projects/maler/

Via

Access Paper or Ask Questions

What You See is What You Ask: Evaluating Audio Descriptions

Oct 01, 2025

Divy Kala, Eshika Khandelwal, Makarand Tapaswi

Figure 1 for What You See is What You Ask: Evaluating Audio Descriptions

Figure 2 for What You See is What You Ask: Evaluating Audio Descriptions

Figure 3 for What You See is What You Ask: Evaluating Audio Descriptions

Figure 4 for What You See is What You Ask: Evaluating Audio Descriptions

Abstract:Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.

* EMNLP 2025 Main Track Long Paper

Via

Access Paper or Ask Questions

Investigating Mechanisms for In-Context Vision Language Binding

May 28, 2025

Darshana Saravanan, Makarand Tapaswi, Vineet Gandhi

Abstract:To understand a prompt, Vision-Language models (VLMs) must perceive the image, comprehend the text, and build associations within and across both modalities. For instance, given an 'image of a red toy car', the model should associate this image to phrases like 'car', 'red toy', 'red object', etc. Feng and Steinhardt propose the Binding ID mechanism in LLMs, suggesting that the entity and its corresponding attribute tokens share a Binding ID in the model activations. We investigate this for image-text binding in VLMs using a synthetic dataset and task that requires models to associate 3D objects in an image with their descriptions in the text. Our experiments demonstrate that VLMs assign a distinct Binding ID to an object's image tokens and its textual references, enabling in-context association.

* Accepted to MIV at CVPRW 2025 (Oral)

Via

Access Paper or Ask Questions

The Sound of Water: Inferring Physical Properties from Pouring Liquids

Nov 18, 2024

Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman

Figure 1 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 2 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 3 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Figure 4 for The Sound of Water: Inferring Physical Properties from Pouring Liquids

Abstract:We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids. Given only the sound of liquid pouring into a container, our objective is to automatically infer physical properties such as the liquid level, the shape and size of the container, the pouring rate and the time to fill. To this end, we: (i) show in theory that these properties can be determined from the fundamental frequency (pitch); (ii) train a pitch detection model with supervision from simulated data and visual data with a physics-inspired objective; (iii) introduce a new large dataset of real pouring videos for a systematic study; (iv) show that the trained model can indeed infer these physical properties for real data; and finally, (v) we demonstrate strong generalization to various container shapes, other datasets, and in-the-wild YouTube videos. Our work presents a keen understanding of a narrow yet rich problem at the intersection of acoustics, physics, and learning. It opens up applications to enhance multisensory perception in robotic pouring.

* 25 pages, 17 figures. Project page at https://bpiyush.github.io/pouring-water-website

Via

Access Paper or Ask Questions

IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Nov 12, 2024

Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal

Figure 1 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Figure 2 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Figure 3 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Figure 4 for IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark

Abstract:Recent evaluations of LLMs on coreference resolution have revealed that traditional output formats and evaluation metrics do not fully capture the models' referential understanding. To address this, we introduce IdentifyMe, a new benchmark for mention resolution presented in a multiple-choice question (MCQ) format, commonly used for evaluating LLMs. IdentifyMe features long narratives and employs heuristics to exclude easily identifiable mentions, creating a more challenging task. The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance. We evaluate both closed- and open source LLMs on IdentifyMe and observe a significant performance gap (20-30%) between the state-of-the-art sub-10B open models vs. closed ones. We observe that pronominal mentions, which have limited surface information, are typically much harder for models to resolve than nominal mentions. Additionally, we find that LLMs often confuse entities when their mentions overlap in nested structures. The highest-scoring model, GPT-4o, achieves 81.9% accuracy, highlighting the strong referential capabilities of state-of-the-art LLMs while also indicating room for further improvement.

* 9 pages, 5 figures

Via

Access Paper or Ask Questions

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Sep 04, 2024

Manu Gaur, Darshan Singh S, Makarand Tapaswi

Figure 1 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Figure 2 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Figure 3 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Figure 4 for No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Abstract:Image captioning systems are unable to generate fine-grained captions as they are trained on data that is either noisy (alt-text) or generic (human annotations). This is further exacerbated by maximum likelihood training that encourages generation of frequently occurring phrases. Previous works have tried to address this limitation by fine-tuning captioners with a self-retrieval (SR) reward. However, we find that SR fine-tuning has a tendency to reduce caption faithfulness and even hallucinate. In this work, we circumvent this bottleneck by improving the MLE initialization of the captioning system and designing a curriculum for the SR fine-tuning process. To this extent, we present (1) Visual Caption Boosting, a novel framework to instill fine-grainedness in generic image captioning datasets while remaining anchored in human annotations; and (2) BagCurri, a carefully designed training curriculum that more optimally leverages the contrastive nature of the self-retrieval reward. Jointly, they enable the captioner to describe fine-grained aspects in the image while preserving faithfulness to ground-truth captions. Our approach outperforms previous work by +8.9% on SR against 99 random distractors (RD100) (Dessi et al., 2023); and +7.6% on ImageCoDe. Additionally, existing metrics to evaluate captioning systems fail to reward diversity or evaluate a model's fine-grained understanding ability. Our third contribution addresses this by proposing self-retrieval from the lens of evaluation. We introduce TrueMatch, a benchmark comprising bags of highly similar images that uses SR to assess the captioner's ability to capture subtle visual distinctions. We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e.g. +4.8% - 7.1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.

Via

Access Paper or Ask Questions

Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Jun 20, 2024

Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi

Figure 1 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Figure 2 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Figure 3 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Figure 4 for Major Entity Identification: A Generalizable Alternative to Coreference Resolution

Abstract:The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.

* 16 pages, 6 figures

Via

Access Paper or Ask Questions

VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?

Jun 16, 2024

Darshana Saravanan, Darshan Singh, Varun Gupta, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

Abstract:Compositionality is a fundamental aspect of vision-language understanding and is especially required for videos since they contain multiple entities (e.g. persons, actions, and scenes) interacting dynamically over time. Existing benchmarks focus primarily on perception capabilities. However, they do not study binding, the ability of a model to associate entities through appropriate relationships. To this end, we propose VELOCITI, a new benchmark building on complex movie clips and dense semantic role label annotations to test perception and binding in video language models (contrastive and Video-LLMs). Our perception-based tests require discriminating video-caption pairs that share similar entities, and the binding tests require models to associate the correct entity to a given situation while ignoring the different yet plausible entities that also appear in the same video. While current state-of-the-art models perform moderately well on perception tests, accuracy is near random when both entities are present in the same video, indicating that they fail at binding tests. Even the powerful Gemini 1.5 Flash has a substantial gap (16-28%) with respect to human accuracy in such binding tests.

* 26 pages, 17 figures, 3 tables

Via

Access Paper or Ask Questions

MICap: A Unified Model for Identity-aware Movie Descriptions

May 19, 2024

Haran Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi

Figure 1 for MICap: A Unified Model for Identity-aware Movie Descriptions

Figure 2 for MICap: A Unified Model for Identity-aware Movie Descriptions

Figure 3 for MICap: A Unified Model for Identity-aware Movie Descriptions

Figure 4 for MICap: A Unified Model for Identity-aware Movie Descriptions

Abstract:Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end, we introduce iSPICE, a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC), where we show a 4.2% improvement in FITB accuracy, and a 1-2% bump in classic captioning metrics.

* CVPR 2024, Project Page: https://katha-ai.github.io/projects/micap/

Via

Access Paper or Ask Questions