Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Akshar Tumu

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

Nov 08, 2025

Akshar Tumu, Varad Shinde, Parisa Kordjamshidi

Abstract:Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

* Accepted at IJCNLP-AACL 2025

Via

Access Paper or Ask Questions

Using Language and Road Manuals to Inform Map Reconstruction for Autonomous Driving

Jun 12, 2025

Akshar Tumu, Henrik I. Christensen, Marcell Vazquez-Chanlatte, Chikao Tsuchiya, Dhaval Bhanderi

Abstract:Lane-topology prediction is a critical component of safe and reliable autonomous navigation. An accurate understanding of the road environment aids this task. We observe that this information often follows conventions encoded in natural language, through design codes that reflect the road structure and road names that capture the road functionality. We augment this information in a lightweight manner to SMERF, a map-prior-based online lane-topology prediction model, by combining structured road metadata from OSM maps and lane-width priors from Road design manuals with the road centerline encodings. We evaluate our method on two geo-diverse complex intersection scenarios. Our method shows improvement in both lane and traffic element detection and their association. We report results using four topology-aware metrics to comprehensively assess the model performance. These results demonstrate the ability of our approach to generalize and scale to diverse topologies and conditions.

* 4 pages, 3 figures, Accepted at RSS 2025 Workshop - RobotEvaluation@RSS2025

Via

Access Paper or Ask Questions

Exploring Spatial Language Grounding Through Referring Expressions

Feb 04, 2025

Akshar Tumu, Parisa Kordjamshidi

Via

Access Paper or Ask Questions

SD++: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLMs

Feb 04, 2025

Hitvarth Diwanji, Jing-Yan Liao, Akshar Tumu, Henrik I. Christensen, Marcell Vazquez-Chanlatte, Chikao Tsuchiya

Figure 1 for SD++: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLMs

Figure 2 for SD++: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLMs

Figure 3 for SD++: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLMs

Figure 4 for SD++: Enhancing Standard Definition Maps by Incorporating Road Knowledge using LLMs

Abstract:High-definition maps (HD maps) are detailed and informative maps capturing lane centerlines and road elements. Although very useful for autonomous driving, HD maps are costly to build and maintain. Furthermore, access to these high-quality maps is usually limited to the firms that build them. On the other hand, standard definition (SD) maps provide road centerlines with an accuracy of a few meters. In this paper, we explore the possibility of enhancing SD maps by incorporating information from road manuals using LLMs. We develop SD++, an end-to-end pipeline to enhance SD maps with location-dependent road information obtained from a road manual. We suggest and compare several ways of using LLMs for such a task. Furthermore, we show the generalization ability of SD++ by showing results from both California and Japan.

Via

Access Paper or Ask Questions