Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Zongrong Li

DamageArbiter: A CLIP-Enhanced Multimodal Arbitration Framework for Hurricane Damage Assessment from Street-View Imagery

Mar 16, 2026

Yifan Yang, Lei Zou, Wenjing Gong, Kani Fu, Zongrong Li, Siqin Wang, Bing Zhou, Heng Cai, Hao Tian

Abstract:Analyzing street-view imagery with computer vision models for rapid, hyperlocal damage assessment is becoming popular and valuable in emergency response and recovery, but traditional models often act like black boxes, lacking interpretability and reliability. This study proposes a multimodal disagreement-driven Arbitration framework powered by Contrastive Language-Image Pre-training (CLIP) models, DamageArbiter, to improve the accuracy, interpretability, and robustness of damage estimation from street-view imagery. DamageArbiter leverages the complementary strengths of unimodal and multimodal models, employing a lightweight logistic regression meta-classifier to arbitrate cases of disagreement. Using 2,556 post-disaster street-view images, paired with both manually generated and large language model (LLM)-generated text descriptions, we systematically compared the performance of unimodal models (including image-only and text-only models), multimodal CLIP-based models, and DamageArbiter. Notably, DamageArbiter improved the accuracy from 74.33% (ViT-B/32, image-only) to 82.79%, surpassing the 80% accuracy threshold and achieving an absolute improvement of 8.46% compared to the strongest baseline model. Beyond improvements in overall accuracy, compared to visual models relying solely on images, DamageArbiter, through arbitration of discrepancies between unimodal and multimodal predictions, mitigates common overconfidence errors in visual models, especially in situations where disaster visual cues are ambiguous or subject to interference, reducing overconfidence but incorrect predictions. We further mapped and analyzed geo-referenced predictions and misclassifications to compare model performance across locations. Overall, this work advances street-view-based disaster assessment from coarse severity classification toward a more reliable and interpretable framework.

Via

Access Paper or Ask Questions

StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model

Nov 19, 2024

Zongrong Li, Junhao Xu, Siqin Wang, Yifan Wu, Haiyang Li

Figure 1 for StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model

Figure 2 for StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model

Figure 3 for StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model

Figure 4 for StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model

Abstract:Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health. Traditional machine learning methods often face limitations when handling unstructured or multi-modal data like street view imagery. To address these challenges, we propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources. By combining street view imagery with geographic coordinates and textual data, StreetViewLLM improves the precision and granularity of geospatial predictions. Using retrieval-augmented generation techniques, our approach enhances geographic information extraction, enabling a detailed analysis of urban environments. The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris, demonstrating superior performance in predicting urban indicators, including population density, accessibility to healthcare, normalized difference vegetation index, building height, and impervious surface. The results show that StreetViewLLM consistently outperforms baseline models, offering improved predictive accuracy and deeper insights into the built environment. This research opens new opportunities for integrating the large language model into urban analytics, decision-making in urban planning, infrastructure management, and environmental monitoring.

Via

Access Paper or Ask Questions

BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Mode

Sep 29, 2024

Zongrong Li, Yunlei Su, Chenyuan Zhu, Wufan Zhao

Figure 1 for BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Mode

Figure 2 for BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Mode

Figure 3 for BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Mode

Figure 4 for BuildingView: Constructing Urban Building Exteriors Databases with Street View Imagery and Multimodal Large Language Mode

Abstract:Urban Building Exteriors are increasingly important in urban analytics, driven by advancements in Street View Imagery and its integration with urban research. Multimodal Large Language Models (LLMs) offer powerful tools for urban annotation, enabling deeper insights into urban environments. However, challenges remain in creating accurate and detailed urban building exterior databases, identifying critical indicators for energy efficiency, environmental sustainability, and human-centric design, and systematically organizing these indicators. To address these challenges, we propose BuildingView, a novel approach that integrates high-resolution visual data from Google Street View with spatial information from OpenStreetMap via the Overpass API. This research improves the accuracy of urban building exterior data, identifies key sustainability and design indicators, and develops a framework for their extraction and categorization. Our methodology includes a systematic literature review, building and Street View sampling, and annotation using the ChatGPT-4O API. The resulting database, validated with data from New York City, Amsterdam, and Singapore, provides a comprehensive tool for urban studies, supporting informed decision-making in urban planning, architectural design, and environmental policy. The code for BuildingView is available at https://github.com/Jasper0122/BuildingView.

* 8 pages, 6 figures

Via

Access Paper or Ask Questions