Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Vicky Feliren

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

Apr 13, 2026

Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia(+36 more)

Abstract:While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.

Via

Access Paper or Ask Questions

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Jan 25, 2026

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, Idris Abdulmumin, Vukosi Marivate(+87 more)

Abstract:Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

* 17 pages, 7 tables, 5 figures

Via

Access Paper or Ask Questions

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Mar 10, 2025

Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib(+82 more)

Figure 1 for Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Figure 2 for Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Figure 3 for Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Figure 4 for Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Abstract:Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.

* SEA-VL Dataset: https://huggingface.co/collections/SEACrowd/sea-vl-multicultural-vl-dataset-for-southeast-asia-67cf223d0c341d4ba2b236e7

Via

Access Paper or Ask Questions

Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery

Jan 21, 2025

Vicky Feliren, Fithrothul Khikmah, Irfan Dwiki Bhaswara, Bahrul I. Nasution, Alex M. Lechner, Muhamad Risqi U. Saputra

Figure 1 for Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery

Figure 2 for Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery

Figure 3 for Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery

Figure 4 for Progressive Cross Attention Network for Flood Segmentation using Multispectral Satellite Imagery

Abstract:In recent years, the integration of deep learning techniques with remote sensing technology has revolutionized the way natural hazards, such as floods, are monitored and managed. However, existing methods for flood segmentation using remote sensing data often overlook the utility of correlative features among multispectral satellite information. In this study, we introduce a progressive cross attention network (ProCANet), a deep learning model that progressively applies both self- and cross-attention mechanisms to multispectral features, generating optimal feature combinations for flood segmentation. The proposed model was compared with state-of-the-art approaches using Sen1Floods11 dataset and our bespoke flood data generated for the Citarum River basin, Indonesia. Our model demonstrated superior performance with the highest Intersection over Union (IoU) score of 0.815. Our results in this study, coupled with the ablation assessment comparing scenarios with and without attention across various modalities, opens a promising path for enhancing the accuracy of flood analysis using remote sensing technology.

* IEEE Geoscience and Remote Sensing Letters, vol. 22, pp. 1-5, 2025, Art no. 1500105
* 5 pages, 4 figures, published in IEEE Geoscience and Remote Sensing Letters

Via

Access Paper or Ask Questions