Abstract:Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
Abstract:Instance segmentation is applied widely in image editing, image analysis and autonomous driving, etc. However, insufficient data is a common problem in practical applications. The Visual Inductive Priors(VIPriors) Instance Segmentation Challenge has focused on this problem. VIPriors for Data-Efficient Computer Vision Challenges ask competitors to train models from scratch in a data-deficient setting, but there are some visual inductive priors that can be used. In order to address the VIPriors instance segmentation problem, we designed a Task-Specific Data Augmentation(TS-DA) strategy and Inference Processing(TS-IP) strategy. The main purpose of task-specific data augmentation strategy is to tackle the data-deficient problem. And in order to make the most of visual inductive priors, we designed a task-specific inference processing strategy. We demonstrate the applicability of proposed method on VIPriors Instance Segmentation Challenge. The segmentation model applied is Hybrid Task Cascade based detector on the Swin-Base based CBNetV2 backbone. Experimental results demonstrate that proposed method can achieve a competitive result on the test set of 2022 VIPriors Instance Segmentation Challenge, with 0.531 AP@0.50:0.95.