Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ya Guo

Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

Apr 27, 2025

Yijie Hong, Xiaofei Yin, Xinzhong Wang, Yi Tu, Ya Guo, Sufeng Duan, Weiqiang Wang, Lingyong Fang, Depeng Wang, Huijia Zhu

Abstract:Large Vision Language Models have demonstrated impressive versatile capabilities through extensive multimodal pre-training, but face significant limitations when incorporating specialized knowledge domains beyond their training distribution. These models struggle with a fundamental dilemma: direct adaptation approaches that inject domain-specific knowledge often trigger catastrophic forgetting of foundational visual-linguistic abilities. We introduce Structured Dialogue Fine-Tuning (SDFT), an effective approach that effectively injects domain-specific knowledge while minimizing catastrophic forgetting. Drawing inspiration from supervised fine-tuning in LLMs and subject-driven personalization in text-to-image diffusion models, our method employs a three-phase dialogue structure: Foundation Preservation reinforces pre-trained visual-linguistic alignment through caption tasks; Contrastive Disambiguation introduces carefully designed counterfactual examples to maintain semantic boundaries; and Knowledge Specialization embeds specialized information through chain-of-thought reasoning. Experimental results across multiple domains confirm SDFT's effectiveness in balancing specialized knowledge acquisition with general capability retention. Our key contributions include a data-centric dialogue template that balances foundational alignment with targeted knowledge integration, a weighted multi-turn supervision framework, and comprehensive evaluation across diverse knowledge types.

* 13 pages, 3 figures

Via

Access Paper or Ask Questions

InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models

Feb 19, 2025

Xiaofei Yin, Yijie Hong, Ya Guo, Yi Tu, Weiqiang Wang, Gongshen Liu, Huijia zhu

Abstract:In the evolving landscape of multimodal language models, understanding the nuanced meanings conveyed through visual cues - such as satire, insult, or critique - remains a significant challenge. Existing evaluation benchmarks primarily focus on direct tasks like image captioning or are limited to a narrow set of categories, such as humor or satire, for deep semantic understanding. To address this gap, we introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark designed specifically for evaluating the understanding of implicit meanings in images. This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension. We propose an innovative semi-automatic method for constructing datasets, adhering to established construction protocols. Using this benchmark, we evaluate 15 open-source large vision language models (LVLMs) and GPT-4o, revealing that even the best-performing model lags behind human performance by nearly 14% in understanding implicit meaning. Our findings underscore the intrinsic challenges current LVLMs face in grasping nuanced visual semantics, highlighting significant opportunities for future research and development in this domain. We will publicly release our InsightVision dataset, code upon acceptance of the paper.

* 19 pages, 10 figures

Via

Access Paper or Ask Questions

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

Sep 29, 2024

Chong Zhang, Yi Tu, Yixi Zhao, Chenshu Yuan, Huan Chen, Yue Zhang, Mingxu Chai, Ya Guo, Huijia Zhu, Qi Zhang(+1 more)

Abstract:Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents. Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements. However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream VrD tasks. To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous methods. Moreover, to highlight the practical benefits of introducing the improved form of layout reading order, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs. Comprehensive results demonstrate that the pipeline generally benefits downstream VrD tasks: (1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.

* Accepted as a long paper in the main conference of EMNLP 2024

Via

Access Paper or Ask Questions

UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Aug 02, 2024

Yi Tu, Chong Zhang, Ya Guo, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang

Figure 1 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Figure 2 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Figure 3 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Figure 4 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Abstract:The recognition of named entities in visually-rich documents (VrD-NER) plays a critical role in various real-world scenarios and applications. However, the research in VrD-NER faces three major challenges: complex document layouts, incorrect reading orders, and unsuitable task formulations. To address these challenges, we propose a query-aware entity extraction head, namely UNER, to collaborate with existing multi-modal document transformers to develop more robust VrD-NER models. The UNER head considers the VrD-NER task as a combination of sequence labeling and reading order prediction, effectively addressing the issues of discontinuous entities in documents. Experimental evaluations on diverse datasets demonstrate the effectiveness of UNER in improving entity extraction performance. Moreover, the UNER head enables a supervised pre-training stage on various VrD-NER datasets to enhance the document transformer backbones and exhibits substantial knowledge transfer from the pre-training stage to the fine-tuning stage. By incorporating universal layout understanding, a pre-trained UNER-based model demonstrates significant advantages in few-shot and cross-linguistic scenarios and exhibits zero-shot entity extraction abilities.

* accepted by ACM Multimedia 2024

Via

Access Paper or Ask Questions

Causal Prototype-inspired Contrast Adaptation for Unsupervised Domain Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery

Mar 06, 2024

Jingru Zhu, Ya Guo, Geng Sun, Liang Hong, Jie Chen

Abstract:Semantic segmentation of high-resolution remote sensing imagery (HRSI) suffers from the domain shift, resulting in poor performance of the model in another unseen domain. Unsupervised domain adaptive (UDA) semantic segmentation aims to adapt the semantic segmentation model trained on the labeled source domain to an unlabeled target domain. However, the existing UDA semantic segmentation models tend to align pixels or features based on statistical information related to labels in source and target domain data, and make predictions accordingly, which leads to uncertainty and fragility of prediction results. In this paper, we propose a causal prototype-inspired contrast adaptation (CPCA) method to explore the invariant causal mechanisms between different HRSIs domains and their semantic labels. It firstly disentangles causal features and bias features from the source and target domain images through a causal feature disentanglement module. Then, a causal prototypical contrast module is used to learn domain invariant causal features. To further de-correlate causal and bias features, a causal intervention module is introduced to intervene on the bias features to generate counterfactual unbiased samples. By forcing the causal features to meet the principles of separability, invariance and intervention, CPCA can simulate the causal factors of source and target domains, and make decisions on the target domain based on the causal features, which can observe improved generalization ability. Extensive experiments under three cross-domain tasks indicate that CPCA is remarkably superior to the state-of-the-art methods.

* 16 pages, 8 figures and 7 tables, submitted to IEEE Transactions on Geoscience and Remote Sensing, 2024

Via

Access Paper or Ask Questions

Rethinking the Evaluation of Pre-trained Text-and-Layout Models from an Entity-Centric Perspective

Feb 04, 2024

Chong Zhang, Yixi Zhao, Chenshu Yuan, Yi Tu, Ya Guo, Qi Zhang

Abstract:Recently developed pre-trained text-and-layout models (PTLMs) have shown remarkable success in multiple information extraction tasks on visually-rich documents. However, the prevailing evaluation pipeline may not be sufficiently robust for assessing the information extraction ability of PTLMs, due to inadequate annotations within the benchmarks. Therefore, we claim the necessary standards for an ideal benchmark to evaluate the information extraction ability of PTLMs. We then introduce EC-FUNSD, an entity-centric benckmark designed for the evaluation of semantic entity recognition and entity linking on visually-rich documents. This dataset contains diverse formats of document layouts and annotations of semantic-driven entities and their relations. Moreover, this dataset disentangles the falsely coupled annotation of segment and entity that arises from the block-level annotation of FUNSD. Experiment results demonstrate that state-of-the-art PTLMs exhibit overfitting tendencies on the prevailing benchmarks, as their performance sharply decrease when the dataset bias is removed.

Via

Access Paper or Ask Questions

Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

Oct 17, 2023

Chong Zhang, Ya Guo, Yi Tu, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang, Tao Gui

Abstract:Recent advances in multimodal pre-trained models have significantly improved information extraction from visually-rich documents (VrDs), in which named entity recognition (NER) is treated as a sequence-labeling task of predicting the BIO entity tags for tokens, following the typical setting of NLP. However, BIO-tagging scheme relies on the correct order of model inputs, which is not guaranteed in real-world NER on scanned VrDs where text are recognized and arranged by OCR systems. Such reading order issue hinders the accurate marking of entities by BIO-tagging scheme, making it impossible for sequence-labeling methods to predict correct named entities. To address the reading order issue, we introduce Token Path Prediction (TPP), a simple prediction head to predict entity mentions as token sequences within documents. Alternative to token classification, TPP models the document layout as a complete directed graph of tokens, and predicts token paths within the graph as entities. For better evaluation of VrD-NER systems, we also propose two revised benchmark datasets of NER on scanned documents which can reflect real-world scenarios. Experiment results demonstrate the effectiveness of our method, and suggest its potential to be a universal solution to various information extraction tasks on documents.

* Accepted as a long paper in the main conference of EMNLP 2023

Via

Access Paper or Ask Questions

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Jun 09, 2023

Yi Tu, Ya Guo, Huan Chen, Jinyang Tang

Figure 1 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Figure 2 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Figure 3 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Figure 4 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Abstract:Visually-rich Document Understanding (VrDU) has attracted much research attention over the past years. Pre-trained models on a large number of document images with transformer-based backbones have led to significant performance gains in this field. The major challenge is how to fusion the different modalities (text, layout, and image) of the documents in a unified model with different pre-training tasks. This paper focuses on improving text-layout interactions and proposes a novel multi-modal pre-training model, LayoutMask. LayoutMask uses local 1D position, instead of global 1D position, as layout input and has two pre-training objectives: (1) Masked Language Modeling: predicting masked tokens with two novel masking strategies; (2) Masked Position Modeling: predicting masked 2D positions to improve layout representation learning. LayoutMask can enhance the interactions between text and layout modalities in a unified model and produce adaptive and robust multi-modal representations for downstream tasks. Experimental results show that our proposed method can achieve state-of-the-art results on a wide variety of VrDU problems, including form understanding, receipt understanding, and document image classification.

* Accepted by ACL 2023 main conference

Via

Access Paper or Ask Questions

Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory

Aug 16, 2022

Jingru Zhu, Ya Guo, Geng Sun, Libo Yang, Min Deng, Jie Chen

Figure 1 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory

Figure 2 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory

Figure 3 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory

Figure 4 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory

Abstract:Semantic segmentation is a key technique involved in automatic interpretation of high-resolution remote sensing (HRS) imagery and has drawn much attention in the remote sensing community. Deep convolutional neural networks (DCNNs) have been successfully applied to the HRS imagery semantic segmentation task due to their hierarchical representation ability. However, the heavy dependency on a large number of training data with dense annotation and the sensitiveness to the variation of data distribution severely restrict the potential application of DCNNs for the semantic segmentation of HRS imagery. This study proposes a novel unsupervised domain adaptation semantic segmentation network (MemoryAdaptNet) for the semantic segmentation of HRS imagery. MemoryAdaptNet constructs an output space adversarial learning scheme to bridge the domain distribution discrepancy between source domain and target domain and to narrow the influence of domain shift. Specifically, we embed an invariant feature memory module to store invariant domain-level context information because the features obtained from adversarial learning only tend to represent the variant feature of current limited inputs. This module is integrated by a category attention-driven invariant domain-level context aggregation module to current pseudo invariant feature for further augmenting the pixel representations. An entropy-based pseudo label filtering strategy is used to update the memory module with high-confident pseudo invariant feature of current target images. Extensive experiments under three cross-domain tasks indicate that our proposed MemoryAdaptNet is remarkably superior to the state-of-the-art methods.

* Submitted to IEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 17 pages, 12 figures and 8 tables

Via

Access Paper or Ask Questions

A comprehensive benchmark analysis for sand dust image reconstruction

Feb 07, 2022

Yazhong Si, Fan Yang, Ya Guo, Wei Zhang, Yipu Yang

Figure 1 for A comprehensive benchmark analysis for sand dust image reconstruction

Figure 2 for A comprehensive benchmark analysis for sand dust image reconstruction

Figure 3 for A comprehensive benchmark analysis for sand dust image reconstruction

Figure 4 for A comprehensive benchmark analysis for sand dust image reconstruction

Abstract:Numerous sand dust image enhancement algorithms have been proposed in recent years. To our best acknowledge, however, most methods evaluated their performance with no-reference way using few selected real-world images from internet. It is unclear how to quantitatively analysis the performance of the algorithms in a supervised way and how we could gauge the progress in the field. Moreover, due to the absence of large-scale benchmark datasets, there are no well-known reports of data-driven based method for sand dust image enhancement up till now. To advance the development of deep learning-based algorithms for sand dust image reconstruction, while enabling supervised objective evaluation of algorithm performance. In this paper, we presented a comprehensive perceptual study and analysis of real-world sand dust images, then constructed a Sand-dust Image Reconstruction Benchmark (SIRB) for training Convolutional Neural Networks (CNNs) and evaluating algorithms performance. In addition, we adopted the existing image transformation neural network trained on SIRB as baseline to illustrate the generalization of SIRB for training CNNs. Finally, we conducted the qualitative and quantitative evaluation to demonstrate the performance and limitations of the state-of-the-arts (SOTA), which shed light on future research in sand dust image reconstruction.

* 13 pages, 12 figures

Via

Access Paper or Ask Questions