Picture for Ya Guo

Ya Guo

EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs

Add code
Dec 11, 2025
Figure 1 for EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Figure 2 for EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Figure 3 for EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Figure 4 for EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs
Viaarxiv icon

Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting

Add code
Apr 27, 2025
Figure 1 for Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Figure 2 for Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Figure 3 for Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Figure 4 for Keep the General, Inject the Specific: Structured Dialogue Fine-Tuning for Knowledge Injection without Catastrophic Forgetting
Viaarxiv icon

InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models

Add code
Feb 19, 2025
Figure 1 for InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Figure 2 for InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Figure 3 for InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Figure 4 for InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Viaarxiv icon

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

Add code
Sep 29, 2024
Viaarxiv icon

UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents

Add code
Aug 02, 2024
Figure 1 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents
Figure 2 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents
Figure 3 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents
Figure 4 for UNER: A Unified Prediction Head for Named Entity Recognition in Visually-rich Documents
Viaarxiv icon

Causal Prototype-inspired Contrast Adaptation for Unsupervised Domain Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery

Add code
Mar 06, 2024
Figure 1 for Causal Prototype-inspired Contrast Adaptation for Unsupervised Domain Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery
Figure 2 for Causal Prototype-inspired Contrast Adaptation for Unsupervised Domain Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery
Figure 3 for Causal Prototype-inspired Contrast Adaptation for Unsupervised Domain Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery
Figure 4 for Causal Prototype-inspired Contrast Adaptation for Unsupervised Domain Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery
Viaarxiv icon

Rethinking the Evaluation of Pre-trained Text-and-Layout Models from an Entity-Centric Perspective

Add code
Feb 04, 2024
Viaarxiv icon

Reading Order Matters: Information Extraction from Visually-rich Documents by Token Path Prediction

Add code
Oct 17, 2023
Viaarxiv icon

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

Add code
Jun 09, 2023
Figure 1 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Figure 2 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Figure 3 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Figure 4 for LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding
Viaarxiv icon

Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory

Add code
Aug 16, 2022
Figure 1 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory
Figure 2 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory
Figure 3 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory
Figure 4 for Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level context memory
Viaarxiv icon