Picture for Yan Shu

Yan Shu

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

Add code
Jun 05, 2025
Viaarxiv icon

VidText: Towards Comprehensive Evaluation for Video Text Understanding

Add code
May 28, 2025
Viaarxiv icon

Visual Text Processing: A Comprehensive Review and Unified Evaluation

Add code
Apr 30, 2025
Viaarxiv icon

TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

Add code
Oct 14, 2024
Figure 1 for TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Figure 2 for TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Figure 3 for TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Figure 4 for TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Viaarxiv icon

First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending

Add code
Oct 14, 2024
Figure 1 for First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending
Figure 2 for First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending
Figure 3 for First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending
Figure 4 for First Creating Backgrounds Then Rendering Texts: A New Paradigm for Visual Text Blending
Viaarxiv icon

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Add code
Sep 24, 2024
Figure 1 for Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Figure 2 for Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Figure 3 for Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Figure 4 for Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Viaarxiv icon

MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

Add code
Jun 06, 2024
Figure 1 for MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
Figure 2 for MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
Figure 3 for MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
Figure 4 for MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
Viaarxiv icon

Visual Text Meets Low-level Vision: A Comprehensive Survey on Visual Text Processing

Add code
Feb 05, 2024
Viaarxiv icon

Depth-agnostic Single Image Dehazing

Add code
Jan 14, 2024
Viaarxiv icon

CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings

Add code
Nov 13, 2023
Figure 1 for CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings
Figure 2 for CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings
Figure 3 for CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings
Figure 4 for CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings
Viaarxiv icon