Picture for Zhihang Liu

Zhihang Liu

RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents

Add code
Oct 31, 2025
Viaarxiv icon

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Add code
Mar 20, 2025
Viaarxiv icon

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Add code
Mar 18, 2025
Figure 1 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Figure 2 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Figure 3 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Figure 4 for SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
Viaarxiv icon

Rethinking Video Tokenization: A Conditioned Diffusion-based Approach

Add code
Mar 05, 2025
Figure 1 for Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Figure 2 for Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Figure 3 for Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Figure 4 for Rethinking Video Tokenization: A Conditioned Diffusion-based Approach
Viaarxiv icon

What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs

Add code
Feb 19, 2025
Figure 1 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Figure 2 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Figure 3 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Figure 4 for What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Coverage of MLLMs
Viaarxiv icon

Hallucination Mitigation Prompts Long-term Video Understanding

Add code
Jun 17, 2024
Figure 1 for Hallucination Mitigation Prompts Long-term Video Understanding
Figure 2 for Hallucination Mitigation Prompts Long-term Video Understanding
Figure 3 for Hallucination Mitigation Prompts Long-term Video Understanding
Figure 4 for Hallucination Mitigation Prompts Long-term Video Understanding
Viaarxiv icon

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Add code
Dec 19, 2023
Figure 1 for Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval
Figure 2 for Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval
Figure 3 for Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval
Figure 4 for Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval
Viaarxiv icon