Picture for Chaoyou Fu

Chaoyou Fu

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Add code
Oct 10, 2025
Viaarxiv icon

BaseReward: A Strong Baseline for Multimodal Reward Model

Add code
Sep 19, 2025
Viaarxiv icon

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs

Add code
May 28, 2025
Viaarxiv icon

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Add code
May 27, 2025
Viaarxiv icon

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Add code
May 27, 2025
Viaarxiv icon

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Add code
May 06, 2025
Viaarxiv icon

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Add code
May 05, 2025
Viaarxiv icon

MME-Unify: A Comprehensive Benchmark for Unified Multimodal Understanding and Generation Models

Add code
Apr 07, 2025
Viaarxiv icon

Aligning Multimodal LLM with Human Preference: A Survey

Add code
Mar 18, 2025
Figure 1 for Aligning Multimodal LLM with Human Preference: A Survey
Figure 2 for Aligning Multimodal LLM with Human Preference: A Survey
Figure 3 for Aligning Multimodal LLM with Human Preference: A Survey
Figure 4 for Aligning Multimodal LLM with Human Preference: A Survey
Viaarxiv icon

QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension

Add code
Mar 11, 2025
Figure 1 for QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Figure 2 for QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Figure 3 for QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Figure 4 for QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension
Viaarxiv icon