Picture for Yi-Fan Zhang

Yi-Fan Zhang

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Add code
Apr 06, 2026
Viaarxiv icon

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?

Add code
Apr 03, 2026
Viaarxiv icon

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Add code
Apr 01, 2026
Viaarxiv icon

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Add code
Mar 26, 2026
Viaarxiv icon

VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents

Add code
Mar 17, 2026
Viaarxiv icon

Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation

Add code
Feb 12, 2026
Viaarxiv icon

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

Add code
Oct 10, 2025
Viaarxiv icon

BaseReward: A Strong Baseline for Multimodal Reward Model

Add code
Sep 19, 2025
Figure 1 for BaseReward: A Strong Baseline for Multimodal Reward Model
Figure 2 for BaseReward: A Strong Baseline for Multimodal Reward Model
Figure 3 for BaseReward: A Strong Baseline for Multimodal Reward Model
Figure 4 for BaseReward: A Strong Baseline for Multimodal Reward Model
Viaarxiv icon

Kwai Keye-VL Technical Report

Add code
Jul 02, 2025
Viaarxiv icon

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios

Add code
May 27, 2025
Viaarxiv icon