Picture for Pei Fu

Pei Fu

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

Add code
Mar 11, 2026
Viaarxiv icon

EMO-R3: Reflective Reinforcement Learning for Emotional Reasoning in Multimodal Large Language Models

Add code
Feb 27, 2026
Viaarxiv icon

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration

Add code
Feb 22, 2026
Viaarxiv icon

GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

Add code
Jan 26, 2026
Viaarxiv icon

Xiaomi MiMo-VL-Miloco Technical Report

Add code
Dec 22, 2025
Figure 1 for Xiaomi MiMo-VL-Miloco Technical Report
Figure 2 for Xiaomi MiMo-VL-Miloco Technical Report
Figure 3 for Xiaomi MiMo-VL-Miloco Technical Report
Figure 4 for Xiaomi MiMo-VL-Miloco Technical Report
Viaarxiv icon

HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Add code
Oct 31, 2025
Viaarxiv icon

BTL-UI: Blink-Think-Link Reasoning Model for GUI Agent

Add code
Sep 19, 2025
Viaarxiv icon

Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding

Add code
Mar 18, 2025
Figure 1 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Figure 2 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Figure 3 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Figure 4 for Marten: Visual Question Answering with Mask Generation for Multi-modal Document Understanding
Viaarxiv icon

A Token-level Text Image Foundation Model for Document Understanding

Add code
Mar 04, 2025
Figure 1 for A Token-level Text Image Foundation Model for Document Understanding
Figure 2 for A Token-level Text Image Foundation Model for Document Understanding
Figure 3 for A Token-level Text Image Foundation Model for Document Understanding
Figure 4 for A Token-level Text Image Foundation Model for Document Understanding
Viaarxiv icon

Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review

Add code
Feb 23, 2025
Figure 1 for Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Figure 2 for Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Figure 3 for Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Figure 4 for Multimodal Large Language Models for Text-rich Image Understanding: A Comprehensive Review
Viaarxiv icon