Picture for Xizhou Zhu

Xizhou Zhu

CoMemo: LVLMs Need Image Context with Image Memory

Add code
Jun 06, 2025
Viaarxiv icon

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Add code
May 29, 2025
Viaarxiv icon

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

Add code
Apr 21, 2025
Viaarxiv icon

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Add code
Apr 15, 2025
Viaarxiv icon

LangBridge: Interpreting Image as a Combination of Language Embeddings

Add code
Mar 26, 2025
Viaarxiv icon

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

Add code
Mar 25, 2025
Viaarxiv icon

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

Add code
Mar 13, 2025
Viaarxiv icon

Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding

Add code
Jan 14, 2025
Viaarxiv icon

A Flexible and Scalable Framework for Video Moment Search

Add code
Jan 09, 2025
Viaarxiv icon

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Add code
Dec 20, 2024
Viaarxiv icon