Picture for Zhe Gan

Zhe Gan

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Add code
Oct 22, 2025
Viaarxiv icon

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Add code
Oct 14, 2025
Viaarxiv icon

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Add code
Sep 30, 2025
Viaarxiv icon

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Add code
Sep 19, 2025
Viaarxiv icon

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Add code
May 16, 2025
Viaarxiv icon

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Add code
Mar 27, 2025
Viaarxiv icon

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Add code
Mar 16, 2025
Viaarxiv icon

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Add code
Dec 11, 2024
Viaarxiv icon

Multimodal Autoregressive Pre-training of Large Vision Encoders

Add code
Nov 21, 2024
Figure 1 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 2 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 3 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 4 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Viaarxiv icon

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Add code
Oct 24, 2024
Viaarxiv icon