Picture for Zhe Gan

Zhe Gan

UniGen-1.5: Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

Add code
Nov 18, 2025
Viaarxiv icon

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Add code
Oct 22, 2025
Viaarxiv icon

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Add code
Oct 14, 2025
Viaarxiv icon

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Add code
Sep 30, 2025
Figure 1 for Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Figure 2 for Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Figure 3 for Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Figure 4 for Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents
Viaarxiv icon

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Add code
Sep 19, 2025
Figure 1 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Figure 2 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Figure 3 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Figure 4 for MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer
Viaarxiv icon

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Add code
May 16, 2025
Figure 1 for GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Figure 2 for GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Figure 3 for GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Figure 4 for GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing
Viaarxiv icon

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Add code
Mar 27, 2025
Viaarxiv icon

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Add code
Mar 16, 2025
Viaarxiv icon

From Multimodal LLMs to Generalist Embodied Agents: Methods and Lessons

Add code
Dec 11, 2024
Viaarxiv icon

Multimodal Autoregressive Pre-training of Large Vision Encoders

Add code
Nov 21, 2024
Figure 1 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 2 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 3 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 4 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Viaarxiv icon