Picture for Yinfei Yang

Yinfei Yang

Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Add code
Oct 22, 2025
Viaarxiv icon

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Add code
Oct 14, 2025
Viaarxiv icon

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Add code
Sep 30, 2025
Viaarxiv icon

AToken: A Unified Tokenizer for Vision

Add code
Sep 19, 2025
Viaarxiv icon

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Add code
Sep 19, 2025
Viaarxiv icon

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Add code
May 20, 2025
Viaarxiv icon

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Add code
May 16, 2025
Viaarxiv icon

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Add code
Apr 06, 2025
Viaarxiv icon

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Add code
Mar 27, 2025
Viaarxiv icon

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Add code
Mar 17, 2025
Figure 1 for MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Figure 2 for MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Figure 3 for MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Figure 4 for MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Viaarxiv icon