Picture for Yinfei Yang

Yinfei Yang

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Add code
May 20, 2025
Viaarxiv icon

GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing

Add code
May 16, 2025
Viaarxiv icon

Advancing Egocentric Video Question Answering with Multimodal Large Language Models

Add code
Apr 06, 2025
Viaarxiv icon

SlowFast-LLaVA-1.5: A Family of Token-Efficient Video Large Language Models for Long-Form Video Understanding

Add code
Mar 27, 2025
Viaarxiv icon

MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

Add code
Mar 17, 2025
Viaarxiv icon

UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing

Add code
Mar 16, 2025
Viaarxiv icon

DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation

Add code
Mar 13, 2025
Viaarxiv icon

CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling

Add code
Feb 03, 2025
Figure 1 for CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
Figure 2 for CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
Figure 3 for CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
Figure 4 for CLIP-UP: A Simple and Efficient Mixture-of-Experts CLIP Training Recipe with Sparse Upcycling
Viaarxiv icon

STIV: Scalable Text and Image Conditioned Video Generation

Add code
Dec 10, 2024
Viaarxiv icon

Multimodal Autoregressive Pre-training of Large Vision Encoders

Add code
Nov 21, 2024
Figure 1 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 2 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 3 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Figure 4 for Multimodal Autoregressive Pre-training of Large Vision Encoders
Viaarxiv icon