Picture for Zechen Bai

Zechen Bai

Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance

Add code
Mar 05, 2026
Viaarxiv icon

SIGMA: Selective-Interleaved Generation with Multi-Attribute Tokens

Add code
Feb 07, 2026
Viaarxiv icon

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Add code
Feb 06, 2026
Viaarxiv icon

EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models

Add code
Dec 16, 2025
Viaarxiv icon

Impossible Videos

Add code
Mar 18, 2025
Figure 1 for Impossible Videos
Figure 2 for Impossible Videos
Figure 3 for Impossible Videos
Figure 4 for Impossible Videos
Viaarxiv icon

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Add code
Nov 26, 2024
Figure 1 for ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Figure 2 for ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Figure 3 for ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Figure 4 for ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Viaarxiv icon

Factorized Visual Tokenization and Generation

Add code
Nov 25, 2024
Figure 1 for Factorized Visual Tokenization and Generation
Figure 2 for Factorized Visual Tokenization and Generation
Figure 3 for Factorized Visual Tokenization and Generation
Figure 4 for Factorized Visual Tokenization and Generation
Viaarxiv icon

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos

Add code
Sep 29, 2024
Figure 1 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Figure 2 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Figure 3 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Figure 4 for One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
Viaarxiv icon

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Add code
Aug 22, 2024
Figure 1 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 2 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 3 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Figure 4 for Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Viaarxiv icon

GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval

Add code
Aug 14, 2024
Viaarxiv icon