Picture for Minghong Cai

Minghong Cai

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Add code
Jun 04, 2026
Viaarxiv icon

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding

Add code
Jun 01, 2026
Viaarxiv icon

In-Context Audio Control of Video Diffusion Transformers

Add code
Dec 21, 2025
Viaarxiv icon

DebiasDiff: Debiasing Text-to-image Diffusion Models with Self-discovering Latent Attribute Directions

Add code
Dec 25, 2024
Viaarxiv icon

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

Add code
Dec 24, 2024
Figure 1 for DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Figure 2 for DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Figure 3 for DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Figure 4 for DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Viaarxiv icon