Picture for Mingzhen Sun

Mingzhen Sun

ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation

Add code
Nov 15, 2025
Viaarxiv icon

AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion

Add code
Mar 10, 2025
Viaarxiv icon

I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength

Add code
Nov 26, 2024
Figure 1 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength
Figure 2 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength
Figure 3 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength
Figure 4 for I2VControl-Camera: Precise Video Camera Control with Adjustable Motion Strength
Viaarxiv icon

I2VControl: Disentangled and Unified Video Motion Synthesis Control

Add code
Nov 26, 2024
Figure 1 for I2VControl: Disentangled and Unified Video Motion Synthesis Control
Figure 2 for I2VControl: Disentangled and Unified Video Motion Synthesis Control
Figure 3 for I2VControl: Disentangled and Unified Video Motion Synthesis Control
Figure 4 for I2VControl: Disentangled and Unified Video Motion Synthesis Control
Viaarxiv icon

COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation

Add code
Oct 02, 2024
Figure 1 for COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation
Figure 2 for COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation
Figure 3 for COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation
Figure 4 for COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation
Viaarxiv icon

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

Add code
Oct 02, 2024
Viaarxiv icon

VL-Mamba: Exploring State Space Models for Multimodal Learning

Add code
Mar 20, 2024
Figure 1 for VL-Mamba: Exploring State Space Models for Multimodal Learning
Figure 2 for VL-Mamba: Exploring State Space Models for Multimodal Learning
Figure 3 for VL-Mamba: Exploring State Space Models for Multimodal Learning
Figure 4 for VL-Mamba: Exploring State Space Models for Multimodal Learning
Viaarxiv icon

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

Add code
Sep 23, 2023
Viaarxiv icon

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

Add code
May 29, 2023
Figure 1 for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Figure 2 for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Figure 3 for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Figure 4 for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Viaarxiv icon

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

Add code
Mar 16, 2023
Figure 1 for MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Figure 2 for MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Figure 3 for MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Figure 4 for MOSO: Decomposing MOtion, Scene and Object for Video Prediction
Viaarxiv icon