Picture for Zuxuan Wu

Zuxuan Wu

Fudan University

Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Add code
May 23, 2025
Viaarxiv icon

ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning

Add code
May 21, 2025
Figure 1 for ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
Figure 2 for ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
Figure 3 for ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
Figure 4 for ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
Viaarxiv icon

UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation

Add code
May 20, 2025
Viaarxiv icon

Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks

Add code
Apr 26, 2025
Figure 1 for Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Figure 2 for Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Figure 3 for Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Figure 4 for Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Viaarxiv icon

SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL

Add code
Apr 15, 2025
Figure 1 for SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Figure 2 for SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Figure 3 for SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Figure 4 for SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL
Viaarxiv icon

Aligning Anime Video Generation with Human Feedback

Add code
Apr 14, 2025
Figure 1 for Aligning Anime Video Generation with Human Feedback
Figure 2 for Aligning Anime Video Generation with Human Feedback
Figure 3 for Aligning Anime Video Generation with Human Feedback
Figure 4 for Aligning Anime Video Generation with Human Feedback
Viaarxiv icon

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Add code
Mar 27, 2025
Figure 1 for DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Figure 2 for DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Figure 3 for DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Figure 4 for DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Viaarxiv icon

CoMP: Continual Multimodal Pre-training for Vision Foundation Models

Add code
Mar 24, 2025
Viaarxiv icon

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Add code
Mar 20, 2025
Figure 1 for EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Figure 2 for EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Figure 3 for EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Figure 4 for EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation
Viaarxiv icon

BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

Add code
Mar 20, 2025
Viaarxiv icon