Picture for Jingdong Wang

Jingdong Wang

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Add code
Sep 16, 2025
Viaarxiv icon

Can Understanding and Generation Truly Benefit Together -- or Just Coexist?

Add code
Sep 11, 2025
Viaarxiv icon

iDiT-HOI: Inpainting-based Hand Object Interaction Reenactment via Video Diffusion Transformer

Add code
Jun 15, 2025
Viaarxiv icon

VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction

Add code
Jun 05, 2025
Viaarxiv icon

Vision Remember: Alleviating Visual Forgetting in Efficient MLLM with Vision Feature Resample

Add code
Jun 04, 2025
Viaarxiv icon

Hallo4: High-Fidelity Dynamic Portrait Animation via Direct Preference Optimization and Temporal Motion Modulation

Add code
May 29, 2025
Viaarxiv icon

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Add code
May 05, 2025
Viaarxiv icon

AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

Add code
Mar 25, 2025
Viaarxiv icon

Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers

Add code
Mar 13, 2025
Viaarxiv icon

MagicGeo: Training-Free Text-Guided Geometric Diagram Generation

Add code
Feb 19, 2025
Viaarxiv icon