Picture for Xinhan Di

Xinhan Di

Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos

Add code
Aug 12, 2025
Viaarxiv icon

JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Add code
Jul 28, 2025
Viaarxiv icon

Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Add code
May 26, 2025
Viaarxiv icon

MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing

Add code
May 22, 2025
Viaarxiv icon

Towards Film-Making Production Dialogue, Narration, Monologue Adaptive Moving Dubbing Benchmarks

Add code
Apr 30, 2025
Viaarxiv icon

OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance

Add code
Apr 07, 2025
Viaarxiv icon

DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance

Add code
Mar 31, 2025
Viaarxiv icon

DeepSound-V1: Start to Think Step-by-Step in the Audio Generation from Videos

Add code
Mar 28, 2025
Viaarxiv icon

Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization

Add code
Mar 28, 2025
Viaarxiv icon

DeepAudio-V1:Towards Multi-Modal Multi-Stage End-to-End Video to Speech and Audio Generation

Add code
Mar 28, 2025
Viaarxiv icon