Picture for Hang Xu

Hang Xu

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs

Add code
Jun 06, 2025
Viaarxiv icon

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Add code
May 25, 2025
Viaarxiv icon

CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

Add code
Apr 28, 2025
Viaarxiv icon

PaMi-VDPO: Mitigating Video Hallucinations by Prompt-Aware Multi-Instance Video Preference Learning

Add code
Apr 08, 2025
Viaarxiv icon

ILLUME+: Illuminating Unified MLLM with Dual Visual Tokenization and Diffusion Refinement

Add code
Apr 03, 2025
Viaarxiv icon

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

Add code
Mar 29, 2025
Viaarxiv icon

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Add code
Mar 27, 2025
Viaarxiv icon

EDEN: Enhanced Diffusion for High-quality Large-motion Video Frame Interpolation

Add code
Mar 20, 2025
Viaarxiv icon

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

Add code
Mar 16, 2025
Viaarxiv icon

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Add code
Mar 12, 2025
Viaarxiv icon