Picture for Tong He

Tong He

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

Add code
May 14, 2026
Viaarxiv icon

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

Add code
May 14, 2026
Viaarxiv icon

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Add code
Jan 05, 2026
Viaarxiv icon

Yume-1.5: A Text-Controlled Interactive World Generation Model

Add code
Dec 26, 2025
Viaarxiv icon

WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Add code
Sep 05, 2025
Viaarxiv icon

Learning Primitive Embodied World Models: Towards Scalable Robotic Learning

Add code
Aug 28, 2025
Figure 1 for Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
Figure 2 for Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
Figure 3 for Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
Figure 4 for Learning Primitive Embodied World Models: Towards Scalable Robotic Learning
Viaarxiv icon

Yume: An Interactive World Generation Model

Add code
Jul 23, 2025
Figure 1 for Yume: An Interactive World Generation Model
Figure 2 for Yume: An Interactive World Generation Model
Figure 3 for Yume: An Interactive World Generation Model
Figure 4 for Yume: An Interactive World Generation Model
Viaarxiv icon

$π^3$: Scalable Permutation-Equivariant Visual Geometry Learning

Add code
Jul 17, 2025
Viaarxiv icon

DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation

Add code
Jul 03, 2025
Figure 1 for DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
Figure 2 for DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
Figure 3 for DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
Figure 4 for DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
Viaarxiv icon

VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers

Add code
Jul 01, 2025
Viaarxiv icon