Picture for Sipeng Zheng

Sipeng Zheng

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Add code
Dec 15, 2025
Viaarxiv icon

Robust Motion Generation using Part-level Reliable Data from Videos

Add code
Dec 14, 2025
Viaarxiv icon

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Add code
Aug 11, 2025
Viaarxiv icon

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

Add code
Jun 15, 2025
Viaarxiv icon

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

Add code
Mar 19, 2025
Figure 1 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Figure 2 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Figure 3 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Figure 4 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Viaarxiv icon

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Add code
Mar 10, 2025
Figure 1 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Figure 2 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Figure 3 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Figure 4 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Viaarxiv icon

VideoOrion: Tokenizing Object Dynamics in Videos

Add code
Nov 25, 2024
Viaarxiv icon

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

Add code
Oct 04, 2024
Figure 1 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Figure 2 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Figure 3 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Figure 4 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Viaarxiv icon

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Add code
Oct 03, 2024
Figure 1 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Figure 2 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Figure 3 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Figure 4 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Viaarxiv icon

QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds

Add code
Jun 24, 2024
Figure 1 for QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Figure 2 for QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Figure 3 for QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Figure 4 for QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
Viaarxiv icon