Picture for Sipeng Zheng

Sipeng Zheng

Being-M0.5: A Real-Time Controllable Vision-Language-Motion Model

Add code
Aug 11, 2025
Viaarxiv icon

RL from Physical Feedback: Aligning Large Motion Models with Humanoid Control

Add code
Jun 15, 2025
Viaarxiv icon

EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining

Add code
Mar 19, 2025
Figure 1 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Figure 2 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Figure 3 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Figure 4 for EgoDTM: Towards 3D-Aware Egocentric Video-Language Pretraining
Viaarxiv icon

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

Add code
Mar 10, 2025
Figure 1 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Figure 2 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Figure 3 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Figure 4 for Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Viaarxiv icon

VideoOrion: Tokenizing Object Dynamics in Videos

Add code
Nov 25, 2024
Viaarxiv icon

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

Add code
Oct 04, 2024
Figure 1 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Figure 2 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Figure 3 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Figure 4 for Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models
Viaarxiv icon

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Add code
Oct 03, 2024
Figure 1 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Figure 2 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Figure 3 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Figure 4 for From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
Viaarxiv icon

QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds

Add code
Jun 24, 2024
Viaarxiv icon

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

Add code
May 28, 2024
Viaarxiv icon

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

Add code
Mar 14, 2024
Viaarxiv icon