Picture for Sifei Liu

Sifei Liu

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

Add code
Jul 16, 2025
Viaarxiv icon

Scaling RL to Long Videos

Add code
Jul 10, 2025
Viaarxiv icon

Describe Anything: Detailed Localized Image and Video Captioning

Add code
Apr 22, 2025
Viaarxiv icon

Scaling Vision Pre-Training to 4K Resolution

Add code
Mar 25, 2025
Viaarxiv icon

M3: 3D-Spatial MultiModal Memory

Add code
Mar 20, 2025
Viaarxiv icon

Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training

Add code
Feb 25, 2025
Viaarxiv icon

Parallel Sequence Modeling via Generalized Spatial Propagation Network

Add code
Jan 21, 2025
Figure 1 for Parallel Sequence Modeling via Generalized Spatial Propagation Network
Figure 2 for Parallel Sequence Modeling via Generalized Spatial Propagation Network
Figure 3 for Parallel Sequence Modeling via Generalized Spatial Propagation Network
Figure 4 for Parallel Sequence Modeling via Generalized Spatial Propagation Network
Viaarxiv icon

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Add code
Jan 14, 2025
Viaarxiv icon

BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations

Add code
Jan 13, 2025
Viaarxiv icon

NaVILA: Legged Robot Vision-Language-Action Model for Navigation

Add code
Dec 05, 2024
Figure 1 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation
Figure 2 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation
Figure 3 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation
Figure 4 for NaVILA: Legged Robot Vision-Language-Action Model for Navigation
Viaarxiv icon