Picture for Bingyi Kang

Bingyi Kang

SpatialTrackerV2: 3D Point Tracking Made Easy

Add code
Jul 16, 2025
Viaarxiv icon

$\text{M}^{\text{3}}$: A Modular World Model over Streams of Tokens

Add code
Feb 20, 2025
Viaarxiv icon

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Add code
Jan 21, 2025
Viaarxiv icon

VideoWorld: Exploring Knowledge Learning from Unlabeled Videos

Add code
Jan 16, 2025
Viaarxiv icon

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Add code
Dec 18, 2024
Viaarxiv icon

Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

Add code
Dec 18, 2024
Viaarxiv icon

Image Understanding Makes for A Good Tokenizer for Image Generation

Add code
Nov 07, 2024
Viaarxiv icon

DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

Add code
Nov 04, 2024
Figure 1 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 2 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 3 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Figure 4 for DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
Viaarxiv icon

How Far is Video Generation from World Model: A Physical Law Perspective

Add code
Nov 04, 2024
Figure 1 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 2 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 3 for How Far is Video Generation from World Model: A Physical Law Perspective
Figure 4 for How Far is Video Generation from World Model: A Physical Law Perspective
Viaarxiv icon

Loong: Generating Minute-level Long Videos with Autoregressive Language Models

Add code
Oct 03, 2024
Figure 1 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Figure 2 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Figure 3 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Figure 4 for Loong: Generating Minute-level Long Videos with Autoregressive Language Models
Viaarxiv icon