Picture for Michael S. Ryoo

Michael S. Ryoo

Robotic VLA Benefits from Joint Learning with Motion Image Diffusion

Add code
Dec 19, 2025
Viaarxiv icon

Pixel Motion Diffusion is What We Need for Robot Control

Add code
Sep 26, 2025
Viaarxiv icon

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Add code
Sep 03, 2025
Figure 1 for Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Figure 2 for Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Figure 3 for Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Figure 4 for Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Viaarxiv icon

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Add code
Nov 22, 2024
Figure 1 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 2 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 3 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Figure 4 for Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
Viaarxiv icon

Adaptive Caching for Faster Video Generation with Diffusion Transformers

Add code
Nov 04, 2024
Figure 1 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Figure 2 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Figure 3 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Figure 4 for Adaptive Caching for Faster Video Generation with Diffusion Transformers
Viaarxiv icon

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Add code
Oct 21, 2024
Figure 1 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 2 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 3 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Figure 4 for xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Viaarxiv icon

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Add code
Jun 28, 2024
Figure 1 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 2 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 3 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Figure 4 for LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Viaarxiv icon

Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA

Add code
Jun 17, 2024
Figure 1 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Figure 2 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Figure 3 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Figure 4 for Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Viaarxiv icon

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Add code
Apr 11, 2024
Figure 1 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Figure 2 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Figure 3 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Figure 4 for Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Viaarxiv icon

Understanding Long Videos in One Multimodal Language Model Pass

Add code
Mar 25, 2024
Figure 1 for Understanding Long Videos in One Multimodal Language Model Pass
Figure 2 for Understanding Long Videos in One Multimodal Language Model Pass
Figure 3 for Understanding Long Videos in One Multimodal Language Model Pass
Figure 4 for Understanding Long Videos in One Multimodal Language Model Pass
Viaarxiv icon