Picture for Shanghang Zhang

Shanghang Zhang

HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement

Add code
Aug 23, 2025
Viaarxiv icon

$NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything

Add code
Aug 06, 2025
Figure 1 for $NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything
Figure 2 for $NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything
Figure 3 for $NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything
Figure 4 for $NavA^3$: Understanding Any Instruction, Navigating Anywhere, Finding Anything
Viaarxiv icon

RoboBrain 2.0 Technical Report

Add code
Jul 02, 2025
Viaarxiv icon

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Add code
Jul 02, 2025
Figure 1 for AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Figure 2 for AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Figure 3 for AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Figure 4 for AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
Viaarxiv icon

MinD: Unified Visual Imagination and Control via Hierarchical World Models

Add code
Jun 23, 2025
Figure 1 for MinD: Unified Visual Imagination and Control via Hierarchical World Models
Figure 2 for MinD: Unified Visual Imagination and Control via Hierarchical World Models
Figure 3 for MinD: Unified Visual Imagination and Control via Hierarchical World Models
Figure 4 for MinD: Unified Visual Imagination and Control via Hierarchical World Models
Viaarxiv icon

Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought

Add code
Jun 12, 2025
Figure 1 for Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Figure 2 for Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Figure 3 for Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Figure 4 for Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
Viaarxiv icon

Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs

Add code
Jun 12, 2025
Figure 1 for Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Figure 2 for Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Figure 3 for Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Figure 4 for Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs
Viaarxiv icon

SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game

Add code
Jun 07, 2025
Figure 1 for SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game
Figure 2 for SpikePingpong: High-Frequency Spike Vision-based Robot Learning for Precise Striking in Table Tennis Game
Viaarxiv icon

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Add code
Jun 04, 2025
Viaarxiv icon

GeoDrive: 3D Geometry-Informed Driving World Model with Precise Action Control

Add code
May 29, 2025
Viaarxiv icon