Picture for Xiang Bai

Xiang Bai

Huazhong University of Science and Technology

I2E: From Image Pixels to Actionable Interactive Environments for Text-Guided Image Editing

Add code
Jan 07, 2026
Viaarxiv icon

Crowded Video Individual Counting Informed by Social Grouping and Spatial-Temporal Displacement Priors

Add code
Jan 03, 2026
Viaarxiv icon

FitControler: Toward Fit-Aware Virtual Try-On

Add code
Dec 30, 2025
Viaarxiv icon

MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning

Add code
Dec 16, 2025
Figure 1 for MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning
Figure 2 for MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning
Figure 3 for MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning
Figure 4 for MindDrive: A Vision-Language-Action Model for Autonomous Driving via Online Reinforcement Learning
Viaarxiv icon

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Add code
Dec 14, 2025
Figure 1 for DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Figure 2 for DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Figure 3 for DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Figure 4 for DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning
Viaarxiv icon

GenieDrive: Towards Physics-Aware Driving World Model with 4D Occupancy Guided Video Generation

Add code
Dec 14, 2025
Viaarxiv icon

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Add code
Nov 16, 2025
Viaarxiv icon

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

Add code
Nov 10, 2025
Figure 1 for StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Figure 2 for StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Figure 3 for StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Figure 4 for StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression
Viaarxiv icon

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

Add code
Oct 31, 2025
Viaarxiv icon

OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Add code
Aug 27, 2025
Figure 1 for OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
Figure 2 for OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
Figure 3 for OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
Figure 4 for OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward
Viaarxiv icon