Picture for Xiaojian Ma

Xiaojian Ma

LEO-VL: Towards 3D Vision-Language Generalists via Data Scaling with Efficient Representation

Add code
Jun 11, 2025
Viaarxiv icon

From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Add code
Jun 05, 2025
Viaarxiv icon

FlowDreamer: A RGB-D World Model with Flow-based Motion Representations for Robot Manipulation

Add code
May 15, 2025
Viaarxiv icon

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

Add code
May 06, 2025
Viaarxiv icon

Iterative Trajectory Exploration for Multimodal Agents

Add code
Apr 30, 2025
Viaarxiv icon

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

Add code
Apr 17, 2025
Viaarxiv icon

Building LLM Agents by Incorporating Insights from Computer Systems

Add code
Apr 06, 2025
Viaarxiv icon

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Add code
Mar 20, 2025
Viaarxiv icon

LongViTU: Instruction Tuning for Long-Form Video Understanding

Add code
Jan 09, 2025
Viaarxiv icon

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Add code
Dec 31, 2024
Figure 1 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 2 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 3 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Figure 4 for Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Viaarxiv icon