Picture for Jianwei Yang

Jianwei Yang

Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models

Add code
Jun 04, 2025
Viaarxiv icon

SITE: towards Spatial Intelligence Thorough Evaluation

Add code
May 08, 2025
Viaarxiv icon

Towards Understanding Graphical Perception in Large Multimodal Models

Add code
Mar 13, 2025
Viaarxiv icon

Magma: A Foundation Model for Multimodal AI Agents

Add code
Feb 18, 2025
Viaarxiv icon

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Add code
Jan 09, 2025
Figure 1 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 2 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 3 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Figure 4 for ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Viaarxiv icon

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

Add code
Dec 17, 2024
Viaarxiv icon

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

Add code
Dec 13, 2024
Viaarxiv icon

Mojito: Motion Trajectory and Intensity Control for Video Generation

Add code
Dec 12, 2024
Figure 1 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 2 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 3 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Figure 4 for Mojito: Motion Trajectory and Intensity Control for Video Generation
Viaarxiv icon

OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation

Add code
Dec 12, 2024
Viaarxiv icon

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Add code
Dec 05, 2024
Figure 1 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 2 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 3 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Figure 4 for Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Viaarxiv icon