Picture for Jiwen Zhang

Jiwen Zhang

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

Add code
Jun 13, 2026
Viaarxiv icon

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

Add code
Jun 09, 2026
Viaarxiv icon

SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation

Add code
Mar 27, 2026
Viaarxiv icon

MAGNET: Towards Adaptive GUI Agents with Memory-Driven Knowledge Evolution

Add code
Jan 27, 2026
Viaarxiv icon

A Graph Prompt Fine-Tuning Method for WSN Spatio-Temporal Correlation Anomaly Detection

Add code
Jan 19, 2026
Viaarxiv icon

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

Add code
Jan 11, 2026
Viaarxiv icon

A robust and compliant robotic assembly control strategy for batch precision assembly task with uncertain fit types and fit amounts

Add code
Aug 17, 2025
Viaarxiv icon

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Add code
May 27, 2025
Viaarxiv icon

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

Add code
Oct 07, 2024
Viaarxiv icon

VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models

Add code
May 28, 2024
Figure 1 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Figure 2 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Figure 3 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Figure 4 for VoCoT: Unleashing Visually Grounded Multi-Step Reasoning in Large Multi-Modal Models
Viaarxiv icon