Picture for Haodong Duan

Haodong Duan

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

Add code
Jun 17, 2026
Viaarxiv icon

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Add code
Jun 09, 2026
Viaarxiv icon

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models

Add code
May 26, 2026
Viaarxiv icon

OpenCompass: A Universal Evaluation Platform for Large Language Models

Add code
May 19, 2026
Viaarxiv icon

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Add code
May 13, 2026
Viaarxiv icon

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Add code
May 11, 2026
Viaarxiv icon

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Add code
Apr 02, 2026
Viaarxiv icon

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

Add code
Mar 29, 2026
Viaarxiv icon

PosterIQ: A Design Perspective Benchmark for Poster Understanding and Generation

Add code
Mar 25, 2026
Viaarxiv icon

MIBench: Evaluating LMMs on Multimodal Interaction

Add code
Mar 13, 2026
Viaarxiv icon