Picture for Lueyang Zhang

Lueyang Zhang

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

Add code
Feb 26, 2026
Viaarxiv icon

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Add code
Oct 29, 2025
Viaarxiv icon