Picture for Xinyu Che

Xinyu Che

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Add code
Feb 05, 2026
Viaarxiv icon

TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents

Add code
Feb 03, 2026
Viaarxiv icon