Picture for Hubert M. Pysklo

Hubert M. Pysklo

Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation

Add code
Feb 11, 2026
Viaarxiv icon