Picture for Jesse Hu

Jesse Hu

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

Add code
Jun 05, 2026
Viaarxiv icon

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Add code
Jan 17, 2026
Viaarxiv icon