Picture for Arman Cohan

Arman Cohan

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Add code
Jun 12, 2026
Viaarxiv icon

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

Add code
Jun 10, 2026
Viaarxiv icon

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

Add code
Jun 03, 2026
Viaarxiv icon

Quantifying Faithful Confidence Expression in Large Reasoning Models

Add code
Jun 02, 2026
Viaarxiv icon

Can LLMs Use Linguistic Uncertainty Markers to Reliably Reflect Intrinsic Confidence?

Add code
May 27, 2026
Viaarxiv icon

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Add code
May 19, 2026
Viaarxiv icon

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Add code
May 18, 2026
Viaarxiv icon

Herculean: An Agentic Benchmark for Financial Intelligence

Add code
May 14, 2026
Viaarxiv icon

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Add code
May 10, 2026
Viaarxiv icon

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

Add code
May 05, 2026
Viaarxiv icon