Picture for Graham Neubig

Graham Neubig

Carnegie Mellon University

FieldWorkArena: Agentic AI Benchmark for Real Field Work Tasks

Add code
May 26, 2025
Viaarxiv icon

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Add code
May 15, 2025
Viaarxiv icon

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Add code
Apr 15, 2025
Viaarxiv icon

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering

Add code
Apr 10, 2025
Viaarxiv icon

Inducing Programmatic Skills for Agentic Tasks

Add code
Apr 09, 2025
Viaarxiv icon

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Add code
Apr 09, 2025
Viaarxiv icon

M-Prometheus: A Suite of Open Multilingual LLM Judges

Add code
Apr 07, 2025
Viaarxiv icon

Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators

Add code
Mar 25, 2025
Viaarxiv icon

Overtrained Language Models Are Harder to Fine-Tune

Add code
Mar 24, 2025
Viaarxiv icon

Benchmarking Failures in Tool-Augmented Language Models

Add code
Mar 18, 2025
Viaarxiv icon