Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Aug 07, 2025

Roshita Bhonsle, Rishav Dutta, Sneha Vavilapalli, Harsh Seth, Abubakarr Jaye, Yapei Chang, Mukund Rungta, Emmanuel Aboah Boateng, Sadid Hasan, Ehi Nosakhare(+1 more)

Figure 1 for Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Figure 2 for Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Figure 3 for Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Figure 4 for Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Share this with someone who'll enjoy it:

Abstract:The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another's task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent's output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.

View paper on

Share this with someone who'll enjoy it:

Title:Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

Paper and Code