Picture for Emine Yilmaz

Emine Yilmaz

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Add code
Apr 24, 2026
Viaarxiv icon

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Add code
Apr 19, 2026
Viaarxiv icon

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Add code
Feb 16, 2026
Viaarxiv icon

Beyond Output Critique: Self-Correction via Task Distillation

Add code
Jan 31, 2026
Viaarxiv icon

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Add code
Jan 21, 2026
Viaarxiv icon

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System

Add code
Jan 17, 2026
Viaarxiv icon

Self-Correcting Large Language Models: Generation vs. Multiple Choice

Add code
Nov 12, 2025
Viaarxiv icon

Adaptive Multi-Agent Response Refinement in Conversational Systems

Add code
Nov 11, 2025
Viaarxiv icon

Towards Understanding Bias in Synthetic Data for Evaluation

Add code
Jun 12, 2025
Viaarxiv icon

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

Add code
Jun 11, 2025
Viaarxiv icon