Picture for Emine Yilmaz

Emine Yilmaz

Beyond Individual Personas: Aligning Synthetic Dialogue to Population-Level Behavior Distributions

Add code
Jun 05, 2026
Viaarxiv icon

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Add code
Apr 24, 2026
Viaarxiv icon

Towards Self-Improving Error Diagnosis in Multi-Agent Systems

Add code
Apr 19, 2026
Viaarxiv icon

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Add code
Feb 16, 2026
Viaarxiv icon

Beyond Output Critique: Self-Correction via Task Distillation

Add code
Jan 31, 2026
Viaarxiv icon

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Add code
Jan 21, 2026
Viaarxiv icon

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System

Add code
Jan 17, 2026
Viaarxiv icon

Self-Correcting Large Language Models: Generation vs. Multiple Choice

Add code
Nov 12, 2025
Viaarxiv icon

Adaptive Multi-Agent Response Refinement in Conversational Systems

Add code
Nov 11, 2025
Viaarxiv icon

Towards Understanding Bias in Synthetic Data for Evaluation

Add code
Jun 12, 2025
Viaarxiv icon