Picture for Emine Yilmaz

Emine Yilmaz

Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems

Add code
Jan 21, 2026
Viaarxiv icon

ATOD: An Evaluation Framework and Benchmark for Agentic Task-Oriented Dialogue System

Add code
Jan 17, 2026
Viaarxiv icon

Self-Correcting Large Language Models: Generation vs. Multiple Choice

Add code
Nov 12, 2025
Viaarxiv icon

Adaptive Multi-Agent Response Refinement in Conversational Systems

Add code
Nov 11, 2025
Viaarxiv icon

Towards Understanding Bias in Synthetic Data for Evaluation

Add code
Jun 12, 2025
Viaarxiv icon

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants

Add code
Jun 11, 2025
Viaarxiv icon

Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Add code
May 22, 2025
Viaarxiv icon

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

Add code
Feb 19, 2025
Figure 1 for Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Figure 2 for Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Figure 3 for Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Figure 4 for Judging the Judges: A Collection of LLM-Generated Relevance Judgements
Viaarxiv icon

KEIR @ ECIR 2025: The Second Workshop on Knowledge-Enhanced Information Retrieval

Add code
Jan 20, 2025
Viaarxiv icon

JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment

Add code
Dec 17, 2024
Figure 1 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Figure 2 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Figure 3 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Figure 4 for JudgeBlender: Ensembling Judgments for Automatic Relevance Assessment
Viaarxiv icon