Picture for Asaf Yehudai

Asaf Yehudai

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Add code
Jun 12, 2026
Viaarxiv icon

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Add code
Jun 09, 2026
Viaarxiv icon

Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

Add code
May 28, 2026
Viaarxiv icon

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Add code
May 27, 2026
Viaarxiv icon

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Add code
May 21, 2026
Viaarxiv icon

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Add code
May 18, 2026
Viaarxiv icon

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Add code
Apr 14, 2026
Viaarxiv icon

Mediocrity is the key for LLM as a Judge Anchor Selection

Add code
Mar 17, 2026
Viaarxiv icon

CUBE: A Standard for Unifying Agent Benchmarks

Add code
Mar 16, 2026
Viaarxiv icon

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon