Picture for Asaf Yehudai

Asaf Yehudai

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Add code
Apr 14, 2026
Viaarxiv icon

Mediocrity is the key for LLM as a Judge Anchor Selection

Add code
Mar 17, 2026
Viaarxiv icon

CUBE: A Standard for Unifying Agent Benchmarks

Add code
Mar 16, 2026
Viaarxiv icon

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon

Will it Merge? On The Causes of Model Mergeability

Add code
Jan 10, 2026
Viaarxiv icon

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Add code
Oct 06, 2025
Figure 1 for Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Figure 2 for Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Figure 3 for Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Figure 4 for Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization
Viaarxiv icon

Survey on Evaluation of LLM-based Agents

Add code
Mar 20, 2025
Viaarxiv icon

WildIFEval: Instruction Following in the Wild

Add code
Mar 09, 2025
Figure 1 for WildIFEval: Instruction Following in the Wild
Figure 2 for WildIFEval: Instruction Following in the Wild
Figure 3 for WildIFEval: Instruction Following in the Wild
Figure 4 for WildIFEval: Instruction Following in the Wild
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models

Add code
Feb 12, 2025
Viaarxiv icon