Picture for Asaf Yehudai

Asaf Yehudai

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon

Will it Merge? On The Causes of Model Mergeability

Add code
Jan 10, 2026
Viaarxiv icon

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Add code
Oct 06, 2025
Viaarxiv icon

Survey on Evaluation of LLM-based Agents

Add code
Mar 20, 2025
Viaarxiv icon

WildIFEval: Instruction Following in the Wild

Add code
Mar 09, 2025
Figure 1 for WildIFEval: Instruction Following in the Wild
Figure 2 for WildIFEval: Instruction Following in the Wild
Figure 3 for WildIFEval: Instruction Following in the Wild
Figure 4 for WildIFEval: Instruction Following in the Wild
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

Selective Self-to-Supervised Fine-Tuning for Generalization in Large Language Models

Add code
Feb 12, 2025
Viaarxiv icon

JuStRank: Benchmarking LLM Judges for System Ranking

Add code
Dec 12, 2024
Figure 1 for JuStRank: Benchmarking LLM Judges for System Ranking
Figure 2 for JuStRank: Benchmarking LLM Judges for System Ranking
Figure 3 for JuStRank: Benchmarking LLM Judges for System Ranking
Figure 4 for JuStRank: Benchmarking LLM Judges for System Ranking
Viaarxiv icon

Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models

Add code
Sep 07, 2024
Figure 1 for Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models
Figure 2 for Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models
Figure 3 for Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models
Figure 4 for Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models
Viaarxiv icon

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Add code
Jul 18, 2024
Figure 1 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 2 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 3 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 4 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Viaarxiv icon