Picture for Yotam Perlitz

Yotam Perlitz

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Add code
Mar 04, 2025
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

JuStRank: Benchmarking LLM Judges for System Ranking

Add code
Dec 12, 2024
Figure 1 for JuStRank: Benchmarking LLM Judges for System Ranking
Figure 2 for JuStRank: Benchmarking LLM Judges for System Ranking
Figure 3 for JuStRank: Benchmarking LLM Judges for System Ranking
Figure 4 for JuStRank: Benchmarking LLM Judges for System Ranking
Viaarxiv icon

Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity

Add code
Aug 22, 2024
Figure 1 for Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity
Figure 2 for Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity
Figure 3 for Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity
Figure 4 for Can You Trust Your Metric? Automatic Concatenation-Based Tests for Metric Validity
Viaarxiv icon

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Add code
Jul 18, 2024
Figure 1 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 2 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 3 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 4 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Viaarxiv icon

Holmes: Benchmark the Linguistic Competence of Language Models

Add code
Apr 29, 2024
Figure 1 for Holmes: Benchmark the Linguistic Competence of Language Models
Figure 2 for Holmes: Benchmark the Linguistic Competence of Language Models
Figure 3 for Holmes: Benchmark the Linguistic Competence of Language Models
Figure 4 for Holmes: Benchmark the Linguistic Competence of Language Models
Viaarxiv icon

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

Add code
Jan 25, 2024
Figure 1 for Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Figure 2 for Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Figure 3 for Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Figure 4 for Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Viaarxiv icon

Efficient Benchmarking (of Language Models)

Add code
Aug 31, 2023
Figure 1 for Efficient Benchmarking (of Language Models)
Figure 2 for Efficient Benchmarking (of Language Models)
Figure 3 for Efficient Benchmarking (of Language Models)
Figure 4 for Efficient Benchmarking (of Language Models)
Viaarxiv icon

Active Learning for Natural Language Generation

Add code
May 24, 2023
Figure 1 for Active Learning for Natural Language Generation
Figure 2 for Active Learning for Natural Language Generation
Figure 3 for Active Learning for Natural Language Generation
Figure 4 for Active Learning for Natural Language Generation
Viaarxiv icon