Picture for Michal Shmueli-Scheuer

Michal Shmueli-Scheuer

CUBE: A Standard for Unifying Agent Benchmarks

Add code
Mar 16, 2026
Viaarxiv icon

General Agent Evaluation

Add code
Feb 26, 2026
Viaarxiv icon

Robustness as an Emergent Property of Task Performance

Add code
Feb 03, 2026
Viaarxiv icon

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Add code
Jan 22, 2026
Viaarxiv icon

Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

Add code
May 26, 2025
Figure 1 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Figure 2 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Figure 3 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Figure 4 for Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Viaarxiv icon

Survey on Evaluation of LLM-based Agents

Add code
Mar 20, 2025
Viaarxiv icon

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Add code
Mar 04, 2025
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications

Add code
Jul 25, 2024
Figure 1 for Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications
Figure 2 for Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications
Figure 3 for Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications
Figure 4 for Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications
Viaarxiv icon

Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation

Add code
Jul 18, 2024
Figure 1 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 2 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 3 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Figure 4 for Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Viaarxiv icon