Picture for Eliya Habba

Eliya Habba

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Add code
Apr 16, 2026
Viaarxiv icon

Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Add code
Apr 14, 2026
Viaarxiv icon

ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

Add code
Apr 10, 2026
Viaarxiv icon

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Add code
Feb 18, 2026
Viaarxiv icon

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Add code
Nov 06, 2025
Figure 1 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Figure 2 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Figure 3 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Figure 4 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Viaarxiv icon

JSON Whisperer: Efficient JSON Editing with LLMs

Add code
Oct 06, 2025
Viaarxiv icon

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

Add code
May 28, 2025
Figure 1 for ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Figure 2 for ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Figure 3 for ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Figure 4 for ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
Viaarxiv icon

DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

Add code
Mar 04, 2025
Viaarxiv icon

Beyond Benchmarks: On The False Promise of AI Regulation

Add code
Jan 26, 2025
Figure 1 for Beyond Benchmarks: On The False Promise of AI Regulation
Figure 2 for Beyond Benchmarks: On The False Promise of AI Regulation
Figure 3 for Beyond Benchmarks: On The False Promise of AI Regulation
Figure 4 for Beyond Benchmarks: On The False Promise of AI Regulation
Viaarxiv icon

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

Add code
Jul 28, 2024
Viaarxiv icon