Picture for José Hernández-Orallo

José Hernández-Orallo

Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Add code
Jun 10, 2025
Viaarxiv icon

Relative Drawing Identification Complexity is Invariant to Modality in Vision-Language Models

Add code
May 14, 2025
Viaarxiv icon

Cognitive Science-Inspired Evaluation of Core Capabilities for Object Understanding in AI

Add code
Mar 27, 2025
Viaarxiv icon

General Scales Unlock AI Evaluation with Explanatory and Predictive Power

Add code
Mar 09, 2025
Viaarxiv icon

Paradigms of AI Evaluation: Mapping Goals, Methodologies and Culture

Add code
Feb 21, 2025
Viaarxiv icon

PredictaBoard: Benchmarking LLM Score Predictability

Add code
Feb 20, 2025
Viaarxiv icon

Leaving the barn door open for Clever Hans: Simple features predict LLM benchmark answers

Add code
Oct 15, 2024
Viaarxiv icon

100 instances is all you need: predicting the success of a new LLM on unseen data by testing on a few instances

Add code
Sep 05, 2024
Viaarxiv icon

Learning Alternative Ways of Performing a Task

Add code
Apr 03, 2024
Viaarxiv icon

Animal-AI 3: What's New & Why You Should Care

Add code
Dec 18, 2023
Figure 1 for Animal-AI 3: What's New & Why You Should Care
Figure 2 for Animal-AI 3: What's New & Why You Should Care
Figure 3 for Animal-AI 3: What's New & Why You Should Care
Figure 4 for Animal-AI 3: What's New & Why You Should Care
Viaarxiv icon