Picture for Mubashara Akhtar

Mubashara Akhtar

Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation Results

Add code
Jun 12, 2026
Viaarxiv icon

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

Add code
Jun 09, 2026
Viaarxiv icon

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

Add code
May 29, 2026
Viaarxiv icon

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Add code
Feb 18, 2026
Viaarxiv icon

Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads

Add code
Nov 11, 2025
Viaarxiv icon

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Add code
Nov 06, 2025
Figure 1 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Figure 2 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Figure 3 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Figure 4 for Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations
Viaarxiv icon

Chimera: Diagnosing Shortcut Learning in Visual-Language Understanding

Add code
Sep 26, 2025
Viaarxiv icon

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Add code
May 19, 2025
Figure 1 for LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Figure 2 for LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Figure 3 for LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Figure 4 for LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Viaarxiv icon

Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking

Add code
Nov 08, 2024
Figure 1 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
Figure 2 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
Figure 3 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
Figure 4 for Ev2R: Evaluating Evidence Retrieval in Automated Fact-Checking
Viaarxiv icon

The Automated Verification of Textual Claims (AVeriTeC) Shared Task

Add code
Oct 31, 2024
Figure 1 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task
Figure 2 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task
Figure 3 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task
Figure 4 for The Automated Verification of Textual Claims (AVeriTeC) Shared Task
Viaarxiv icon