Picture for Yifan Mai

Yifan Mai

Characterizing Delusional Spirals through Human-LLM Chat Logs

Add code
Mar 17, 2026
Viaarxiv icon

Talk, Evaluate, Diagnose: User-aware Agent Evaluation with Automated Error Analysis

Add code
Mar 16, 2026
Viaarxiv icon

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Add code
Mar 03, 2026
Viaarxiv icon

ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models

Add code
Jan 22, 2026
Viaarxiv icon

The Singapore Consensus on Global AI Safety Research Priorities

Add code
Jun 25, 2025
Figure 1 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 2 for The Singapore Consensus on Global AI Safety Research Priorities
Figure 3 for The Singapore Consensus on Global AI Safety Research Priorities
Viaarxiv icon

Judging LLMs on a Simplex

Add code
May 28, 2025
Viaarxiv icon

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks

Add code
May 26, 2025
Figure 1 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Figure 2 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Figure 3 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Figure 4 for MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks
Viaarxiv icon

The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

Add code
Feb 26, 2025
Figure 1 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 2 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 3 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Figure 4 for The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Viaarxiv icon

SEA-HELM: Southeast Asian Holistic Evaluation of Language Models

Add code
Feb 20, 2025
Figure 1 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Figure 2 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Figure 3 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Figure 4 for SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
Viaarxiv icon

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Add code
Oct 29, 2024
Figure 1 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Figure 2 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Figure 3 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Figure 4 for Image2Struct: Benchmarking Structure Extraction for Vision-Language Models
Viaarxiv icon