Picture for Zhoufutu Wen

Zhoufutu Wen

SciDA: Scientific Dynamic Assessor of LLMs

Add code
Jun 15, 2025
Viaarxiv icon

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Add code
May 27, 2025
Viaarxiv icon

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Add code
May 21, 2025
Viaarxiv icon

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Add code
Apr 21, 2025
Viaarxiv icon

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Add code
Feb 20, 2025
Viaarxiv icon

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Add code
Feb 18, 2025
Viaarxiv icon

CryptoX : Compositional Reasoning Evaluation of Large Language Models

Add code
Feb 08, 2025
Viaarxiv icon

Enhancing Dynamic Image Advertising with Vision-Language Pre-training

Add code
Jun 25, 2023
Viaarxiv icon