Picture for Ge Zhang

Ge Zhang

SciDA: Scientific Dynamic Assessor of LLMs

Add code
Jun 15, 2025
Viaarxiv icon

Scaling Test-time Compute for LLM Agents

Add code
Jun 15, 2025
Viaarxiv icon

Overview of the NLPCC 2025 Shared Task: Gender Bias Mitigation Challenge

Add code
Jun 14, 2025
Viaarxiv icon

TaskCraft: Automated Generation of Agentic Tasks

Add code
Jun 11, 2025
Viaarxiv icon

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

Add code
May 29, 2025
Viaarxiv icon

MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Add code
May 27, 2025
Viaarxiv icon

LIFEBench: Evaluating Length Instruction Following in Large Language Models

Add code
May 22, 2025
Viaarxiv icon

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

Add code
May 21, 2025
Viaarxiv icon

P2P: Automated Paper-to-Poster Generation and Fine-Grained Benchmark

Add code
May 21, 2025
Viaarxiv icon

General-Reasoner: Advancing LLM Reasoning Across All Domains

Add code
May 21, 2025
Viaarxiv icon