Picture for Qiyuan Peng

Qiyuan Peng

AGORA: An Archive-Grounded Benchmark for Agentic Workplace Document Reasoning

Add code
Jun 23, 2026
Viaarxiv icon

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Add code
May 19, 2026
Viaarxiv icon

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Add code
Feb 13, 2026
Viaarxiv icon

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models

Add code
Aug 07, 2025
Viaarxiv icon