Picture for Haiyang Shen

Haiyang Shen

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Add code
May 22, 2026
Viaarxiv icon

SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

Add code
May 21, 2026
Viaarxiv icon

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Add code
May 21, 2026
Viaarxiv icon

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

Add code
May 20, 2026
Viaarxiv icon

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

Add code
May 20, 2026
Viaarxiv icon

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Add code
May 13, 2026
Viaarxiv icon

MEME: Modeling the Evolutionary Modes of Financial Markets

Add code
Feb 12, 2026
Viaarxiv icon

AlphaPROBE: Alpha Mining via Principled Retrieval and On-graph biased evolution

Add code
Feb 12, 2026
Viaarxiv icon

M3-BENCH: Process-Aware Evaluation of LLM Agents Social Behaviors in Mixed-Motive Games

Add code
Jan 13, 2026
Viaarxiv icon

BabyVision: Visual Reasoning Beyond Language

Add code
Jan 10, 2026
Viaarxiv icon