Picture for Bing Zhao

Bing Zhao

Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

Add code
Apr 19, 2026
Viaarxiv icon

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

Add code
Apr 04, 2026
Viaarxiv icon

IndustryCode: A Benchmark for Industry Code Generation

Add code
Apr 03, 2026
Viaarxiv icon

MolQuest: A Benchmark for Agentic Evaluation of Abductive Reasoning in Chemical Structure Elucidation

Add code
Mar 26, 2026
Viaarxiv icon

Logics-Parsing-Omni Technical Report

Add code
Mar 12, 2026
Viaarxiv icon

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

Add code
Mar 04, 2026
Viaarxiv icon

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Add code
Mar 03, 2026
Viaarxiv icon

SPM-Bench: Benchmarking Large Language Models for Scanning Probe Microscopy

Add code
Feb 26, 2026
Viaarxiv icon

DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Add code
Feb 18, 2026
Viaarxiv icon

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Add code
Feb 17, 2026
Viaarxiv icon