Picture for Zhongyuan Peng

Zhongyuan Peng

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

Add code
Feb 27, 2026
Viaarxiv icon

CoDiQ: Test-Time Scaling for Controllable Difficult Question Generation

Add code
Feb 02, 2026
Viaarxiv icon

Retrieval-Infused Reasoning Sandbox: A Benchmark for Decoupling Retrieval and Reasoning Capabilities

Add code
Jan 29, 2026
Viaarxiv icon

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Add code
Jan 08, 2026
Viaarxiv icon

CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization

Add code
Jul 08, 2025
Figure 1 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Figure 2 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Figure 3 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Figure 4 for CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization
Viaarxiv icon

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Add code
May 05, 2025
Figure 1 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Figure 2 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Figure 3 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Figure 4 for FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
Viaarxiv icon

IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs

Add code
Apr 21, 2025
Viaarxiv icon

Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?

Add code
Feb 26, 2025
Figure 1 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Figure 2 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Figure 3 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Figure 4 for Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?
Viaarxiv icon

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

Add code
Feb 23, 2025
Figure 1 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Figure 2 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Figure 3 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Figure 4 for CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
Viaarxiv icon

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Add code
Feb 20, 2025
Viaarxiv icon