Picture for Wenhu Chen

Wenhu Chen

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Add code
Apr 09, 2026
Viaarxiv icon

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

Add code
Mar 29, 2026
Viaarxiv icon

SWE-Next: Scalable Real-World Software Engineering Tasks for Agents

Add code
Mar 21, 2026
Viaarxiv icon

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Add code
Mar 17, 2026
Viaarxiv icon

EvolveCoder: Evolving Test Cases via Adversarial Verification for Code Reinforcement Learning

Add code
Mar 13, 2026
Viaarxiv icon

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Add code
Feb 09, 2026
Viaarxiv icon

Context Forcing: Consistent Autoregressive Video Generation with Long Context

Add code
Feb 05, 2026
Viaarxiv icon

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Add code
Feb 02, 2026
Viaarxiv icon

CogDoc: Towards Unified thinking in Documents

Add code
Dec 14, 2025
Viaarxiv icon

Language Models Can Learn from Verbal Feedback Without Scalar Rewards

Add code
Sep 26, 2025
Figure 1 for Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Figure 2 for Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Figure 3 for Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Figure 4 for Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Viaarxiv icon