Picture for Chen Bo Calvin Zhang

Chen Bo Calvin Zhang

PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning

Add code
Nov 14, 2025
Viaarxiv icon

ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents

Add code
Nov 10, 2025
Viaarxiv icon

Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Add code
Oct 14, 2025
Viaarxiv icon

Reliable Weak-to-Strong Monitoring of LLM Agents

Add code
Aug 26, 2025
Figure 1 for Reliable Weak-to-Strong Monitoring of LLM Agents
Figure 2 for Reliable Weak-to-Strong Monitoring of LLM Agents
Figure 3 for Reliable Weak-to-Strong Monitoring of LLM Agents
Figure 4 for Reliable Weak-to-Strong Monitoring of LLM Agents
Viaarxiv icon

ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

Add code
Oct 17, 2024
Figure 1 for ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
Figure 2 for ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
Figure 3 for ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
Figure 4 for ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization
Viaarxiv icon