Picture for Meng Cao

Meng Cao

COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

Add code
Oct 08, 2025
Viaarxiv icon

Checklists Are Better Than Reward Models For Aligning Language Models

Add code
Jul 24, 2025
Figure 1 for Checklists Are Better Than Reward Models For Aligning Language Models
Figure 2 for Checklists Are Better Than Reward Models For Aligning Language Models
Figure 3 for Checklists Are Better Than Reward Models For Aligning Language Models
Figure 4 for Checklists Are Better Than Reward Models For Aligning Language Models
Viaarxiv icon

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Add code
Jul 22, 2025
Viaarxiv icon

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Add code
Jun 10, 2025
Viaarxiv icon

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

Add code
May 29, 2025
Figure 1 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Figure 2 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Figure 3 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Figure 4 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Viaarxiv icon

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Add code
May 29, 2025
Figure 1 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Figure 2 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Figure 3 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Figure 4 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Viaarxiv icon

Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Add code
May 28, 2025
Viaarxiv icon

SCAR: Shapley Credit Assignment for More Efficient RLHF

Add code
May 26, 2025
Viaarxiv icon

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Add code
May 26, 2025
Viaarxiv icon

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Add code
May 20, 2025
Viaarxiv icon