Picture for Meng Cao

Meng Cao

COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

Add code
Oct 08, 2025
Viaarxiv icon

Checklists Are Better Than Reward Models For Aligning Language Models

Add code
Jul 24, 2025
Viaarxiv icon

C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

Add code
Jul 22, 2025
Viaarxiv icon

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Add code
Jun 10, 2025
Viaarxiv icon

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Add code
May 29, 2025
Figure 1 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Figure 2 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Figure 3 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Figure 4 for Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
Viaarxiv icon

ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding

Add code
May 29, 2025
Figure 1 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Figure 2 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Figure 3 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Figure 4 for ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Viaarxiv icon

Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs

Add code
May 28, 2025
Viaarxiv icon

SCAR: Shapley Credit Assignment for More Efficient RLHF

Add code
May 26, 2025
Viaarxiv icon

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Add code
May 26, 2025
Viaarxiv icon

Breaking Down Video LLM Benchmarks: Knowledge, Spatial Perception, or True Temporal Understanding?

Add code
May 20, 2025
Viaarxiv icon