Picture for Chenxi Whitehouse

Chenxi Whitehouse

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Add code
Mar 19, 2026
Viaarxiv icon

Text-to-Stage: Spatial Layouts from Long-form Narratives

Add code
Mar 18, 2026
Viaarxiv icon

APRES: An Agentic Paper Revision and Evaluation System

Add code
Mar 03, 2026
Viaarxiv icon

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Add code
Feb 18, 2026
Viaarxiv icon

Macaron: Controlled, Human-Written Benchmark for Multilingual and Multicultural Reasoning via Template-Filling

Add code
Feb 11, 2026
Viaarxiv icon

Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

Add code
Feb 04, 2026
Viaarxiv icon

The Llama 4 Herd: Architecture, Training, Evaluation, and Deployment Notes

Add code
Jan 15, 2026
Viaarxiv icon

Training AI Co-Scientists Using Rubric Rewards

Add code
Dec 29, 2025
Viaarxiv icon

Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation

Add code
Dec 23, 2025
Viaarxiv icon

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages

Add code
Sep 30, 2025
Figure 1 for MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Figure 2 for MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Figure 3 for MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Figure 4 for MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Viaarxiv icon