Picture for Robert Kitchen

Robert Kitchen

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Add code
May 10, 2026
Viaarxiv icon

Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

Add code
May 07, 2026
Viaarxiv icon

MechPert: Mechanistic Consensus as an Inductive Bias for Unseen Perturbation Prediction

Add code
Feb 14, 2026
Viaarxiv icon