Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Samuel Marc Denton

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

Apr 22, 2026

Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue

Abstract:Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

Via

Access Paper or Ask Questions

LHAW: Controllable Underspecification for Long-Horizon Tasks

Feb 11, 2026

George Pu, Michael S. Lee, Udari Madhushani Sehwag, David J. Lee, Bryan Zhu, Yash Maurya, Mohit Raghavendra, Yuan Xue, Samuel Marc Denton

Abstract:Long-horizon workflow agents that operate effectively over extended periods are essential for truly autonomous systems. Their reliable execution critically depends on the ability to reason through ambiguous situations in which clarification seeking is necessary to ensure correct task execution. However, progress is limited by the lack of scalable, task-agnostic frameworks for systematically curating and measuring the impact of ambiguity across custom workflows. We address this gap by introducing LHAW (Long-Horizon Augmented Workflows), a modular, dataset-agnostic synthetic pipeline that transforms any well-specified task into controllable underspecified variants by systematically removing information across four dimensions - Goals, Constraints, Inputs, and Context - at configurable severity levels. Unlike approaches that rely on LLM predictions of ambiguity, LHAW validates variants through empirical agent trials, classifying them as outcome-critical, divergent, or benign based on observed terminal state divergence. We release 285 task variants from TheAgentCompany, SWE-Bench Pro and MCP-Atlas according to our taxonomy alongside formal analysis measuring how current agents detect, reason about, and resolve underspecification across ambiguous settings. LHAW provides the first systematic framework for cost-sensitive evaluation of agent clarification behavior in long-horizon settings, enabling development of reliable autonomous systems.

Via

Access Paper or Ask Questions

A Weighted Solution to SVM Actionability and Interpretability

Dec 06, 2020

Samuel Marc Denton, Ansaf Salleb-Aouissi

Figure 1 for A Weighted Solution to SVM Actionability and Interpretability

Figure 2 for A Weighted Solution to SVM Actionability and Interpretability

Figure 3 for A Weighted Solution to SVM Actionability and Interpretability

Figure 4 for A Weighted Solution to SVM Actionability and Interpretability

Abstract:Research in machine learning has successfully developed algorithms to build accurate classification models. However, in many real-world applications, such as healthcare, customer satisfaction, and environment protection, we want to be able to use the models to decide what actions to take. We investigate the concept of actionability in the context of Support Vector Machines. Actionability is as important as interpretability or explainability of machine learning models, an ongoing and important research topic. Actionability is the task that gives us ways to act upon machine learning models and their predictions. This paper finds a solution to the question of actionability on both linear and non-linear SVM models. Additionally, we introduce a way to account for weighted actions that allow for more change in certain features than others. We propose a gradient descent solution on the linear, RBF, and polynomial kernels, and we test the effectiveness of our models on both synthetic and real datasets. We are also able to explore the model's interpretability through the lens of actionability.

* 20 pages; work in progress; 17 figures; 3 tables

Via

Access Paper or Ask Questions