Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ryan Othniel Kearns

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Mar 12, 2026

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao(+5 more)

Abstract:Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.

Via

Access Paper or Ask Questions

Quantifying construct validity in large language model evaluations

Feb 17, 2026

Ryan Othniel Kearns

Abstract:The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.

Via

Access Paper or Ask Questions

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Sep 11, 2025

Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi

Abstract:To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.

* Accepted to EMNLP 2025 Main

Via

Access Paper or Ask Questions

Contextual Trust

Mar 15, 2023

Ryan Othniel Kearns

Abstract:Trust is an important aspect of human life. It provides instrumental value in allowing us to collaborate on and defer actions to others, and intrinsic value in our intimate relationships with romantic partners, family, and friends. In this paper I examine the nature of trust from a philosophical perspective. Specifically I propose to view trust as a context-sensitive state in a manner that will be made precise. The contribution of this paper is threefold. First, I make the simple observation that an individual's trust is typically both action- and context-sensitive. Action-sensitivity means that trust may obtain between a given truster and trustee for only certain actions. Context-sensitivity means that trust may obtain between a given truster and trustee, regarding the same action, in some conditions and not others. I also opine about what kinds of things may play the role of the truster, trustee, and action. Second, I advance a theory for the nature of contextual trust. I propose that the answer to "What does it mean for $A$ to trust $B$ to do $X$ in context $C$?" has two conditions. First, $A$ must take $B$'s doing $X$ as a means towards one of $A$'s ends. Second, $A$ must adopt an unquestioning attitude concerning $B$'s doing $X$ in context $C$. This unquestioning attitude is similar to the attitude introduced in Nguyen 2021. Finally, we explore how contextual trust can help us make sense of trust in general non-interpersonal settings, notably that of artificial intelligence (AI) systems. The field of Explainable Artificial Intelligence (XAI) assigns paramount importance to the problem of user trust in opaque computational models, yet does little to give trust diagnostic or even conceptual criteria. I propose that contextual trust is a natural fit for the task by illustrating that model transparency and explainability map nicely into our construction of the contexts $C$.

* 60 pages

Via

Access Paper or Ask Questions