Abstract:The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.
Abstract:To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.
Abstract:Trust is an important aspect of human life. It provides instrumental value in allowing us to collaborate on and defer actions to others, and intrinsic value in our intimate relationships with romantic partners, family, and friends. In this paper I examine the nature of trust from a philosophical perspective. Specifically I propose to view trust as a context-sensitive state in a manner that will be made precise. The contribution of this paper is threefold. First, I make the simple observation that an individual's trust is typically both action- and context-sensitive. Action-sensitivity means that trust may obtain between a given truster and trustee for only certain actions. Context-sensitivity means that trust may obtain between a given truster and trustee, regarding the same action, in some conditions and not others. I also opine about what kinds of things may play the role of the truster, trustee, and action. Second, I advance a theory for the nature of contextual trust. I propose that the answer to "What does it mean for $A$ to trust $B$ to do $X$ in context $C$?" has two conditions. First, $A$ must take $B$'s doing $X$ as a means towards one of $A$'s ends. Second, $A$ must adopt an unquestioning attitude concerning $B$'s doing $X$ in context $C$. This unquestioning attitude is similar to the attitude introduced in Nguyen 2021. Finally, we explore how contextual trust can help us make sense of trust in general non-interpersonal settings, notably that of artificial intelligence (AI) systems. The field of Explainable Artificial Intelligence (XAI) assigns paramount importance to the problem of user trust in opaque computational models, yet does little to give trust diagnostic or even conceptual criteria. I propose that contextual trust is a natural fit for the task by illustrating that model transparency and explainability map nicely into our construction of the contexts $C$.