Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Chris Russell

Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

May 14, 2026

William Lugoloobi, Samuelle Marro, Jabez Magomere, Joss Wright, Chris Russell

Abstract:As LLM-based agents increasingly browse the web on users' behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent's actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96\% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \href{https://github.com/KabakaWilliam/known_actions}{here}.

Via

Access Paper or Ask Questions

LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

Feb 10, 2026

William Lugoloobi, Thomas Foster, William Bankes, Chris Russell

Abstract:Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

Via

Access Paper or Ask Questions

Agent Benchmarks Fail Public Sector Requirements

Jan 28, 2026

Jonathan Rystrøm, Chris Schmitz, Karolina Korgul, Jan Batzner, Chris Russell

Abstract:Deploying Large Language Model-based agents (LLM agents) in the public sector requires assuring that they meet the stringent legal, procedural, and structural requirements of public-sector institutions. Practitioners and researchers often turn to benchmarks for such assessments. However, it remains unclear what criteria benchmarks must meet to ensure they adequately reflect public-sector requirements, or how many existing benchmarks do so. In this paper, we first define such criteria based on a first-principles survey of public administration literature: benchmarks must be \emph{process-based}, \emph{realistic}, \emph{public-sector-specific} and report \emph{metrics} that reflect the unique requirements of the public sector. We analyse more than 1,300 benchmark papers for these criteria using an expert-validated LLM-assisted pipeline. Our results show that no single benchmark meets all of the criteria. Our findings provide a call to action for both researchers to develop public sector-relevant benchmarks and for public-sector officials to apply these criteria when evaluating their own agentic use cases.

* Forthcoming @ IASEAI 2026

Via

Access Paper or Ask Questions

Evaluating the Ability of Explanations to Disambiguate Models in a Rashomon Set

Jan 13, 2026

Kaivalya Rawal, Eoin Delaney, Zihao Fu, Sandra Wachter, Chris Russell

Abstract:Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper "Evaluating Model Explanations without Ground Truth", we proposed three principles of explanation evaluation and a new method "AXE" to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.

* This is a preprint of the paper published at the MURE workshop, AAAI 2026, which builds on a preprint of separate work published at FAccT 2025 (arXiv:2505.10399)

Via

Access Paper or Ask Questions

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Dec 29, 2025

Karolina Korgul, Yushi Yang, Arkadiusz Drohomirecki, Piotr Błaszczyk, Will Howard, Lukas Aichberger, Chris Russell, Philip H. S. Torr, Adam Mahdi, Adel Bibi

Abstract:Web-based agents powered by large language models are increasingly used for tasks such as email management or professional networking. Their reliance on dynamic web content, however, makes them vulnerable to prompt injection attacks: adversarial instructions hidden in interface elements that persuade the agent to divert from its original task. We introduce the Task-Redirecting Agent Persuasion Benchmark (TRAP), an evaluation for studying how persuasion techniques misguide autonomous web agents on realistic tasks. Across six frontier models, agents are susceptible to prompt injection in 25\% of tasks on average (13\% for GPT-5 to 43\% for DeepSeek-R1), with small interface or contextual changes often doubling success rates and revealing systemic, psychologically driven vulnerabilities in web-based agents. We also provide a modular social-engineering injection framework with controlled experiments on high-fidelity website clones, allowing for further benchmark expansion.

Via

Access Paper or Ask Questions

OxEnsemble: Fair Ensembles for Low-Data Classification

Dec 10, 2025

Jonathan Rystrøm, Zihao Fu, Chris Russell

Abstract:We address the problem of fair classification in settings where data is scarce and unbalanced across demographic groups. Such low-data regimes are common in domains like medical imaging, where false negatives can have fatal consequences. We propose a novel approach \emph{OxEnsemble} for efficiently training ensembles and enforcing fairness in these low-data regimes. Unlike other approaches, we aggregate predictions across ensemble members, each trained to satisfy fairness constraints. By construction, \emph{OxEnsemble} is both data-efficient, carefully reusing held-out data to enforce fairness reliably, and compute-efficient, requiring little more compute than used to fine-tune or evaluate an existing model. We validate this approach with new theoretical guarantees. Experimentally, our approach yields more consistent outcomes and stronger fairness-accuracy trade-offs than existing methods across multiple challenging medical imaging classification datasets.

Via

Access Paper or Ask Questions

CAST: Compositional Analysis via Spectral Tracking for Understanding Transformer Layer Functions

Oct 16, 2025

Zihao Fu, Ming Liao, Chris Russell, Zhenguang G. Cai

Abstract:Large language models have achieved remarkable success but remain largely black boxes with poorly understood internal mechanisms. To address this limitation, many researchers have proposed various interpretability methods including mechanistic analysis, probing classifiers, and activation visualization, each providing valuable insights from different perspectives. Building upon this rich landscape of complementary approaches, we introduce CAST (Compositional Analysis via Spectral Tracking), a probe-free framework that contributes a novel perspective by analyzing transformer layer functions through direct transformation matrix estimation and comprehensive spectral analysis. CAST offers complementary insights to existing methods by estimating the realized transformation matrices for each layer using Moore-Penrose pseudoinverse and applying spectral analysis with six interpretable metrics characterizing layer behavior. Our analysis reveals distinct behaviors between encoder-only and decoder-only models, with decoder models exhibiting compression-expansion cycles while encoder models maintain consistent high-rank processing. Kernel analysis further demonstrates functional relationship patterns between layers, with CKA similarity matrices clearly partitioning layers into three phases: feature extraction, compression, and specialization.

Via

Access Paper or Ask Questions

LLMs Don't Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations

Sep 11, 2025

Harry Mayne, Ryan Othniel Kearns, Yushi Yang, Andrew M. Bean, Eoin Delaney, Chris Russell, Adam Mahdi

Abstract:To collaborate effectively with humans, language models must be able to explain their decisions in natural language. We study a specific type of self-explanation: self-generated counterfactual explanations (SCEs), where a model explains its prediction by modifying the input such that it would have predicted a different outcome. We evaluate whether LLMs can produce SCEs that are valid, achieving the intended outcome, and minimal, modifying the input no more than necessary. When asked to generate counterfactuals, we find that LLMs typically produce SCEs that are valid, but far from minimal, offering little insight into their decision-making behaviour. Worryingly, when asked to generate minimal counterfactuals, LLMs typically make excessively small edits that fail to change predictions. The observed validity-minimality trade-off is consistent across several LLMs, datasets, and evaluation settings. Our findings suggest that SCEs are, at best, an ineffective explainability tool and, at worst, can provide misleading insights into model behaviour. Proposals to deploy LLMs in high-stakes settings must consider the impact of unreliable self-explanations on downstream decision-making. Our code is available at https://github.com/HarryMayne/SCEs.

* Accepted to EMNLP 2025 Main

Via

Access Paper or Ask Questions

Evaluating Model Explanations without Ground Truth

May 15, 2025

Kaivalya Rawal, Zihao Fu, Eoin Delaney, Chris Russell

Abstract:There can be many competing and contradictory explanations for a single model prediction, making it difficult to select which one to use. Current explanation evaluation frameworks measure quality by comparing against ideal "ground-truth" explanations, or by verifying model sensitivity to important inputs. We outline the limitations of these approaches, and propose three desirable principles to ground the future development of explanation evaluation strategies for local feature importance explanations. We propose a ground-truth Agnostic eXplanation Evaluation framework (AXE) for evaluating and comparing model explanations that satisfies these principles. Unlike prior approaches, AXE does not require access to ideal ground-truth explanations for comparison, or rely on model sensitivity - providing an independent measure of explanation quality. We verify AXE by comparing with baselines, and show how it can be used to detect explanation fairwashing. Our code is available at https://github.com/KaiRawal/Evaluating-Model-Explanations-without-Ground-Truth.

* https://github.com/KaiRawal/Evaluating-Model-Explanations-without-Ground-Truth

Via

Access Paper or Ask Questions

Deepfakes on Demand: the rise of accessible non-consensual deepfake image generators

May 06, 2025

Will Hawkins, Chris Russell, Brent Mittelstadt

Abstract:Advances in multimodal machine learning have made text-to-image (T2I) models increasingly accessible and popular. However, T2I models introduce risks such as the generation of non-consensual depictions of identifiable individuals, otherwise known as deepfakes. This paper presents an empirical study exploring the accessibility of deepfake model variants online. Through a metadata analysis of thousands of publicly downloadable model variants on two popular repositories, Hugging Face and Civitai, we demonstrate a huge rise in easily accessible deepfake models. Almost 35,000 examples of publicly downloadable deepfake model variants are identified, primarily hosted on Civitai. These deepfake models have been downloaded almost 15 million times since November 2022, with the models targeting a range of individuals from global celebrities to Instagram users with under 10,000 followers. Both Stable Diffusion and Flux models are used for the creation of deepfake models, with 96% of these targeting women and many signalling intent to generate non-consensual intimate imagery (NCII). Deepfake model variants are often created via the parameter-efficient fine-tuning technique known as low rank adaptation (LoRA), requiring as few as 20 images, 24GB VRAM, and 15 minutes of time, making this process widely accessible via consumer-grade computers. Despite these models violating the Terms of Service of hosting platforms, and regulation seeking to prevent dissemination, these results emphasise the pressing need for greater action to be taken against the creation of deepfakes and NCII.

* FAccT '25: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency
* 13 pages

Via

Access Paper or Ask Questions