Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Theo King

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Sep 05, 2025

Ilham Wicaksono, Zekun Wu, Theo King, Adriano Koshiyama, Philip Treleaven

Abstract:As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.

Via

Access Paper or Ask Questions

LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

May 13, 2025

Zekun Wu, Seonglae Cho, Umar Mohammed, Cristian Munoz, Kleyton Costa, Xin Guan, Theo King, Ze Wang, Emre Kazim, Adriano Koshiyama

Figure 1 for LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

Figure 2 for LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

Figure 3 for LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

Figure 4 for LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries

Abstract:Open-source AI libraries are foundational to modern AI systems but pose significant, underexamined risks across security, licensing, maintenance, supply chain integrity, and regulatory compliance. We present LibVulnWatch, a graph-based agentic assessment framework that performs deep, source-grounded evaluations of these libraries. Built on LangGraph, the system coordinates a directed acyclic graph of specialized agents to extract, verify, and quantify risk using evidence from trusted sources such as repositories, documentation, and vulnerability databases. LibVulnWatch generates reproducible, governance-aligned scores across five critical domains, publishing them to a public leaderboard for longitudinal ecosystem monitoring. Applied to 20 widely used libraries, including ML frameworks, LLM inference engines, and agent orchestration tools, our system covers up to 88% of OpenSSF Scorecard checks while uncovering up to 19 additional risks per library. These include critical Remote Code Execution (RCE) vulnerabilities, absent Software Bills of Materials (SBOMs), licensing constraints, undocumented telemetry, and widespread gaps in regulatory documentation and auditability. By translating high-level governance principles into practical, verifiable metrics, LibVulnWatch advances technical AI governance with a scalable, transparent mechanism for continuous supply chain risk assessment and informed library selection.

Via

Access Paper or Ask Questions

HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Sep 17, 2024

Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven

Figure 1 for HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Figure 2 for HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Figure 3 for HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Figure 4 for HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection

Abstract:Stereotypes are generalised assumptions about societal groups, and even state-of-the-art LLMs using in-context learning struggle to identify them accurately. Due to the subjective nature of stereotypes, where what constitutes a stereotype can vary widely depending on cultural, social, and individual perspectives, robust explainability is crucial. Explainable models ensure that these nuanced judgments can be understood and validated by human users, promoting trust and accountability. We address these challenges by introducing HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text Stereotype Detection), a framework that enhances model performance, minimises carbon footprint, and provides transparent, interpretable explanations. We establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising 57,201 labeled texts across six groups, including under-represented demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm that BERT models fine-tuned on EMGSD outperform those trained on individual components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model using SHAP to generate token-level importance values, ensuring alignment with human understanding, and calculate explainability confidence scores by comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias over time within model families.

* Submitted to NeurIPS 2024 SoLaR Workshop

Via

Access Paper or Ask Questions