Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Bruno Mendes

Department of Informatics Engineering

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

May 11, 2026

Pedro Conde, Henrique Branquinho, Valerio Mazzone, Bruno Mendes, André Baptista, Nuno Moniz

Abstract:AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.

Via

Access Paper or Ask Questions

Enhancing Adaptive Mixed-Criticality Scheduling with Deep Reinforcement Learning

Nov 01, 2024

Bruno Mendes, Pedro F. Souto, Pedro C. Diniz

Figure 1 for Enhancing Adaptive Mixed-Criticality Scheduling with Deep Reinforcement Learning

Figure 2 for Enhancing Adaptive Mixed-Criticality Scheduling with Deep Reinforcement Learning

Figure 3 for Enhancing Adaptive Mixed-Criticality Scheduling with Deep Reinforcement Learning

Figure 4 for Enhancing Adaptive Mixed-Criticality Scheduling with Deep Reinforcement Learning

Abstract:Adaptive Mixed-Criticality (AMC) is a fixed-priority preemptive scheduling algorithm for mixed-criticality hard real-time systems. It dominates many other scheduling algorithms for mixed-criticality systems, but does so at the cost of occasionally dropping jobs of less important/critical tasks, when low-priority jobs overrun their time budgets. In this paper we enhance AMC with a deep reinforcement learning (DRL) approach based on a Deep-Q Network. The DRL agent is trained off-line, and at run-time adjusts the low-criticality budgets of tasks to avoid budget overruns, while ensuring that no job misses its deadline if it does not overrun its budget. We have implemented and evaluated this approach by simulating realistic workloads from the automotive domain. The results show that the agent is able to reduce budget overruns by at least up to 50%, even when the budget of each task is chosen based on sampling the distribution of its execution time. To the best of our knowledge, this is the first use of DRL in AMC reported in the literature.

* Version submitted to RTNS 2024, on 17/08/2024 (with some typos fixed)

Via

Access Paper or Ask Questions