Abstract:This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.
Abstract:Reinforcement learning (RL) algorithms, due to their reliance on external systems to learn from, require digital environments (e.g., simulators) with very simple interfaces, which in turn constrain significantly the implementation of such environments. In particular, these environments are implemented either as separate processes or as state machines, leading to synchronization and communication overheads in the first case, and to unstructured programming in the second. We propose a new domain-specific, co-routine-based, compiled language, called Rulebook, designed to automatically generate the state machine required to interact with machine learning (ML) algorithms and similar applications, with no performance overhead. Rulebook allows users to express programs without needing to be aware of the specific interface required by the ML components. By decoupling the execution model of the program from the syntactical encoding of the program, and thus without the need for manual state management, Rulebook allows to create larger and more sophisticated environments at a lower development cost.
Abstract:Cross-site scripting (XSS) poses a significant threat to web application security. While Deep Learning (DL) has shown remarkable success in detecting XSS attacks, it remains vulnerable to adversarial attacks due to the discontinuous nature of its input-output mapping. These adversarial attacks employ mutation-based strategies for different components of XSS attack vectors, allowing adversarial agents to iteratively select mutations to evade detection. Our work replicates a state-of-the-art XSS adversarial attack, highlighting threats to validity in the reference work and extending it toward a more effective evaluation strategy. Moreover, we introduce an XSS Oracle to mitigate these threats. The experimental results show that our approach achieves an escape rate above 96% when the threats to validity of the replicated technique are addressed.