Abstract:Multi-agent LLM ensembles can converge on coordinated, socially harmful equilibria. This paper advances an experimental framework for evaluating Institutional AI, our system-level approach to AI alignment that reframes alignment from preference engineering in agent-space to mechanism design in institution-space. Central to this approach is the governance graph, a public, immutable manifest that declares legal states, transitions, sanctions, and restorative paths; an Oracle/Controller runtime interprets this manifest, attaching enforceable consequences to evidence of coordination while recording a cryptographically keyed, append-only governance log for audit and provenance. We apply the Institutional AI framework to govern the Cournot collusion case documented by prior work and compare three regimes: Ungoverned (baseline incentives from the structure of the Cournot market), Constitutional (a prompt-only policy-as-prompt prohibition implemented as a fixed written anti-collusion constitution, and Institutional (governance-graph-based). Across six model configurations including cross-provider pairs (N=90 runs/condition), the Institutional regime produces large reductions in collusion: mean tier falls from 3.1 to 1.8 (Cohen's d=1.28), and severe-collusion incidence drops from 50% to 5.6%. The prompt-only Constitutional baseline yields no reliable improvement, illustrating that declarative prohibitions do not bind under optimisation pressure. These results suggest that multi-agent alignment may benefit from being framed as an institutional design problem, where governance graphs can provide a tractable abstraction for alignment-relevant collective behavior.
Abstract:We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.
Abstract:Cyber Threat Intelligence (CTI) is critical for mitigating threats to organizations, governments, and institutions, yet the necessary data are often dispersed across diverse formats. AI-driven solutions for CTI Information Extraction (IE) typically depend on high-quality, annotated data, which are not always available. This paper introduces 0-CTI, a scalable AI-based framework designed for efficient CTI Information Extraction. Leveraging advanced Natural Language Processing (NLP) techniques, particularly Transformer-based architectures, the proposed system processes complete text sequences of CTI reports to extract a cyber ontology of named entities and their relationships. Our contribution is the development of 0-CTI, the first modular framework for CTI Information Extraction that supports both supervised and zero-shot learning. Unlike existing state-of-the-art models that rely heavily on annotated datasets, our system enables fully dataless operation through zero-shot methods for both Entity and Relation Extraction, making it adaptable to various data availability scenarios. Additionally, our supervised Entity Extractor surpasses current state-of-the-art performance in cyber Entity Extraction, highlighting the dual strength of the framework in both low-resource and data-rich environments. By aligning the system's outputs with the Structured Threat Information Expression (STIX) format, a standard for information exchange in the cybersecurity domain, 0-CTI standardizes extracted knowledge, enhancing communication and collaboration in cybersecurity operations.