Abstract:AI-enabled services deployed in critical digital infrastructure are subject to governance obligations spanning transparency, accountability, fairness, and traceability. Compliance today remains documentation-centric: obligations are described in prose, audits rely on static checklists, and verification depends on manual review. Such approaches do not scale to automated AI systems. This paper introduces Ontological Knowledge Blocks (OKBs), a programmable governance infrastructure that compiles regulatory obligations into machine-checkable constraints over structured evidence graphs. We formalize an OKB as a 5-tuple that binds normative obligations to an RDF/OWL concept schema, executable SHACL validation rules, explicit evidence requirements, and PROV-O provenance links. A deterministic regulatory compiler translates structured Intermediate Representation (IR) records into composable KB modules, enabling profile-based governance reconfiguration without modifying service code. We implement two prototypes and evaluate them in an AI-assisted HPC resource allocation scenario across 24 validation runs and four governance profiles. Results demonstrate profile-sensitive validation, strictly additive violation accumulation, SHACL validation latency between 12.6 ms and 100.3 ms, and profile equivalence testing confirming Combined as the strictly most comprehensive profile. All artefacts are released as open source.
Abstract:The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of AI models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with Slurm, enabling seamless deployment on HPC clusters and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at https://github.com/gwdg/chat-ai
Abstract:The landscape of maintenance in distributed systems is rapidly evolving with the integration of Artificial Intelligence (AI). Also, as the complexity of computing continuum systems intensifies, the role of AI in predictive maintenance (Pd.M.) becomes increasingly pivotal. This paper presents a comprehensive survey of the current state of Pd.M. in the computing continuum, with a focus on the combination of scalable AI technologies. Recognizing the limitations of traditional maintenance practices in the face of increasingly complex and heterogenous computing continuum systems, the study explores how AI, especially machine learning and neural networks, is being used to enhance Pd.M. strategies. The survey encompasses a thorough review of existing literature, highlighting key advancements, methodologies, and case studies in the field. It critically examines the role of AI in improving prediction accuracy for system failures and in optimizing maintenance schedules, thereby contributing to reduced downtime and enhanced system longevity. By synthesizing findings from the latest advancements in the field, the article provides insights into the effectiveness and challenges of implementing AI-driven predictive maintenance. It underscores the evolution of maintenance practices in response to technological advancements and the growing complexity of computing continuum systems. The conclusions drawn from this survey are instrumental for practitioners and researchers in understanding the current landscape and future directions of Pd.M. in distributed systems. It emphasizes the need for continued research and development in this area, pointing towards a trend of more intelligent, efficient, and cost-effective maintenance solutions in the era of AI.