Abstract:Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.




Abstract:The explosions on September 26th, 2022, which damaged the gas pipelines of Nord Stream 1 and Nord Stream 2, have highlighted the need and urgency of improving the resilience of Underwater Critical Infrastructures (UCIs). Comprising gas pipelines and power and communication cables, these connect countries worldwide and are critical for the global economy and stability. An attack targeting multiple of such infrastructures simultaneously could potentially cause significant damage and greatly affect various aspects of daily life. Due to the increasing number and continuous deployment of UCIs, existing underwater surveillance solutions, such as Autonomous Underwater Vehicles (AUVs) or Remotely Operated Vehicles (ROVs), are not adequate enough to ensure thorough monitoring. We show that the combination of information from both underwater and above-water surveillance sensors enables achieving Seabed-to-Space Situational Awareness (S3A), mainly thanks to Artificial Intelligence (AI) and Information Fusion (IF) methodologies. These are designed to process immense volumes of information, fused from a variety of sources and generated from monitoring a very large number of assets on a daily basis. The learned knowledge can be used to anticipate future behaviors, identify threats, and determine critical situations concerning UCIs. To illustrate the capabilities and importance of S3A, we consider three events that occurred in the second half of 2022: the aforementioned Nord Stream explosions, the cutoff of the underwater communication cable SHEFA-2 connecting the Shetland Islands and the UK mainland, and the suspicious activity of a large vessel in the Adriatic Sea. Specifically, we provide analyses of the available data, from Automatic Identification System (AIS) and satellite data, integrated with possible contextual information, e.g., bathymetry, weather conditions, and human intelligence.