Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Katarzyna Lorenc

PLLuM: A Family of Polish Large Language Models

Nov 05, 2025

Jan Kocoń, Maciej Piasecki, Arkadiusz Janz, Teddy Ferdinan, Łukasz Radliński, Bartłomiej Koptyra, Marcin Oleksy, Stanisław Woźniak, Paweł Walkowiak, Konrad Wojtasik(+89 more)

Figure 1 for PLLuM: A Family of Polish Large Language Models

Figure 2 for PLLuM: A Family of Polish Large Language Models

Figure 3 for PLLuM: A Family of Polish Large Language Models

Figure 4 for PLLuM: A Family of Polish Large Language Models

Abstract:Large Language Models (LLMs) play a central role in modern artificial intelligence, yet their development has been primarily focused on English, resulting in limited support for other languages. We present PLLuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language. Developed by a consortium of major Polish research institutions, PLLuM addresses the need for high-quality, transparent, and culturally relevant language models beyond the English-centric commercial landscape. We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training, a 77k custom instructions dataset, and a 100k preference optimization dataset. A key component is a Responsible AI framework that incorporates strict data governance and a hybrid module for output correction and safety filtering. We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants, and demonstrate their utility in a downstream task within public administration. By releasing these models publicly, PLLuM aims to foster open research and strengthen sovereign AI technologies in Poland.

* 83 pages, 19 figures

Via

Access Paper or Ask Questions

Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Jun 09, 2025

Maciej Chrabąszcz, Katarzyna Lorenc, Karolina Seweryn

Figure 1 for Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Figure 2 for Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Figure 3 for Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Figure 4 for Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing (NLP) tasks in recent years. However, their susceptibility to jailbreaks and perturbations necessitates additional evaluations. Many LLMs are multilingual, but safety-related training data contains mainly high-resource languages like English. This can leave them vulnerable to perturbations in low-resource languages such as Polish. We show how surprisingly strong attacks can be cheaply created by altering just a few characters and using a small proxy model for word importance calculation. We find that these character and word-level attacks drastically alter the predictions of different LLMs, suggesting a potential vulnerability that can be used to circumvent their internal safety mechanisms. We validate our attack construction methodology on Polish, a low-resource language, and find potential vulnerabilities of LLMs in this language. Additionally, we show how it can be extended to other languages. We release the created datasets and code for further research.

Via

Access Paper or Ask Questions

Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

Dec 23, 2024

Anna Kołos, Katarzyna Lorenc, Emilia Wiśnios, Agnieszka Karlińska

Figure 1 for Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

Figure 2 for Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

Figure 3 for Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

Figure 4 for Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

Abstract:The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We present forePLay, a novel Polish language dataset for erotic content detection, featuring over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.

* The forePLay dataset and associated resources will be made publicly available for research purposes upon publication, in accordance with data sharing regulations

Via

Access Paper or Ask Questions