Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Joona Kytöniemi

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Dec 15, 2025

Joona Kytöniemi, Jousia Piha, Akseli Reunamo, Fedor Vitiugin, Farrokh Mehryary, Sampo Pyysalo

Figure 1 for FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Figure 2 for FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Figure 3 for FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Figure 4 for FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Abstract:We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

Via

Access Paper or Ask Questions

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Mar 13, 2025

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, and Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič(+25 more)

Figure 1 for An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Figure 2 for An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Figure 3 for An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Figure 4 for An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

Abstract:Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Via

Access Paper or Ask Questions