Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Alex Jude

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

May 28, 2025

Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk(+9 more)

Figure 1 for Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Figure 2 for Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Figure 3 for Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Figure 4 for Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Abstract:High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

* Project page available at https://huggingface.co/spaces/Jackal-AI/JQL

Via

Access Paper or Ask Questions

Towards Multilingual LLM Evaluation for European Languages

Oct 17, 2024

Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel(+1 more)

Figure 1 for Towards Multilingual LLM Evaluation for European Languages

Figure 2 for Towards Multilingual LLM Evaluation for European Languages

Figure 3 for Towards Multilingual LLM Evaluation for European Languages

Figure 4 for Towards Multilingual LLM Evaluation for European Languages

Abstract:The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of language-parallel multilingual benchmarks. We introduce a multilingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.

Via

Access Paper or Ask Questions

Towards Cross-Lingual LLM Evaluation for European Languages

Oct 11, 2024

Klaudia Thellmann, Bernhard Stadler, Michael Fromm, Jasper Schulze Buschhoff, Alex Jude, Fabio Barth, Johannes Leveling, Nicolas Flores-Herr, Joachim Köhler, René Jäkel(+1 more)

Figure 1 for Towards Cross-Lingual LLM Evaluation for European Languages

Figure 2 for Towards Cross-Lingual LLM Evaluation for European Languages

Figure 3 for Towards Cross-Lingual LLM Evaluation for European Languages

Figure 4 for Towards Cross-Lingual LLM Evaluation for European Languages

Abstract:The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains challenging, especially due to the scarcity of multilingual benchmarks. We introduce a cross-lingual evaluation approach tailored for European languages. We employ translated versions of five widely-used benchmarks to assess the capabilities of 40 LLMs across 21 European languages. Our contributions include examining the effectiveness of translated benchmarks, assessing the impact of different translation services, and offering a multilingual evaluation framework for LLMs that includes newly created datasets: EU20-MMLU, EU20-HellaSwag, EU20-ARC, EU20-TruthfulQA, and EU20-GSM8K. The benchmarks and results are made publicly available to encourage further research in multilingual LLM evaluation.

Via

Access Paper or Ask Questions