Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Taido Purason

EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Mar 02, 2026

Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel, Tanel Alumäe, Eleri Aedmaa, Krister Kruusmaa, Kairit Sirts

Abstract:Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.

Via

Access Paper or Ask Questions

Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Jan 05, 2025

Aleksei Dorkin, Taido Purason, Kairit Sirts

Figure 1 for Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Figure 2 for Prune or Retrain: Optimizing the Vocabulary of Multilingual Models for Estonian

Abstract:Adapting multilingual language models to specific languages can enhance both their efficiency and performance. In this study, we explore how modifying the vocabulary of a multilingual encoder model to better suit the Estonian language affects its downstream performance on the Named Entity Recognition (NER) task. The motivations for adjusting the vocabulary are twofold: practical benefits affecting the computational cost, such as reducing the input sequence length and the model size, and performance enhancements by tailoring the vocabulary to the particular language. We evaluate the effectiveness of two vocabulary adaptation approaches -- retraining the tokenizer and pruning unused tokens -- and assess their impact on the model's performance, particularly after continual training. While retraining the tokenizer degraded the performance of the NER task, suggesting that longer embedding tuning might be needed, we observed no negative effects on pruning.

* Published in the Proceedings of the 9th International Workshop on Computational Linguistics for Uralic Languages

Via

Access Paper or Ask Questions

LLMs for Extremely Low-Resource Finno-Ugric Languages

Oct 24, 2024

Taido Purason, Hele-Andra Kuulmets, Mark Fishel

Figure 1 for LLMs for Extremely Low-Resource Finno-Ugric Languages

Figure 2 for LLMs for Extremely Low-Resource Finno-Ugric Languages

Figure 3 for LLMs for Extremely Low-Resource Finno-Ugric Languages

Figure 4 for LLMs for Extremely Low-Resource Finno-Ugric Languages

Abstract:The advancement of large language models (LLMs) has predominantly focused on high-resource languages, leaving low-resource languages, such as those in the Finno-Ugric family, significantly underrepresented. This paper addresses this gap by focusing on V\~oro, Livonian, and Komi. We cover almost the entire cycle of LLM creation, from data collection to instruction tuning and evaluation. Our contributions include developing multilingual base and instruction-tuned models; creating evaluation benchmarks, including the smugri-MT-bench multi-turn conversational benchmark; and conducting human evaluation. We intend for this work to promote linguistic diversity, ensuring that lesser-resourced languages can benefit from advancements in NLP.

Via

Access Paper or Ask Questions

Teaching Llama a New Language Through Cross-Lingual Knowledge Transfer

Apr 05, 2024

Hele-Andra Kuulmets, Taido Purason, Agnes Luhtaru, Mark Fishel

Abstract:This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian. Leveraging the Llama 2 model, we investigate the impact of combining cross-lingual instruction-tuning with additional monolingual pretraining. Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian. Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities. Our best model, named \textsc{Llammas}, represents the first open-source instruction-following LLM for Estonian. Additionally, we publish Alpaca-est, the first general task instruction dataset for Estonia. These contributions mark the initial progress in the direction of developing open-source LLMs for Estonian.

Via

Access Paper or Ask Questions

To Err Is Human, but Llamas Can Learn It Too

Mar 08, 2024

Agnes Luhtaru, Taido Purason, Martin Vainikko, Maksym Del, Mark Fishel

Figure 1 for To Err Is Human, but Llamas Can Learn It Too

Figure 2 for To Err Is Human, but Llamas Can Learn It Too

Figure 3 for To Err Is Human, but Llamas Can Learn It Too

Figure 4 for To Err Is Human, but Llamas Can Learn It Too

Abstract:This study explores enhancing grammatical error correction (GEC) through artificial error generation (AEG) using language models (LMs). Specifically, we fine-tune Llama 2-based LMs for error generation and find that this approach yields synthetic errors akin to human errors. Next, we train GEC Llama models with the help of these artificial errors and outperform previous state-of-the-art error correction models, with gains ranging between 0.8 and 6 F0.5 points across all tested languages (German, Ukrainian, and Estonian). Moreover, we demonstrate that generating errors by fine-tuning smaller sequence-to-sequence models and prompting large commercial LMs (GPT-3.5 and GPT-4) also results in synthetic errors beneficially affecting error generation models.

Via

Access Paper or Ask Questions