Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Dávid Márk Nemeskey

Department of Digital Humanities, Eötvös Loránd University

Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure

Jan 03, 2026

Zsolt Csibi, Bence György Gortka, Natabara Gyöngyössy, Kornél Nagy, Dávid Márk Nemeskey, Martin Sallai, András Simonyi, András Márk Szekeres, Gábor Palkó

Abstract:We present Racka, a lightweight, continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages such as English and German. Racka employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen-3 4B backbone, making the recipe practical on A100 (40GB)-based HPC clusters with low inter-node bandwidth. To better match the training distribution, we replace and adapt the tokenizer, achieving substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German. The model is trained on 160B subword tokens drawn from a mixture of internet and high-quality curated sources, with a composition of 44% Hungarian, 24% English, 21% German, and 11% code. This data mix is chosen to mitigate catastrophic forgetting and preserve high-resource language capabilities during continual pretraining. Our preliminary results indicate modest but stable results in language adaptation.

* 18 pages, 1 figures. To appear in the XXII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2026)

Via

Access Paper or Ask Questions

Evaluating Contextualized Language Models for Hungarian

Feb 22, 2021

Judit Ács, Dániel Lévai, Dávid Márk Nemeskey, András Kornai

Figure 1 for Evaluating Contextualized Language Models for Hungarian

Figure 2 for Evaluating Contextualized Language Models for Hungarian

Figure 3 for Evaluating Contextualized Language Models for Hungarian

Figure 4 for Evaluating Contextualized Language Models for Hungarian

Abstract:We present an extended comparison of contextualized language models for Hungarian. We compare huBERT, a Hungarian model against 4 multilingual models including the multilingual BERT model. We evaluate these models through three tasks, morphological probing, POS tagging and NER. We find that huBERT works better than the other models, often by a large margin, particularly near the global optimum (typically at the middle layers). We also find that huBERT tends to generate fewer subwords for one word and that using the last subword for token-level tasks is generally a better choice than using the first one.

* Hungarian NLP Conference (MSZNY2021)

Via

Access Paper or Ask Questions

emLam -- a Hungarian Language Modeling baseline

Jan 26, 2017

Dávid Márk Nemeskey

Figure 1 for emLam -- a Hungarian Language Modeling baseline

Figure 2 for emLam -- a Hungarian Language Modeling baseline

Figure 3 for emLam -- a Hungarian Language Modeling baseline

Figure 4 for emLam -- a Hungarian Language Modeling baseline

Abstract:This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungar- ian benchmark corpus is introduced.

* In Proceedings of the 13th Conference on Hungarian Computational Linguistics (MSZNY), pp. 91-102. Szeged, 2017
* Additional resources: - the emLam repository: https://github.com/DavidNemeskey/emLam - the emLam corpus: http://hlt.bme.hu/en/resources/emLam

Via

Access Paper or Ask Questions