Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Ivan Porupski

State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?

Nov 11, 2025

Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić

Abstract:Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has increasingly moved toward zero-shot and few-shot prompting. However, the performance of LLMs on text classification, particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of current language models on text classification tasks across several South Slavic languages. We compare openly available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamentary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup, LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to these limitations, fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.

* 16 pages; 4 figures; 3 tables. Submitted to the LREC 2026 conference

Via

Access Paper or Ask Questions

Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models

May 30, 2025

Nikola Ljubešić, Ivan Porupski, Peter Rupnik

Figure 1 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models

Figure 2 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models

Figure 3 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models

Figure 4 for Identifying Primary Stress Across Related Languages and Dialects with Transformer-based Speech Encoder Models

Abstract:Automating primary stress identification has been an active research field due to the role of stress in encoding meaning and aiding speech comprehension. Previous studies relied mainly on traditional acoustic features and English datasets. In this paper, we investigate the approach of fine-tuning a pre-trained transformer model with an audio frame classification head. Our experiments use a new Croatian training dataset, with test sets in Croatian, Serbian, the Chakavian dialect, and Slovenian. By comparing an SVM classifier using traditional acoustic features with the fine-tuned speech transformer, we demonstrate the transformer's superiority across the board, achieving near-perfect results for Croatian and Serbian, with a 10-point performance drop for the more distant Chakavian and Slovenian. Finally, we show that only a few hundred multi-syllabic training words suffice for strong performance. We release our datasets and model under permissive licenses.

* Accepted to InterSpeech2025

Via

Access Paper or Ask Questions