Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Iaroslav Chelombitko

Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets

Jan 26, 2026

Iaroslav Chelombitko, Mika Hämäläinen, Aleksey Komissarov

Abstract:We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.

* 15 pages, 4 figues, 4 tables

Via

Access Paper or Ask Questions

SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

Jan 08, 2026

Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov

Abstract:The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the "elbow points" of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

* Accepted to the 10th International Workshop on Computational Linguistics for Uralic Languages (IWCLUL 2025), pp. 57-67

Via

Access Paper or Ask Questions

Compressed code: the hidden effects of quantization and distillation on programming tokens

Jan 05, 2026

Viacheslav Siniaev, Iaroslav Chelombitko, Aleksey Komissarov

Abstract:Large Language Models (LLMs) have demonstrated exceptional code generation capabilities, yet their token-level mechanisms remain underexplored, particularly in compressed models. Through systematic analysis of programming language token representations, we characterize how programming languages are encoded in LLM tokenizers by analyzing their vocabulary distribution and keyword coverage patterns. We introduce a novel cold-start probability analysis method that provides insights into model behavior without requiring explicit prompts. Additionally, we present a comprehensive evaluation of how different model optimization techniques - including quantization, distillation, model scaling, and task-specific fine-tuning - affect token-level representations and code generation quality. Our experiments, supported by comprehensive probability distribution analysis and evaluation metrics, reveal critical insights into token-level behavior and provide empirically-validated guidelines for maintaining code generation quality under various optimization constraints. These findings advance both theoretical understanding of LLM code generation and practical implementation of optimized models in production environments.

* 18 pages, 1 figure and 6 tables

Via

Access Paper or Ask Questions

When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

May 13, 2025

Marina Popova, Iaroslav Chelombitko, Aleksey Komissarov

Abstract:The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.

* ICLR 2025 Workshop on Machine Learning for Genomics Explorations

Via

Access Paper or Ask Questions

Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Oct 16, 2024

Iaroslav Chelombitko, Egor Safronov, Aleksey Komissarov

Figure 1 for Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Figure 2 for Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Figure 3 for Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Figure 4 for Qtok: A Comprehensive Framework for Evaluating Multilingual Tokenizer Quality in Large Language Models

Abstract:In the development of Large Language Models (LLMs), considerable attention has been given to the quality of training datasets. However, the role of tokenizers in the LLM training pipeline, particularly for multilingual models, has received less focus. The quality of tokenization can significantly impact a model's ability to handle diverse languages effectively. We introduce Qtok, a tool designed to assess tokenizer quality with a specific emphasis on their performance in multilingual contexts. Our research proposes a set of metrics for evaluating tokenizer quality, including measures of language coverage, token completeness, and distribution across languages and linguistic categories. Qtok applies these metrics to evaluate 13 distinct tokenizers from 58 publicly available models, analyzing their output across different linguistic contexts. Our analysis revealed significant variations in token distribution across languages and categories, highlighting potential biases and areas for improvement in current tokenization strategies. This research contributes to the field of tokenizer evaluation within multilingual LLM development by providing a systematic approach to assessing tokenizer quality. Our findings highlight the critical role of tokenization in multilingual LLM capability. The Qtok tool and our analysis methodology offer practical means for researchers to evaluate and improve tokenization strategies for multilingual applications. We offer a method to compare tokenizer quality across these metrics, which may be useful when selecting or adjusting tokenizers for specific multilingual LLM applications.

* 24 pages, 9 figures, 6 tables. Code and data available at https://github.com/nup-csai/Qtok/

Via

Access Paper or Ask Questions