Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Abishek Stephen

Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features

Jan 26, 2026

Abishek Stephen, Jindřich Libovický

Abstract:We present a novel metric for the evaluation of the morphological plausibility of subword segmentation. Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentation data that is either unavailable or of inconsistent quality across many languages, our approach utilizes morpho-syntactic features. These are available in resources such as Universal Dependencies or UniMorph for a much wider range of languages. The metric works by probabilistically aligning subwords with morphological features through an IBM Model 1. Our experiments show that the metric correlates well with traditional morpheme boundary recall while being more broadly applicable across languages with different morphological systems.

* Accepted to Findings of EACL 2026, 9 pages, 6 figures

Via

Access Paper or Ask Questions

Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

Oct 24, 2025

Kellen Parker van Dam, Abishek Stephen

Abstract:Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

* Submitted to The 5th Workshop on Evaluation and Comparison for NLP systems (Eval4NLP) 2025

Via

Access Paper or Ask Questions

Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Mar 04, 2025

Arne Rubehn, Christoph Rzymski, Luca Ciucci, Kellen Parker van Dam, Alžběta Kučerová, Katja Bocklage, David Snee, Abishek Stephen, Johann-Mattis List

Figure 1 for Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Figure 2 for Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Figure 3 for Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Figure 4 for Annotating and Inferring Compositional Structures in Numeral Systems Across Languages

Abstract:Numeral systems across the world's languages vary in fascinating ways, both regarding their synchronic structure and the diachronic processes that determined how they evolved in their current shape. For a proper comparison of numeral systems across different languages, however, it is important to code them in a standardized form that allows for the comparison of basic properties. Here, we present a simple but effective coding scheme for numeral annotation, along with a workflow that helps to code numeral systems in a computer-assisted manner, providing sample data for numerals from 1 to 40 in 25 typologically diverse languages. We perform a thorough analysis of the sample, focusing on the systematic comparison between the underlying and the surface morphological structure. We further experiment with automated models for morpheme segmentation, where we find allomorphy as the major reason for segmentation errors. Finally, we show that subword tokenization algorithms are not viable for discovering morphemes in low-resource scenarios.

* Submitted to the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP (SIGTYP)

Via

Access Paper or Ask Questions