Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Promise Dodzi Kpoglu

Feature-Refined Unsupervised Model for Loanword Detection

Aug 25, 2025

Promise Dodzi Kpoglu

Abstract:We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.

Via

Access Paper or Ask Questions

Unsupervised Protoform Reconstruction through Parsimonious Rule-guided Heuristics and Evolutionary Search

Jun 12, 2025

Promise Dodzi Kpoglu

Abstract:We propose an unsupervised method for the reconstruction of protoforms i.e., ancestral word forms from which modern language forms are derived. While prior work has primarily relied on probabilistic models of phonological edits to infer protoforms from cognate sets, such approaches are limited by their predominantly data-driven nature. In contrast, our model integrates data-driven inference with rule-based heuristics within an evolutionary optimization framework. This hybrid approach leverages on both statistical patterns and linguistically motivated constraints to guide the reconstruction process. We evaluate our method on the task of reconstructing Latin protoforms using a dataset of cognates from five Romance languages. Experimental results demonstrate substantial improvements over established baselines across both character-level accuracy and phonological plausibility metrics.

Via

Access Paper or Ask Questions