Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Mayar Nassar

Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

May 05, 2025

Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallaf, Nizar Habash

Figure 1 for Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

Figure 2 for Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

Figure 3 for Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

Figure 4 for Proper Name Diacritization for Arabic Wikipedia: A Benchmark Dataset

Abstract:Proper names in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP,their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper names of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper name diacritization.

Via

Access Paper or Ask Questions

Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals

Feb 01, 2024

Christian Khairallah, Reham Marzouk, Salam Khalifa, Mayar Nassar, Nizar Habash

Abstract:Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals' intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.

* Findings of the Association for Computational Linguistics: EACL 2024

Via

Access Paper or Ask Questions