Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Nizar Habash

New York University Abu Dhabi

Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching

Feb 03, 2024

Kurt Micallef, Nizar Habash, Claudia Borg, Fadhl Eryani, Houda Bouamor

Abstract:Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language, thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.

* EACL 2024 camera-ready version

Via

Access Paper or Ask Questions

Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals

Feb 01, 2024

Christian Khairallah, Reham Marzouk, Salam Khalifa, Mayar Nassar, Nizar Habash

Abstract:Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals' intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.

* Findings of the Association for Computational Linguistics: EACL 2024

Via

Access Paper or Ask Questions

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Oct 24, 2023

Muhammad Abdul-Mageed, AbdelRahim Elmadany, Chiyu Zhang, El Moatez Billah Nagoudi, Houda Bouamor, Nizar Habash

Figure 1 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Figure 2 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Figure 3 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Figure 4 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Abstract:We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comparisons between different approaches. NADI 2023 targeted both dialect identification (Subtask 1) and dialect-to-MSA machine translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered for the shared task, of whom 18 teams have participated (with 76 valid submissions during test phase). Among these, 16 teams participated in Subtask 1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning teams achieved 87.27 F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, respectively. Results show that all three subtasks remain challenging, thereby motivating future work in this area. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

* arXiv admin note: text overlap with arXiv:2210.09582

Via

Access Paper or Ask Questions

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Oct 23, 2023

Injy Hamed, Nizar Habash, Ngoc Thang Vu

Figure 1 for Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Figure 2 for Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Figure 3 for Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Figure 4 for Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Abstract:Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.

* Findings of EMNLP 2023

Via

Access Paper or Ask Questions

Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation

May 24, 2023

Bashar Alhafni, Go Inoue, Christian Khairallah, Nizar Habash

Abstract:Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC by using two newly developed Transformer-based pretrained sequence-to-sequence models. We address the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve state-of-the-art results on two Arabic GEC shared tasks datasets and establish a strong benchmark on a newly created dataset.

Via

Access Paper or Ask Questions

Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Nov 30, 2022

Ossama Obeid, Go Inoue, Nizar Habash

Figure 1 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 2 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 3 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Figure 4 for Camelira: An Arabic Multi-Dialect Morphological Disambiguator

Abstract:We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine. Camelira offers a user-friendly web interface that allows researchers and language learners to explore various linguistic information, such as part-of-speech, morphological features, and lemmas. Our system also provides an option to automatically choose an appropriate dialect-specific disambiguator based on the prediction of a dialect identification component. Camelira is publicly accessible at http://camelira.camel-lab.com.

Via

Access Paper or Ask Questions

ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

Nov 22, 2022

Injy Hamed, Nizar Habash, Slim Abdennadher, Ngoc Thang Vu

Figure 1 for ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

Figure 2 for ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

Figure 3 for ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

Figure 4 for ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic - English

Abstract:We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic - English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.

* Accepted to the Seventh Arabic Natural Language Processing Workshop (WANLP 2022)

Via

Access Paper or Ask Questions

Benchmarking Evaluation Metrics for Code-Switching Automatic Speech Recognition

Nov 22, 2022

Injy Hamed, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali

Abstract:Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition. In this paper, we focus on the question of robust and fair evaluation metrics. To that end, we develop a reference benchmark data set of code-switching speech recognition hypotheses with human judgments. We define clear guidelines for minimal editing of automatic hypotheses. We validate the guidelines using 4-way inter-annotator agreement. We evaluate a large number of metrics in terms of correlation with human judgments. The metrics we consider vary in terms of representation (orthographic, phonological, semantic), directness (intrinsic vs extrinsic), granularity (e.g. word, character), and similarity computation method. The highest correlation to human judgment is achieved using transliteration followed by text normalization. We release the first corpus for human acceptance of code-switching speech recognition results in dialectal Arabic/English conversation speech.

* Accepted to SLT 2022

Via

Access Paper or Ask Questions

Maknuune: A Large Open Palestinian Arabic Lexicon

Oct 24, 2022

Shahd Dibas, Christian Khairallah, Nizar Habash, Omar Fayez Sadi, Tariq Sairafy, Karmel Sarabta, Abrar Ardah

Figure 1 for Maknuune: A Large Open Palestinian Arabic Lexicon

Figure 2 for Maknuune: A Large Open Palestinian Arabic Lexicon

Figure 3 for Maknuune: A Large Open Palestinian Arabic Lexicon

Figure 4 for Maknuune: A Large Open Palestinian Arabic Lexicon

Abstract:We present Maknuune, a large open lexicon for the Palestinian Arabic dialect. Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information such as broken plurals and templatic feminine forms, associated phrases and collocations, Standard Arabic glosses, and examples or notes on grammar, usage, or location of collected entry.

Via

Access Paper or Ask Questions

Arabic Word-level Readability Visualization for Assisted Text Simplification

Oct 19, 2022

Reem Hazim, Hind Saddiki, Bashar Alhafni, Muhamed Al Khalil, Nizar Habash

Figure 1 for Arabic Word-level Readability Visualization for Assisted Text Simplification

Figure 2 for Arabic Word-level Readability Visualization for Assisted Text Simplification

Figure 3 for Arabic Word-level Readability Visualization for Assisted Text Simplification

Figure 4 for Arabic Word-level Readability Visualization for Assisted Text Simplification

Abstract:This demo paper presents a Google Docs add-on for automatic Arabic word-level readability visualization. The add-on includes a lemmatization component that is connected to a five-level readability lexicon and Arabic WordNet-based substitution suggestions. The add-on can be used for assessing the reading difficulty of a text and identifying difficult words as part of the task of manual text simplification. We make our add-on and its code publicly available.

Via

Access Paper or Ask Questions