Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Houda Bouamor

Carnegie Mellon University Qatar

LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

Dec 30, 2025

May Bashendy, Walid Massoud, Sohaila Eltanbouly, Salam Albatarni, Marwan Sayed, Abrar Abir, Houda Bouamor, Tamer Elsayed

Abstract:Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.

Via

Access Paper or Ask Questions

Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset

Dec 24, 2024

Jiarui Liu, Iman Ouzzani, Wenkai Li, Lechen Zhang, Tianyue Ou, Houda Bouamor, Zhijing Jin, Mona Diab

Abstract:The field of machine translation has achieved significant advancements, yet domain-specific terminology translation, particularly in AI, remains challenging. We introduced GIST, a large-scale multilingual AI terminology dataset containing 5K terms extracted from top AI conference papers spanning 2000 to 2023. The terms were translated into Arabic, Chinese, French, Japanese, and Russian using a hybrid framework that combines LLMs for extraction with human expertise for translation. The dataset's quality was benchmarked against existing resources, demonstrating superior translation accuracy through crowdsourced evaluation. GIST was integrated into translation workflows using post-translation refinement methods that required no retraining, where LLM prompting consistently improved BLEU and COMET scores. A web demonstration on the ACL Anthology platform highlights its practical application, showcasing improved accessibility for non-English speakers. This work aims to address critical gaps in AI terminology resources and fosters global inclusivity and collaboration in AI research.

Via

Access Paper or Ask Questions

The FIGNEWS Shared Task on News Media Narratives

Jul 25, 2024

Wajdi Zaghouani, Mustafa Jarrar, Nizar Habash, Houda Bouamor, Imed Zitouni, Mona Diab, Samhaa R. El-Beltagy, Muhammed AbuOdeh

Figure 1 for The FIGNEWS Shared Task on News Media Narratives

Figure 2 for The FIGNEWS Shared Task on News Media Narratives

Figure 3 for The FIGNEWS Shared Task on News Media Narratives

Figure 4 for The FIGNEWS Shared Task on News Media Narratives

Abstract:We present an overview of the FIGNEWS shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. The shared task addresses bias and propaganda annotation in multilingual news posts. We focus on the early days of the Israel War on Gaza as a case study. The task aims to foster collaboration in developing annotation guidelines for subjective tasks by creating frameworks for analyzing diverse narratives highlighting potential bias and propaganda. In a spirit of fostering and encouraging diversity, we address the problem from a multilingual perspective, namely within five languages: English, French, Arabic, Hebrew, and Hindi. A total of 17 teams participated in two annotation subtasks: bias (16 teams) and propaganda (6 teams). The teams competed in four evaluation tracks: guidelines development, annotation quality, annotation quantity, and consistency. Collectively, the teams produced 129,800 data points. Key findings and implications for the field are discussed.

* 18 pages, 10 tables, 1 figure, accepted to ArabicNLP 2024 co-located with ACL 2024

Via

Access Paper or Ask Questions

AraFinNLP 2024: The First Arabic Financial NLP Shared Task

Jul 13, 2024

Sanad Malaysha, Mo El-Haj, Saad Ezzini, Mohammed Khalilia, Mustafa Jarrar, Sultan Almujaiwel, Ismail Berrada, Houda Bouamor

Figure 1 for AraFinNLP 2024: The First Arabic Financial NLP Shared Task

Figure 2 for AraFinNLP 2024: The First Arabic Financial NLP Shared Task

Figure 3 for AraFinNLP 2024: The First Arabic Financial NLP Shared Task

Figure 4 for AraFinNLP 2024: The First Arabic Financial NLP Shared Task

Abstract:The expanding financial markets of the Arab world require sophisticated Arabic NLP tools. To address this need within the banking domain, the Arabic Financial NLP (AraFinNLP) shared task proposes two subtasks: (i) Multi-dialect Intent Detection and (ii) Cross-dialect Translation and Intent Preservation. This shared task uses the updated ArBanking77 dataset, which includes about 39k parallel queries in MSA and four dialects. Each query is labeled with one or more of a common 77 intents in the banking domain. These resources aim to foster the development of robust financial Arabic NLP, particularly in the areas of machine translation and banking chat-bots. A total of 45 unique teams registered for this shared task, with 11 of them actively participated in the test phase. Specifically, 11 teams participated in Subtask 1, while only 1 team participated in Subtask 2. The winning team of Subtask 1 achieved F1 score of 0.8773, and the only team submitted in Subtask 2 achieved a 1.667 BLEU score.

Via

Access Paper or Ask Questions

NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task

Jul 06, 2024

Muhammad Abdul-Mageed, Amr Keleg, AbdelRahim Elmadany, Chiyu Zhang, Injy Hamed, Walid Magdy, Houda Bouamor, Nizar Habash

Abstract:We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI's objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on pre-specified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask~1), identification of the Arabic level of dialectness (Subtask~2), and dialect-to-MSA machine translation (Subtask~3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask~1, three in Subtask~2, and eight in Subtask~3. The winning teams achieved 50.57 F\textsubscript{1} on Subtask~1, 0.1403 RMSE for Subtask~2, and 20.44 BLEU in Subtask~3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

* Accepted by The Second Arabic Natural Language Processing Conference

Via

Access Paper or Ask Questions

Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Jul 03, 2024

Bashar Alhafni, Sarah Al-Towaity, Ziyad Fawzy, Fatema Nassar, Fadhl Eryani, Houda Bouamor, Nizar Habash

Figure 1 for Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Figure 2 for Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Figure 3 for Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Figure 4 for Exploiting Dialect Identification in Automatic Dialectal Text Normalization

Abstract:Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, and pretrained models publicly available.

* Accepted to ArabicNLP 2024, ACL

Via

Access Paper or Ask Questions

Chinese Offensive Language Detection:Current Status and Future Directions

Mar 29, 2024

Yunze Xiao, Houda Bouamor, Wajdi Zaghouani

Figure 1 for Chinese Offensive Language Detection:Current Status and Future Directions

Figure 2 for Chinese Offensive Language Detection:Current Status and Future Directions

Figure 3 for Chinese Offensive Language Detection:Current Status and Future Directions

Abstract:Despite the considerable efforts being made to monitor and regulate user-generated content on social media platforms, the pervasiveness of offensive language, such as hate speech or cyberbullying, in the digital space remains a significant challenge. Given the importance of maintaining a civilized and respectful online environment, there is an urgent and growing need for automatic systems capable of detecting offensive speech in real time. However, developing effective systems for processing languages such as Chinese presents a significant challenge, owing to the language's complex and nuanced nature, which makes it difficult to process automatically. This paper provides a comprehensive overview of offensive language detection in Chinese, examining current benchmarks and approaches and highlighting specific models and tools for addressing the unique challenges of detecting offensive language in this complex language. The primary objective of this survey is to explore the existing techniques and identify potential avenues for further research that can address the cultural and linguistic complexities of Chinese.

Via

Access Paper or Ask Questions

Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching

Feb 03, 2024

Kurt Micallef, Nizar Habash, Claudia Borg, Fadhl Eryani, Houda Bouamor

Abstract:Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model's pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language, thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.

* EACL 2024 camera-ready version

Via

Access Paper or Ask Questions

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Oct 24, 2023

Muhammad Abdul-Mageed, AbdelRahim Elmadany, Chiyu Zhang, El Moatez Billah Nagoudi, Houda Bouamor, Nizar Habash

Figure 1 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Figure 2 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Figure 3 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Figure 4 for NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Abstract:We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comparisons between different approaches. NADI 2023 targeted both dialect identification (Subtask 1) and dialect-to-MSA machine translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered for the shared task, of whom 18 teams have participated (with 76 valid submissions during test phase). Among these, 16 teams participated in Subtask 1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning teams achieved 87.27 F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, respectively. Results show that all three subtasks remain challenging, thereby motivating future work in this area. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

* arXiv admin note: text overlap with arXiv:2210.09582

Via

Access Paper or Ask Questions

NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

Oct 18, 2022

Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Houda Bouamor, Nizar Habash

Figure 1 for NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

Figure 2 for NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

Figure 3 for NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

Figure 4 for NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

Abstract:We describe findings of the third Nuanced Arabic Dialect Identification Shared Task (NADI 2022). NADI aims at advancing state of the art Arabic NLP, including on Arabic dialects. It does so by affording diverse datasets and modeling opportunities in a standardized context where meaningful comparisons between models and approaches are possible. NADI 2022 targeted both dialect identification (Subtask 1) and dialectal sentiment analysis (Subtask 2) at the country level. A total of 41 unique teams registered for the shared task, of whom 21 teams have actually participated (with 105 valid submissions). Among these, 19 teams participated in Subtask 1 and 10 participated in Subtask 2. The winning team achieved 27.06 F1 on Subtask 1 and F1=75.16 on Subtask 2, reflecting that the two subtasks remain challenging and motivating future work in this area. We describe methods employed by participating teams and offer an outlook for NADI.

* arXiv admin note: text overlap with arXiv:2103.08466

Via

Access Paper or Ask Questions