Alert button
Picture for El Moatez Billah Nagoudi

El Moatez Billah Nagoudi

Alert button

TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties

Aug 06, 2023
Karima Kadaoui, Samar M. Magdy, Abdul Waheed, Md Tawkat Islam Khondaker, Ahmed Oumar El-Shangiti, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

Figure 1 for TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties
Figure 2 for TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties
Figure 3 for TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties
Figure 4 for TARJAMAT: Evaluation of Bard and ChatGPT on Machine Translation of Ten Arabic Varieties

Large language models (LLMs) finetuned to follow human instructions have recently emerged as a breakthrough in AI. Models such as Google Bard and OpenAI ChatGPT, for example, are surprisingly powerful tools for question answering, code debugging, and dialogue generation. Despite the purported multilingual proficiency of these models, their linguistic inclusivity remains insufficiently explored. Considering this constraint, we present a thorough assessment of Bard and ChatGPT (encompassing both GPT-3.5 and GPT-4) regarding their machine translation proficiencies across ten varieties of Arabic. Our evaluation covers diverse Arabic varieties such as Classical Arabic, Modern Standard Arabic, and several nuanced dialectal variants. Furthermore, we undertake a human-centric study to scrutinize the efficacy of the most recent model, Bard, in following human instructions during translation tasks. Our exhaustive analysis indicates that LLMs may encounter challenges with certain Arabic dialects, particularly those for which minimal public data exists, such as Algerian and Mauritanian dialects. However, they exhibit satisfactory performance with more prevalent dialects, albeit occasionally trailing behind established commercial systems like Google Translate. Additionally, our analysis reveals a circumscribed capability of Bard in aligning with human instructions in translation contexts. Collectively, our findings underscore that prevailing LLMs remain far from inclusive, with only limited ability to cater for the linguistic and cultural intricacies of diverse communities.

Viaarxiv icon

Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

May 24, 2023
El Moatez Billah Nagoudi, Ahmed El-Shangiti, AbdelRahim Elmadany, Muhammad Abdul-Mageed

Figure 1 for Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
Figure 2 for Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
Figure 3 for Dolphin: A Challenging and Diverse Benchmark for Arabic NLG
Figure 4 for Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

We present Dolphin, a novel benchmark that addresses the need for an evaluation framework for the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including text summarization, machine translation, question answering, and dialogue generation, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also evaluate several Arabic and multilingual models on our benchmark, allowing us to set strong baselines against which researchers can compare.

Viaarxiv icon

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

May 24, 2023
Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

Figure 1 for GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Figure 2 for GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Figure 3 for GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP
Figure 4 for GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP

The recent emergence of ChatGPT has brought a revolutionary change in the landscape of NLP. Although ChatGPT has consistently shown impressive performance on English benchmarks, its exact capabilities on most other languages remain largely unknown. To better understand ChatGPT's capabilities on Arabic, we present a large-scale evaluation of the model on a broad range of Arabic NLP tasks. Namely, we evaluate ChatGPT on 32 diverse natural language understanding and generation tasks on over 60 different datasets. To the best of our knowledge, our work offers the first performance analysis of ChatGPT on Arabic NLP at such a massive scale. Our results show that, despite its success on English benchmarks, ChatGPT trained in-context (few-shot) is consistently outperformed by much smaller dedicated models finetuned on Arabic. These results suggest that there is significant place for improvement for instruction-tuned LLMs such as ChatGPT.

* Work in progress 
Viaarxiv icon

Zero-Shot Slot and Intent Detection in Low-Resource Languages

Apr 26, 2023
Sang Yun Kwon, Gagan Bhatia, El Moatez Billah Nagoudi, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed

Figure 1 for Zero-Shot Slot and Intent Detection in Low-Resource Languages
Figure 2 for Zero-Shot Slot and Intent Detection in Low-Resource Languages
Figure 3 for Zero-Shot Slot and Intent Detection in Low-Resource Languages
Figure 4 for Zero-Shot Slot and Intent Detection in Low-Resource Languages

Intent detection and slot filling are critical tasks in spoken and natural language understanding for task-oriented dialog systems. In this work we describe our participation in the slot and intent detection for low-resource language varieties (SID4LR; Aepli et al. (2023)). We investigate the slot and intent detection (SID) tasks using a wide range of models and settings. Given the recent success of multitask-prompted finetuning of large language models, we also test the generalization capability of the recent encoder-decoder model mT0 (Muennighoff et al., 2022) on new tasks (i.e., SID) in languages they have never intentionally seen. We show that our best model outperforms the baseline by a large margin (up to +30 F1 points) in both SID tasks

* VarDial @ EACL 
Viaarxiv icon

ORCA: A Challenging Benchmark for Arabic Language Understanding

Dec 21, 2022
AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

Figure 1 for ORCA: A Challenging Benchmark for Arabic Language Understanding
Figure 2 for ORCA: A Challenging Benchmark for Arabic Language Understanding
Figure 3 for ORCA: A Challenging Benchmark for Arabic Language Understanding
Figure 4 for ORCA: A Challenging Benchmark for Arabic Language Understanding

Due to their crucial role in all NLP, several benchmarks have been proposed to evaluate pretrained language models. In spite of these efforts, no public benchmark of diverse nature currently exists for evaluation of Arabic. This makes it challenging to measure progress for both Arabic and multilingual language models. This challenge is compounded by the fact that any benchmark targeting Arabic needs to take into account the fact that Arabic is not a single language but rather a collection of languages and varieties. In this work, we introduce ORCA, a publicly available benchmark for Arabic language understanding evaluation. ORCA is carefully constructed to cover diverse Arabic varieties and a wide range of challenging Arabic understanding tasks exploiting 60 different datasets across seven NLU task clusters. To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models. We also provide a public leaderboard with a unified single-number evaluation metric (ORCA score) to facilitate future research.

* All authors contributed equally 
Viaarxiv icon

JASMINE: Arabic GPT Models for Few-Shot Learning

Dec 21, 2022
El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, AbdelRahim Elmadany, Alcides Alcoba Inciarte, Md Tawkat Islam Khondaker

Figure 1 for JASMINE: Arabic GPT Models for Few-Shot Learning
Figure 2 for JASMINE: Arabic GPT Models for Few-Shot Learning
Figure 3 for JASMINE: Arabic GPT Models for Few-Shot Learning
Figure 4 for JASMINE: Arabic GPT Models for Few-Shot Learning

Task agnostic generative pretraining (GPT) has recently proved promising for zero- and few-shot learning, gradually diverting attention from the expensive supervised learning paradigm. Although the community is accumulating knowledge as to capabilities of English-language autoregressive models such as GPT-3 adopting this generative approach, scholarship about these models remains acutely Anglocentric. Consequently, the community currently has serious gaps in its understanding of this class of models, their potential, and their societal impacts in diverse settings, linguistic traditions, and cultures. To alleviate this issue for Arabic, a collection of diverse languages and language varieties with more than $400$ million population, we introduce JASMINE, a suite of powerful Arabic autoregressive Transformer language models ranging in size between 300 million-13 billion parameters. We pretrain our new models with large amounts of diverse data (400GB of text) from different Arabic varieties and domains. We evaluate JASMINE extensively in both intrinsic and extrinsic settings, using a comprehensive benchmark for zero- and few-shot learning across a wide range of NLP tasks. We also carefully develop and release a novel benchmark for both automated and human evaluation of Arabic autoregressive models focused at investigating potential social biases, harms, and toxicity in these models. We aim to responsibly release our models with interested researchers, along with code for experimenting with them

Viaarxiv icon

TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

May 27, 2022
El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

Figure 1 for TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation
Figure 2 for TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation
Figure 3 for TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation
Figure 4 for TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations as an added value. To train TURJUMAN, we sample from publicly available parallel data employing a simple semantic similarity method to ensure data quality. This allows us to prepare and release AraOPUS-20, a new machine translation benchmark. We publicly release our translation toolkit (TURJUMAN) as well as our benchmark dataset (AraOPUS-20).

* Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5), 2022  
* All authors contributed equally 
Viaarxiv icon

Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

Apr 10, 2022
Chiyu Zhang, Muhammad Abdul-Mageed, El Moatez Billah Nagoudi

Figure 1 for Decay No More: A Persistent Twitter Dataset for Learning Social Meaning
Figure 2 for Decay No More: A Persistent Twitter Dataset for Learning Social Meaning
Figure 3 for Decay No More: A Persistent Twitter Dataset for Learning Social Meaning
Figure 4 for Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of $17$ social meaning datasets in $10$ categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.

* Under review. arXiv admin note: text overlap with arXiv:2108.00356 
Viaarxiv icon

Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Aug 01, 2021
Chiyu Zhang, Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi

Figure 1 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning
Figure 2 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning
Figure 3 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning
Figure 4 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Masked language models (MLMs) are pretrained with a denoising objective that, while useful, is in a mismatch with the objective of downstream fine-tuning. We propose pragmatic masking and surrogate fine-tuning as two strategies that exploit social cues to drive pre-trained representations toward a broad set of concepts useful for a wide class of social meaning tasks. To test our methods, we introduce a new benchmark of 15 different Twitter datasets for social meaning detection. Our methods achieve 2.34% F1 over a competitive baseline, while outperforming other transfer learning methods such as multi-task learning and domain-specific language models pretrained on large datasets. With only 5% of training data (severely few-shot), our methods enable an impressive 68.74% average F1, and we observe promising results in a zero-shot setting involving six datasets from three different languages.

* Under Review 
Viaarxiv icon

Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

May 28, 2021
El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

Figure 1 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation
Figure 2 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation
Figure 3 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation
Figure 4 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Recent progress in neural machine translation (NMT) has made it possible to translate successfully between monolingual language pairs where large parallel data exist, with pre-trained models improving performance even further. Although there exists work on translating in code-mixed settings (where one of the pairs includes text from two or more languages), it is still unclear what recent success in NMT and language modeling exactly means for translating code-mixed text. We investigate one such context, namely MT from code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English. We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs). We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch. We also find LMs fine-tuned on data from various Arabic dialects to help the MSAEA-EN task. Our work is in the context of the Shared Task on Machine Translation in Code-Switching. Our best model achieves $\bf25.72$ BLEU, placing us first on the official shared task evaluation for MSAEA-EN.

* CALCS2021, colocated with NAACL-2021 
Viaarxiv icon