Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Muhammad Abdul-Mageed

Improving Similar Language Translation With Transfer Learning

Aug 07, 2021
Ife Adebara, Muhammad Abdul-Mageed

Figure 1 for Improving Similar Language Translation With Transfer Learning

Figure 2 for Improving Similar Language Translation With Transfer Learning

Figure 3 for Improving Similar Language Translation With Transfer Learning

Figure 4 for Improving Similar Language Translation With Transfer Learning

We investigate transfer learning based on pre-trained neural machine translation models to translate between (low-resource) similar languages. This work is part of our contribution to the WMT 2021 Similar Languages Translation Shared Task where we submitted models for different language pairs, including French-Bambara, Spanish-Catalan, and Spanish-Portuguese in both directions. Our models for Catalan-Spanish ($82.79$ BLEU) and Portuguese-Spanish ($87.11$ BLEU) rank top 1 in the official shared task evaluation, and we are the only team to submit models for the French-Bambara pairs.

* Submitted to WMT 2021 Similar Language Task

Via

Access Paper or Ask Questions

Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Aug 01, 2021
Chiyu Zhang, Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi

Figure 1 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Figure 2 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Figure 3 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Figure 4 for Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

Masked language models (MLMs) are pretrained with a denoising objective that, while useful, is in a mismatch with the objective of downstream fine-tuning. We propose pragmatic masking and surrogate fine-tuning as two strategies that exploit social cues to drive pre-trained representations toward a broad set of concepts useful for a wide class of social meaning tasks. To test our methods, we introduce a new benchmark of 15 different Twitter datasets for social meaning detection. Our methods achieve 2.34% F1 over a competitive baseline, while outperforming other transfer learning methods such as multi-task learning and domain-specific language models pretrained on large datasets. With only 5% of training data (severely few-shot), our methods enable an impressive 68.74% average F1, and we observe promising results in a zero-shot setting involving six datasets from three different languages.

* Under Review

Via

Access Paper or Ask Questions

Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

May 28, 2021
El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

Figure 1 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Figure 2 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Figure 3 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Figure 4 for Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Recent progress in neural machine translation (NMT) has made it possible to translate successfully between monolingual language pairs where large parallel data exist, with pre-trained models improving performance even further. Although there exists work on translating in code-mixed settings (where one of the pairs includes text from two or more languages), it is still unclear what recent success in NMT and language modeling exactly means for translating code-mixed text. We investigate one such context, namely MT from code-mixed Modern Standard Arabic and Egyptian Arabic (MSAEA) into English. We develop models under different conditions, employing both (i) standard end-to-end sequence-to-sequence (S2S) Transformers trained from scratch and (ii) pre-trained S2S language models (LMs). We are able to acquire reasonable performance using only MSA-EN parallel data with S2S models trained from scratch. We also find LMs fine-tuned on data from various Arabic dialects to help the MSAEA-EN task. Our work is in the context of the Shared Task on Machine Translation in Code-Switching. Our best model achieves $\bf25.72$ BLEU, placing us first on the official shared task evaluation for MSAEA-EN.

* CALCS2021, colocated with NAACL-2021

Via

Access Paper or Ask Questions

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

May 18, 2021
Ganesh Jawahar, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

Figure 1 for Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

Figure 2 for Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

Figure 3 for Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

Figure 4 for Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

* Computational Approaches to Linguistic Code-Switching (CALCS 2021) workshop

Via

Access Paper or Ask Questions

AraStance: A Multi-Country and Multi-Domain Dataset of Arabic Stance Detection for Fact Checking

May 18, 2021
Tariq Alhindi, Amal Alabdulkarim, Ali Alshehri, Muhammad Abdul-Mageed, Preslav Nakov

Figure 1 for AraStance: A Multi-Country and Multi-Domain Dataset of Arabic Stance Detection for Fact Checking

Figure 2 for AraStance: A Multi-Country and Multi-Domain Dataset of Arabic Stance Detection for Fact Checking

Figure 3 for AraStance: A Multi-Country and Multi-Domain Dataset of Arabic Stance Detection for Fact Checking

Figure 4 for AraStance: A Multi-Country and Multi-Domain Dataset of Arabic Stance Detection for Fact Checking

With the continuing spread of misinformation and disinformation online, it is of increasing importance to develop combating mechanisms at scale in the form of automated systems that support multiple languages. One task of interest is claim veracity prediction, which can be addressed using stance detection with respect to relevant documents retrieved online. To this end, we present our new Arabic Stance Detection dataset (AraStance) of 4,063 claim--article pairs from a diverse set of sources comprising three fact-checking websites and one news website. AraStance covers false and true claims from multiple domains (e.g., politics, sports, health) and several Arab countries, and it is well-balanced between related and unrelated documents with respect to the claims. We benchmark AraStance, along with two other stance detection datasets, using a number of BERT-based models. Our best model achieves an accuracy of 85\% and a macro F1 score of 78\%, which leaves room for improvement and reflects the challenging nature of AraStance and the task of stance detection in general.

* Accepted to the 2021 Workshop on NLP4IF: Censorship, Disinformation, and Propaganda

Via

Access Paper or Ask Questions

IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

Apr 27, 2021
El Moatez Billah Nagoudi, Wei-Rui Chen, Muhammad Abdul-Mageed, Hasan Cavusogl

Figure 1 for IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

Figure 2 for IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

Figure 3 for IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

Figure 4 for IndT5: A Text-to-Text Transformer for 10 Indigenous Languages

Transformer language models have become fundamental components of natural language processing based pipelines. Although several Transformer models have been introduced to serve many languages, there is a shortage of models pre-trained for low-resource and Indigenous languages. In this work, we introduce IndT5, the first Transformer language model for Indigenous languages. To train IndT5, we build IndCorpus--a new dataset for ten Indigenous languages and Spanish. We also present the application of IndT5 to machine translation by investigating different approaches to translate between Spanish and the Indigenous languages as part of our contribution to the AmericasNLP 2021 Shared Task on Open Machine Translation. IndT5 and IndCorpus are publicly available for research

* Accepted in AmericasNLP 2021, co-located with NAACL-HLT 2021

Via

Access Paper or Ask Questions

Translating the Unseen? Yoruba-English MT in Low-Resource, Morphologically-Unmarked Settings

Apr 06, 2021
Ife Adebara, Muhammad Abdul-Mageed, Miikka Silfverberg

Figure 1 for Translating the Unseen? Yoruba-English MT in Low-Resource, Morphologically-Unmarked Settings

Figure 2 for Translating the Unseen? Yoruba-English MT in Low-Resource, Morphologically-Unmarked Settings

Figure 3 for Translating the Unseen? Yoruba-English MT in Low-Resource, Morphologically-Unmarked Settings

Figure 4 for Translating the Unseen? Yoruba-English MT in Low-Resource, Morphologically-Unmarked Settings

Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation. When translating into English which marks (in)definiteness morphologically, from Yor\`ub\'a which uses bare nouns but marks these features contextually, ambiguities arise. In this work, we perform fine-grained analysis on how an SMT system compares with two NMT systems (BiLSTM and Transformer) when translating bare nouns in Yor\`ub\'a into English. We investigate how the systems what extent they identify BNs, correctly translate them, and compare with human translation patterns. We also analyze the type of errors each model makes and provide a linguistic description of these errors. We glean insights for evaluating model performance in low-resource settings. In translating bare nouns, our results show the transformer model outperforms the SMT and BiLSTM models for 4 categories, the BiLSTM outperforms the SMT model for 3 categories while the SMT outperforms the NMT models for 1 category.

* Accepted at AfricanNLP @ EACL 2021

Via

Access Paper or Ask Questions

Translating the Unseen? Yorùbá $\rightarrow$ English MT in Low-Resource, Morphologically-Unmarked Settings

Mar 09, 2021
Ife Adebara, Muhammad Abdul-Mageed, Miikka Silfverberg

$Figure 1 for Translating the Unseen? Yorùbá $\rightarrow$ English MT in Low-Resource, Morphologically-Unmarked Settings$

$Figure 2 for Translating the Unseen? Yorùbá $\rightarrow$ English MT in Low-Resource, Morphologically-Unmarked Settings$

$Figure 3 for Translating the Unseen? Yorùbá $\rightarrow$ English MT in Low-Resource, Morphologically-Unmarked Settings$

$Figure 4 for Translating the Unseen? Yorùbá $\rightarrow$ English MT in Low-Resource, Morphologically-Unmarked Settings$

* Accepted at AfricanNLP @ EACL 2021

Via

Access Paper or Ask Questions