Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Mar 15, 2021

David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina España-Bonet

Figure 1 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Figure 2 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Figure 3 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Figure 4 for MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Share this with someone who'll enjoy it:

Abstract:Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due the lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yor\`ub\'a--English (yo--en) language pair with standardized train-test splits for benchmarking. We provide several neural MT (NMT) benchmarks on this dataset and compare to the performance of popular pre-trained (massively multilingual) MT models, showing that, in almost all cases, our simple benchmarks outperform the pre-trained MT models. A major gain of BLEU $+9.9$ and $+8.6$ (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively when we use MENYO-20k to fine-tune generic models.

* Accepted to the AfricaNLP 2021 Workshop @EACL 2021

View paper on

Share this with someone who'll enjoy it:

Title:MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation

Paper and Code