Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

Practical Comparable Data Collection for Low-Resource Languages via Images

Apr 28, 2020
Aman Madaan, Shruti Rijhwani, Antonios Anastasopoulos, Yiming Yang, Graham Neubig

We propose a method of curating high-quality comparable training data for low-resource languages with monolingual annotators. Our method involves using a carefully selected set of images as a pivot between the source and target languages by getting captions for such images in both languages independently. Human evaluations on the English-Hindi comparable corpora created with our method show that 81.1% of the pairs are acceptable translations, and only 2.47% of the pairs are not translations at all. We further establish the potential of the dataset collected through our approach by experimenting on two downstream tasks - machine translation and dictionary extraction. All code and data are available at

* Accepted for poster presentation at the Practical Machine Learning for Developing Countries (PML4DC) workshop, ICLR 2020 

Share this with someone who'll enjoy it:

   Access Paper Source

Share this with someone who'll enjoy it: