Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:Tackling the Low-resource Challenge for Canonical Segmentation

Oct 06, 2020

Manuel Mager, Özlem Çetinoğlu, Katharina Kann

Figure 1 for Tackling the Low-resource Challenge for Canonical Segmentation

Figure 2 for Tackling the Low-resource Challenge for Canonical Segmentation

Figure 3 for Tackling the Low-resource Challenge for Canonical Segmentation

Figure 4 for Tackling the Low-resource Challenge for Canonical Segmentation

Share this with someone who'll enjoy it:

Abstract:Canonical morphological segmentation consists of dividing words into their standardized morphemes. Here, we are interested in approaches for the task when training data is limited. We compare model performance in a simulated low-resource setting for the high-resource languages German, English, and Indonesian to experiments on new datasets for the truly low-resource languages Popoluca and Tepehua. We explore two new models for the task, borrowing from the closely related area of morphological generation: an LSTM pointer-generator and a sequence-to-sequence model with hard monotonic attention trained with imitation learning. We find that, in the low-resource setting, the novel approaches outperform existing ones on all languages by up to 11.4% accuracy. However, while accuracy in emulated low-resource scenarios is over 50% for all languages, for the truly low-resource languages Popoluca and Tepehua, our best model only obtains 37.4% and 28.4% accuracy, respectively. Thus, we conclude that canonical segmentation is still a challenging task for low-resource languages.

* Accepted to EMNLP 2020

View paper on

Share this with someone who'll enjoy it:

Title:Tackling the Low-resource Challenge for Canonical Segmentation

Paper and Code