Get our free extension to see links to code for papers anywhere online!

 Add to Chrome

 Add to Firefox

CatalyzeX Code Finder - Browser extension linking code for ML papers across the web! | Product Hunt Embed

Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

Nov 30, 2019
Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Musat, Andreas Fischer



This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling. To capture new content, our approach will run continuously to keep increasing the corpus over time.

* Submitted to LREC 2020 


Share this with someone who'll enjoy it:

   Access Paper Source



Share this with someone who'll enjoy it: