Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Nov 25, 2021

Abir Messaoudi, Ahmed Cheikhrouhou, Hatem Haddad, Nourchene Ferchichi, Moez BenHajhmida, Abir Korched, Malek Naski, Faten Ghriss, Amine Kerkeni

Figure 1 for TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Figure 2 for TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Figure 3 for TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Figure 4 for TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Share this with someone who'll enjoy it:

Abstract:Pretrained contextualized text representation models learn an effective representation of a natural language to make it machine understandable. After the breakthrough of the attention mechanism, a new generation of pretrained models have been proposed achieving good performances since the introduction of the Transformer. Bidirectional Encoder Representations from Transformers (BERT) has become the state-of-the-art model for language understanding. Despite their success, most of the available models have been trained on Indo-European languages however similar research for under-represented languages and dialects remains sparse. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for under represented languages, with a specific focus on the Tunisian dialect. We evaluate our language model on sentiment analysis task, dialect identification task and reading comprehension question-answering task. We show that the use of noisy web crawled data instead of structured data (Wikipedia, articles, etc.) is more convenient for such non-standardized language. Moreover, results indicate that a relatively small web crawled dataset leads to performances that are as good as those obtained using larger datasets. Finally, our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks. We release the TunBERT pretrained model and the datasets used for fine-tuning.

View paper on

Share this with someone who'll enjoy it:

Title:TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect

Paper and Code