Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

A. Mansurov

UzBERT: pretraining a BERT model for Uzbek

Aug 22, 2021

B. Mansurov, A. Mansurov

Figure 1 for UzBERT: pretraining a BERT model for Uzbek

Abstract:Pretrained language models based on the Transformer architecture have achieved state-of-the-art results in various natural language processing tasks such as part-of-speech tagging, named entity recognition, and question answering. However, no such monolingual model for the Uzbek language is publicly available. In this paper, we introduce UzBERT, a pretrained Uzbek language model based on the BERT architecture. Our model greatly outperforms multilingual BERT on masked language model accuracy. We make the model publicly available under the MIT open-source license.

* 9 pages, 1 table

Via

Access Paper or Ask Questions

Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Jan 13, 2021

B. Mansurov, A. Mansurov

Figure 1 for Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Figure 2 for Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Figure 3 for Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Figure 4 for Uzbek Cyrillic-Latin-Cyrillic Machine Transliteration

Abstract:In this paper, we introduce a data-driven approach to transliterating Uzbek dictionary words from the Cyrillic script into the Latin script, and vice versa. We heuristically align characters of words in the source script with sub-strings of the corresponding words in the target script and train a decision tree classifier that learns these alignments. On the test set, our Cyrillic to Latin model achieves a character level micro-averaged F1 score of 0.9992, and our Latin to Cyrillic model achieves the score of 0.9959. Our contribution is a novel method of producing machine transliterated texts for the low-resource Uzbek language.

* 9 pages, 11 tables

Via

Access Paper or Ask Questions

Development of Word Embeddings for Uzbek Language

Sep 30, 2020

B. Mansurov, A. Mansurov

Abstract:In this paper, we share the process of developing word embeddings for the Cyrillic variant of the Uzbek language. The result of our work is the first publicly available set of word vectors trained on the word2vec, GloVe, and fastText algorithms using a high-quality web crawl corpus developed in-house. The developed word embeddings can be used in many natural language processing downstream tasks.

* 7 pages

Via

Access Paper or Ask Questions