Alert button
Picture for Abinew Ali Ayele

Abinew Ali Ayele

Alert button

Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

Mar 25, 2023
Atnafu Lambebo Tonja, Tadesse Destaw Belay, Israel Abebe Azime, Abinew Ali Ayele, Moges Ahmed Mehamed, Olga Kolesnikova, Seid Muhie Yimam

Figure 1 for Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities
Figure 2 for Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities
Figure 3 for Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities
Figure 4 for Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities

This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia. Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to identify research gaps and disseminate the information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.

* Accepted to Fourth workshop on Resources for African Indigenous Languages (RAIL), EACL2023 
Viaarxiv icon

AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Feb 17, 2023
Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma Ousidhoum, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Sa'id Ahmad, Meriem Beloucif, Saif Mohammad, Sebastian Ruder, Oumaima Hourrane, Pavel Brazdil, Felermino Dário Mário António Ali, Davis Davis, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Wendimu Baye Messelle, Hailu Beshada Balcha, Sisay Adugna Chala, Hagos Tesfahun Gebremichael, Bernard Opoku, Steven Arthur

Figure 1 for AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Figure 2 for AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Figure 3 for AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Figure 4 for AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).

* 15 pages, 6 Figures, 9 Tables 
Viaarxiv icon

The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

Oct 27, 2022
Tadesse Destaw Belay, Atnafu Lambebo Tonja, Olga Kolesnikova, Seid Muhie Yimam, Abinew Ali Ayele, Silesh Bogale Haile, Grigori Sidorov, Alexander Gelbukh

Figure 1 for The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation
Figure 2 for The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation
Figure 3 for The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation
Figure 4 for The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

Machine translation (MT) is one of the main tasks in natural language processing whose objective is to translate texts automatically from one natural language to another. Nowadays, using deep neural networks for MT tasks has received great attention. These networks require lots of data to learn abstract representations of the input and store it in continuous vectors. This paper presents the first relatively large-scale Amharic-English parallel sentence dataset. Using these compiled data, we build bi-directional Amharic-English translation models by fine-tuning the existing Facebook M2M100 pre-trained model achieving a BLEU score of 37.79 in Amharic-English 32.74 in English-Amharic translation. Additionally, we explore the effects of Amharic homophone normalization on the machine translation task. The results show that the normalization of Amharic homophone characters increases the performance of Amharic-English machine translation in both directions.

Viaarxiv icon

Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

Nov 02, 2020
Seid Muhie Yimam, Abinew Ali Ayele, Gopalakrishnan Venkatesh, Chris Biemann

Figure 1 for Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets
Figure 2 for Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets
Figure 3 for Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets
Figure 4 for Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets

The availability of different pre-trained semantic models enabled the quick development of machine learning components for downstream applications. Despite the availability of abundant text data for low resource languages, only a few semantic models are publicly available. Publicly available pre-trained models are usually built as a multilingual version of semantic models that can not fit well for each language due to context variations. In this work, we introduce different semantic models for Amharic. After we experiment with the existing pre-trained semantic models, we trained and fine-tuned nine new different models using a monolingual text corpus. The models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and investigate their impact. We find that newly trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from RoBERTA perform better than the word2Vec models.

Viaarxiv icon

Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic

Dec 09, 2019
Seid Muhie Yimam, Abinew Ali Ayele, Chris Biemann

Figure 1 for Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic
Figure 2 for Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic
Figure 3 for Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic
Figure 4 for Analysis of the Ethiopic Twitter Dataset for Abusive Speech in Amharic

In this paper, we present an analysis of the first Ethiopic Twitter Dataset for the Amharic language targeted for recognizing abusive speech. The dataset has been collected since 2014 that is written in Fidel script. Since several languages can be written using the Fidel script, we have used the existing Amharic, Tigrinya and Ge'ez corpora to retain only the Amharic tweets. We have analyzed the tweets for abusive speech content with the following targets: Analyze the distribution and tendency of abusive speech content over time and compare the abusive speech content between a Twitter and general reference Amharic corpus.

Viaarxiv icon