The main objective of this paper is to compare and evaluate the performances of three open Arabic NER tools: CAMeL, Hatmi, and Stanza. We collected a corpus consisting of 30 articles written in MSA and manually annotated all the entities of the person, organization, and location types at the article (document) level. Our results suggest a similarity between Stanza and Hatmi with the latter receiving the highest F1 score for the three entity types. However, CAMeL achieved the highest precision values for names of people and organizations. Following this, we implemented a "merge" method that combined the results from the three tools and a "vote" method that tagged named entities only when two of the three identified them as entities. Our results showed that merging achieved the highest overall F1 scores. Moreover, merging had the highest recall values while voting had the highest precision values for the three entity types. This indicates that merging is more suitable when recall is desired, while voting is optimal when precision is required. Finally, we collected a corpus of 21,635 articles related to COVID-19 and applied the merge and vote methods. Our analysis demonstrates the tradeoff between precision and recall for the two methods.
The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectical similarity. Our work is twofold, reliant on: 1) comparing the textual similarities between dialects using cosine similarity and 2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectical similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectical similarity and city proximity suggests that cities that are closer together are more likely to share dialectical attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialects identification.
Named entities in text documents are the names of people, organization, location or other types of objects in the documents that exist in the real world. A persisting research challenge is to use computational techniques to identify such entities in text documents. Once identified, several text mining tools and algorithms can be utilized to leverage these discovered named entities and improve NLP applications. In this paper, a method that clusters prominent names of people and organizations based on their semantic similarity in a text corpus is proposed. The method relies on common named entity recognition techniques and on recent word embeddings models. The semantic similarity scores generated using the word embeddings models for the named entities are used to cluster similar entities of the people and organizations types. Two human judges evaluated ten variations of the method after it was run on a corpus that consists of 4,821 articles on a specific topic. The performance of the method was measured using three quantitative measures. The results of these three metrics demonstrate that the method is effective in clustering semantically similar named entities.
Image classification is an ongoing research challenge. Most of the available research focuses on image classification for the English language, however there is very little research on image classification for the Arabic language. Expanding image classification to Arabic has several applications. The present study investigated a method for generating Arabic labels for images of objects. The method used in this study involved a direct English to Arabic translation of the labels that are currently available on ImageNet, a database commonly used in image classification research. The purpose of this study was to test the accuracy of this method. In this study, 2,887 labeled images were randomly selected from ImageNet. All of the labels were translated from English to Arabic using Google Translate. The accuracy of the translations was evaluated. Results indicated that that 65.6% of the Arabic labels were accurate. This study makes three important contributions to the image classification literature: (1) it determined the baseline level of accuracy for algorithms that provide Arabic labels for images, (2) it provided 1,895 images that are tagged with accurate Arabic labels, and (3) provided the accuracy of translations of image labels from English to Arabic.