Alert button
Picture for Vukosi Marivate

Vukosi Marivate

Alert button

Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution

Jun 26, 2023
Abiodun Modupe, Turgay Celik, Vukosi Marivate, Oludayo O. Olugbara

Figure 1 for Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution
Figure 2 for Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution
Figure 3 for Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution
Figure 4 for Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution

The problem of unveiling the author of a given text document from multiple candidate authors is called authorship attribution. Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution. Unfortunately, the performance of word-based authorship attribution systems is limited by the vocabulary of the training corpus. Literature has recommended character-based stylistic markers as an alternative to overcome the hidden word problem. However, character-based methods often fail to capture the sequential relationship of words in texts which is a chasm for further improvement. The question addressed in this paper is whether it is possible to address the ambiguity of hidden words in text documents while preserving the sequential context of words. Consequently, a method based on bidirectional long short-term memory (BLSTM) with a 2-dimensional convolutional neural network (CNN) is proposed to capture sequential writing styles for authorship attribution. The BLSTM was used to obtain the sequential relationship among characteristics using subword information. The 2-dimensional CNN was applied to understand the local syntactical position of the style from unlabeled input text. The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50. Experimental results indicate accuracy improvement of 1.07\%, and 0.96\% on CCAT50 and Twitter, respectively, and produce comparable results on the remaining datasets.

* 8 pages, 4 figure 
Viaarxiv icon

Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Jun 12, 2023
Andani Madodonga, Vukosi Marivate, Matthew Adendorff

Figure 1 for Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati
Figure 2 for Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati
Figure 3 for Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati
Figure 4 for Izindaba-Tindzaba: Machine learning news categorisation for Long and Short Text for isiZulu and Siswati

Local/Native South African languages are classified as low-resource languages. As such, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this work, the focus was to create annotated news datasets for the isiZulu and Siswati native languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these native South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Bag-Of-Words, TFIDF and Word2vec. The results of this study showed that XGBoost, Logistic Regression and LSTM, trained from Word2vec performed better than the other combinations.

* Accepted for Third workshop on Resources for African Indigenous Languages (RAIL) 
Viaarxiv icon

MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

May 23, 2023
Cheikh M. Bamba Dione, David Adelani, Peter Nabende, Jesujoba Alabi, Thapelo Sindane, Happy Buzaaba, Shamsuddeen Hassan Muhammad, Chris Chinenye Emezue, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jonathan Mukiibi, Blessing Sibanda, Bonaventure F. P. Dossou, Andiswa Bukula, Rooweither Mabuya, Allahsera Auguste Tapo, Edwin Munkoh-Buabeng, victoire Memdjokam Koagne, Fatoumata Ouoba Kabore, Amelia Taylor, Godson Kalipe, Tebogo Macucwa, Vukosi Marivate, Tajuddeen Gwadabe, Mboning Tchiaze Elvis, Ikechukwu Onyenwe, Gratien Atindogbe, Tolulope Adelani, Idris Akinade, Olanrewaju Samuel, Marien Nahimana, Théogène Musabeyezu, Emile Niyomutabazi, Ester Chimhenga, Kudzai Gotosa, Patrick Mizha, Apelete Agbolo, Seydou Traore, Chinedu Uchechukwu, Aliyu Yusuf, Muhammad Abdullahi, Dietrich Klakow

Figure 1 for MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Figure 2 for MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Figure 3 for MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages
Figure 4 for MasakhaPOS: Part-of-Speech Tagging for Typologically Diverse African Languages

In this paper, we present MasakhaPOS, the largest part-of-speech (POS) dataset for 20 typologically diverse African languages. We discuss the challenges in annotating POS for these languages using the UD (universal dependencies) guidelines. We conducted extensive POS baseline experiments using conditional random field and several multilingual pre-trained language models. We applied various cross-lingual transfer models trained with data available in UD. Evaluating on the MasakhaPOS dataset, we show that choosing the best transfer language(s) in both single-source and multi-source setups greatly improves the POS tagging performance of the target languages, in particular when combined with cross-lingual parameter-efficient fine-tuning methods. Crucially, transferring knowledge from a language that matches the language family and morphosyntactic properties seems more effective for POS tagging in unseen languages.

* Accepted to ACL 2023 (Main conference) 
Viaarxiv icon

MphayaNER: Named Entity Recognition for Tshivenda

Apr 08, 2023
Rendani Mbuvha, David I. Adelani, Tendani Mutavhatsindi, Tshimangadzo Rakhuhu, Aluwani Mauda, Tshifhiwa Joshua Maumela, Andisani Masindi, Seani Rananga, Vukosi Marivate, Tshilidzi Marwala

Figure 1 for MphayaNER: Named Entity Recognition for Tshivenda
Figure 2 for MphayaNER: Named Entity Recognition for Tshivenda

Named Entity Recognition (NER) plays a vital role in various Natural Language Processing tasks such as information retrieval, text classification, and question answering. However, NER can be challenging, especially in low-resource languages with limited annotated datasets and tools. This paper adds to the effort of addressing these challenges by introducing MphayaNER, the first Tshivenda NER corpus in the news domain. We establish NER baselines by \textit{fine-tuning} state-of-the-art models on MphayaNER. The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with chiShona and Kiswahili showing the best results. Augmenting MphayaNER with chiShona data was also found to improve model performance significantly. Both MphayaNER and the baseline models are made publicly available.

* Accepted at AfricaNLP Workshop at ICLR 2023 
Viaarxiv icon

Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

Mar 07, 2023
Richard Lastrucci, Isheanesu Dzingirai, Jenalea Rajab, Andani Madodonga, Matimba Shingange, Daniel Njini, Vukosi Marivate

Figure 1 for Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora
Figure 2 for Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora
Figure 3 for Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora
Figure 4 for Preparing the Vuk'uzenzele and ZA-gov-multilingual South African multilingual corpora

This paper introduces two multilingual government themed corpora in various South African languages. The corpora were collected by gathering the South African Government newspaper (Vuk'uzenzele), as well as South African government speeches (ZA-gov-multilingual), that are translated into all 11 South African official languages. The corpora can be used for a myriad of downstream NLP tasks. The corpora were created to allow researchers to study the language used in South African government publications, with a focus on understanding how South African government officials communicate with their constituents. In this paper we highlight the process of gathering, cleaning and making available the corpora. We create parallel sentence corpora for Neural Machine Translation (NMT) tasks using Language-Agnostic Sentence Representations (LASER) embeddings. With these aligned sentences we then provide NMT benchmarks for 9 indigenous languages by fine-tuning a massively multilingual pre-trained language model. \end{abstra

* Under Review 
Viaarxiv icon

Conversational Pattern Mining using Motif Detection

Nov 13, 2022
Nicolle Garber, Vukosi Marivate

Figure 1 for Conversational Pattern Mining using Motif Detection
Figure 2 for Conversational Pattern Mining using Motif Detection
Figure 3 for Conversational Pattern Mining using Motif Detection
Figure 4 for Conversational Pattern Mining using Motif Detection

The subject of conversational mining has become of great interest recently due to the explosion of social and other online media. Supplementing this explosion of text is the advancement in pre-trained language models which have helped us to leverage these sources of information. An interesting domain to analyse is conversations in terms of complexity and value. Complexity arises due to the fact that a conversation can be asynchronous and can involve multiple parties. It is also computationally intensive to process. We use unsupervised methods in our work in order to develop a conversational pattern mining technique which does not require time consuming, knowledge demanding and resource intensive labelling exercises. The task of identifying repeating patterns in sequences is well researched in the Bioinformatics field. In our work, we adapt this to the field of Natural Language Processing and make several extensions to a motif detection algorithm. In order to demonstrate the application of the algorithm on a dynamic, real world data set; we extract motifs from an open-source film script data source. We run an exploratory investigation into the types of motifs we are able to mine.

* Accepted and to appear in proceedings of the 2022 Pan-African Artificial Intelligence and Smart Systems Conference 
Viaarxiv icon

Reinforcement Learning in Education: A Multi-Armed Bandit Approach

Nov 01, 2022
Herkulaas Combrink, Vukosi Marivate, Benjamin Rosman

Figure 1 for Reinforcement Learning in Education: A Multi-Armed Bandit Approach
Figure 2 for Reinforcement Learning in Education: A Multi-Armed Bandit Approach
Figure 3 for Reinforcement Learning in Education: A Multi-Armed Bandit Approach
Figure 4 for Reinforcement Learning in Education: A Multi-Armed Bandit Approach

Advances in reinforcement learning research have demonstrated the ways in which different agent-based models can learn how to optimally perform a task within a given environment. Reinforcement leaning solves unsupervised problems where agents move through a state-action-reward loop to maximize the overall reward for the agent, which in turn optimizes the solving of a specific problem in a given environment. However, these algorithms are designed based on our understanding of actions that should be taken in a real-world environment to solve a specific problem. One such problem is the ability to identify, recommend and execute an action within a system where the users are the subject, such as in education. In recent years, the use of blended learning approaches integrating face-to-face learning with online learning in the education context, has in-creased. Additionally, online platforms used for education require the automation of certain functions such as the identification, recommendation or execution of actions that can benefit the user, in this sense, the student or learner. As promising as these scientific advances are, there is still a need to conduct research in a variety of different areas to ensure the successful deployment of these agents within education systems. Therefore, the aim of this study was to contextualise and simulate the cumulative reward within an environment for an intervention recommendation problem in the education context.

* 17 pages, 6 figures, 1 table, EAI AFRICATEK 2022 Conference 
Viaarxiv icon

A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education

Oct 16, 2022
Herkulaas MvE Combrink, Vukosi Marivate, Benjamin Rosman

Figure 1 for A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education
Figure 2 for A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education
Figure 3 for A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education
Figure 4 for A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education

Understanding which student support strategies mitigate dropout and improve student retention is an important part of modern higher educational research. One of the largest challenges institutions of higher learning currently face is the scalability of student support. Part of this is due to the shortage of staff addressing the needs of students, and the subsequent referral pathways associated to provide timeous student support strategies. This is further complicated by the difficulty of these referrals, especially as students are often faced with a combination of administrative, academic, social, and socio-economic challenges. A possible solution to this problem can be a combination of student outcome predictions and applying algorithmic recommender systems within the context of higher education. While much effort and detail has gone into the expansion of explaining algorithmic decision making in this context, there is still a need to develop data collection strategies Therefore, the purpose of this paper is to outline a data collection framework specific to recommender systems within this context in order to reduce collection biases, understand student characteristics, and find an ideal way to infer optimal influences on the student journey. If confirmation biases, challenges in data sparsity and the type of information to collect from students are not addressed, it will have detrimental effects on attempts to assess and evaluate the effects of these systems within higher education.

* 14 pages, 4 figures, Proceedings of the 2020 SACAIR Conference 
Viaarxiv icon

Comparing Synthetic Tabular Data Generation Between a Probabilistic Model and a Deep Learning Model for Education Use Cases

Oct 16, 2022
Herkulaas MvE Combrink, Vukosi Marivate, Benjamin Rosman

Figure 1 for Comparing Synthetic Tabular Data Generation Between a Probabilistic Model and a Deep Learning Model for Education Use Cases
Figure 2 for Comparing Synthetic Tabular Data Generation Between a Probabilistic Model and a Deep Learning Model for Education Use Cases
Figure 3 for Comparing Synthetic Tabular Data Generation Between a Probabilistic Model and a Deep Learning Model for Education Use Cases
Figure 4 for Comparing Synthetic Tabular Data Generation Between a Probabilistic Model and a Deep Learning Model for Education Use Cases

The ability to generate synthetic data has a variety of use cases across different domains. In education research, there is a growing need to have access to synthetic data to test certain concepts and ideas. In recent years, several deep learning architectures were used to aid in the generation of synthetic data but with varying results. In the education context, the sophistication of implementing different models requiring large datasets is becoming very important. This study aims to compare the application of synthetic tabular data generation between a probabilistic model specifically a Bayesian Network, and a deep learning model, specifically a Generative Adversarial Network using a classification task. The results of this study indicate that synthetic tabular data generation is better suited for the education context using probabilistic models (overall accuracy of 75%) than deep learning architecture (overall accuracy of 38%) because of probabilistic interdependence. Lastly, we recommend that other data types, should be explored and evaluated for their application in generating synthetic data for education use cases.

* 11 paged, 5 figures, Proceedings for the SACAIR 2023 Conference 
Viaarxiv icon

Semi-supervised learning approaches for predicting South African political sentiment for local government elections

May 04, 2022
Mashadi Ledwaba, Vukosi Marivate

Figure 1 for Semi-supervised learning approaches for predicting South African political sentiment for local government elections
Figure 2 for Semi-supervised learning approaches for predicting South African political sentiment for local government elections
Figure 3 for Semi-supervised learning approaches for predicting South African political sentiment for local government elections
Figure 4 for Semi-supervised learning approaches for predicting South African political sentiment for local government elections

This study aims to understand the South African political context by analysing the sentiments shared on Twitter during the local government elections. An emphasis on the analysis was placed on understanding the discussions led around four predominant political parties ANC, DA, EFF and ActionSA. A semi-supervised approach by means of a graph-based technique to label the vast accessible Twitter data for the classification of tweets into negative and positive sentiment was used. The tweets expressing negative sentiment were further analysed through latent topic extraction to uncover hidden topics of concern associated with each of the political parties. Our findings demonstrated that the general sentiment across South African Twitter users is negative towards all four predominant parties with the worst negative sentiment among users projected towards the current ruling party, ANC, relating to concerns cantered around corruption, incompetence and loadshedding.

* Accepted for DGO 2022 
Viaarxiv icon