Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Tharindu Ranasinghe

A Federated Learning Approach to Privacy Preserving Offensive Language Identification

Apr 17, 2024

Marcos Zampieri, Damith Premasiri, Tharindu Ranasinghe

Figure 1 for A Federated Learning Approach to Privacy Preserving Offensive Language Identification

Figure 2 for A Federated Learning Approach to Privacy Preserving Offensive Language Identification

Figure 3 for A Federated Learning Approach to Privacy Preserving Offensive Language Identification

Figure 4 for A Federated Learning Approach to Privacy Preserving Offensive Language Identification

Abstract:The spread of various forms of offensive speech online is an important concern in social media. While platforms have been investing heavily in ways of coping with this problem, the question of privacy remains largely unaddressed. Models trained to detect offensive language on social media are trained and/or fine-tuned using large amounts of data often stored in centralized servers. Since most social media data originates from end users, we propose a privacy preserving decentralized architecture for identifying offensive language online by introducing Federated Learning (FL) in the context of offensive language identification. FL is a decentralized architecture that allows multiple models to be trained locally without the need for data sharing hence preserving users' privacy. We propose a model fusion approach to perform FL. We trained multiple deep learning models on four publicly available English benchmark datasets (AHSD, HASOC, HateXplain, OLID) and evaluated their performance in detail. We also present initial cross-lingual experiments in English and Spanish. We show that the proposed model fusion approach outperforms baselines in all the datasets while preserving privacy.

* Accepted to TRAC 2024 (Fourth Workshop on Threat, Aggression and Cyberbullying) at LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

Via

Access Paper or Ask Questions

CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Apr 04, 2024

Nishat Raihan, Dhiman Goswami, Sadiya Sayara Chowdhury Puspo, Christian Newman, Tharindu Ranasinghe, Marcos Zampieri

Figure 1 for CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Figure 2 for CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Figure 3 for CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Figure 4 for CSEPrompts: A Benchmark of Introductory Computer Science Prompts

Abstract:Recent advances in AI, machine learning, and NLP have led to the development of a new generation of Large Language Models (LLMs) that are trained on massive amounts of data and often have trillions of parameters. Commercial applications (e.g., ChatGPT) have made this technology available to the general public, thus making it possible to use LLMs to produce high-quality texts for academic and professional purposes. Schools and universities are aware of the increasing use of AI-generated content by students and they have been researching the impact of this new technology and its potential misuse. Educational programs in Computer Science (CS) and related fields are particularly affected because LLMs are also capable of generating programming code in various programming languages. To help understand the potential impact of publicly available LLMs in CS education, we introduce CSEPrompts, a framework with hundreds of programming exercise prompts and multiple-choice questions retrieved from introductory CS and programming courses. We also provide experimental results on CSEPrompts to evaluate the performance of several LLMs with respect to generating Python code and answering basic computer science and programming questions.

Via

Access Paper or Ask Questions

DORE: A Dataset For Portuguese Definition Generation

Mar 28, 2024

Anna Beatriz Dimas Furtado, Tharindu Ranasinghe, Frédéric Blain, Ruslan Mitkov

Figure 1 for DORE: A Dataset For Portuguese Definition Generation

Figure 2 for DORE: A Dataset For Portuguese Definition Generation

Figure 3 for DORE: A Dataset For Portuguese Definition Generation

Figure 4 for DORE: A Dataset For Portuguese Definition Generation

Abstract:Definition modelling (DM) is the task of automatically generating a dictionary definition for a specific word. Computational systems that are capable of DM can have numerous applications benefiting a wide range of audiences. As DM is considered a supervised natural language generation problem, these systems require large annotated datasets to train the machine learning (ML) models. Several DM datasets have been released for English and other high-resource languages. While Portuguese is considered a mid/high-resource language in most natural language processing tasks and is spoken by more than 200 million native speakers, there is no DM dataset available for Portuguese. In this research, we fill this gap by introducing DORE; the first dataset for Definition MOdelling for PoRtuguEse containing more than 100,000 definitions. We also evaluate several deep learning based DM models on DORE and report the results. The dataset and the findings of this paper will facilitate research and study of Portuguese in wider contexts.

* Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

Via

Access Paper or Ask Questions

Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language

Mar 27, 2024

Alistair Plum, Tharindu Ranasinghe, Christoph Purschke

Figure 1 for Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language

Figure 2 for Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language

Figure 3 for Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language

Figure 4 for Guided Distant Supervision for Multilingual Relation Extraction Data: Adapting to a New Language

Abstract:Relation extraction is essential for extracting and understanding biographical information in the context of digital humanities and related subjects. There is a growing interest in the community to build datasets capable of training machine learning models to extract relationships. However, annotating such datasets can be expensive and time-consuming, in addition to being limited to English. This paper applies guided distant supervision to create a large biographical relationship extraction dataset for German. Our dataset, composed of more than 80,000 instances for nine relationship types, is the largest biographical German relationship extraction dataset. We also create a manually annotated dataset with 2000 instances to evaluate the models and release it together with the dataset compiled using guided distant supervision. We train several state-of-the-art machine learning models on the automatically created dataset and release them as well. Furthermore, we experiment with multilingual and cross-lingual experiments that could benefit many low-resource languages.

* Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

Via

Access Paper or Ask Questions

NSINA: A News Corpus for Sinhala

Mar 25, 2024

Hansi Hettiarachchi, Damith Premasiri, Lasitha Uyangodage, Tharindu Ranasinghe

Figure 1 for NSINA: A News Corpus for Sinhala

Figure 2 for NSINA: A News Corpus for Sinhala

Figure 3 for NSINA: A News Corpus for Sinhala

Figure 4 for NSINA: A News Corpus for Sinhala

Abstract:The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.

* Accepted to LREC-COLING 2024 (The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation)

Via

Access Paper or Ask Questions

MultiLS: A Multi-task Lexical Simplification Framework

Feb 22, 2024

Kai North, Tharindu Ranasinghe, Matthew Shardlow, Marcos Zampieri

Figure 1 for MultiLS: A Multi-task Lexical Simplification Framework

Figure 2 for MultiLS: A Multi-task Lexical Simplification Framework

Figure 3 for MultiLS: A Multi-task Lexical Simplification Framework

Figure 4 for MultiLS: A Multi-task Lexical Simplification Framework

Abstract:Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence's original meaning. LS is a precursor to Text Simplification with the aim of improving text accessibility to various target demographics, including children, second language learners, individuals with reading disabilities or low literacy. Several datasets exist for LS. These LS datasets specialize on one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset to be created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1). lexical complexity prediction (LCP), (2). substitute generation, and (3). substitute ranking for Portuguese. Model performances are reported, ranging from transformer-based models to more recent large language models (LLMs).

Via

Access Paper or Ask Questions

A Text-to-Text Model for Multilingual Offensive Language Identification

Dec 06, 2023

Tharindu Ranasinghe, Marcos Zampieri

Figure 1 for A Text-to-Text Model for Multilingual Offensive Language Identification

Figure 2 for A Text-to-Text Model for Multilingual Offensive Language Identification

Figure 3 for A Text-to-Text Model for Multilingual Offensive Language Identification

Figure 4 for A Text-to-Text Model for Multilingual Offensive Language Identification

Abstract:The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.

* Accepted to Findings of IJCNLP-AACL 2023

Via

Access Paper or Ask Questions

SurreyAI 2023 Submission for the Quality Estimation Shared Task

Dec 01, 2023

Archchana Sindhujan, Diptesh Kanojia, Constantin Orasan, Tharindu Ranasinghe

Abstract:Quality Estimation (QE) systems are important in situations where it is necessary to assess the quality of translations, but there is no reference available. This paper describes the approach adopted by the SurreyAI team for addressing the Sentence-Level Direct Assessment shared task in WMT23. The proposed approach builds upon the TransQuest framework, exploring various autoencoder pre-trained language models within the MonoTransQuest architecture using single and ensemble settings. The autoencoder pre-trained language models employed in the proposed systems are XLMV, InfoXLM-large, and XLMR-large. The evaluation utilizes Spearman and Pearson correlation coefficients, assessing the relationship between machine-predicted quality scores and human judgments for 5 language pairs (English-Gujarati, English-Hindi, English-Marathi, English-Tamil and English-Telugu). The MonoTQ-InfoXLM-large approach emerges as a robust strategy, surpassing all other individual models proposed in this study by significantly improving over the baseline for the majority of the language pairs.

Via

Access Paper or Ask Questions

Offensive Language Identification in Transliterated and Code-Mixed Bangla

Nov 25, 2023

Md Nishat Raihan, Umma Hani Tanmoy, Anika Binte Islam, Kai North, Tharindu Ranasinghe, Antonios Anastasopoulos, Marcos Zampieri

Abstract:Identifying offensive content in social media is vital for creating safe online communities. Several recent studies have addressed this problem by creating datasets for various languages. In this paper, we explore offensive language identification in texts with transliterations and code-mixing, linguistic phenomena common in multilingual societies, and a known challenge for NLP systems. We introduce TB-OLID, a transliterated Bangla offensive language dataset containing 5,000 manually annotated comments. We train and fine-tune machine learning models on TB-OLID, and we evaluate their results on this dataset. Our results show that English pre-trained transformer-based models, such as fBERT and HateBERT achieve the best performance on this dataset.

Via

Access Paper or Ask Questions

Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Jul 18, 2023

Damith Premasiri, Tharindu Ranasinghe, Ruslan Mitkov

Figure 1 for Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Figure 2 for Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Figure 3 for Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Figure 4 for Can Model Fusing Help Transformers in Long Document Classification? An Empirical Study

Abstract:Text classification is an area of research which has been studied over the years in Natural Language Processing (NLP). Adapting NLP to multiple domains has introduced many new challenges for text classification and one of them is long document classification. While state-of-the-art transformer models provide excellent results in text classification, most of them have limitations in the maximum sequence length of the input sequence. The majority of the transformer models are limited to 512 tokens, and therefore, they struggle with long document classification problems. In this research, we explore on employing Model Fusing for long document classification while comparing the results with well-known BERT and Longformer architectures.

* Accepted in RANLP 2023

Via

Access Paper or Ask Questions