Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Marcos Zampieri

Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

Dec 17, 2021

Thomas Mandl, Sandip Modha, Gautam Kishore Shahi, Hiren Madhu, Shrey Satapara, Prasenjit Majumder, Johannes Schaefer, Tharindu Ranasinghe, Marcos Zampieri, Durgesh Nandini(+1 more)

Figure 1 for Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

Figure 2 for Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

Figure 3 for Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

Figure 4 for Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages

Abstract:The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop benchmark data for this purpose. This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. This subtrack has two sub-tasks. Task A is a binary classification problem (Hate and Not Offensive) offered for all three languages. Task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY offered for English and Hindi. Overall, 652 runs were submitted by 65 teams. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively. This overview presents the tasks and the data development as well as the detailed results. The systems submitted to the competition applied a variety of technologies. The best performing algorithms were mainly variants of transformer architectures.

Via

Access Paper or Ask Questions

FBERT: A Neural Transformer for Identifying Offensive Content

Sep 10, 2021

Diptanu Sarkar, Marcos Zampieri, Tharindu Ranasinghe, Alexander Ororbia

Figure 1 for FBERT: A Neural Transformer for Identifying Offensive Content

Figure 2 for FBERT: A Neural Transformer for Identifying Offensive Content

Figure 3 for FBERT: A Neural Transformer for Identifying Offensive Content

Figure 4 for FBERT: A Neural Transformer for Identifying Offensive Content

Abstract:Transformer-based models such as BERT, XLNET, and XLM-R have achieved state-of-the-art performance across various NLP tasks including the identification of offensive language and hate speech, an important problem in social media. In this paper, we present fBERT, a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances. We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.

* Accepted to EMNLP Findings

Via

Access Paper or Ask Questions

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Sep 08, 2021

Saurabh Gaikwad, Tharindu Ranasinghe, Marcos Zampieri, Christopher M. Homan

Figure 1 for Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Figure 2 for Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Figure 3 for Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Figure 4 for Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

Abstract:The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

* Accepted to RANLP 2021

Via

Access Paper or Ask Questions

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Sep 01, 2021

Christian D. Newman, Michael J. Decker, Reem S. AlSuhaibani, Anthony Peruma, Satyajit Mohapatra, Tejal Vishnoi, Marcos Zampieri, Mohamed W. Mkaouer, Timothy J. Sheldon, Emily Hill

Figure 1 for An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Figure 2 for An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Figure 3 for An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Figure 4 for An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

Abstract:This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE, and Stanford. We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names. We also study and discuss the weaknesses of our tagger to promote the future amelioration of these problems through further research. Our results show that the ensemble achieves 75\% accuracy at the identifier level and 84-86\% accuracy at the word level. This is an increase of +17\% points at the identifier level from the closest independent part-of-speech tagger.

* in IEEE Transactions on Software Engineering, vol. , no. 01, pp. 1-1, 5555
* 18 pages. arXiv admin note: text overlap with arXiv:2007.08033

Via

Access Paper or Ask Questions

WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

Jul 30, 2021

Skye Morgan, Tharindu Ranasinghe, Marcos Zampieri

Figure 1 for WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

Figure 2 for WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

Figure 3 for WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

Figure 4 for WLV-RIT at GermEval 2021: Multitask Learning with Transformers to Detect Toxic, Engaging, and Fact-Claiming Comments

Abstract:This paper addresses the identification of toxic, engaging, and fact-claiming comments on social media. We used the dataset made available by the organizers of the GermEval-2021 shared task containing over 3,000 manually annotated Facebook comments in German. Considering the relatedness of the three tasks, we approached the problem using large pre-trained transformer models and multitask learning. Our results indicate that multitask learning achieves performance superior to the more common single task learning approach in all three tasks. We submit our best systems to GermEval-2021 under the team name WLV-RIT.

* Accepted to GermEval-2021

Via

Access Paper or Ask Questions

SemEval-2021 Task 1: Lexical Complexity Prediction

Jun 01, 2021

Matthew Shardlow, Richard Evans, Gustavo Henrique Paetzold, Marcos Zampieri

Figure 1 for SemEval-2021 Task 1: Lexical Complexity Prediction

Figure 2 for SemEval-2021 Task 1: Lexical Complexity Prediction

Figure 3 for SemEval-2021 Task 1: Lexical Complexity Prediction

Figure 4 for SemEval-2021 Task 1: Lexical Complexity Prediction

Abstract:This paper presents the results and main findings of SemEval-2021 Task 1 - Lexical Complexity Prediction. We provided participants with an augmented version of the CompLex Corpus (Shardlow et al 2020). CompLex is an English multi-domain corpus in which words and multi-word expressions (MWEs) were annotated with respect to their complexity using a five point Likert scale. SemEval-2021 Task 1 featured two Sub-tasks: Sub-task 1 focused on single words and Sub-task 2 focused on MWEs. The competition attracted 198 teams in total, of which 54 teams submitted official runs on the test data to Sub-task 1 and 37 to Sub-task 2.

Via

Access Paper or Ask Questions

An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

May 31, 2021

Ana-Maria Bucur, Marcos Zampieri, Liviu P. Dinu

Figure 1 for An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

Figure 2 for An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

Figure 3 for An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

Figure 4 for An Exploratory Analysis of the Relation Between Offensive Language and Mental Health

Abstract:In this paper, we analyze the interplay between the use of offensive language and mental health. We acquired publicly available datasets created for offensive language identification and depression detection and we train computational models to compare the use of offensive language in social media posts written by groups of individuals with and without self-reported depression diagnosis. We also look at samples written by groups of individuals whose posts show signs of depression according to recent related studies. Our analysis indicates that offensive language is more frequently used in the samples written by individuals with self-reported depression as well as individuals showing signs of depression. The results discussed here open new avenues in research in politeness/offensiveness and mental health.

* Accepted to Findings of the Association for Computational Linguistics: ACL 2021

Via

Access Paper or Ask Questions

Multilingual Offensive Language Identification for Low-resource Languages

May 20, 2021

Tharindu Ranasinghe, Marcos Zampieri

Figure 1 for Multilingual Offensive Language Identification for Low-resource Languages

Figure 2 for Multilingual Offensive Language Identification for Low-resource Languages

Figure 3 for Multilingual Offensive Language Identification for Low-resource Languages

Figure 4 for Multilingual Offensive Language Identification for Low-resource Languages

Abstract:Offensive content is pervasive in social media and a reason for concern to companies and government organizations. Several studies have been recently published investigating methods to detect the various forms of such content (e.g. hate speech, cyberbullying, and cyberaggression). The clear majority of these studies deal with English partially because most annotated datasets available contain English data. In this paper, we take advantage of available English datasets by applying cross-lingual contextual word embeddings and transfer learning to make predictions in low-resource languages. We project predictions on comparable data in Arabic, Bengali, Danish, Greek, Hindi, Spanish, and Turkish. We report results of 0.8415 F1 macro for Bengali in TRAC-2 shared task, 0.8532 F1 macro for Danish and 0.8701 F1 macro for Greek in OffensEval 2020, 0.8568 F1 macro for Hindi in HASOC 2019 shared task and 0.7513 F1 macro for Spanish in in SemEval-2019 Task 5 (HatEval) showing that our approach compares favourably to the best systems submitted to recent shared tasks on these three languages. Additionally, we report competitive performance on Arabic, and Turkish using the training and development sets of OffensEval 2020 shared task. The results for all languages confirm the robustness of cross-lingual contextual embeddings and transfer learning for this task.

* Accepted to ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). This is an extended version of a paper accepted to EMNLP. arXiv admin note: substantial text overlap with arXiv:2010.05324

Via

Access Paper or Ask Questions

LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

May 18, 2021

Abhinandan Desai, Kai North, Marcos Zampieri, Christopher M. Homan

Figure 1 for LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

Figure 2 for LCP-RIT at SemEval-2021 Task 1: Exploring Linguistic Features for Lexical Complexity Prediction

Abstract:This paper describes team LCP-RIT's submission to the SemEval-2021 Task 1: Lexical Complexity Prediction (LCP). The task organizers provided participants with an augmented version of CompLex (Shardlow et al., 2020), an English multi-domain dataset in which words in context were annotated with respect to their complexity using a five point Likert scale. Our system uses logistic regression and a wide range of linguistic features (e.g. psycholinguistic features, n-grams, word frequency, POS tags) to predict the complexity of single words in this dataset. We analyze the impact of different linguistic features in the classification performance and we evaluate the results in terms of mean absolute error, mean squared error, Pearson correlation, and Spearman correlation.

Via

Access Paper or Ask Questions

WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Apr 15, 2021

Tharindu Ranasinghe, Diptanu Sarkar, Marcos Zampieri, Alex Ororbia

Figure 1 for WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Figure 2 for WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Figure 3 for WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Figure 4 for WLV-RIT at SemEval-2021 Task 5: A Neural Transformer Framework for Detecting Toxic Spans

Abstract:In recent years, the widespread use of social media has led to an increase in the generation of toxic and offensive content on online platforms. In response, social media platforms have worked on developing automatic detection methods and employing human moderators to cope with this deluge of offensive content. While various state-of-the-art statistical models have been applied to detect toxic posts, there are only a few studies that focus on detecting the words or expressions that make a post offensive. This motivates the organization of the SemEval-2021 Task 5: Toxic Spans Detection competition, which has provided participants with a dataset containing toxic spans annotation in English posts. In this paper, we present the WLV-RIT entry for the SemEval-2021 Task 5. Our best performing neural transformer model achieves an $0.68$ F1-Score. Furthermore, we develop an open-source framework for multilingual detection of offensive spans, i.e., MUDES, based on neural transformers that detect toxic spans in texts.

* Accepted to SemEval-2021

Via

Access Paper or Ask Questions