Alert button
Picture for Siddhant Jaydeep Mahajani

Siddhant Jaydeep Mahajani

Alert button

A Comparison of Lexicon-Based and ML-Based Sentiment Analysis: Are There Outlier Words?

Nov 10, 2023
Siddhant Jaydeep Mahajani, Shashank Srivastava, Alan F. Smeaton

Lexicon-based approaches to sentiment analysis of text are based on each word or lexical entry having a pre-defined weight indicating its sentiment polarity. These are usually manually assigned but the accuracy of these when compared against machine leaning based approaches to computing sentiment, are not known. It may be that there are lexical entries whose sentiment values cause a lexicon-based approach to give results which are very different to a machine learning approach. In this paper we compute sentiment for more than 150,000 English language texts drawn from 4 domains using the Hedonometer, a lexicon-based technique and Azure, a contemporary machine-learning based approach which is part of the Azure Cognitive Services family of APIs which is easy to use. We model differences in sentiment scores between approaches for documents in each domain using a regression and analyse the independent variables (Hedonometer lexical entries) as indicators of each word's importance and contribution to the score differences. Our findings are that the importance of a word depends on the domain and there are no standout lexical entries which systematically cause differences in sentiment scores.

* 4 pages, to appear in Proceedings of the 31st Irish Conference on Artificial Intelligence and Cognitive Science. December 7th-8th, 2023 
Viaarxiv icon