Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

ArSentD-LEV: A Multi-Topic Corpus for Target-based Sentiment Analysis in Arabic Levantine Tweets

May 25, 2019
Ramy Baly, Alaa Khaddaj, Hazem Hajj, Wassim El-Hajj, Khaled Bashir Shaban

Sentiment analysis is a highly subjective and challenging task. Its complexity further increases when applied to the Arabic language, mainly because of the large variety of dialects that are unstandardized and widely used in the Web, especially in social media. While many datasets have been released to train sentiment classifiers in Arabic, most of these datasets contain shallow annotation, only marking the sentiment of the text unit, as a word, a sentence or a document. In this paper, we present the Arabic Sentiment Twitter Dataset for the Levantine dialect (ArSenTD-LEV). Based on findings from analyzing tweets from the Levant region, we created a dataset of 4,000 tweets with the following annotations: the overall sentiment of the tweet, the target to which the sentiment was expressed, how the sentiment was expressed, and the topic of the tweet. Results confirm the importance of these annotations at improving the performance of a baseline sentiment classifier. They also confirm the gap of training in a certain domain, and testing in another domain.

* Corpus development, Levantine tweets, multi-topic, sentiment analysis, sentiment target, LREC-2018, OSACT-2018 

  Access Paper or Ask Questions

Automated Topical Component Extraction Using Neural Network Attention Scores from Source-based Essay Scoring

Aug 04, 2020
Haoran Zhang, Diane Litman

While automated essay scoring (AES) can reliably grade essays at scale, automated writing evaluation (AWE) additionally provides formative feedback to guide essay revision. However, a neural AES typically does not provide useful feature representations for supporting AWE. This paper presents a method for linking AWE and neural AES, by extracting Topical Components (TCs) representing evidence from a source text using the intermediate output of attention layers. We evaluate performance using a feature-based AES requiring TCs. Results show that performance is comparable whether using automatically or manually constructed TCs for 1) representing essays as rubric-based features, 2) grading essays.

* The 58th Annual Meeting of the Association for Computational Linguistics, pp. 8569-8584. 2020 
* Published in the ACL 2020 

  Access Paper or Ask Questions

Going deep in clustering high-dimensional data: deep mixtures of unigrams for uncovering topics in textual data

Feb 18, 2019
Laura Anderlucci, Cinzia Viroli

Mixtures of Unigrams (Nigam et al., 2000) are one of the simplest and most efficient tools for clustering textual data, as they assume that documents related to the same topic have similar distributions of terms, naturally described by Multinomials. When the classification task is particularly challenging, such as when the document-term matrix is high-dimensional and extremely sparse, a more composite representation can provide better insight on the grouping structure. In this work, we developed a deep version of mixtures of Unigrams for the unsupervised classification of very short documents with a large number of terms, by allowing for models with further deeper latent layers; the proposal is derived in a Bayesian framework. Simulation studies and real data analysis prove that going deep in clustering such data highly improves the classification accuracy with respect to more `shallow' methods.

  Access Paper or Ask Questions

Women in ISIS Propaganda: A Natural Language Processing Analysis of Topics and Emotions in a Comparison with Mainstream Religious Group

Dec 09, 2019
Mojtaba Heidarysafa, Kamran Kowsari, Tolu Odukoya, Philip Potter, Laura E. Barnes, Donald E. Brown

Online propaganda is central to the recruitment strategies of extremist groups and in recent years these efforts have increasingly extended to women. To investigate ISIS' approach to targeting women in their online propaganda and uncover implications for counterterrorism, we rely on text mining and natural language processing (NLP). Specifically, we extract articles published in Dabiq and Rumiyah (ISIS's online English language publications) to identify prominent topics. To identify similarities or differences between these texts and those produced by non-violent religious groups, we extend the analysis to articles from a Catholic forum dedicated to women. We also perform an emotional analysis of both of these resources to better understand the emotional components of propaganda. We rely on Depechemood (a lexical-base emotion analysis method) to detect emotions most likely to be evoked in readers of these materials. The findings indicate that the emotional appeal of ISIS and Catholic materials are similar

  Access Paper or Ask Questions

From Talk to Action with Accountability: Monitoring the Public Discussion of Finnish Decision-Makers with Deep Neural Networks and Topic Modelling

Oct 16, 2020
Vili Hätönen, Fiona Melzer

Decades of research on climate have provided a consensus that human activity has changed the climate and we are currently heading into a climate crisis. Many tools and methods, some of which utilize machine learning, have been developed to monitor, evaluate, and predict the changing climate and its effects on societies. However, the mere existence of tools and increased awareness have not led to swift action to reduce emissions and mitigate climate change. Politicians and other policy makers lack the initiative to move from talking about the climate to concrete climate action. In this work, we contribute to the efforts of holding decision makers accountable by describing a system which digests politicians' speeches and statements into a topic summary. We propose a multi-source hybrid latent Dirichlet allocation model which can process the large number of publicly available reports, social media posts, speeches, and other documents of Finnish politicians, providing transparency and accountability towards the general public.

* Submitted to NeurIPS 2020 Workshop Tackling Climate Change with Machine Learning 

  Access Paper or Ask Questions

Detecting Group Beliefs Related to 2018's Brazilian Elections in Tweets A Combined Study on Modeling Topics and Sentiment Analysis

May 31, 2020
Brenda Salenave Santana, Aline Aver Vanin

2018's Brazilian presidential elections highlighted the influence of alternative media and social networks, such as Twitter. In this work, we perform an analysis covering politically motivated discourses related to the second round in Brazilian elections. In order to verify whether similar discourses reinforce group engagement to personal beliefs, we collected a set of tweets related to political hashtags at that moment. To this end, we have used a combination of topic modeling approach with opinion mining techniques to analyze the motivated political discourses. Using SentiLex-PT, a Portuguese sentiment lexicon, we extracted from the dataset the top 5 most frequent group of words related to opinions. Applying a bag-of-words model, the cosine similarity calculation was performed between each opinion and the observed groups. This study allowed us to observe an exacerbated use of passionate discourses in the digital political scenario as a form of appreciation and engagement to the groups which convey similar beliefs.

* Proceedings of the Workshop on Digital Humanities and Natural Language Processing (DHandNLP 2020) co-located with International Conference on the Computational Processing of Portuguese (PROPOR 2020) 

  Access Paper or Ask Questions

How Metaphors Impact Political Discourse: A Large-Scale Topic-Agnostic Study Using Neural Metaphor Detection

Apr 08, 2021
Vinodkumar Prabhakaran, Marek Rei, Ekaterina Shutova

Metaphors are widely used in political rhetoric as an effective framing device. While the efficacy of specific metaphors such as the war metaphor in political discourse has been documented before, those studies often rely on small number of hand-coded instances of metaphor use. Larger-scale topic-agnostic studies are required to establish the general persuasiveness of metaphors as a device, and to shed light on the broader patterns that guide their persuasiveness. In this paper, we present a large-scale data-driven study of metaphors used in political discourse. We conduct this study on a publicly available dataset of over 85K posts made by 412 US politicians in their Facebook public pages, up until Feb 2017. Our contributions are threefold: we show evidence that metaphor use correlates with ideological leanings in complex ways that depend on concurrent political events such as winning or losing elections; we show that posts with metaphors elicit more engagement from their audience overall even after controlling for various socio-political factors such as gender and political party affiliation; and finally, we demonstrate that metaphoricity is indeed the reason for increased engagement of posts, through a fine-grained linguistic analysis of metaphorical vs. literal usages of 513 words across 70K posts.

* The International AAAI Conference on Web and Social Media (ICWSM) 2021 
* Published at ICWSM 2021. Please cite that version for academic publications 

  Access Paper or Ask Questions

Automatic construction of Chinese herbal prescription from tongue image via CNNs and auxiliary latent therapy topics

Mar 01, 2018
Yang Hu, Guihua Wen, Huiqiang Liao, Changjun Wang, Dan Dai, Zhiwen Yu, Jun Zhang

The tongue image is an important physical information of human, it is of great importance to the diagnosis and treatment in clinical medicine. Herbal prescriptions are simple, noninvasive and low side effects, and are widely applied in China. Researches on automatic construction technology of herbal prescription based on tongue image have great significance for deep learning to explore the relevance from tongue image to herbal prescription, and can be applied to healthcare services in mobile medical system. In order to adapt to the tongue image in a variety of photographing environments and construct the herbal prescriptions, a neural network framework for prescriptions construction is designed, which includes single / double convolution channels and fully connected layers, and propose the mechanism of auxiliary therapy topic loss to model the therapy of Chinese doctors then alleviate the interference of sparse output labels to the diversity of results. The experimental data include the patient tongue images and their corresponding prescriptions from real world outpatient clinic, and the experiment results can generate the prescriptions that are close to the real samples, which verifies the feasibility of the proposed method for automatic construction of herbal prescription from tongue image. Also, provides a reference for automatic herbal prescription construction from more physical information (or integrated body information).

  Access Paper or Ask Questions

Computational analyses of the topics, sentiments, literariness, creativity and beauty of texts in a large Corpus of English Literature

Jan 12, 2022
Arthur M. Jacobs, Annette Kinder

The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. In this study we address differences among the different literature categories in GLEC, as well as differences between authors. We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC (i.e., children and youth, essays, novels, plays, poems, stories) and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC (e.g., Jane Austen's six novels), and iii) two experiments on text classification and authorship recognition using novel features of semantic complexity. The data on two novel measures estimating a text's literariness, intratextual variance and stepwise distance (van Cranenburgh et al., 2019) revealed that plays are the most literary texts in GLEC, followed by poems and novels. Computation of a novel index of text creativity (Gray et al., 2016) revealed poems and plays as the most creative categories with the most creative authors all being poets (Milton, Pope, Keats, Byron, or Wordsworth). We also computed a novel index of perceived beauty of verbal art (Kintsch, 2012) for the works in GLEC and predict that Emma is the theoretically most beautiful of Austen's novels. Finally, we demonstrate that these novel measures of semantic complexity are important features for text classification and authorship recognition with overall predictive accuracies in the range of .75 to .97. Our data pave the way for future computational and empirical studies of literature or experiments in reading psychology and offer multiple baselines and benchmarks for analysing and validating other book corpora.

* 37 pages, 12 figures 

  Access Paper or Ask Questions

Improving Logical-Level Natural Language Generation with Topic-Conditioned Data Augmentation and Logical Form Generation

Dec 12, 2021
Ao Liu, Congjian Luo, Naoaki Okazaki

Logical Natural Language Generation, i.e., generating textual descriptions that can be logically entailed by a structured table, has been a challenge due to the low fidelity of the generation. \citet{chen2020logic2text} have addressed this problem by annotating interim logical programs to control the generation contents and semantics, and presented the task of table-aware logical form to text (Logic2text) generation. However, although table instances are abundant in the real world, logical forms paired with textual descriptions require costly human annotation work, which limits the performance of neural models. To mitigate this, we propose topic-conditioned data augmentation (TopicDA), which utilizes GPT-2 to generate unpaired logical forms and textual descriptions directly from tables. We further introduce logical form generation (LG), a dual task of Logic2text that requires generating a valid logical form based on a text description of a table. We also propose a semi-supervised learning approach to jointly train a Logic2text and an LG model with both labeled and augmented data. The two models benefit from each other by providing extra supervision signals through back-translation. Experimental results on the Logic2text dataset and the LG task demonstrate that our approach can effectively utilize the augmented data and outperform supervised baselines by a substantial margin.

  Access Paper or Ask Questions