Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Characterizing the Language of Online Communities and its Relation to Community Reception

Sep 15, 2016
Trang Tran, Mari Ostendorf

This work investigates style and topic aspects of language in online communities: looking at both utility as an identifier of the community and correlation with community reception of content. Style is characterized using a hybrid word and part-of-speech tag n-gram language model, while topic is represented using Latent Dirichlet Allocation. Experiments with several Reddit forums show that style is a better indicator of community identity than topic, even for communities organized around specific topics. Further, there is a positive correlation between the community reception to a contribution and the style similarity to that community, but not so for topic similarity.

* EMNLP 2016 

  Access Paper or Ask Questions

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

Feb 14, 2020
Jonas Rieger, Lars Koppers, Carsten Jentsch, Jörg Rahnenführer

For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment via Gibbs Sampling is based on conditional distributions, leading to different results in replicated runs on the same text data. This fact is often neglected in everyday practice. We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. Using such similarities, topics can be clustered. A new pruning algorithm for hierarchical clustering results based on the idea that two LDA runs create pairs of similar topics is proposed. This approach leads to the new measure S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal {\bf P}runing) for quantifying the stability of LDA models. We discuss some characteristics of this measure and illustrate it with an application to real data consisting of newspaper articles from \textit{USA Today}. Our results show that the measure S-CLOP is useful for assessing the stability of LDA models or any other topic modeling procedure that characterize its topics by word distributions. Based on the newly proposed measure for LDA stability, we propose a method to increase the reliability and hence to improve the reproducibility of empirical findings based on topic modeling. This increase in reliability is obtained by running the LDA several times and taking as prototype the most representative run, that is the LDA run with highest average similarity to all other runs.

* 16 pages, 2 figures 

  Access Paper or Ask Questions

COVID-19 Vaccine and Social Media: Exploring Emotions and Discussions on Twitter

Jul 29, 2021
Amir Karami, Michael Zhu, Bailey Goldschmidt, Hannah R. Boyajieff, Mahdi M. Najafabadi

Public response to COVID-19 vaccines is the key success factor to control the COVID-19 pandemic. To understand the public response, there is a need to explore public opinion. Traditional surveys are expensive and time-consuming, address limited health topics, and obtain small-scale data. Twitter can provide a great opportunity to understand public opinion regarding COVID-19 vaccines. The current study proposes an approach using computational and human coding methods to collect and analyze a large number of tweets to provide a wider perspective on the COVID-19 vaccine. This study identifies the sentiment of tweets and their temporal trend, discovers major topics, compares topics of negative and non-negative tweets, and discloses top topics of negative and non-negative tweets. Our findings show that the negative sentiment regarding the COVID-19 vaccine had a decreasing trend between November 2020 and February 2021. We found Twitter users have discussed a wide range of topics from vaccination sites to the 2020 U.S. election between November 2020 and February 2021. The findings show that there was a significant difference between negative and non-negative tweets regarding the weight of most topics. Our results also indicate that the negative and non-negative tweets had different topic priorities and focuses.

  Access Paper or Ask Questions

On Cross-Dataset Generalization in Automatic Detection of Online Abuse

Nov 03, 2020
Isar Nejadgholi, Svetlana Kiritchenko

NLP research has attained high performances in abusive language detection as a supervised classification task. While in research settings, training and test datasets are usually obtained from similar data samples, in practice systems are often applied on data that are different from the training set in topic and class distributions. Also, the ambiguity in class definitions inherited in this task aggravates the discrepancies between source and target datasets. We explore the topic bias and the task formulation bias in cross-dataset generalization. We show that the benign examples in the Wikipedia Detox dataset are biased towards platform-specific topics. We identify these examples using unsupervised topic modeling and manual inspection of topics' keywords. Removing these topics increases cross-dataset generalization, without reducing in-domain classification performance. For a robust dataset design, we suggest applying inexpensive unsupervised methods to inspect the collected data and downsize the non-generalizable content before manually annotating for class labels.

* 13 pages, 3 figures, accepted to WOAH-2020 (The 4th Workshop on Online Abuse and Harms) 

  Access Paper or Ask Questions

The Unfolding Structure of Arguments in Online Debates: The case of a No-Deal Brexit

Mar 09, 2021
Carlo Santagiustina, Massimo Warglien

In the last decade, political debates have progressively shifted to social media. Rhetorical devices employed by online actors and factions that operate in these debating arenas can be captured and analysed to conduct a statistical reading of societal controversies and their argumentation dynamics. In this paper, we propose a five-step methodology, to extract, categorize and explore the latent argumentation structures of online debates. Using Twitter data about a "no-deal" Brexit, we focus on the expected effects in case of materialisation of this event. First, we extract cause-effect claims contained in tweets using RegEx that exploit verbs related to Creation, Destruction and Causation. Second, we categorise extracted "no-deal" effects using a Structural Topic Model estimated on unigrams and bigrams. Third, we select controversial effect topics and explore within-topic argumentation differences between self-declared partisan user factions. We hence type topics using estimated covariate effects on topic propensities, then, using the topics correlation network, we study the topological structure of the debate to identify coherent topical constellations. Finally, we analyse the debate time dynamics and infer lead/follow relations among factions. Results show that the proposed methodology can be employed to perform a statistical rhetorics analysis of debates, and map the architecture of controversies across time. In particular, the "no-deal" Brexit debate is shown to have an assortative argumentation structure heavily characterized by factional constellations of arguments, as well as by polarized narrative frames invoked through verbs related to Creation and Destruction. Our findings highlight the benefits of implementing a systemic approach to the analysis of debates, which allows the unveiling of topical and factional dependencies between arguments employed in online debates.

* Main article (18 pages, 7 figures) & Supplementary material (25 pages, 7 figures) 

  Access Paper or Ask Questions

A Hybrid Citation Retrieval Algorithm for Evidence-based Clinical Knowledge Summarization: Combining Concept Extraction, Vector Similarity and Query Expansion for High Precision

Sep 06, 2016
Kalpana Raja, Andrew J Sauer, Ravi P Garg, Melanie R Klerer, Siddhartha R Jonnalagadda

Novel information retrieval methods to identify citations relevant to a clinical topic can overcome the knowledge gap existing between the primary literature (MEDLINE) and online clinical knowledge resources such as UpToDate. Searching the MEDLINE database directly or with query expansion methods returns a large number of citations that are not relevant to the query. The current study presents a citation retrieval system that retrieves citations for evidence-based clinical knowledge summarization. This approach combines query expansion, concept-based screening algorithm, and concept-based vector similarity. We also propose an information extraction framework for automated concept (Population, Intervention, Comparison, and Disease) extraction. We evaluated our proposed system on all topics (as queries) available from UpToDate for two diseases, heart failure (HF) and atrial fibrillation (AFib). The system achieved an overall F-score of 41.2% on HF topics and 42.4% on AFib topics on a gold standard of citations available in UpToDate. This is significantly high when compared to a query-expansion based baseline (F-score of 1.3% on HF and 2.2% on AFib) and a system that uses query expansion with disease hyponyms and journal names, concept-based screening, and term-based vector similarity system (F-score of 37.5% on HF and 39.5% on AFib). Evaluating the system with top K relevant citations, where K is the number of citations in the gold standard achieved a much higher overall F-score of 69.9% on HF topics and 75.1% on AFib topics. In addition, the system retrieved up to 18 new relevant citations per topic when tested on ten HF and six AFib clinical topics.

  Access Paper or Ask Questions

Interacting with Non-Cooperative User: A New Paradigm for Proactive Dialogue Policy

Apr 07, 2022
Wenqiang Lei, Yao Zhang, Feifan Song, Hongru Liang, Jiaxin Mao, Jiancheng Lv, Zhenglu Yang, Tat-Seng Chua

Proactive dialogue system is able to lead the conversation to a goal topic and has advantaged potential in bargain, persuasion and negotiation. Current corpus-based learning manner limits its practical application in real-world scenarios. To this end, we contribute to advance the study of the proactive dialogue policy to a more natural and challenging setting, i.e., interacting dynamically with users. Further, we call attention to the non-cooperative user behavior -- the user talks about off-path topics when he/she is not satisfied with the previous topics introduced by the agent. We argue that the targets of reaching the goal topic quickly and maintaining a high user satisfaction are not always converge, because the topics close to the goal and the topics user preferred may not be the same. Towards this issue, we propose a new solution named I-Pro that can learn Proactive policy in the Interactive setting. Specifically, we learn the trade-off via a learned goal weight, which consists of four factors (dialogue turn, goal completion difficulty, user satisfaction estimation, and cooperative degree). The experimental results demonstrate I-Pro significantly outperforms baselines in terms of effectiveness and interpretability.

* Accepted to SIGIR 2022 

  Access Paper or Ask Questions

2020 U.S. Presidential Election: Analysis of Female and Male Users on Twitter

Aug 21, 2021
Amir Karami, Spring B. Clark, Anderson Mackenzie, Dorathea Lee, Michael Zhu, Hannah R. Boyajieff, Bailey Goldschmidt

Social media is commonly used by the public during election campaigns to express their opinions regarding different issues. Among various social media channels, Twitter provides an efficient platform for researchers and politicians to explore public opinion regarding a wide range of topics such as economy and foreign policy. Current literature mainly focuses on analyzing the content of tweets without considering the gender of users. This research collects and analyzes a large number of tweets and uses computational, human coding, and statistical analyses to identify topics in more than 300,000 tweets posted during the 2020 U.S. presidential election and to compare female and male users regarding the average weight of the topics. Our findings are based upon a wide range of topics, such as tax, climate change, and the COVID-19 pandemic. Out of the topics, there exists a significant difference between female and male users for more than 70% of topics. Our research approach can inform studies in the areas of informatics, politics, and communication, and it can be used by political campaigns to obtain a gender-based understanding of public opinion.

  Access Paper or Ask Questions

Your Stance is Exposed! Analysing Possible Factors for Stance Detection on Social Media

Aug 08, 2019
Abeer Aldayel, Walid Magdy

To what extent user's stance towards a given topic could be inferred? Most of the studies on stance detection have focused on analysing user's posts on a given topic to predict the stance. However, the stance in social media can be inferred from a mixture of signals that might reflect user's beliefs including posts and online interactions. This paper examines various online features of users to detect their stance towards different topics. We compare multiple set of features, including on-topic content, network interactions, user's preferences, and online network connections. Our objective is to understand the online signals that can reveal the users' stance. Experimentation is applied on tweets dataset from the SemEval stance detection task, which covers five topics. Results show that stance of a user can be detected with multiple signals of user's online activity, including their posts on the topic, the network they interact with or follow, the websites they visit, and the content they like. The performance of the stance modelling using different network features are comparable with the state-of-the-art reported model that used textual content only. In addition, combining network and content features leads to the highest reported performance to date on the SemEval dataset with F-measure of 72.49%. We further present an extensive analysis to show how these different set of features can reveal stance. Our findings have distinct privacy implications, where they highlight that stance is strongly embedded in user's online social network that, in principle, individuals can be profiled from their interactions and connections even when they do not post about the topic.

* Accepted as a full paper at CSCW 2019. Please cite the CSCW version 

  Access Paper or Ask Questions