Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

ClimaText: A Dataset for Climate Change Topic Detection

Jan 02, 2021
Francesco S. Varini, Jordan Boyd-Graber, Massimiliano Ciaramita, Markus Leippold

Climate change communication in the mass media and other textual sources may affect and shape public perception. Extracting climate change information from these sources is an important task, e.g., for filtering content and e-discovery, sentiment analysis, automatic summarization, question-answering, and fact-checking. However, automating this process is a challenge, as climate change is a complex, fast-moving, and often ambiguous topic with scarce resources for popular text-based AI tasks. In this paper, we introduce \textsc{ClimaText}, a dataset for sentence-based climate change topic detection, which we make publicly available. We explore different approaches to identify the climate change topic in various text sources. We find that popular keyword-based models are not adequate for such a complex and evolving task. Context-based algorithms like BERT \cite{devlin2018bert} can detect, in addition to many trivial cases, a variety of complex and implicit topic patterns. Nevertheless, our analysis reveals a great potential for improvement in several directions, such as, e.g., capturing the discussion on indirect effects of climate change. Hence, we hope this work can serve as a good starting point for further research on this topic.

* Accepted for the Tackling Climate Change with Machine Learning Workshop at NeurIPS 2020 

  Access Paper or Ask Questions

Exploratory Analysis of Covid-19 Tweets using Topic Modeling, UMAP, and DiGraphs

May 06, 2020
Catherine Ordun, Sanjay Purushotham, Edward Raff

This paper illustrates five different techniques to assess the distinctiveness of topics, key terms and features, speed of information dissemination, and network behaviors for Covid19 tweets. First, we use pattern matching and second, topic modeling through Latent Dirichlet Allocation (LDA) to generate twenty different topics that discuss case spread, healthcare workers, and personal protective equipment (PPE). One topic specific to U.S. cases would start to uptick immediately after live White House Coronavirus Task Force briefings, implying that many Twitter users are paying attention to government announcements. We contribute machine learning methods not previously reported in the Covid19 Twitter literature. This includes our third method, Uniform Manifold Approximation and Projection (UMAP), that identifies unique clustering-behavior of distinct topics to improve our understanding of important themes in the corpus and help assess the quality of generated topics. Fourth, we calculated retweeting times to understand how fast information about Covid19 propagates on Twitter. Our analysis indicates that the median retweeting time of Covid19 for a sample corpus in March 2020 was 2.87 hours, approximately 50 minutes faster than repostings from Chinese social media about H7N9 in March 2013. Lastly, we sought to understand retweet cascades, by visualizing the connections of users over time from fast to slow retweeting. As the time to retweet increases, the density of connections also increase where in our sample, we found distinct users dominating the attention of Covid19 retweeters. One of the simplest highlights of this analysis is that early-stage descriptive methods like regular expressions can successfully identify high-level themes which were consistently verified as important through every subsequent analysis.

  Access Paper or Ask Questions

Supervised multi-specialist topic model with applications on large-scale electronic health record data

May 04, 2021
Ziyang Song, Xavier Sumba Toral, Yixin Xu, Aihua Liu, Liming Guo, Guido Powell, Aman Verma, David Buckeridge, Ariane Marelli, Yue Li

Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. MixEHR-S source code and scripts of the experiments are freely available at

  Access Paper or Ask Questions

Recommendation System based on Semantic Scholar Mining and Topic modeling: A behavioral analysis of researchers from six conferences

Dec 20, 2018
Hamed Jelodar, Yongli Wang, Mahdi Rabbani, Ru-xin Zhao, Seyedvalyallah Ayobi, Peng Hu, Isma Masood

Recommendation systems have an important place to help online users in the internet society. Recommendation Systems in computer science are of very practical use these days in various aspects of the Internet portals, such as social networks, and library websites. There are several approaches to implement recommendation systems, Latent Dirichlet Allocation (LDA) is one the popular techniques in Topic Modeling. Recently, researchers have proposed many approaches based on Recommendation Systems and LDA. According to importance of the subject, in this paper we discover the trends of the topics and find relationship between LDA topics and Scholar-Context-documents. In fact, We apply probabilistic topic modeling based on Gibbs sampling algorithms for a semantic mining from six conference publications in computer science from DBLP dataset. According to our experimental results, our semantic framework can be effective to help organizations to better organize these conferences and cover future research topics.

  Access Paper or Ask Questions

Modeling community structure and topics in dynamic text networks

Aug 23, 2018
Teague Henry, David Banks, Christine Chai, Derek Owens-Oas

The last decade has seen great progress in both dynamic network modeling and topic modeling. This paper draws upon both areas to create a Bayesian method that allows topic discovery to inform the latent network model and the network structure to facilitate topic identification. We apply this method to the 467 top political blogs of 2012. Our results find complex community structure within this set of blogs, where community membership depends strongly upon the set of topics in which the blogger is interested.

* Accepted at Journal of Classification 

  Access Paper or Ask Questions

Incorporating Domain Knowledge To Improve Topic Segmentation Of Long MOOC Lecture Videos

Dec 08, 2020
Ananda Das, Partha Pratim Das

Topical Segmentation poses a great role in reducing search space of the topics taught in a lecture video specially when the video metadata lacks topic wise segmentation information. This segmentation information eases user efforts of searching, locating and browsing a topic inside a lecture video. In this work we propose an algorithm, that combines state-of-the art language model and domain knowledge graph for automatically detecting different coherent topics present inside a long lecture video. We use the language model on speech-to-text transcription to capture the implicit meaning of the whole video while the knowledge graph provides us the domain specific dependencies between different concepts of that subjects. Also leveraging the domain knowledge we can capture the way instructor binds and connects different concepts while teaching, which helps us in achieving better segmentation accuracy. We tested our approach on NPTEL lecture videos and holistic evaluation shows that it out performs the other methods described in the literature.

  Access Paper or Ask Questions

Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features

Jun 16, 2021
Tan Liu, Wu Guo, Bin Gu

Topic classification systems on spoken documents usually consist of two modules: an automatic speech recognition (ASR) module to convert speech into text and a text topic classification (TTC) module to predict the topic class from the decoded text. In this paper, instead of using the ASR transcripts, the fusion of deep acoustic and linguistic features is used for topic classification on spoken documents. More specifically, a conventional CTC-based acoustic model (AM) using phonemes as output units is first trained, and the outputs of the layer before the linear phoneme classifier in the trained AM are used as the deep acoustic features of spoken documents. Furthermore, these deep acoustic features are fed to a phoneme-to-word (P2W) module to obtain deep linguistic features. Finally, a local multi-head attention module is proposed to fuse these two types of deep features for topic classification. Experiments conducted on a subset selected from Switchboard corpus show that our proposed framework outperforms the conventional ASR+TTC systems and achieves a 3.13% improvement in ACC.

  Access Paper or Ask Questions

Classifying Idiomatic and Literal Expressions Using Topic Models and Intensity of Emotions

Feb 27, 2018
Jing Peng, Anna Feldman, Ekaterina Vylomova

We describe an algorithm for automatic classification of idiomatic and literal expressions. Our starting point is that words in a given text segment, such as a paragraph, that are highranking representatives of a common topic of discussion are less likely to be a part of an idiomatic expression. Our additional hypothesis is that contexts in which idioms occur, typically, are more affective and therefore, we incorporate a simple analysis of the intensity of the emotions expressed by the contexts. We investigate the bag of words topic representation of one to three paragraphs containing an expression that should be classified as idiomatic or literal (a target phrase). We extract topics from paragraphs containing idioms and from paragraphs containing literals using an unsupervised clustering method, Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Since idiomatic expressions exhibit the property of non-compositionality, we assume that they usually present different semantics than the words used in the local topic. We treat idioms as semantic outliers, and the identification of a semantic shift as outlier detection. Thus, this topic representation allows us to differentiate idioms from literals using local semantic contexts. Our results are encouraging.

* EMNLP 2014 

  Access Paper or Ask Questions

Understanding the Spatio-temporal Topic Dynamics of Covid-19 using Nonnegative Tensor Factorization: A Case Study

Sep 19, 2020
Thirunavukarasu Balasubramaniam, Richi Nayak, Md Abul Bashar

Social media platforms facilitate mankind a data-driven world by enabling billions of people to share their thoughts and activities ubiquitously. This huge collection of data, if analysed properly, can provide useful insights into people's behavior. More than ever, now is a crucial time under the Covid-19 pandemic to understand people's online behaviors detailing what topics are being discussed, and where (space) and when (time) they are discussed. Given the high complexity and poor quality of the huge social media data, an effective spatio-temporal topic detection method is needed. This paper proposes a tensor-based representation of social media data and Non-negative Tensor Factorization (NTF) to identify the topics discussed in social media data along with the spatio-temporal topic dynamics. A case study on Covid-19 related tweets from the Australia Twittersphere is presented to identify and visualize spatio-temporal topic dynamics on Covid-19

* Accepted in 18th Australasian Data Mining Conference (AusDM) 

  Access Paper or Ask Questions