Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Keyword Assisted Embedded Topic Model

Nov 22, 2021
Bahareh Harandizadeh, J. Hunter Priniski, Fred Morstatter

By illuminating latent structures in a corpus of text, topic models are an essential tool for categorizing, summarizing, and exploring large collections of documents. Probabilistic topic models, such as latent Dirichlet allocation (LDA), describe how words in documents are generated via a set of latent distributions called topics. Recently, the Embedded Topic Model (ETM) has extended LDA to utilize the semantic information in word embeddings to derive semantically richer topics. As LDA and its extensions are unsupervised models, they aren't defined to make efficient use of a user's prior knowledge of the domain. To this end, we propose the Keyword Assisted Embedded Topic Model (KeyETM), which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors over the vocabulary. Using both quantitative metrics and human responses on a topic intrusion task, we demonstrate that KeyETM produces better topics than other guided, generative models in the literature.

* 8 pages, 5 figures, WSDM 2022 Conference 

  Access Paper or Ask Questions

Discovering Topical Interactions in Text-based Cascades using Hidden Markov Hawkes Processes

Sep 12, 2018
Srikanta Bedathur, Indrajit Bhattacharya, Jayesh Choudhari, Anirban Dasgupta

Social media conversations unfold based on complex interactions between users, topics and time. While recent models have been proposed to capture network strengths between users, users' topical preferences and temporal patterns between posting and response times, interaction patterns between topics has not been studied. We propose the Hidden Markov Hawkes Process (HMHP) that incorporates topical Markov Chains within Hawkes processes to jointly model topical interactions along with user-user and user-topic patterns. We propose a Gibbs sampling algorithm for HMHP that jointly infers the network strengths, diffusion paths, the topics of the posts as well as the topic-topic interactions. We show using experiments on real and semi-synthetic data that HMHP is able to generalize better and recover the network strengths, topics and diffusion paths more accurately than state-of-the-art baselines. More interestingly, HMHP finds insightful interactions between topics in real tweets which no existing model is able to do.

* Accepted as a short paper at ICDM-2018 

  Access Paper or Ask Questions

Source-LDA: Enhancing probabilistic topic models using prior knowledge sources

May 17, 2017
Justin Wood, Patrick Tan, Wei Wang, Corey Arnold

A popular approach to topic modeling involves extracting co-occurring n-grams of a corpus into semantic themes. The set of n-grams in a theme represents an underlying topic, but most topic modeling approaches are not able to label these sets of words with a single n-gram. Such labels are useful for topic identification in summarization systems. This paper introduces a novel approach to labeling a group of n-grams comprising an individual topic. The approach taken is to complement the existing topic distributions over words with a known distribution based on a predefined set of topics. This is done by integrating existing labeled knowledge sources representing known potential topics into the probabilistic topic model. These knowledge sources are translated into a distribution and used to set the hyperparameters of the Dirichlet generated distribution over words. In the inference these modified distributions guide the convergence of the latent topics to conform with the complementary distributions. This approach ensures that the topic inference process is consistent with existing knowledge. The label assignment from the complementary knowledge sources are then transferred to the latent topics of the corpus. The results show both accurate label assignment to topics as well as improved topic generation than those obtained using various labeling approaches based off Latent Dirichlet allocation (LDA).

  Access Paper or Ask Questions

Topic Modeling with Wasserstein Autoencoders

Jul 24, 2019
Feng Nan, Ran Ding, Ramesh Nallapati, Bing Xiang

We propose a novel neural topic model in the Wasserstein autoencoders (WAE) framework. Unlike existing variational autoencoder based models, we directly enforce Dirichlet prior on the latent document-topic vectors. We exploit the structure of the latent space and apply a suitable kernel in minimizing the Maximum Mean Discrepancy (MMD) to perform distribution matching. We discover that MMD performs much better than the Generative Adversarial Network (GAN) in matching high dimensional Dirichlet distribution. We further discover that incorporating randomness in the encoder output during training leads to significantly more coherent topics. To measure the diversity of the produced topics, we propose a simple topic uniqueness metric. Together with the widely used coherence measure NPMI, we offer a more wholistic evaluation of topic quality. Experiments on several real datasets show that our model produces significantly better topics than existing topic models.

* to appear at ACL 2019 

  Access Paper or Ask Questions

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Mar 11, 2022
Maarten Grootendorst

Topic models can be useful tools to discover latent topics in collections of documents. Recent studies have shown the feasibility of approach topic modeling as a clustering task. We present BERTopic, a topic model that extends this process by extracting coherent topic representation through the development of a class-based variation of TF-IDF. More specifically, BERTopic generates document embedding with pre-trained transformer-based language models, clusters these embeddings, and finally, generates topic representations with the class-based TF-IDF procedure. BERTopic generates coherent topics and remains competitive across a variety of benchmarks involving classical models and those that follow the more recent clustering approach of topic modeling.

* BERTopic has a python implementation, see 

  Access Paper or Ask Questions

Topic Sensitive Neural Headline Generation

Aug 20, 2016
Lei Xu, Ziyun Wang, Ayana, Zhiyuan Liu, Maosong Sun

Neural models have recently been used in text summarization including headline generation. The model can be trained using a set of document-headline pairs. However, the model does not explicitly consider topical similarities and differences of documents. We suggest to categorizing documents into various topics so that documents within the same topic are similar in content and share similar summarization patterns. Taking advantage of topic information of documents, we propose topic sensitive neural headline generation model. Our model can generate more accurate summaries guided by document topics. We test our model on LCSTS dataset, and experiments show that our method outperforms other baselines on each topic and achieves the state-of-art performance.

  Access Paper or Ask Questions

Likelihood estimation of sparse topic distributions in topic models and its applications to Wasserstein document distance calculations

Jul 12, 2021
Xin Bing, Florentina Bunea, Seth Strimas-Mackey, Marten Wegkamp

This paper studies the estimation of high-dimensional, discrete, possibly sparse, mixture models in topic models. The data consists of observed multinomial counts of $p$ words across $n$ independent documents. In topic models, the $p\times n$ expected word frequency matrix is assumed to be factorized as a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $T$. Since columns of both matrices represent conditional probabilities belonging to probability simplices, columns of $A$ are viewed as $p$-dimensional mixture components that are common to all documents while columns of $T$ are viewed as the $K$-dimensional mixture weights that are document specific and are allowed to be sparse. The main interest is to provide sharp, finite sample, $\ell_1$-norm convergence rates for estimators of the mixture weights $T$ when $A$ is either known or unknown. For known $A$, we suggest MLE estimation of $T$. Our non-standard analysis of the MLE not only establishes its $\ell_1$ convergence rate, but reveals a remarkable property: the MLE, with no extra regularization, can be exactly sparse and contain the true zero pattern of $T$. We further show that the MLE is both minimax optimal and adaptive to the unknown sparsity in a large class of sparse topic distributions. When $A$ is unknown, we estimate $T$ by optimizing the likelihood function corresponding to a plug in, generic, estimator $\hat{A}$ of $A$. For any estimator $\hat{A}$ that satisfies carefully detailed conditions for proximity to $A$, the resulting estimator of $T$ is shown to retain the properties established for the MLE. The ambient dimensions $K$ and $p$ are allowed to grow with the sample sizes. Our application is to the estimation of 1-Wasserstein distances between document generating distributions. We propose, estimate and analyze new 1-Wasserstein distances between two probabilistic document representations.

  Access Paper or Ask Questions

Interactive Visual Exploration of Topic Models using Graphs

Nov 27, 2014
Samuel Rönnqvist, Xiaolu Wang, Peter Sarlin

Probabilistic topic modeling is a popular and powerful family of tools for uncovering thematic structure in large sets of unstructured text documents. While much attention has been directed towards the modeling algorithms and their various extensions, comparatively few studies have concerned how to present or visualize topic models in meaningful ways. In this paper, we present a novel design that uses graphs to visually communicate topic structure and meaning. By connecting topic nodes via descriptive keyterms, the graph representation reveals topic similarities, topic meaning and shared, ambiguous keyterms. At the same time, the graph can be used for information retrieval purposes, to find documents by topic or topic subsets. To exemplify the utility of the design, we illustrate its use for organizing and exploring corpora of financial patents.

* Online demo at appears in Proceedings of the 2014 Eurographics Conference on Visualization (EuroVis) 

  Access Paper or Ask Questions

CTM -- A Model for Large-Scale Multi-View Tweet Topic Classification

May 03, 2022
Vivek Kulkarni, Kenny Leung, Aria Haghighi

Automatically associating social media posts with topics is an important prerequisite for effective search and recommendation on many social media platforms. However, topic classification of such posts is quite challenging because of (a) a large topic space (b) short text with weak topical cues, and (c) multiple topic associations per post. In contrast to most prior work which only focuses on post classification into a small number of topics ($10$-$20$), we consider the task of large-scale topic classification in the context of Twitter where the topic space is $10$ times larger with potentially multiple topic associations per Tweet. We address the challenges above by proposing a novel neural model, CTM that (a) supports a large topic space of $300$ topics and (b) takes a holistic approach to tweet content modeling -- leveraging multi-modal content, author context, and deeper semantic cues in the Tweet. Our method offers an effective way to classify Tweets into topics at scale by yielding superior performance to other approaches (a relative lift of $\mathbf{20}\%$ in median average precision score) and has been successfully deployed in production at Twitter.

* 12 pages. 1 figure. NAACL Industry Track 

  Access Paper or Ask Questions

A Coefficient of Determination for Probabilistic Topic Models

Nov 26, 2019
Tommy Jones

This research proposes a new (old) metric for evaluating goodness of fit in topic models, the coefficient of determination, or $R^2$. Within the context of topic modeling, $R^2$ has the same interpretation that it does when used in a broader class of statistical models. Reporting $R^2$ with topic models addresses two current problems in topic modeling: a lack of standard cross-contextual evaluation metrics for topic modeling and ease of communication with lay audiences. The author proposes that $R^2$ should be reported as a standard metric when constructing topic models.

  Access Paper or Ask Questions