Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

The Author-Topic Model for Authors and Documents

Jul 11, 2012
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, Padhraic Smyth

We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts. Exact inference is intractable for these datasets and we use Gibbs sampling to estimate the topic and author distributions. We compare the performance with two other generative models for documents, which are special cases of the author-topic model: LDA (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. We show topics recovered by the author-topic model, and demonstrate applications to computing similarity between authors and entropy of author output.

* Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004) 

  Access Paper or Ask Questions

Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics

Apr 21, 2022
Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad

Recent work incorporates pre-trained word embeddings such as BERT embeddings into Neural Topic Models (NTMs), generating highly coherent topics. However, with high-quality contextualized document representations, do we really need sophisticated neural models to obtain coherent and interpretable topics? In this paper, we conduct thorough experiments showing that directly clustering high-quality sentence embeddings with an appropriate word selecting method can generate more coherent and diverse topics than NTMs, achieving also higher efficiency and simplicity.

* Accepted by NAACL 2022 

  Access Paper or Ask Questions

Automatic Evaluation of Local Topic Quality

May 18, 2019
Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, Kevin Seppi

Topic models are typically evaluated with respect to the global topic distributions that they generate, using metrics such as coherence, but without regard to local (token-level) topic assignments. Token-level assignments are important for downstream tasks such as classification. Even recent models, which aim to improve the quality of these token-level topic assignments, have been evaluated only with respect to global metrics. We propose a task designed to elicit human judgments of token-level topic assignments. We use a variety of topic model types and parameters and discover that global metrics agree poorly with human assignments. Since human evaluation is expensive we propose a variety of automated metrics to evaluate topic models at a local level. Finally, we correlate our proposed metrics with human judgments from the task on several datasets. We show that an evaluation based on the percent of topic switches correlates most strongly with human judgment of local topic quality. We suggest that this new metric, which we call consistency, be adopted alongside global metrics such as topic coherence when evaluating new topic models.

* 8 pages 4 figures 3 tables 

  Access Paper or Ask Questions

Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time

May 01, 2018
Pankaj Gupta, Subburam Rajaram, Hinrich Schütze, Bernt Andrassy

Dynamic topic modeling facilitates the identification of topical trends over time in temporal collections of unstructured documents. We introduce a novel unsupervised neural dynamic topic model named as Recurrent Neural Network-Replicated Softmax Model (RNNRSM), where the discovered topics at each time influence the topic discovery in the subsequent time steps. We account for the temporal ordering of documents by explicitly modeling a joint distribution of latent topical dependencies over time, using distributional estimators with temporal recurrent connections. Applying RNN-RSM to 19 years of articles on NLP research, we demonstrate that compared to state-of-the art topic models, RNNRSM shows better generalization, topic interpretation, evolution and trends. We also introduce a metric (named as SPAN) to quantify the capability of dynamic topic model to capture word evolution in topics over time.

* In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018) 

  Access Paper or Ask Questions

Less is More: Learning Prominent and Diverse Topics for Data Summarization

Dec 01, 2016
Jian Tang, Cheng Li, Ming Zhang, Qiaozhu Mei

Statistical topic models efficiently facilitate the exploration of large-scale data sets. Many models have been developed and broadly used to summarize the semantic structure in news, science, social media, and digital humanities. However, a common and practical objective in data exploration tasks is not to enumerate all existing topics, but to quickly extract representative ones that broadly cover the content of the corpus, i.e., a few topics that serve as a good summary of the data. Most existing topic models fit exactly the same number of topics as a user specifies, which have imposed an unnecessary burden to the users who have limited prior knowledge. We instead propose new models that are able to learn fewer but more representative topics for the purpose of data summarization. We propose a reinforced random walk that allows prominent topics to absorb tokens from similar and smaller topics, thus enhances the diversity among the top topics extracted. With this reinforced random walk as a general process embedded in classical topic models, we obtain \textit{diverse topic models} that are able to extract the most prominent and diverse topics from data. The inference procedures of these diverse topic models remain as simple and efficient as the classical models. Experimental results demonstrate that the diverse topic models not only discover topics that better summarize the data, but also require minimal prior knowledge of the users.

  Access Paper or Ask Questions

Semiparametric Latent Topic Modeling on Consumer-Generated Corpora

Jul 13, 2021
Dominic B. Dayta, Erniel B. Barrios

Legacy procedures for topic modelling have generally suffered problems of overfitting and a weakness towards reconstructing sparse topic structures. With motivation from a consumer-generated corpora, this paper proposes semiparametric topic model, a two-step approach utilizing nonnegative matrix factorization and semiparametric regression in topic modeling. The model enables the reconstruction of sparse topic structures in the corpus and provides a generative model for predicting topics in new documents entering the corpus. Assuming the presence of auxiliary information related to the topics, this approach exhibits better performance in discovering underlying topic structures in cases where the corpora are small and limited in vocabulary. In an actual consumer feedback corpus, the model also demonstrably provides interpretable and useful topic definitions comparable with those produced by other methods.

  Access Paper or Ask Questions

TopicsRanksDC: Distance-based Topic Ranking applied on Two-Class Data

May 17, 2021
Malik Yousef, Jamal Al Qundus, Silvio Peikert, Adrian Paschke

In this paper, we introduce a novel approach named TopicsRanksDC for topics ranking based on the distance between two clusters that are generated by each topic. We assume that our data consists of text documents that are associated with two-classes. Our approach ranks each topic contained in these text documents by its significance for separating the two-classes. Firstly, the algorithm detects topics using Latent Dirichlet Allocation (LDA). The words defining each topic are represented as two clusters, where each one is associated with one of the classes. We compute four distance metrics, Single Linkage, Complete Linkage, Average Linkage and distance between the centroid. We compare the results of LDA topics and random topics. The results show that the rank for LDA topics is much higher than random topics. The results of TopicsRanksDC tool are promising for future work to enable search engines to suggest related topics.

* International Conference on Database and Expert Systems Applications DEXA 2020: Database and Expert Systems Applications pp 11-21 
* 10 pages, 5 figures 

  Access Paper or Ask Questions

Topic-based Evaluation for Conversational Bots

Jan 11, 2018
Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, Ashwin Ram

Dialog evaluation is a challenging problem, especially for non task-oriented dialogs where conversational success is not well-defined. We propose to evaluate dialog quality using topic-based metrics that describe the ability of a conversational bot to sustain coherent and engaging conversations on a topic, and the diversity of topics that a bot can handle. To detect conversation topics per utterance, we adopt Deep Average Networks (DAN) and train a topic classifier on a variety of question and query data categorized into multiple topics. We propose a novel extension to DAN by adding a topic-word attention table that allows the system to jointly capture topic keywords in an utterance and perform topic classification. We compare our proposed topic based metrics with the ratings provided by users and show that our metrics both correlate with and complement human judgment. Our analysis is performed on tens of thousands of real human-bot dialogs from the Alexa Prize competition and highlights user expectations for conversational bots.

* Nips.Workshop.ConversationalAI 2017-12-08 
* 10 Pages, 2 figures, 9 tables. NIPS 2017 Conversational AI workshop paper. 

  Access Paper or Ask Questions

Neural Embedding Allocation: Distributed Representations of Topic Models

Sep 10, 2019
Kamrun Naher Keya, Yannis Papanikolaou, James R. Foulds

Word embedding models such as the skip-gram learn vector representations of words' semantic relationships, and document embedding models learn similar representations for documents. On the other hand, topic models provide latent representations of the documents' topical themes. To get the benefits of these representations simultaneously, we propose a unifying algorithm, called neural embedding allocation (NEA), which deconstructs topic models into interpretable vector-space embeddings of words, topics, documents, authors, and so on, by learning neural embeddings to mimic the topic models. We showcase NEA's effectiveness and generality on LDA, author-topic models and the recently proposed mixed membership skip gram topic model and achieve better performance with the embeddings compared to several state-of-the-art models. Furthermore, we demonstrate that using NEA to smooth out the topics improves coherence scores over the original topic models when the number of topics is large.

  Access Paper or Ask Questions

Analysis of Computational Science Papers from ICCS 2001-2016 using Topic Modeling and Graph Theory

Apr 18, 2017
Tesfamariam M. Abuhay, Sergey V. Kovalchuk, Klavdiya O. Bochenina, George Kampis, Valeria V. Krzhizhanovskaya, Michael H. Lees

This paper presents results of topic modeling and network models of topics using the International Conference on Computational Science corpus, which contains domain-specific (computational science) papers over sixteen years (a total of 5695 papers). We discuss topical structures of International Conference on Computational Science, how these topics evolve over time in response to the topicality of various problems, technologies and methods, and how all these topics relate to one another. This analysis illustrates multidisciplinary research and collaborations among scientific communities, by constructing static and dynamic networks from the topic modeling results and the keywords of authors. The results of this study give insights about the past and future trends of core discussion topics in computational science. We used the Non-negative Matrix Factorization topic modeling algorithm to discover topics and labeled and grouped results hierarchically.

* Accepted by International Conference on Computational Science (ICCS) 2017 which will be held in Zurich, Switzerland from June 11-June 14 

  Access Paper or Ask Questions