Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

A Topic Coverage Approach to Evaluation of Topic Models

Dec 11, 2020
Damir Korenčić, Strahil Ristov, Jelena Repar, Jan Šnajder

When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. We investigate an approach to topic model evaluation based on measuring topic coverage, and propose measures of coverage based on matching between model topics and reference topics. We demonstrate the benefits of the approach by evaluating, in a series of experiments, different types of topic models on two distinct text domains. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the relation between coverage and other topic model evaluation methods. The contributions of the paper include the measures of coverage and the recommendations for the use of topic models for topic discovery.

  Access Paper or Ask Questions

Topic Scaling: A Joint Document Scaling -- Topic Model Approach To Learn Time-Specific Topics

Mar 31, 2021
Sami Diaf, Ulrich Fritsche

This paper proposes a new methodology to study sequential corpora by implementing a two-stage algorithm that learns time-based topics with respect to a scale of document positions and introduces the concept of Topic Scaling which ranks learned topics within the same document scale. The first stage ranks documents using Wordfish, a Poisson-based document scaling method, to estimate document positions that serve, in the second stage, as a dependent variable to learn relevant topics via a supervised Latent Dirichlet Allocation. This novelty brings two innovations in text mining as it explains document positions, whose scale is a latent variable, and ranks the inferred topics on the document scale to match their occurrences within the corpus and track their evolution. Tested on the U.S. State Of The Union two-party addresses, this inductive approach reveals that each party dominates one end of the learned scale with interchangeable transitions that follow the parties' term of office. Besides a demonstrated high accuracy in predicting in-sample documents' positions from topic scores, this method reveals further hidden topics that differentiate similar documents by increasing the number of learned topics to unfold potential nested hierarchical topic structures. Compared to other popular topic models, Topic Scaling learns topics with respect to document similarities without specifying a time frequency to learn topic evolution, thus capturing broader topic patterns than dynamic topic models and yielding more interpretable outputs than a plain latent Dirichlet allocation.

  Access Paper or Ask Questions

An Online Topic Modeling Framework with Topics Automatically Labeled

Jun 22, 2019
Fenglei Jin, Cuiyun Gao, Michael R. Lyu

In this paper, we propose a novel online topic tracking framework, named IEDL, for tracking the topic changes related to deep learning techniques on Stack Exchange and automatically interpreting each identified topic. The proposed framework combines the prior topic distributions in a time window during inferring the topics in current time slice, and introduces a new ranking scheme to select most representative phrases and sentences for the inferred topics in each time slice. Experiments on 7,076 Stack Exchange posts show the effectiveness of IEDL in tracking topic changes and labeling topics.

* 5 pages, 3 figures, ICML 

  Access Paper or Ask Questions

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

Apr 13, 2019
Daniel Pfeifer, Jochen L. Leidner

We introduce Topic Grouper as a complementary approach in the field of probabilistic topic modeling. Topic Grouper creates a disjunctive partitioning of the training vocabulary in a stepwise manner such that resulting partitions represent topics. It is governed by a simple generative model, where the likelihood to generate the training documents via topics is optimized. The algorithm starts with one-word topics and joins two topics at every step. It therefore generates a solution for every desired number of topics ranging between the size of the training vocabulary and one. The process represents an agglomerative clustering that corresponds to a binary tree of topics. A resulting tree may act as a containment hierarchy, typically with more general topics towards the root of tree and more specific topics towards the leaves. Topic Grouper is not governed by a background distribution such as the Dirichlet and avoids hyper parameter optimizations. We show that Topic Grouper has reasonable predictive power and also a reasonable theoretical and practical complexity. Topic Grouper can deal well with stop words and function words and tends to push them into their own topics. Also, it can handle topic distributions, where some topics are more frequent than others. We present typical examples of computed topics from evaluation datasets, where topics appear conclusive and coherent. In this context, the fact that each word belongs to exactly one topic is not a major limitation; in some scenarios this can even be a genuine advantage, e.g.~a related shopping basket analysis may aid in optimizing groupings of articles in sales catalogs.

  Access Paper or Ask Questions

Is Stance Detection Topic-Independent and Cross-topic Generalizable? -- A Reproduction Study

Oct 14, 2021
Myrthe Reuver, Suzan Verberne, Roser Morante, Antske Fokkens

Cross-topic stance detection is the task to automatically detect stances (pro, against, or neutral) on unseen topics. We successfully reproduce state-of-the-art cross-topic stance detection work (Reimers et. al., 2019), and systematically analyze its reproducibility. Our attention then turns to the cross-topic aspect of this work, and the specificity of topics in terms of vocabulary and socio-cultural context. We ask: To what extent is stance detection topic-independent and generalizable across topics? We compare the model's performance on various unseen topics, and find topic (e.g. abortion, cloning), class (e.g. pro, con), and their interaction affecting the model's performance. We conclude that investigating performance on different topics, and addressing topic-specific vocabulary and context, is a future avenue for cross-topic stance detection.

* Accepted at the 8th Workshop on Argument Mining, 2021 co-located with EMNLP 2021. Cite the published version 

  Access Paper or Ask Questions

Competing Topic Naming Conventions in Quora: Predicting Appropriate Topic Merges and Winning Topics from Millions of Topic Pairs

Sep 10, 2019
Binny Mathew, Suman Kalyan Maity, Pawan Goyal, Animesh Mukherjee

Quora is a popular Q&A site which provides users with the ability to tag questions with multiple relevant topics which helps to attract quality answers. These topics are not predefined but user-defined conventions and it is not so rare to have multiple such conventions present in the Quora ecosystem describing exactly the same concept. In almost all such cases, users (or Quora moderators) manually merge the topic pair into one of the either topics, thus selecting one of the competing conventions. An important application for the site therefore is to identify such competing conventions early enough that should merge in future. In this paper, we propose a two-step approach that uniquely combines the anomaly detection and the supervised classification frameworks to predict whether two topics from among millions of topic pairs are indeed competing conventions, and should merge, achieving an F-score of 0.711. We also develop a model to predict the direction of the topic merge, i.e., the winning convention, achieving an F-score of 0.898. Our system is also able to predict ~ 25% of the correct case of merges within the first month of the merge and ~ 40% of the cases within a year. This is an encouraging result since Quora users on average take 936 days to identify such a correct merge. Human judgment experiments show that our system is able to predict almost all the correct cases that humans can predict plus 37.24% correct cases which the humans are not able to identify at all.

* 15 pages, 8 figures, 9 tables 

  Access Paper or Ask Questions

Topic representation: finding more representative words in topic models

Oct 23, 2018
Jinjin Chi, Jihong Ouyang, Changchun Li, Xueyang Dong, Ximing Li, Xinhua Wang

The top word list, i.e., the top-M words with highest marginal probability in a given topic, is the standard topic representation in topic models. Most of recent automatical topic labeling algorithms and popular topic quality metrics are based on it. However, we find, empirically, words in this type of top word list are not always representative. The objective of this paper is to find more representative top word lists for topics. To achieve this, we rerank the words in a given topic by further considering marginal probability on words over every other topic. The reranking list of top-M words is used to be a novel topic representation for topic models. We investigate three reranking methodologies, using (1) standard deviation weight, (2) standard deviation weight with topic size and (3) Chi Square \c{hi}2statistic selection. Experimental results on real world collections indicate that our representations can extract more representative words for topics, agreeing with human judgements.

* The paper has been submitted to Pattern Recognition Letters and is being reviewed 

  Access Paper or Ask Questions

HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

Oct 12, 2018
Hosein Azarbonyad, Mostafa Dehghani, Tom Kenter, Maarten Marx, Jaap Kamps, Maarten de Rijke

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three distributions for assessing the diversity of documents: distributions of words within documents, words within topics, and topics within documents. Topic models play a central role in this approach and, hence, their quality is crucial to the efficacy of measuring topical diversity. The quality of topic models is affected by two causes: generality and impurity of topics. General topics only include common information of a background corpus and are assigned to most of the documents. Impure topics contain words that are not related to the topic. Impurity lowers the interpretability of topic models. Impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation process aimed at removing generality and impurity. Our approach has three re-estimation components: (1) document re-estimation, which removes general words from the documents; (2) topic re-estimation, which re-estimates the distribution over words of each topic; and (3) topic assignment re-estimation, which re-estimates for each document its distributions over topics. For measuring topical diversity of text documents, our HiTR approach improves over the state-of-the-art measured on PubMed dataset.

* IEEE Transactions on Knowledge and Data Engineering 

  Access Paper or Ask Questions

Topic Graph Generation for Query Navigation: Use of Frequency Classes for Topic Extraction

Dec 12, 1997
Yoshiki Niwa, Shingo Nishioka, Makoto Iwayama, Akihiko Takano, Yoshihiko Nitta

To make an interactive guidance mechanism for document retrieval systems, we developed a user-interface which presents users the visualized map of topics at each stage of retrieval process. Topic words are automatically extracted by frequency analysis and the strength of the relationships between topic words is measured by their co-occurrence. A major factor affecting a user's impression of a given topic word graph is the balance between common topic words and specific topic words. By using frequency classes for topic word extraction, we made it possible to select well-balanced set of topic words, and to adjust the balance of common and specific topic words.

* Proceedings of NLPRS'97, Natural Language Processing Pacific Rim Symposium '97, pages 95-100, Phuket, Thailand, Dec. 1997 
* 6 pages, 3 figures 

  Access Paper or Ask Questions

How Many Topics? Stability Analysis for Topic Models

Jun 19, 2014
Derek Greene, Derek O'Callaghan, Pádraig Cunningham

Topic modeling refers to the task of discovering the underlying thematic structure in a text corpus, where the output is commonly presented as a report of the top terms appearing in each topic. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the "over-clustering" of a corpus into many small, highly-similar topics. In this paper, we propose a term-centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data. Using a topic modeling approach based on matrix factorization, evaluations performed on a range of corpora show that this strategy can successfully guide the model selection process.

* Improve readability of plots. Add minor clarifications 

  Access Paper or Ask Questions