Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Enhance Topics Analysis based on Keywords Properties

Mar 09, 2022
Antonio Penta

Topic Modelling is one of the most prevalent text analysis technique used to explore and retrieve collection of documents. The evaluation of the topic model algorithms is still a very challenging tasks due to the absence of gold-standard list of topics to compare against for every corpus. In this work, we present a specificity score based on keywords properties that is able to select the most informative topics. This approach helps the user to focus on the most informative topics. In the experiments, we show that we are able to compress the state-of-the-art topic modelling results of different factors with an information loss that is much lower than the solution based on the recent coherence score presented in literature.

  Access Paper or Ask Questions

Viewpoint and Topic Modeling of Current Events

Aug 14, 2016
Kerry Zhang, Jussi Karlgren, Cheng Zhang, Jens Lagergren

There are multiple sides to every story, and while statistical topic models have been highly successful at topically summarizing the stories in corpora of text documents, they do not explicitly address the issue of learning the different sides, the viewpoints, expressed in the documents. In this paper, we show how these viewpoints can be learned completely unsupervised and represented in a human interpretable form. We use a novel approach of applying CorrLDA2 for this purpose, which learns topic-viewpoint relations that can be used to form groups of topics, where each group represents a viewpoint. A corpus of documents about the Israeli-Palestinian conflict is then used to demonstrate how a Palestinian and an Israeli viewpoint can be learned. By leveraging the magnitudes and signs of the feature weights of a linear SVM, we introduce a principled method to evaluate associations between topics and viewpoints. With this, we demonstrate, both quantitatively and qualitatively, that the learned topic groups are contextually coherent, and form consistently correct topic-viewpoint associations.

* 16 pages, 4 figures, 4 tables 

  Access Paper or Ask Questions

Transfer Topic Labeling with Domain-Specific Knowledge Base: An Analysis of UK House of Commons Speeches 1935-2014

Aug 27, 2018
Alexander Herzog, Peter John, Slava Jankin Mikhaylov

Topic models are widely used in natural language processing, allowing researchers to estimate the underlying themes in a collection of documents. Most topic models use unsupervised methods and hence require the additional step of attaching meaningful labels to estimated topics. This process of manual labeling is not scalable and suffers from human bias. We present a semi-automatic transfer topic labeling method that seeks to remedy these problems. Domain-specific codebooks form the knowledge-base for automated topic labeling. We demonstrate our approach with a dynamic topic model analysis of the complete corpus of UK House of Commons speeches 1935-2014, using the coding instructions of the Comparative Agendas Project to label topics. We show that our method works well for a majority of the topics we estimate; but we also find that institution-specific topics, in particular on subnational governance, require manual input. We validate our results using human expert coding.

  Access Paper or Ask Questions

Topic Analysis of Superconductivity Literature by Semantic Non-negative Matrix Factorization

Dec 01, 2021
Valentin Stanev, Erik Skau, Ichiro Takeuchi, Boian S. Alexandrov

We utilize a recently developed topic modeling method called SeNMFk, extending the standard Non-negative Matrix Factorization (NMF) methods by incorporating the semantic structure of the text, and adding a robust system for determining the number of topics. With SeNMFk, we were able to extract coherent topics validated by human experts. From these topics, a few are relatively general and cover broad concepts, while the majority can be precisely mapped to specific scientific effects or measurement techniques. The topics also differ by ubiquity, with only three topics prevalent in almost 40 percent of the abstract, while each specific topic tends to dominate a small subset of the abstracts. These results demonstrate the ability of SeNMFk to produce a layered and nuanced analysis of large scientific corpora.

  Access Paper or Ask Questions

Topical Language Generation using Transformers

Mar 11, 2021
Rohola Zandie, Mohammad H. Mahoor

Large-scale transformer-based language models (LMs) demonstrate impressive capabilities in open text generation. However, controlling the generated text's properties such as the topic, style, and sentiment is challenging and often requires significant changes to the model architecture or retraining and fine-tuning the model on new supervised data. This paper presents a novel approach for Topical Language Generation (TLG) by combining a pre-trained LM with topic modeling information. We cast the problem using Bayesian probability formulation with topic probabilities as a prior, LM probabilities as the likelihood, and topical language generation probability as the posterior. In learning the model, we derive the topic probability distribution from the user-provided document's natural structure. Furthermore, we extend our model by introducing new parameters and functions to influence the quantity of the topical features presented in the generated text. This feature would allow us to easily control the topical properties of the generated text. Our experimental results demonstrate that our model outperforms the state-of-the-art results on coherency, diversity, and fluency while being faster in decoding.

* Accepted in the Journal of Natural Language Engineering 

  Access Paper or Ask Questions

Prediction Focused Topic Models via Vocab Selection

Oct 12, 2019
Jason Ren, Russell Kunes, Finale Doshi-Velez

Supervised topic models are often sought to balance prediction quality and interpretability. However, when models are (inevitably) misspecified, standard approaches rarely deliver on both. We introduce a novel approach, the prediction-focused topic model, that uses the supervisory signal to retain only vocabulary terms that improve, or do not hinder, prediction performance. By removing terms with irrelevant signal, the topic model is able to learn task-relevant, interpretable topics. We demonstrate on several data sets that compared to existing approaches, prediction-focused topic models are able to learn much more coherent topics while maintaining competitive predictions.

* Extended Abstract to appear in ML4Health and HCML Neurips Workshops, 8 pages 

  Access Paper or Ask Questions

Topic Modeling Using Distributed Word Embeddings

Mar 15, 2016
Ramandeep S Randhawa, Parag Jain, Gagan Madan

We propose a new algorithm for topic modeling, Vec2Topic, that identifies the main topics in a corpus using semantic information captured via high-dimensional distributed word embeddings. Our technique is unsupervised and generates a list of topics ranked with respect to importance. We find that it works better than existing topic modeling techniques such as Latent Dirichlet Allocation for identifying key topics in user-generated content, such as emails, chats, etc., where topics are diffused across the corpus. We also find that Vec2Topic works equally well for non-user generated content, such as papers, reports, etc., and for small corpora such as a single-document.

  Access Paper or Ask Questions

TACAM: Topic And Context Aware Argument Mining

May 26, 2019
Michael Fromm, Evgeniy Faerman, Thomas Seidl

In this work we address the problem of argument search. The purpose of argument search is the distillation of pro and contra arguments for requested topics from large text corpora. In previous works, the usual approach is to use a standard search engine to extract text parts which are relevant to the given topic and subsequently use an argument recognition algorithm to select arguments from them. The main challenge in the argument recognition task, which is also known as argument mining, is that often sentences containing arguments are structurally similar to purely informative sentences without any stance about the topic. In fact, they only differ semantically. Most approaches use topic or search term information only for the first search step and therefore assume that arguments can be classified independently of a topic. We argue that topic information is crucial for argument mining, since the topic defines the semantic context of an argument. Precisely, we propose different models for the classification of arguments, which take information about a topic of an argument into account. Moreover, to enrich the context of a topic and to let models understand the context of the potential argument better, we integrate information from different external sources such as Knowledge Graphs or pre-trained NLP models. Our evaluation shows that considering topic information, especially in connection with external information, provides a significant performance boost for the argument mining task.

  Access Paper or Ask Questions

A Gamma-Poisson Mixture Topic Model for Short Text

Apr 23, 2020
Jocelyn Mazarura, Alta de Waal, Pieter de Villiers

Most topic models are constructed under the assumption that documents follow a multinomial distribution. The Poisson distribution is an alternative distribution to describe the probability of count data. For topic modelling, the Poisson distribution describes the number of occurrences of a word in documents of fixed length. The Poisson distribution has been successfully applied in text classification, but its application to topic modelling is not well documented, specifically in the context of a generative probabilistic model. Furthermore, the few Poisson topic models in literature are admixture models, making the assumption that a document is generated from a mixture of topics. In this study, we focus on short text. Many studies have shown that the simpler assumption of a mixture model fits short text better. With mixture models, as opposed to admixture models, the generative assumption is that a document is generated from a single topic. One topic model, which makes this one-topic-per-document assumption, is the Dirichlet-multinomial mixture model. The main contributions of this work are a new Gamma-Poisson mixture model, as well as a collapsed Gibbs sampler for the model. The benefit of the collapsed Gibbs sampler derivation is that the model is able to automatically select the number of topics contained in the corpus. The results show that the Gamma-Poisson mixture model performs better than the Dirichlet-multinomial mixture model at selecting the number of topics in labelled corpora. Furthermore, the Gamma-Poisson mixture produces better topic coherence scores than the Dirichlet-multinomial mixture model, thus making it a viable option for the challenging task of topic modelling of short text.

* 26 pages, 14 Figures, to be published in Mathematical Problems in Engineering 

  Access Paper or Ask Questions

Response Selection with Topic Clues for Retrieval-based Chatbots

Sep 22, 2016
Yu Wu, Wei Wu, Zhoujun Li, Ming Zhou

We consider incorporating topic information into message-response matching to boost responses with rich content in retrieval-based chatbots. To this end, we propose a topic-aware convolutional neural tensor network (TACNTN). In TACNTN, matching between a message and a response is not only conducted between a message vector and a response vector generated by convolutional neural networks, but also leverages extra topic information encoded in two topic vectors. The two topic vectors are linear combinations of topic words of the message and the response respectively, where the topic words are obtained from a pre-trained LDA model and their weights are determined by themselves as well as the message vector and the response vector. The message vector, the response vector, and the two topic vectors are fed to neural tensors to calculate a matching score. Empirical study on a public data set and a human annotated data set shows that TACNTN can significantly outperform state-of-the-art methods for message-response matching.

* under reviewed of AAAI 2017 

  Access Paper or Ask Questions