Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Text Modeling using Unsupervised Topic Models and Concept Hierarchies

Aug 07, 2008
Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers

Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, on the other hand, tend to be semantically richer due to careful selection of words to define concepts but they tend not to cover the themes in a data set exhaustively. In this paper, we propose a probabilistic framework to combine a hierarchy of human-defined semantic concepts with statistical topic models to seek the best of both worlds. Experimental results using two different sources of concept hierarchies and two collections of text documents indicate that this combination leads to systematic improvements in the quality of the associated language models as well as enabling new techniques for inferring and visualizing the semantics of a document.

  Access Paper or Ask Questions

Spectral Learning for Supervised Topic Models

Feb 19, 2016
Yong Ren, Yining Wang, Jun Zhu

Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on variational approximation or Monte Carlo sampling, which often suffers from the local minimum defect. Spectral methods have been applied to learn unsupervised topic models, such as latent Dirichlet allocation (LDA), with provable guarantees. This paper investigates the possibility of applying spectral methods to recover the parameters of supervised LDA (sLDA). We first present a two-stage spectral method, which recovers the parameters of LDA followed by a power update method to recover the regression model parameters. Then, we further present a single-phase spectral algorithm to jointly recover the topic distribution matrix as well as the regression weights. Our spectral algorithms are provably correct and computationally efficient. We prove a sample complexity bound for each algorithm and subsequently derive a sufficient condition for the identifiability of sLDA. Thorough experiments on synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the spectral algorithms. In fact, our results on a large-scale review rating dataset demonstrate that our single-phase spectral algorithm alone gets comparable or even better performance than state-of-the-art methods, while previous work on spectral methods has rarely reported such promising performance.

  Access Paper or Ask Questions

A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation

May 23, 2013
Yin Zheng, Yu-Jin Zhang, Hugo Larochelle

Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to perform scene recognition and annotation. Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for document modeling. In this work, we show how to successfully apply and extend this model to the context of visual scene modeling. Specifically, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model. We also describe how to leverage information about the spatial position of the visual words and how to embed additional image annotations, so as to simultaneously perform image classification and annotation. We test our model on the Scene15, LabelMe and UIUC-Sports datasets and show that it compares favorably to other topic models such as the supervised variant of LDA.

* 13 pages, 5 figures 

  Access Paper or Ask Questions

MetaLDA: a Topic Model that Efficiently Incorporates Meta information

Sep 19, 2017
He Zhao, Lan Du, Wray Buntine, Gang Liu

Besides the text content, documents and their associated words usually come with rich sets of meta informa- tion, such as categories of documents and semantic/syntactic features of words, like those encoded in word embeddings. Incorporating such meta information directly into the generative process of topic models can improve modelling accuracy and topic quality, especially in the case where the word-occurrence information in the training data is insufficient. In this paper, we present a topic model, called MetaLDA, which is able to leverage either document or word meta information, or both of them jointly. With two data argumentation techniques, we can derive an efficient Gibbs sampling algorithm, which benefits from the fully local conjugacy of the model. Moreover, the algorithm is favoured by the sparsity of the meta information. Extensive experiments on several real world datasets demonstrate that our model achieves comparable or improved performance in terms of both perplexity and topic quality, particularly in handling sparse texts. In addition, compared with other models using meta information, our model runs significantly faster.

* To appear in ICDM 2017 

  Access Paper or Ask Questions

Text Network Exploration via Heterogeneous Web of Topics

Oct 02, 2016
Junxian He, Ying Huang, Changfeng Liu, Jiaming Shen, Yuting Jia, Xinbing Wang

A text network refers to a data type that each vertex is associated with a text document and the relationship between documents is represented by edges. The proliferation of text networks such as hyperlinked webpages and academic citation networks has led to an increasing demand for quickly developing a general sense of a new text network, namely text network exploration. In this paper, we address the problem of text network exploration through constructing a heterogeneous web of topics, which allows people to investigate a text network associating word level with document level. To achieve this, a probabilistic generative model for text and links is proposed, where three different relationships in the heterogeneous topic web are quantified. We also develop a prototype demo system named TopicAtlas to exhibit such heterogeneous topic web, and demonstrate how this system can facilitate the task of text network exploration. Extensive qualitative analyses are included to verify the effectiveness of this heterogeneous topic web. Besides, we validate our model on real-life text networks, showing that it preserves good performance on objective evaluation metrics.

* 8 pages 

  Access Paper or Ask Questions

Topic Space Trajectories: A case study on machine learning literature

Oct 26, 2020
Bastian Schäfermeier, Gerd Stumme, Tom Hanika

The annual number of publications at scientific venues, for example, conferences and journals, is growing quickly. Hence, even for researchers it becomes harder and harder to keep track of research topics and their progress. In this task, researchers can be supported by automated publication analysis. Yet, many such methods result in uninterpretable, purely numerical representations. As an attempt to support human analysts, we present \emph{topic space trajectories}, a structure that allows for the comprehensible tracking of research topics. We demonstrate how these trajectories can be interpreted based on eight different analysis approaches. To obtain comprehensible results, we employ non-negative matrix factorization as well as suitable visualization techniques. We show the applicability of our approach on a publication corpus spanning 50 years of machine learning research from 32 publication venues. Our novel analysis method may be employed for paper classification, for the prediction of future research topics, and for the recommendation of fitting conferences and journals for submitting unpublished work.

* 36 pages, 8 figures 

  Access Paper or Ask Questions

Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel Data

Jun 21, 2017
Tengfei Ma, Tetsuya Nasukawa

Topic models have been successfully applied in lexicon extraction. However, most previous methods are limited to document-aligned data. In this paper, we try to address two challenges of applying topic models to lexicon extraction in non-parallel data: 1) hard to model the word relationship and 2) noisy seed dictionary. To solve these two challenges, we propose two new bilingual topic models to better capture the semantic information of each word while discriminating the multiple translations in a noisy seed dictionary. We extend the scope of topic models by inverting the roles of "word" and "document". In addition, to solve the problem of noise in seed dictionary, we incorporate the probability of translation selection in our models. Moreover, we also propose an effective measure to evaluate the similarity of words in different languages and select the optimal translation pairs. Experimental results using real world data demonstrate the utility and efficacy of the proposed models.

* To appear in IJCAI 2017 

  Access Paper or Ask Questions

Topical Change Detection in Documents via Embeddings of Long Sequences

Dec 07, 2020
Dennis Aumiller, Satya Almasian, Sebastian Lackner, Michael Gertz

In a longer document, the topic often slightly shifts from one passage to the next, where topic boundaries are usually indicated by semantically coherent segments. Discovering this latent structure in a document improves the readability and is essential for passage retrieval and summarization tasks. We formulate the task of text segmentation as an independent supervised prediction task, making it suitable to train on Transformer-based language models. By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information, which can be used to find the section boundaries and divide the text into coherent segments. Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context of an entire paragraph and assume topical independence of preceeding and succeeding text. We lastly introduce a novel large-scale dataset constructed from online Terms-of-Service documents, on which we compare against various traditional and deep learning baselines, showing significantly better performance of Transformer-based methods.

  Access Paper or Ask Questions

Aspect Category Detection via Topic-Attention Network

Jan 04, 2019
Sajad Movahedi, Erfan Ghadery, Heshaam Faili, Azadeh Shakery

The e-commerce has started a new trend in natural language processing through sentiment analysis of user-generated reviews. Different consumers have different concerns about various aspects of a specific product or service. Aspect category detection, as a subtask of aspect-based sentiment analysis, tackles the problem of categorizing a given review sentence into a set of pre-defined aspect categories. In recent years, deep learning approaches have brought revolutionary advances in multiple branches of natural language processing including sentiment analysis. In this paper, we propose a deep neural network method based on attention mechanism to identify different aspect categories of a given review sentence. Our model utilizes several attentions with different topic contexts, enabling it to attend to different parts of a review sentence based on different topics. Experimental results on two datasets in the restaurant domain released by SemEval workshop demonstrates that our approach outperforms existing methods on both datasets. Visualization of the topic attention weights shows the effectiveness of our model in identifying words related to different topics.

  Access Paper or Ask Questions

An NLP approach to quantify dynamic salience of predefined topics in a text corpus

Aug 16, 2021
A. Bock, A. Palladino, S. Smith-Heisters, I. Boardman, E. Pellegrini, E. J. Bienenstock, A. Valenti

The proliferation of news media available online simultaneously presents a valuable resource and significant challenge to analysts aiming to profile and understand social and cultural trends in a geographic location of interest. While an abundance of news reports documenting significant events, trends, and responses provides a more democratized picture of the social characteristics of a location, making sense of an entire corpus to extract significant trends is a steep challenge for any one analyst or team. Here, we present an approach using natural language processing techniques that seeks to quantify how a set of pre-defined topics of interest change over time across a large corpus of text. We found that, given a predefined topic, we can identify and rank sets of terms, or n-grams, that map to those topics and have usage patterns that deviate from a normal baseline. Emergence, disappearance, or significant variations in n-gram usage present a ground-up picture of a topic's dynamic salience within a corpus of interest.

* This paper was presented at the 2021 International Conference on Social Computing, Behavioral-Cultural Modeling Prediction and Behavior Representation in Modeling and Simulation (SBP-BRiMS), 9 July 2021 

  Access Paper or Ask Questions