Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Optimal estimation of sparse topic models

Jan 22, 2020
Xin Bing, Florentina Bunea, Marten Wegkamp

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of $p$ words in $n$ documents, stored in a $p\times n$ matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a $p\times K$ word-topic matrix $A$ and a $K\times n$ topic-document matrix $W$. This paper studies the estimation of $A$ that is possibly element-wise sparse, and the number of topics $K$ is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such $A$ and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of $A$ and our analysis is valid for any finite $n$, $p$, $K$ and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse $A$ and sparse $A$, and has superior performance is many scenarios of interest.

  Access Paper or Ask Questions

Scientific Dataset Discovery via Topic-level Recommendation

Jun 07, 2021
Basmah Altaf, Shichao Pei, Xiangliang Zhang

Data intensive research requires the support of appropriate datasets. However, it is often time-consuming to discover usable datasets matching a specific research topic. We formulate the dataset discovery problem on an attributed heterogeneous graph, which is composed of paper-paper citation, paper-dataset citation, and also paper content. We propose to characterize both paper and dataset nodes by their commonly shared latent topics, rather than learning user and item representations via canonical graph embedding models, because the usage of datasets and the themes of research projects can be understood on the common base of research topics. The relevant datasets to a given research project can then be inferred in the shared topic space. The experimental results show that our model can generate reasonable profiles for datasets, and recommend proper datasets for a query, which represents a research project linked with several papers.

  Access Paper or Ask Questions

Fast Online EM for Big Topic Modeling

Dec 07, 2015
Jia Zeng, Zhi-Qiang Liu, Xiao-Qin Cao

The expectation-maximization (EM) algorithm can compute the maximum-likelihood (ML) or maximum a posterior (MAP) point estimate of the mixture models or latent variable models such as latent Dirichlet allocation (LDA), which has been one of the most popular probabilistic topic modeling methods in the past decade. However, batch EM has high time and space complexities to learn big LDA models from big data streams. In this paper, we present a fast online EM (FOEM) algorithm that infers the topic distribution from the previously unseen documents incrementally with constant memory requirements. Within the stochastic approximation framework, we show that FOEM can converge to the local stationary point of the LDA's likelihood function. By dynamic scheduling for the fast speed and parameter streaming for the low memory usage, FOEM is more efficient for some lifelong topic modeling tasks than the state-of-the-art online LDA algorithms to handle both big data and big models (aka, big topic modeling) on just a PC.

* 14 pages, 12 figures in IEEE Transactions on Knowledge and Data Engineering, 2016 

  Access Paper or Ask Questions

Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context

Nov 22, 2019
Gabriele Pergola, Yulan He, David Lowe

Making sense of words often requires to simultaneously examine the surrounding context of a term as well as the global themes characterizing the overall corpus. Several topic models have already exploited word embeddings to recognize local context, however, it has been weakly combined with the global context during the topic inference. This paper proposes to extract topical phrases corroborating the word embedding information with the global context detected by Latent Semantic Analysis, and then combine them by means of the P\'{o}lya urn model. To highlight the effectiveness of this combined approach the model was assessed analyzing clinical reports, a challenging scenario characterized by technical jargon and a limited word statistics available. Results show it outperforms the state-of-the-art approaches in terms of both topic coherence and computational cost.

* The 2nd AAAI Workshop on Health Intelligence, AAAI18 

  Access Paper or Ask Questions

Inductive Document Network Embedding with Topic-Word Attention

Jan 10, 2020
Robin Brochier, Adrien Guille, Julien Velcin

Document network embedding aims at learning representations for a structured text corpus i.e. when documents are linked to each other. Recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations. In most cases, it is hard to interpret the learned representations. Moreover, little importance is given to the generalization to new documents that are not observed within the network. In this paper, we propose an interpretable and inductive document network embedding method. We introduce a novel mechanism, the Topic-Word Attention (TWA), that generates document representations based on the interplay between word and topic representations. We train these word and topic vectors through our general model, Inductive Document Network Embedding (IDNE), by leveraging the connections in the document network. Quantitative evaluations show that our approach achieves state-of-the-art performance on various networks and we qualitatively show that our model produces meaningful and interpretable representations of the words, topics and documents.

  Access Paper or Ask Questions

Survival-Supervised Topic Modeling with Anchor Words: Characterizing Pancreatitis Outcomes

Dec 07, 2017
George H. Chen, Jeremy C. Weiss

We introduce a new approach for topic modeling that is supervised by survival analysis. Specifically, we build on recent work on unsupervised topic modeling with so-called anchor words by providing supervision through an elastic-net regularized Cox proportional hazards model. In short, an anchor word being present in a document provides strong indication that the document is partially about a specific topic. For example, by seeing "gallstones" in a document, we are fairly certain that the document is partially about medicine. Our proposed method alternates between learning a topic model and learning a survival model to find a local minimum of a block convex optimization problem. We apply our proposed approach to predicting how long patients with pancreatitis admitted to an intensive care unit (ICU) will stay in the ICU. Our approach is as accurate as the best of a variety of baselines while being more interpretable than any of the baselines.

* NIPS Workshop on Machine Learning for Health 2017, fixed some equation typos, some minor wording edits 

  Access Paper or Ask Questions

Who Responded to Whom: The Joint Effects of Latent Topics and Discourse in Conversation Structure

Apr 17, 2021
Lu Ji, Jing Li, Zhongyu Wei, Qi Zhang, Xuanjing Huang

Numerous online conversations are produced on a daily basis, resulting in a pressing need to conversation understanding. As a basis to structure a discussion, we identify the responding relations in the conversation discourse, which link response utterances to their initiations. To figure out who responded to whom, here we explore how the consistency of topic contents and dependency of discourse roles indicate such interactions, whereas most prior work ignore the effects of latent factors underlying word occurrences. We propose a model to learn latent topics and discourse in word distributions, and predict pairwise initiation-response links via exploiting topic consistency and discourse dependency. Experimental results on both English and Chinese conversations show that our model significantly outperforms the previous state of the arts, such as 79 vs. 73 MRR on Chinese customer service dialogues. We further probe into our outputs and shed light on how topics and discourse indicate conversational user interactions.

* 10 pages, 7 figures, 3 tables submitted for emnlp2021 

  Access Paper or Ask Questions

Topic-Aware Neural Keyphrase Generation for Social Media Language

Jun 10, 2019
Yue Wang, Jing Li, Hou Pong Chan, Irwin King, Michael R. Lyu, Shuming Shi

A huge volume of user-generated content is daily produced on social media. To facilitate automatic language understanding, we study keyphrase prediction, distilling salient information from massive posts. While most existing methods extract words from source posts to form keyphrases, we propose a sequence-to-sequence (seq2seq) based neural keyphrase generation framework, enabling absent keyphrases to be created. Moreover, our model, being topic-aware, allows joint modeling of corpus-level latent topic representations, which helps alleviate the data sparsity that widely exhibited in social media language. Experiments on three datasets collected from English and Chinese social media platforms show that our model significantly outperforms both extraction and generation models that do not exploit latent topics. Further discussions show that our model learns meaningful topics, which interprets its superiority in social media keyphrase generation.

* ACL 2019 (11 pages) 

  Access Paper or Ask Questions

SimDoc: Topic Sequence Alignment based Document Similarity Framework

Nov 11, 2017
Gaurav Maheshwari, Priyansh Trivedi, Harshita Sahijwani, Kunal Jha, Sourish Dasgupta, Jens Lehmann

Document similarity is the problem of estimating the degree to which a given pair of documents has similar semantic content. An accurate document similarity measure can improve several enterprise relevant tasks such as document clustering, text mining, and question-answering. In this paper, we show that a document's thematic flow, which is often disregarded by bag-of-word techniques, is pivotal in estimating their similarity. To this end, we propose a novel semantic document similarity framework, called SimDoc. We model documents as topic-sequences, where topics represent latent generative clusters of related words. Then, we use a sequence alignment algorithm to estimate their semantic similarity. We further conceptualize a novel mechanism to compute topic-topic similarity to fine tune our system. In our experiments, we show that SimDoc outperforms many contemporary bag-of-words techniques in accurately computing document similarity, and on practical applications such as document clustering.

  Access Paper or Ask Questions

Necessary and Sufficient Conditions for Novel Word Detection in Separable Topic Models

Oct 30, 2013
Weicong Ding, Prakash Ishwar, Mohammad H. Rohban, Venkatesh Saligrama

The simplicial condition and other stronger conditions that imply it have recently played a central role in developing polynomial time algorithms with provable asymptotic consistency and sample complexity guarantees for topic estimation in separable topic models. Of these algorithms, those that rely solely on the simplicial condition are impractical while the practical ones need stronger conditions. In this paper, we demonstrate, for the first time, that the simplicial condition is a fundamental, algorithm-independent, information-theoretic necessary condition for consistent separable topic estimation. Furthermore, under solely the simplicial condition, we present a practical quadratic-complexity algorithm based on random projections which consistently detects all novel words of all topics using only up to second-order empirical word moments. This algorithm is amenable to distributed implementation making it attractive for 'big-data' scenarios involving a network of large distributed databases.

  Access Paper or Ask Questions