Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Topic Modeling": models, code, and papers

Bayesian Nonparametrics in Topic Modeling: A Brief Tutorial

Jan 16, 2015
Alexander Spangher

Using nonparametric methods has been increasingly explored in Bayesian hierarchical modeling as a way to increase model flexibility. Although the field shows a lot of promise, inference in many models, including Hierachical Dirichlet Processes (HDP), remain prohibitively slow. One promising path forward is to exploit the submodularity inherent in Indian Buffet Process (IBP) to derive near-optimal solutions in polynomial time. In this work, I will present a brief tutorial on Bayesian nonparametric methods, especially as they are applied to topic modeling. I will show a comparison between different non-parametric models and the current state-of-the-art parametric model, Latent Dirichlet Allocation (LDA).

* 7 pages, unpublished 

Modeling Curiosity in a Mobile Robot for Long-Term Autonomous Exploration and Monitoring

Sep 26, 2015
Yogesh Girdhar, Gregory Dudek

This paper presents a novel approach to modeling curiosity in a mobile robot, which is useful for monitoring and adaptive data collection tasks, especially in the context of long term autonomous missions where pre-programmed missions are likely to have limited utility. We use a realtime topic modeling technique to build a semantic perception model of the environment, using which, we plan a path through the locations in the world with high semantic information content. The life-long learning behavior of the proposed perception model makes it suitable for long-term exploration missions. We validate the approach using simulated exploration experiments using aerial and underwater data, and demonstrate an implementation on the Aqua underwater robot in a variety of scenarios. We find that the proposed exploration paths that are biased towards locations with high topic perplexity, produce better terrain models with high discriminative power. Moreover, we show that the proposed algorithm implemented on Aqua robot is able to do tasks such as coral reef inspection, diver following, and sea floor exploration, without any prior training or preparation.

* 20 pages, in-press, Autonomous Robots, 2015. arXiv admin note: substantial text overlap with arXiv:1310.6767 

Curiosity Based Exploration for Learning Terrain Models

Oct 24, 2013
Yogesh Girdhar, David Whitney, Gregory Dudek

We present a robotic exploration technique in which the goal is to learn to a visual model and be able to distinguish between different terrains and other visual components in an unknown environment. We use ROST, a realtime online spatiotemporal topic modeling framework to model these terrains using the observations made by the robot, and then use an information theoretic path planning technique to define the exploration path. We conduct experiments with aerial view and underwater datasets with millions of observations and varying path lengths, and find that paths that are biased towards locations with high topic perplexity produce better terrain models with high discriminative power, especially with paths of length close to the diameter of the world.

* 7 pages, 5 figures, submitted to ICRA 2014 

A Spectral Algorithm for Latent Dirichlet Allocation

Jan 17, 2013
Animashree Anandkumar, Dean P. Foster, Daniel Hsu, Sham M. Kakade, Yi-Kai Liu

The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on $k\times k$ matrices, where $k$ is the number of latent factors (e.g. the number of topics), rather than in the $d$-dimensional observed space (typically $d \gg k$).

* Changed title to match conference version, which appears in Advances in Neural Information Processing Systems 25, 2012 

Open vs Closed-ended questions in attitudinal surveys -- comparing, combining, and interpreting using natural language processing

May 03, 2022
Vishnu Baburajan, João de Abreu e Silva, Francisco Camara Pereira

To improve the traveling experience, researchers have been analyzing the role of attitudes in travel behavior modeling. Although most researchers use closed-ended surveys, the appropriate method to measure attitudes is debatable. Topic Modeling could significantly reduce the time to extract information from open-ended responses and eliminate subjective bias, thereby alleviating analyst concerns. Our research uses Topic Modeling to extract information from open-ended questions and compare its performance with closed-ended responses. Furthermore, some respondents might prefer answering questions using their preferred questionnaire type. So, we propose a modeling framework that allows respondents to use their preferred questionnaire type to answer the survey and enable analysts to use the modeling frameworks of their choice to predict behavior. We demonstrate this using a dataset collected from the USA that measures the intention to use Autonomous Vehicles for commute trips. Respondents were presented with alternative questionnaire versions (open- and closed- ended). Since our objective was also to compare the performance of alternative questionnaire versions, the survey was designed to eliminate influences resulting from statements, behavioral framework, and the choice experiment. Results indicate the suitability of using Topic Modeling to extract information from open-ended responses; however, the models estimated using the closed-ended questions perform better compared to them. Besides, the proposed model performs better compared to the models used currently. Furthermore, our proposed framework will allow respondents to choose the questionnaire type to answer, which could be particularly beneficial to them when using voice-based surveys.


Better Conversations by Modeling,Filtering,and Optimizing for Coherence and Diversity

Sep 18, 2018
Xinnuo Xu, Ondřej Dušek, Ioannis Konstas, Verena Rieser

We present three enhancements to existing encoder-decoder models for open-domain conversational agents, aimed at effectively modeling coherence and promoting output diversity: (1) We introduce a measure of coherence as the GloVe embedding similarity between the dialogue context and the generated response, (2) we filter our training corpora based on the measure of coherence to obtain topically coherent and lexically diverse context-response pairs, (3) we then train a response generator using a conditional variational autoencoder model that incorporates the measure of coherence as a latent variable and uses a context gate to guarantee topical consistency with the context and promote lexical diversity. Experiments on the OpenSubtitles corpus show a substantial improvement over competitive neural models in terms of BLEU score as well as metrics of coherence and diversity.

* EMNLP 2018 

Dynamic Nonlocal Language Modeling via Hierarchical Topic-Based Adaptation

Apr 27, 2001
Radu Florian, David Yarowsky

This paper presents a novel method of generating and applying hierarchical, dynamic topic-based language models. It proposes and evaluates new cluster generation, hierarchical smoothing and adaptive topic-probability estimation techniques. These combined models help capture long-distance lexical dependencies. Experiments on the Broadcast News corpus show significant improvement in perplexity (10.5% overall and 33.5% on target vocabulary).

* Proceedings of the 37th Annual Meeting of the ACL, pages 167-174, College Park, Maryland 
* 8 pages, 29 figures, presented at ACL99, College Park, Maryland 

Variable Selection for Latent Dirichlet Allocation

May 04, 2012
Dongwoo Kim, Yeonseung Chung, Alice Oh

In latent Dirichlet allocation (LDA), topics are multinomial distributions over the entire vocabulary. However, the vocabulary usually contains many words that are not relevant in forming the topics. We adopt a variable selection method widely used in statistical modeling as a dimension reduction tool and combine it with LDA. In this variable selection model for LDA (vsLDA), topics are multinomial distributions over a subset of the vocabulary, and by excluding words that are not informative for finding the latent topic structure of the corpus, vsLDA finds topics that are more robust and discriminative. We compare three models, vsLDA, LDA with symmetric priors, and LDA with asymmetric priors, on heldout likelihood, MCMC chain consistency, and document classification. The performance of vsLDA is better than symmetric LDA for likelihood and classification, better than asymmetric LDA for consistency and classification, and about the same in the other comparisons.


A Generalized Hierarchical Nonnegative Tensor Decomposition

Sep 30, 2021
Joshua Vendrow, Jamie Haddock, Deanna Needell

Nonnegative matrix factorization (NMF) has found many applications including topic modeling and document analysis. Hierarchical NMF (HNMF) variants are able to learn topics at various levels of granularity and illustrate their hierarchical relationship. Recently, nonnegative tensor factorization (NTF) methods have been applied in a similar fashion in order to handle data sets with complex, multi-modal structure. Hierarchical NTF (HNTF) methods have been proposed, however these methods do not naturally generalize their matrix-based counterparts. Here, we propose a new HNTF model which directly generalizes a HNMF model special case, and provide a supervised extension. We also provide a multiplicative updates training method for this model. Our experimental results show that this model more naturally illuminates the topic hierarchy than previous HNMF and HNTF methods.

* 6 pages, 2 figues, 3 tables 

Contrastive estimation reveals topic posterior information to linear models

Mar 04, 2020
Christopher Tosh, Akshay Krishnamurthy, Daniel Hsu

Contrastive learning is an approach to representation learning that utilizes naturally occurring similar and dissimilar pairs of data points to find useful embeddings of data. In the context of document classification under topic modeling assumptions, we prove that contrastive learning is capable of recovering a representation of documents that reveals their underlying topic posterior information to linear models. We apply this procedure in a semi-supervised setup and demonstrate empirically that linear classifiers with these representations perform well in document classification tasks with very few training examples.