Non-negative matrix factorization (NMF) is a technique for finding latent representations of data. The method has been applied to corpora to construct topic models. However, NMF has likelihood assumptions which are often violated by real document corpora. We present a double parametric bootstrap test for evaluating the fit of an NMF-based topic model based on the duality of the KL divergence and Poisson maximum likelihood estimation. The test correctly identifies whether a topic model based on an NMF approach yields reliable results in simulated and real data.
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.
Distributed word representations are widely used for modeling words in NLP tasks. Most of the existing models generate one representation per word and do not consider different meanings of a word. We present two approaches to learn multiple topic-sensitive representations per word by using Hierarchical Dirichlet Process. We observe that by modeling topics and integrating topic distributions for each document we obtain representations that are able to distinguish between different meanings of a given word. Our models yield statistically significant improvements for the lexical substitution task indicating that commonly used single word representations, even when combined with contextual information, are insufficient for this task.
This paper presents an intertemporal bimodal network to analyze the evolution of the semantic content of a scientific field within the framework of topic modeling, namely using the Latent Dirichlet Allocation (LDA). The main contribution is the conceptualization of the topic dynamics and its formalization and codification into an algorithm. To benchmark the effectiveness of this approach, we propose three indexes which track the transformation of topics over time, their rate of birth and death, and the novelty of their content. Applying the LDA, we test the algorithm both on a controlled experiment and on a corpus of several thousands of scientific papers over a period of more than 100 years which account for the history of the economic thought.
Steady progress has been made in abstractive summarization with attention-based sequence-to-sequence learning models. In this paper, we propose a new decoder where the output summary is generated by conditioning on both the input text and the latent topics of the document. The latent topics, identified by a topic model such as LDA, reveals more global semantic information that can be used to bias the decoder to generate words. In particular, they enable the decoder to have access to additional word co-occurrence statistics captured at document corpus level. We empirically validate the advantage of the proposed approach on both the CNN/Daily Mail and the WikiHow datasets. Concretely, we attain strongly improved ROUGE scores when compared to state-of-the-art models.
In today's content-centric Internet, blogs are becoming increasingly popular and important from a data analysis perspective. According to Wikipedia, there were over 156 million public blogs on the Internet as of February 2011. Blogs are a reflection of our contemporary society. The contents of different blog posts are important from social, psychological, economical and political perspectives. Discovery of important topics in the blogosphere is an area which still needs much exploring. We try to come up with a procedure using probabilistic topic modeling and network centrality measures which identifies the central topics in a blog corpus.
Poetic traditions across languages evolved differently, but we find that certain semantic topics occur in several of them, albeit sometimes with temporal delay, or with diverging trajectories over time. We apply Latent Dirichlet Allocation (LDA) to poetry corpora of four languages, i.e. German (52k poems), English (85k poems), Russian (18k poems), and Czech (80k poems). We align and interpret salient topics, their trend over time (1600--1925 A.D.), showing similarities and disparities across poetic traditions with a few select topics, and use their trajectories over time to pinpoint specific literary epochs.
Automated generation of high-quality topical hierarchies for a text collection is a dream problem in knowledge engineering with many valuable applications. In this paper a scalable and robust algorithm is proposed for constructing a hierarchy of topics from a text collection. We divide and conquer the problem using a top-down recursive framework, based on a tensor orthogonal decomposition technique. We solve a critical challenge to perform scalable inference for our newly designed hierarchical topic model. Experiments with various real-world datasets illustrate its ability to generate robust, high-quality hierarchies efficiently. Our method reduces the time of construction by several orders of magnitude, and its robust feature renders it possible for users to interactively revise the hierarchy.
We analyze methods for selecting topics in news articles to explain stock returns. We find, through empirical and theoretical results, that supervised Latent Dirichlet Allocation (sLDA) implemented through Gibbs sampling in a stochastic EM algorithm will often overfit returns to the detriment of the topic model. We obtain better out-of-sample performance through a random search of plain LDA models. A branching procedure that reinforces effective topic assignments often performs best. We test methods on an archive of over 90,000 news articles about S&P 500 firms.
In this paper, we study the topical behavior in a large scale. We use the network logs where each entry contains the entity ID, the timestamp, and the meta data about the activity. Both the temporal and the spatial relationships of the behavior are explored with the deep learning architectures combing the recurrent neural network (RNN) and the convolutional neural network (CNN). To make the behavioral data appropriate for the spatial learning in the CNN, we propose several reduction steps to form the topical metrics and to place them homogeneously like pixels in the images. The experimental result shows both temporal and spatial gains when compared against a multilayer perceptron (MLP) network. A new learning framework called the spatially connected convolutional networks (SCCN) is introduced to predict the topical metrics more efficiently.