As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.
Topic models are statistical methods that extract underlying topics from document collections. When performing topic modeling, a user usually desires topics that are coherent, diverse between each other, and that constitute good document representations for downstream tasks (e.g. document classification). In this paper, we conduct a multi-objective hyperparameter optimization of three well-known topic models. The obtained results reveal the conflicting nature of different objectives and that the training corpus characteristics are crucial for the hyperparameter selection, suggesting that it is possible to transfer the optimal hyperparameter configurations between datasets.
The concept of literary genre is a highly complex one: not only are different genres frequently defined on several, but not necessarily the same levels of description, but consideration of genres as cognitive, social, or scholarly constructs with a rich history further complicate the matter. This contribution focuses on thematic aspects of genre with a quantitative approach, namely Topic Modeling. Topic Modeling has proven to be useful to discover thematic patterns and trends in large collections of texts, with a view to class or browse them on the basis of their dominant themes. It has rarely if ever, however, been applied to collections of dramatic texts. In this contribution, Topic Modeling is used to analyze a collection of French Drama of the Classical Age and the Enlightenment. The general aim of this contribution is to discover what semantic types of topics are found in this collection, whether different dramatic subgenres have distinctive dominant topics and plot-related topic patterns, and inversely, to what extent clustering methods based on topic scores per play produce groupings of texts which agree with more conventional genre distinctions. This contribution shows that interesting topic patterns can be detected which provide new insights into the thematic, subgenre-related structure of French drama as well as into the history of French drama of the Classical Age and the Enlightenment.
Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the topic composition of individual documents. Recently Thresholded Linear Inverse (TLI) is proposed to map the observed words of each document back to its topic composition. However, its linear characteristics limit the inference quality without considering the important prior information over topics. In this paper, we evaluate Simple Probabilistic Inverse (SPI) method and novel Prior-aware Dual Decomposition (PADD) that is capable of learning document-specific topic compositions in parallel. Experiments show that PADD successfully leverages topic correlations as a prior, notably outperforming TLI and learning quality topic compositions comparable to Gibbs sampling on various data.
This article presents the results of investigations using topic modeling of the Voynich Manuscript (Beinecke MS408). Topic modeling is a set of computational methods which are used to identify clusters of subjects within text. We use latent dirichlet allocation, latent semantic analysis, and nonnegative matrix factorization to cluster Voynich pages into `topics'. We then compare the topics derived from the computational models to clusters derived from the Voynich illustrations and from paleographic analysis. We find that computationally derived clusters match closely to a conjunction of scribe and subject matter (as per the illustrations), providing further evidence that the Voynich Manuscript contains meaningful text.
We analyze a large corpus of police incident narrative documents in understanding the spatial distribution of the topics. The motivation for doing this is that police narratives in each incident report contains very fine-grained information that is richer than the category that is manually assigned by the police. Our approach is to split the corpus into topics using two different unsupervised machine learning algorithms - Latent Dirichlet Allocation and Non-negative Matrix Factorization. We validate the performance of each learned topic model using model coherence. Then, using a k-nearest neighbors density ratio estimation (kNN-DRE) approach that we propose, we estimate the spatial density ratio per topic and use this for data discovery and analysis of each topic, allowing for insights into the described incidents at scale. We provide a qualitative assessment of each topic and highlight some key benefits for using our kNN-DRE model for estimating spatial trends.
The COVID-19 pandemic fueled one of the most rapid vaccine developments in history. However, misinformation spread through online social media often leads to negative vaccine sentiment and hesitancy. To investigate COVID-19 vaccine-related discussion in social media, we conducted a sentiment analysis and Latent Dirichlet Allocation topic modeling on textual data collected from 13 Reddit communities focusing on the COVID-19 vaccine from Dec 1, 2020, to May 15, 2021. Data were aggregated and analyzed by month to detect changes in any sentiment and latent topics. ty analysis suggested these communities expressed more positive sentiment than negative regarding the vaccine-related discussions and has remained static over time. Topic modeling revealed community members mainly focused on side effects rather than outlandish conspiracy theories. Covid-19 vaccine-related content from 13 subreddits show that the sentiments expressed in these communities are overall more positive than negative and have not meaningfully changed since December 2020. Keywords indicating vaccine hesitancy were detected throughout the LDA topic modeling. Public sentiment and topic modeling analysis regarding vaccines could facilitate the implementation of appropriate messaging, digital interventions, and new policies to promote vaccine confidence.
Large-scale transformer-based language models (LMs) demonstrate impressive capabilities in open text generation. However, controlling the generated text's properties such as the topic, style, and sentiment is challenging and often requires significant changes to the model architecture or retraining and fine-tuning the model on new supervised data. This paper presents a novel approach for Topical Language Generation (TLG) by combining a pre-trained LM with topic modeling information. We cast the problem using Bayesian probability formulation with topic probabilities as a prior, LM probabilities as the likelihood, and topical language generation probability as the posterior. In learning the model, we derive the topic probability distribution from the user-provided document's natural structure. Furthermore, we extend our model by introducing new parameters and functions to influence the quantity of the topical features presented in the generated text. This feature would allow us to easily control the topical properties of the generated text. Our experimental results demonstrate that our model outperforms the state-of-the-art results on coherency, diversity, and fluency while being faster in decoding.
Understanding the semantic of a collection of texts is a challenging task. Topic models are probabilistic models that aims at extracting "topics" from a corpus of documents. This task is particularly difficult when the corpus is composed of short texts, such as posts on social networks. Following several previous research papers, we explore in this paper a set of collected tweets about bitcoin. In this work, we train three topic models and evaluate their output with several scores. We also propose a concrete application of the extracted topics.
In this paper, we present hierarchical relationbased latent Dirichlet allocation (hrLDA), a data-driven hierarchical topic model for extracting terminological ontologies from a large number of heterogeneous documents. In contrast to traditional topic models, hrLDA relies on noun phrases instead of unigrams, considers syntax and document structures, and enriches topic hierarchies with topic relations. Through a series of experiments, we demonstrate the superiority of hrLDA over existing topic models, especially for building hierarchies. Furthermore, we illustrate the robustness of hrLDA in the settings of noisy data sets, which are likely to occur in many practical scenarios. Our ontology evaluation results show that ontologies extracted from hrLDA are very competitive with the ontologies created by domain experts.