Support or opposition concerning a debated claim such as abortion should be legal can have different underlying reasons, which we call perspectives. This paper explores how opinion mining can be enhanced with joint topic modeling, to identify distinct perspectives within the topic, providing an informative overview from unstructured text. We evaluate four joint topic models (TAM, JST, VODUM, and LAM) in a user study assessing human understandability of the extracted perspectives. Based on the results, we conclude that joint topic models such as TAM can discover perspectives that align with human judgments. Moreover, our results suggest that users are not influenced by their pre-existing stance on the topic of abortion when interpreting the output of topic models.
Topic model evaluation, like evaluation of other unsupervised methods, can be contentious. However, the field has coalesced around automated estimates of topic coherence, which rely on the frequency of word co-occurrences in a reference corpus. Recent models relying on neural components surpass classical topic models according to these metrics. At the same time, unlike classical models, the practice of neural topic model evaluation suffers from a validation gap: automatic coherence for neural models has not been validated using human experimentation. In addition, as we show via a meta-analysis of topic modeling literature, there is a substantial standardization gap in the use of automated topic modeling benchmarks. We address both the standardization gap and the validation gap. Using two of the most widely used topic model evaluation datasets, we assess a dominant classical model and two state-of-the-art neural models in a systematic, clearly documented, reproducible way. We use automatic coherence along with the two most widely accepted human judgment tasks, namely, topic rating and word intrusion. Automated evaluation will declare one model significantly different from another when corresponding human evaluations do not, calling into question the validity of fully automatic evaluations independent of human judgments.
There is a great need for technologies that can predict the mortality of patients in intensive care units with both high accuracy and accountability. We present joint end-to-end neural network architectures that combine long short-term memory (LSTM) and a latent topic model to simultaneously train a classifier for mortality prediction and learn latent topics indicative of mortality from textual clinical notes. For topic interpretability, the topic modeling layer has been carefully designed as a single-layer network with constraints inspired by LDA. Experiments on the MIMIC-III dataset show that our models significantly outperform prior models that are based on LDA topics in mortality prediction. However, we achieve limited success with our method for interpreting topics from the trained models by looking at the neural network weights.
This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data. Our main contributions are a) a combined sampling, training and inference procedure for applying topic models to huge and imbalanced diachronic text collections; b) a discussion on the differences between two topic models for this type of data; c) quantifying topic prominence for a period and thus a generalization of document-wise topic assignment to a discourse level; and d) a discussion of the role of humanistic interpretation with regard to analysing discourse dynamics through topic models.
The concept of literary genre is a highly complex one: not only are different genres frequently defined on several, but not necessarily the same levels of description, but consideration of genres as cognitive, social, or scholarly constructs with a rich history further complicate the matter. This contribution focuses on thematic aspects of genre with a quantitative approach, namely Topic Modeling. Topic Modeling has proven to be useful to discover thematic patterns and trends in large collections of texts, with a view to class or browse them on the basis of their dominant themes. It has rarely if ever, however, been applied to collections of dramatic texts. In this contribution, Topic Modeling is used to analyze a collection of French Drama of the Classical Age and the Enlightenment. The general aim of this contribution is to discover what semantic types of topics are found in this collection, whether different dramatic subgenres have distinctive dominant topics and plot-related topic patterns, and inversely, to what extent clustering methods based on topic scores per play produce groupings of texts which agree with more conventional genre distinctions. This contribution shows that interesting topic patterns can be detected which provide new insights into the thematic, subgenre-related structure of French drama as well as into the history of French drama of the Classical Age and the Enlightenment.
Automated characterization of spatial data is a kind of critical geographical intelligence. As an emerging technique for characterization, Spatial Representation Learning (SRL) uses deep neural networks (DNNs) to learn non-linear embedded features of spatial data for characterization. However, SRL extracts features by internal layers of DNNs, and thus suffers from lacking semantic labels. Texts of spatial entities, on the other hand, provide semantic understanding of latent feature labels, but is insensible to deep SRL models. How can we teach a SRL model to discover appropriate topic labels in texts and pair learned features with the labels? This paper formulates a new problem: feature-topic pairing, and proposes a novel Particle Swarm Optimization (PSO) based deep learning framework. Specifically, we formulate the feature-topic pairing problem into an automated alignment task between 1) a latent embedding feature space and 2) a textual semantic topic space. We decompose the alignment of the two spaces into: 1) point-wise alignment, denoting the correlation between a topic distribution and an embedding vector; 2) pair-wise alignment, denoting the consistency between a feature-feature similarity matrix and a topic-topic similarity matrix. We design a PSO based solver to simultaneously select an optimal set of topics and learn corresponding features based on the selected topics. We develop a closed loop algorithm to iterate between 1) minimizing losses of representation reconstruction and feature-topic alignment and 2) searching the best topics. Finally, we present extensive experiments to demonstrate the enhanced performance of our method.
This paper explores the suitability of using automatically discovered topics from MOOC discussion forums for modelling students' academic abilities. The Rasch model from psychometrics is a popular generative probabilistic model that relates latent student skill, latent item difficulty, and observed student-item responses within a principled, unified framework. According to scholarly educational theory, discovered topics can be regarded as appropriate measurement items if (1) students' participation across the discovered topics is well fit by the Rasch model, and if (2) the topics are interpretable to subject-matter experts as being educationally meaningful. Such Rasch-scaled topics, with associated difficulty levels, could be of potential benefit to curriculum refinement, student assessment and personalised feedback. The technical challenge that remains, is to discover meaningful topics that simultaneously achieve good statistical fit with the Rasch model. To address this challenge, we combine the Rasch model with non-negative matrix factorisation based topic modelling, jointly fitting both models. We demonstrate the suitability of our approach with quantitative experiments on data from three Coursera MOOCs, and with qualitative survey results on topic interpretability on a Discrete Optimisation MOOC.
In this paper we present the approach of introducing thesaurus knowledge into probabilistic topic models. The main idea of the approach is based on the assumption that the frequencies of semantically related words and phrases, which are met in the same texts, should be enhanced: this action leads to their larger contribution into topics found in these texts. We have conducted experiments with several thesauri and found that for improving topic models, it is useful to utilize domain-specific knowledge. If a general thesaurus, such as WordNet, is used, the thesaurus-based improvement of topic models can be achieved with excluding hyponymy relations in combined topic models.
Topic detection is a challenging task, especially without knowing the exact number of topics. In this paper, we present a novel approach based on neural network to detect topics in the micro-blogging dataset. We use an unsupervised neural sentence embedding model to map the blogs to an embedding space. Our model is a weighted power mean word embedding model, and the weights are calculated by attention mechanism. Experimental result shows our embedding method performs better than baselines in sentence clustering. In addition, we propose an improved clustering algorithm referred as relationship-aware DBSCAN (RADBSCAN). It can discover topics from a micro-blogging dataset, and the topic number depends on dataset character itself. Moreover, in order to solve the problem of parameters sensitive, we take blog forwarding relationship as a bridge of two independent clusters. Finally, we validate our approach on a dataset from sina micro-blog. The result shows that we can detect all the topics successfully and extract keywords in each topic.
Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of "instability" which has previously been studied in the context of $k$-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models.