Alert button

"Topic Modeling": models, code, and papers
Alert button

Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection

Jan 11, 2022
Manuel J. Gomez, Mario Calderón, Victor Sánchez, Félix J. García Clemente, José A. Ruipérez-Valiente

Figure 1 for Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection
Figure 2 for Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection
Figure 3 for Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection
Figure 4 for Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection

The recent pandemic has changed the way we see education. It is not surprising that children and college students are not the only ones using online education. Millions of adults have signed up for online classes and courses during last years, and MOOC providers, such as Coursera or edX, are reporting millions of new users signing up in their platforms. However, students do face some challenges when choosing courses. Though online review systems are standard among many verticals, no standardized or fully decentralized review systems exist in the MOOC ecosystem. In this vein, we believe that there is an opportunity to leverage available open MOOC reviews in order to build simpler and more transparent reviewing systems, allowing users to really identify the best courses out there. Specifically, in our research we analyze 2.4 million reviews (which is the largest MOOC reviews dataset used until now) from five different platforms in order to determine the following: (1) if the numeric ratings provide discriminant information to learners, (2) if NLP-driven sentiment analysis on textual reviews could provide valuable information to learners, (3) if we can leverage NLP-driven topic finding techniques to infer themes that could be important for learners, and (4) if we can use these models to effectively characterize MOOCs based on the open reviews. Results show that numeric ratings are clearly biased (63\% of them are 5-star ratings), and the topic modeling reveals some interesting topics related with course advertisements, the real applicability, or the difficulty of the different courses. We expect our study to shed some light on the area and promote a more transparent approach in online education reviews, which are becoming more and more popular as we enter the post-pandemic era.

* 36 pages, 8 figures 
Viaarxiv icon

Discriminative Topic Modeling with Logistic LDA

Sep 03, 2019
Iryna Korshunova, Hanchen Xiong, Mateusz Fedoryszak, Lucas Theis

Figure 1 for Discriminative Topic Modeling with Logistic LDA
Figure 2 for Discriminative Topic Modeling with Logistic LDA
Figure 3 for Discriminative Topic Modeling with Logistic LDA

Despite many years of research into latent Dirichlet allocation (LDA), applying LDA to collections of non-categorical items is still challenging. Yet many problems with much richer data share a similar structure and could benefit from the vast literature on LDA. We propose logistic LDA, a novel discriminative variant of latent Dirichlet allocation which is easy to apply to arbitrary inputs. In particular, our model can easily be applied to groups of images, arbitrary text embeddings, and integrate well with deep neural networks. Although it is a discriminative model, we show that logistic LDA can learn from unlabeled data in an unsupervised manner by exploiting the group structure present in the data. In contrast to other recent topic models designed to handle arbitrary inputs, our model does not sacrifice the interpretability and principled motivation of LDA.

* Advances in Neural Information Processing Systems 32, 2019  
Viaarxiv icon

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Feb 20, 2018
Amritanshu Agrawal, Wei Fu, Tim Menzies

Figure 1 for What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)
Figure 2 for What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)
Figure 3 for What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)
Figure 4 for What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (VEM,or using Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results: In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE's tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion: Due to topic instability, using standard LDA with its "off-the-shelf" settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.

* Information and Software Technology Journal, 2018  
* 15 pages + 2 page references. Accepted to IST 
Viaarxiv icon

JST-RR Model: Joint Modeling of Ratings and Reviews in Sentiment-Topic Prediction

Feb 18, 2021
Qiao Liang, Shyam Ranganathan, Kaibo Wang, Xinwei Deng

Figure 1 for JST-RR Model: Joint Modeling of Ratings and Reviews in Sentiment-Topic Prediction
Figure 2 for JST-RR Model: Joint Modeling of Ratings and Reviews in Sentiment-Topic Prediction
Figure 3 for JST-RR Model: Joint Modeling of Ratings and Reviews in Sentiment-Topic Prediction
Figure 4 for JST-RR Model: Joint Modeling of Ratings and Reviews in Sentiment-Topic Prediction

Analysis of online reviews has attracted great attention with broad applications. Often times, the textual reviews are coupled with the numerical ratings in the data. In this work, we propose a probabilistic model to accommodate both textual reviews and overall ratings with consideration of their intrinsic connection for a joint sentiment-topic prediction. The key of the proposed method is to develop a unified generative model where the topic modeling is constructed based on review texts and the sentiment prediction is obtained by combining review texts and overall ratings. The inference of model parameters are obtained by an efficient Gibbs sampling procedure. The proposed method can enhance the prediction accuracy of review data and achieve an effective detection of interpretable topics and sentiments. The merits of the proposed method are elaborated by the case study from Amazon datasets and simulation studies.

Viaarxiv icon

TeCoMiner: Topic Discovery Through Term Community Detection

Mar 23, 2021
Andreas Hamm, Jana Thelen, Rasmus Beckmann, Simon Odrowski

Figure 1 for TeCoMiner: Topic Discovery Through Term Community Detection
Figure 2 for TeCoMiner: Topic Discovery Through Term Community Detection
Figure 3 for TeCoMiner: Topic Discovery Through Term Community Detection
Figure 4 for TeCoMiner: Topic Discovery Through Term Community Detection

This note is a short description of TeCoMiner, an interactive tool for exploring the topic content of text collections. Unlike other topic modeling tools, TeCoMiner is not based on some generative probabilistic model but on topological considerations about co-occurrence networks of terms. We outline the methods used for identifying topics, describe the features of the tool, and sketch an application, using a corpus of policy related scientific news on environmental issues published by the European Commission over the last decade.

* 8 pages, 4 figures 
Viaarxiv icon

Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models

Jun 16, 2022
Wei Shao, Lei Huang, Shuqi Liu, Shihua Ma, Linqi Song

Figure 1 for Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models
Figure 2 for Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models
Figure 3 for Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models
Figure 4 for Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models

Embedding-based neural topic models could explicitly represent words and topics by embedding them to a homogeneous feature space, which shows higher interpretability. However, there are no explicit constraints for the training of embeddings, leading to a larger optimization space. Also, a clear description of the changes in embeddings and the impact on model performance is still lacking. In this paper, we propose an embedding regularized neural topic model, which applies the specially designed training constraints on word embedding and topic embedding to reduce the optimization space of parameters. To reveal the changes and roles of embeddings, we introduce \textbf{uniformity} into the embedding-based neural topic model as the evaluation metric of embedding space. On this basis, we describe how embeddings tend to change during training via the changes in the uniformity of embeddings. Furthermore, we demonstrate the impact of changes in embeddings in embedding-based neural topic models through ablation studies. The results of experiments on two mainstream datasets indicate that our model significantly outperforms baseline models in terms of the harmony between topic quality and document modeling. This work is the first attempt to exploit uniformity to explore changes in embeddings of embedding-based neural topic models and their impact on model performance to the best of our knowledge.

* IJCNN 2022 
Viaarxiv icon

Topic Community Based Temporal Expertise for Question Routing

Jul 05, 2022
Vaibhav Krishna, Vaiva Vasiliauskaite, Nino Antulov-Fantulin

Figure 1 for Topic Community Based Temporal Expertise for Question Routing
Figure 2 for Topic Community Based Temporal Expertise for Question Routing
Figure 3 for Topic Community Based Temporal Expertise for Question Routing
Figure 4 for Topic Community Based Temporal Expertise for Question Routing

Question Routing in Community-based Question Answering websites aims at recommending newly posted questions to potential users who are most likely to provide "accepted answers". Most of the existing approaches predict users' expertise based on their past question answering behavior and the content of new questions. However, these approaches suffer from challenges in three aspects: 1) sparsity of users' past records results in lack of personalized recommendation that at times does not match users' interest or domain expertise, 2) modeling based on all questions and answers content makes periodic updates computationally expensive, and 3) while CQA sites are highly dynamic, they are mostly considered as static. This paper proposes a novel approach to QR that addresses the above challenges. It is based on dynamic modeling of users' activity on topic communities. Experimental results on three real-world datasets demonstrate that the proposed model significantly outperforms competitive baseline models

Viaarxiv icon

Auto-Encoding Variational Bayes for Inferring Topics and Visualization

Oct 25, 2020
Dang Pham, Tuan M. V. Le

Figure 1 for Auto-Encoding Variational Bayes for Inferring Topics and Visualization
Figure 2 for Auto-Encoding Variational Bayes for Inferring Topics and Visualization
Figure 3 for Auto-Encoding Variational Bayes for Inferring Topics and Visualization
Figure 4 for Auto-Encoding Variational Bayes for Inferring Topics and Visualization

Visualization and topic modeling are widely used approaches for text analysis. Traditional visualization methods find low-dimensional representations of documents in the visualization space (typically 2D or 3D) that can be displayed using a scatterplot. In contrast, topic modeling aims to discover topics from text, but for visualization, one needs to perform a post-hoc embedding using dimensionality reduction methods. Recent approaches propose using a generative model to jointly find topics and visualization, allowing the semantics to be infused in the visualization space for a meaningful interpretation. A major challenge that prevents these methods from being used practically is the scalability of their inference algorithms. We present, to the best of our knowledge, the first fast Auto-Encoding Variational Bayes based inference method for jointly inferring topics and visualization. Since our method is black box, it can handle model changes efficiently with little mathematical rederivation effort. We demonstrate the efficiency and effectiveness of our method on real-world large datasets and compare it with existing baselines.

* Accepted at the 28th International Conference on Computational Linguistics (COLING 2020) 
Viaarxiv icon

Hotel Preference Rank based on Online Customer Review

Oct 10, 2021
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Fajar Ibnu Fatihan

Figure 1 for Hotel Preference Rank based on Online Customer Review
Figure 2 for Hotel Preference Rank based on Online Customer Review
Figure 3 for Hotel Preference Rank based on Online Customer Review
Figure 4 for Hotel Preference Rank based on Online Customer Review

Topline hotels are now shifting into the digital way in how they understand their customers to maintain and ensuring satisfaction. Rather than the conventional way which uses written reviews or interviews, the hotel is now heavily investing in Artificial Intelligence particularly Machine Learning solutions. Analysis of online customer reviews changes the way companies make decisions in a more effective way than using conventional analysis. The purpose of this research is to measure hotel service quality. The proposed approach emphasizes service quality dimensions reviews of the top-5 luxury hotel in Indonesia that appear on the online travel site TripAdvisor based on section Best of 2018. In this research, we use a model based on a simple Bayesian classifier to classify each customer review into one of the service quality dimensions. Our model was able to separate each classification properly by accuracy, kappa, recall, precision, and F-measure measurements. To uncover latent topics in the customer's opinion we use Topic Modeling. We found that the common issue that occurs is about responsiveness as it got the lowest percentage compared to others. Our research provides a faster outlook of hotel rank based on service quality to end customers based on a summary of the previous online review.

* Test Engineering and Management, Vol. 83: March/April 2020  
* 5 pages, 6 figures, 5 tables 
Viaarxiv icon