Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

"Topic Modeling": models, code, and papers

Analysis of Legal Documents via Non-negative Matrix Factorization Methods

Apr 28, 2021
Ryan Budahazy, Lu Cheng, Yihuan Huang, Andrew Johnson, Pengyu Li, Joshua Vendrow, Zhoutong Wu, Denali Molitor, Elizaveta Rebrova, Deanna Needell

The California Innocence Project (CIP), a clinical law school program aiming to free wrongfully convicted prisoners, evaluates thousands of mails containing new requests for assistance and corresponding case files. Processing and interpreting this large amount of information presents a significant challenge for CIP officials, which can be successfully aided by topic modeling techniques.In this paper, we apply Non-negative Matrix Factorization (NMF) method and implement various offshoots of it to the important and previously unstudied data set compiled by CIP. We identify underlying topics of existing case files and classify request files by crime type and case status (decision type). The results uncover the semantic structure of current case files and can provide CIP officials with a general understanding of newly received case files before further examinations. We also provide an exposition of popular variants of NMF with their experimental results and discuss the benefits and drawbacks of each variant through the real-world application.

* 16 pages, 4 figures

Via

Access Paper or Ask Questions

SimLDA: A tool for topic model evaluation

Aug 19, 2022
Rebecca M. C. Taylor, Johan A. du Preez

Figure 1 for SimLDA: A tool for topic model evaluation

Figure 2 for SimLDA: A tool for topic model evaluation

Figure 3 for SimLDA: A tool for topic model evaluation

Figure 4 for SimLDA: A tool for topic model evaluation

Variational Bayes (VB) applied to latent Dirichlet allocation (LDA) has become the most popular algorithm for aspect modeling. While sufficiently successful in text topic extraction from large corpora, VB is less successful in identifying aspects in the presence of limited data. We present a novel variational message passing algorithm as applied to Latent Dirichlet Allocation (LDA) and compare it with the gold standard VB and collapsed Gibbs sampling. In situations where marginalisation leads to non-conjugate messages, we use ideas from sampling to derive approximate update equations. In cases where conjugacy holds, Loopy Belief update (LBU) (also known as Lauritzen-Spiegelhalter) is used. Our algorithm, ALBU (approximate LBU), has strong similarities with Variational Message Passing (VMP) (which is the message passing variant of VB). To compare the performance of the algorithms in the presence of limited data, we use data sets consisting of tweets and news groups. Using coherence measures we show that ALBU learns latent distributions more accurately than does VB, especially for smaller data sets.

* Conference Proceedings

Via

Access Paper or Ask Questions

Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach

Apr 06, 2021
P. Schulze, S. Wiegrebe, P. W. Thurner, C. Heumann, M. Aßenmacher, S. Wankmüller

Figure 1 for Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach

Figure 2 for Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach

Figure 3 for Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach

Figure 4 for Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach

Topic models such as the Structural Topic Model (STM) estimate latent topical clusters within text. An important step in many topic modeling applications is to explore relationships between the discovered topical structure and metadata associated with the text documents. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself. The authors of the STM, for instance, perform repeated OLS regressions of sampled topic proportions on metadata covariates by using a Monte Carlo sampling technique known as the method of composition. In this paper, we propose two improvements: first, we replace OLS with more appropriate Beta regression. Second, we suggest a fully Bayesian approach instead of the current blending of frequentist and Bayesian methods. We demonstrate our improved methodology by exploring relationships between Twitter posts by German members of parliament (MPs) and different metadata covariates.

* 8 pages, 4 figures

Via

Access Paper or Ask Questions

Helping users discover perspectives: Enhancing opinion mining with joint topic models

Oct 23, 2020
Tim Draws, Jody Liu, Nava Tintarev

Figure 1 for Helping users discover perspectives: Enhancing opinion mining with joint topic models

Figure 2 for Helping users discover perspectives: Enhancing opinion mining with joint topic models

Figure 3 for Helping users discover perspectives: Enhancing opinion mining with joint topic models

Figure 4 for Helping users discover perspectives: Enhancing opinion mining with joint topic models

Support or opposition concerning a debated claim such as abortion should be legal can have different underlying reasons, which we call perspectives. This paper explores how opinion mining can be enhanced with joint topic modeling, to identify distinct perspectives within the topic, providing an informative overview from unstructured text. We evaluate four joint topic models (TAM, JST, VODUM, and LAM) in a user study assessing human understandability of the extracted perspectives. Based on the results, we conclude that joint topic models such as TAM can discover perspectives that align with human judgments. Moreover, our results suggest that users are not influenced by their pre-existing stance on the topic of abortion when interpreting the output of topic models.

* Accepted at the SENTIRE workshop at ICDM 2020: https://sentic.net/sentire/#2020

Via

Access Paper or Ask Questions

Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Apr 14, 2021
Kyeongpil Kang, Kyohoon Jin, Soyoung Yang, Sujin Jang, Jaegul Choo, Youngbin Kim

Figure 1 for Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Figure 2 for Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Figure 3 for Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Figure 4 for Restoring and Mining the Records of the Joseon Dynasty via Neural Language Modeling and Machine Translation

Understanding voluminous historical records provides clues on the past in various aspects, such as social and political issues and even natural science facts. However, it is generally difficult to fully utilize the historical records, since most of the documents are not written in a modern language and part of the contents are damaged over time. As a result, restoring the damaged or unrecognizable parts as well as translating the records into modern languages are crucial tasks. In response, we present a multi-task learning approach to restore and translate historical documents based on a self-attention mechanism, specifically utilizing two Korean historical records, ones of the most voluminous historical records in the world. Experimental results show that our approach significantly improves the accuracy of the translation task than baselines without multi-task learning. In addition, we present an in-depth exploratory analysis on our translated results via topic modeling, uncovering several significant historical events.

* Accepted to NAACL 2021

Via

Access Paper or Ask Questions

Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

Feb 14, 2020
Jonas Rieger, Lars Koppers, Carsten Jentsch, Jörg Rahnenführer

Figure 1 for Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

Figure 2 for Improving Reliability of Latent Dirichlet Allocation by Assessing Its Stability Using Clustering Techniques on Replicated Runs

For organizing large text corpora topic modeling provides useful tools. A widely used method is Latent Dirichlet Allocation (LDA), a generative probabilistic model which models single texts in a collection of texts as mixtures of latent topics. The assignments of words to topics rely on initial values such that generally the outcome of LDA is not fully reproducible. In addition, the reassignment via Gibbs Sampling is based on conditional distributions, leading to different results in replicated runs on the same text data. This fact is often neglected in everyday practice. We aim to improve the reliability of LDA results. Therefore, we study the stability of LDA by comparing assignments from replicated runs. We propose to quantify the similarity of two generated topics by a modified Jaccard coefficient. Using such similarities, topics can be clustered. A new pruning algorithm for hierarchical clustering results based on the idea that two LDA runs create pairs of similar topics is proposed. This approach leads to the new measure S-CLOP ({\bf S}imilarity of multiple sets by {\bf C}lustering with {\bf LO}cal {\bf P}runing) for quantifying the stability of LDA models. We discuss some characteristics of this measure and illustrate it with an application to real data consisting of newspaper articles from \textit{USA Today}. Our results show that the measure S-CLOP is useful for assessing the stability of LDA models or any other topic modeling procedure that characterize its topics by word distributions. Based on the newly proposed measure for LDA stability, we propose a method to increase the reliability and hence to improve the reproducibility of empirical findings based on topic modeling. This increase in reliability is obtained by running the LDA several times and taking as prototype the most representative run, that is the LDA run with highest average similarity to all other runs.

* 16 pages, 2 figures

Via

Access Paper or Ask Questions

Topic Modeling based on Keywords and Context

Feb 03, 2018
Johannes Schneider

Current topic models often suffer from discovering topics not matching human intuition, unnatural switching of topics within documents and high computational demands. We address these concerns by proposing a topic model and an inference algorithm based on automatically identifying characteristic keywords for topics. Keywords influence topic-assignments of nearby words. Our algorithm learns (key)word-topic scores and it self-regulates the number of topics. Inference is simple and easily parallelizable. Qualitative analysis yields comparable results to state-of-the-art models (eg. LDA), but with different strengths and weaknesses. Quantitative analysis using 9 datasets shows gains in terms of classification accuracy, PMI score, computational performance and consistency of topic assignments within documents, while most often using less topics.

* SIAM International Conference on Data Mining (SDM), 2018

Via

Access Paper or Ask Questions

SCAT: Second Chance Autoencoder for Textual Data

May 11, 2020
Somaieh Goudarzvand, Gharib Gharibi, Yugyung Lee

Figure 1 for SCAT: Second Chance Autoencoder for Textual Data

Figure 2 for SCAT: Second Chance Autoencoder for Textual Data

Figure 3 for SCAT: Second Chance Autoencoder for Textual Data

We present a k-competitive learning approach for textual autoencoders named Second Chance Autoencoder (SCAT). SCAT selects the $k$ largest and smallest positive activations as the winner neurons, which gain the activation values of the loser neurons during the learning process, and thus focus on retrieving well-representative features for topics. Our experiments show that SCAT achieves outstanding performance in classification, topic modeling, and document visualization compared to LDA, K-Sparse, NVCTM, and KATE.

Via

Access Paper or Ask Questions

Exploring the social influence of Kaggle virtual community on the M5 competition

Feb 28, 2021
Xixi Li, Yun Bai, Yanfei Kang

Figure 1 for Exploring the social influence of Kaggle virtual community on the M5 competition

Figure 2 for Exploring the social influence of Kaggle virtual community on the M5 competition

Figure 3 for Exploring the social influence of Kaggle virtual community on the M5 competition

Figure 4 for Exploring the social influence of Kaggle virtual community on the M5 competition

One of the most significant differences of M5 over previous forecasting competitions is that it was held on Kaggle, an online community of data scientists and machine learning practitioners. On the Kaggle platform, people can form virtual communities such as online notebooks and discussions to discuss their models, choice of features, loss functions, etc. This paper aims to study the social influence of virtual communities on the competition. We first study the content of the M5 virtual community by topic modeling and trend analysis. Further, we perform social media analysis to identify the potential relationship network of the virtual community. We find some key roles in the network and study their roles in spreading the LightGBM related information within the network. Overall, this study provides in-depth insights into the dynamic mechanism of the virtual community influence on the participants and has potential implications for future online competitions.

Via

Access Paper or Ask Questions