Alert button

MPTopic: Improving topic modeling via Masked Permuted pre-training

Sep 02, 2023
Xinche Zhang, Evangelos milios

Figure 1 for MPTopic: Improving topic modeling via Masked Permuted pre-training
Figure 2 for MPTopic: Improving topic modeling via Masked Permuted pre-training
Figure 3 for MPTopic: Improving topic modeling via Masked Permuted pre-training
Figure 4 for MPTopic: Improving topic modeling via Masked Permuted pre-training

Share this with someone who'll enjoy it:

Topic modeling is pivotal in discerning hidden semantic structures within texts, thereby generating meaningful descriptive keywords. While innovative techniques like BERTopic and Top2Vec have recently emerged in the forefront, they manifest certain limitations. Our analysis indicates that these methods might not prioritize the refinement of their clustering mechanism, potentially compromising the quality of derived topic clusters. To illustrate, Top2Vec designates the centroids of clustering results to represent topics, whereas BERTopic harnesses C-TF-IDF for its topic extraction.In response to these challenges, we introduce "TF-RDF" (Term Frequency - Relative Document Frequency), a distinctive approach to assess the relevance of terms within a document. Building on the strengths of TF-RDF, we present MPTopic, a clustering algorithm intrinsically driven by the insights of TF-RDF. Through comprehensive evaluation, it is evident that the topic keywords identified with the synergy of MPTopic and TF-RDF outperform those extracted by both BERTopic and Top2Vec.

* 12 pages, will submit to ECIR 2024  
View paper onarxiv icon

Share this with someone who'll enjoy it: