Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

'Warriors of the Word' -- Deciphering Lyrical Topics in Music and Their Connection to Audio Feature Dimensions Based on a Corpus of Over 100,000 Metal Songs

Nov 13, 2019
Isabella Czedik-Eysenberg, Oliver Wieczorek, Christoph Reuter

We look into the connection between the musical and lyrical content of metal music by combining automated extraction of high-level audio features and quantitative text analysis on a corpus of 124.288 song lyrics from this genre. Based on this text corpus, a topic model was first constructed using Latent Dirichlet Allocation (LDA). For a subsample of 503 songs, scores for predicting perceived musical hardness/heaviness and darkness/gloominess were extracted using audio feature models. By combining both audio feature and text analysis, we (1) offer a comprehensive overview of the lyrical topics present within the metal genre and (2) are able to establish whether or not levels of hardness and other music dimensions are associated with the occurrence of particularly harsh (and other) textual topics. Twenty typical topics were identified and projected into a topic space using multidimensional scaling (MDS). After Bonferroni correction, positive correlations were found between musical hardness and darkness and textual topics dealing with 'brutal death', 'dystopia', 'archaisms and occultism', 'religion and satanism', 'battle' and '(psychological) madness', while there is a negative associations with topics like 'personal life' and 'love and romance'.

* Corrected typo in abstract (subsample) 

  Access Paper or Ask Questions

Topics and Label Propagation: Best of Both Worlds for Weakly Supervised Text Classification

Dec 04, 2017
Sachin Pawar, Nitin Ramrakhiyani, Swapnil Hingmire, Girish K. Palshikar

We propose a Label Propagation based algorithm for weakly supervised text classification. We construct a graph where each document is represented by a node and edge weights represent similarities among the documents. Additionally, we discover underlying topics using Latent Dirichlet Allocation (LDA) and enrich the document graph by including the topics in the form of additional nodes. The edge weights between a topic and a text document represent level of "affinity" between them. Our approach does not require document level labelling, instead it expects manual labels only for topic nodes. This significantly minimizes the level of supervision needed as only a few topics are observed to be enough for achieving sufficiently high accuracy. The Label Propagation Algorithm is employed on this enriched graph to propagate labels among the nodes. Our approach combines the advantages of Label Propagation (through document-document similarities) and Topic Modelling (for minimal but smart supervision). We demonstrate the effectiveness of our approach on various datasets and compare with state-of-the-art weakly supervised text classification approaches.

  Access Paper or Ask Questions

Crosslingual Topic Modeling with WikiPDA

Sep 23, 2020
Tiziano Piccardi, Robert West

We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning.

* 10 pages, first version 

  Access Paper or Ask Questions

Combinatorial Topic Models using Small-Variance Asymptotics

May 27, 2016
Ke Jiang, Suvrit Sra, Brian Kulis

Topic models have emerged as fundamental tools in unsupervised machine learning. Most modern topic modeling algorithms take a probabilistic view and derive inference algorithms based on Latent Dirichlet Allocation (LDA) or its variants. In contrast, we study topic modeling as a combinatorial optimization problem, and propose a new objective function derived from LDA by passing to the small-variance limit. We minimize the derived objective by using ideas from combinatorial optimization, which results in a new, fast, and high-quality topic modeling algorithm. In particular, we show that our results are competitive with popular LDA-based topic modeling approaches, and also discuss the (dis)similarities between our approach and its probabilistic counterparts.

* 19 pages 

  Access Paper or Ask Questions

Topic Modelling Meets Deep Neural Networks: A Survey

Feb 28, 2021
He Zhao, Dinh Phung, Viet Huynh, Yuan Jin, Lan Du, Wray Buntine

Topic modelling has been a successful technique for text analysis for almost twenty years. When topic modelling met deep neural networks, there emerged a new and increasingly popular research area, neural topic models, with over a hundred models developed and a wide range of applications in neural language understanding such as text generation, summarisation and language models. There is a need to summarise research developments and discuss open problems and future directions. In this paper, we provide a focused yet comprehensive overview of neural topic models for interested researchers in the AI community, so as to facilitate them to navigate and innovate in this fast-growing research area. To the best of our knowledge, ours is the first review focusing on this specific topic.

* A review on Neural Topic Models 

  Access Paper or Ask Questions

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

Feb 13, 2019
Sebastian Arnold, Rudolf Schneider, Philippe Cudré-Mauroux, Felix A. Gers, Alexander Löser

When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation.

* Author's final version, accepted for publication at TACL, 2019 

  Access Paper or Ask Questions

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Feb 20, 2018
Amritanshu Agrawal, Wei Fu, Tim Menzies

Context: Topic modeling finds human-readable structures in unstructured textual data. A widely used topic modeler is Latent Dirichlet allocation. When run on different datasets, LDA suffers from "order effects" i.e. different topics are generated if the order of training data is shuffled. Such order effects introduce a systematic error for any study. This error can relate to misleading results;specifically, inaccurate topic descriptions and a reduction in the efficacy of text mining classification results. Objective: To provide a method in which distributions generated by LDA are more stable and can be used for further analysis. Method: We use LDADE, a search-based software engineering tool that tunes LDA's parameters using DE (Differential Evolution). LDADE is evaluated on data from a programmer information exchange site (Stackoverflow), title and abstract text of thousands ofSoftware Engineering (SE) papers, and software defect reports from NASA. Results were collected across different implementations of LDA (Python+Scikit-Learn, Scala+Spark); across different platforms (Linux, Macintosh) and for different kinds of LDAs (VEM,or using Gibbs sampling). Results were scored via topic stability and text mining classification accuracy. Results: In all treatments: (i) standard LDA exhibits very large topic instability; (ii) LDADE's tunings dramatically reduce cluster instability; (iii) LDADE also leads to improved performances for supervised as well as unsupervised learning. Conclusion: Due to topic instability, using standard LDA with its "off-the-shelf" settings should now be depreciated. Also, in future, we should require SE papers that use LDA to test and (if needed) mitigate LDA topic instability. Finally, LDADE is a candidate technology for effectively and efficiently reducing that instability.

* Information and Software Technology Journal, 2018 
* 15 pages + 2 page references. Accepted to IST 

  Access Paper or Ask Questions

A Comprehensive Analysis of Twitter Trending Topics

Jul 21, 2019
Issa Annamoradnejad, Jafar Habibi

Twitter is among the most used microblogging and online social networking services. In Twitter, a name, phrase, or topic that is mentioned at a greater rate than others is called a "trending topic" or simply "trend". Twitter trends has shown their powerful ability in many public events, elections and market changes. Nevertheless, there has been very few works focusing on understanding the dynamics of these trending topics. In this article, we thoroughly examined the Twitter's trending topics of 2018. To this end, we accessed Twitter's trends API for the full year of 2018 and devised six criteria to analyze our dataset. These six criteria are: lexical analysis, time to reach, trend reoccurrence, trending time, tweets count, and language analysis. In addition to providing general statistics and top trending topics regarding each criterion, we computed several distributions that explain this bulk of data.

* 2019 5th International Conference on Web Research (ICWR), Tehran, Iran, 2019, pp. 22-27 
* 6 pages, 8 figures, 3 tables, conference paper 

  Access Paper or Ask Questions

Continuous Time Dynamic Topic Models

May 16, 2015
Chong Wang, David Blei, David Heckerman

In this paper, we develop the continuous time dynamic topic model (cDTM). The cDTM is a dynamic topic model that uses Brownian motion to model the latent topics through a sequential collection of documents, where a "topic" is a pattern of word use that we expect to evolve over the course of the collection. We derive an efficient variational approximate inference algorithm that takes advantage of the sparsity of observations in text, a property that lets us easily handle many time points. In contrast to the cDTM, the original discrete-time dynamic topic model (dDTM) requires that time be discretized. Moreover, the complexity of variational inference for the dDTM grows quickly as time granularity increases, a drawback which limits fine-grained discretization. We demonstrate the cDTM on two news corpora, reporting both predictive perplexity and the novel task of time stamp prediction.

* Appears in Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI2008) 

  Access Paper or Ask Questions