Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Conical Classification For Computationally Efficient One-Class Topic Determination

Oct 31, 2021
Sameer Khanna

As the Internet grows in size, so does the amount of text based information that exists. For many application spaces it is paramount to isolate and identify texts that relate to a particular topic. While one-class classification would be ideal for such analysis, there is a relative lack of research regarding efficient approaches with high predictive power. By noting that the range of documents we wish to identify can be represented as positive linear combinations of the Vector Space Model representing our text, we propose Conical classification, an approach that allows us to identify if a document is of a particular topic in a computationally efficient manner. We also propose Normal Exclusion, a modified version of Bi-Normal Separation that makes it more suitable within the one-class classification context. We show in our analysis that our approach not only has higher predictive power on our datasets, but is also faster to compute.

* Findings in Empirical Methods in Natural Language Processing 2021 

  Access Paper or Ask Questions

Empath: Understanding Topic Signals in Large-Scale Text

Feb 22, 2016
Ethan Fast, Binbin Chen, Michael Bernstein

Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.

* CHI: ACM Conference on Human Factors in Computing Systems 2016 

  Access Paper or Ask Questions

Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

Jun 23, 2018
Christoph Treude, Markus Wagner

To make sense of large amounts of textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima, (ii) an a-posteriori characterisation of text corpora related to eight programming languages from GitHub and Stack Overflow, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration.

  Access Paper or Ask Questions

Semantic Topic Analysis of Traffic Camera Images

Sep 27, 2018
Jeffrey Liu, Andrew Weinert, Saurabh Amin

Traffic cameras are commonly deployed monitoring components in road infrastructure networks, providing operators visual information about conditions at critical points in the network. However, human observers are often limited in their ability to process simultaneous information sources. Recent advancements in computer vision, driven by deep learning methods, have enabled general object recognition, unlocking opportunities for camera-based sensing beyond the existing human observer paradigm. In this paper, we present a Natural Language Processing (NLP)-inspired approach, entitled Bag-of-Label-Words (BoLW), for analyzing image data sets using exclusively textual labels. The BoLW model represents the data in a conventional matrix form, enabling data compression and decomposition techniques, while preserving semantic interpretability. We apply the Latent Dirichlet Allocation (LDA) topic model to decompose the label data into a small number of semantic topics. To illustrate our approach, we use freeway camera images collected from the Boston area between December 2017-January 2018. We analyze the cameras' sensitivity to weather events; identify temporal traffic patterns; and analyze the impact of infrequent events, such as the winter holidays and the "bomb cyclone" winter storm. This study demonstrates the flexibility of our approach, which allows us to analyze weather events and freeway traffic using only traffic camera image labels.

* To be presented at IEEE-ITSC 2018, Nov 3-7 2018 

  Access Paper or Ask Questions

A high-reproducibility and high-accuracy method for automated topic classification

Feb 03, 2014
Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, Luís A. Nunes Amaral

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.

* 23 pages, 24 figures 

  Access Paper or Ask Questions

Dirichlet Process with Mixed Random Measures: A Nonparametric Topic Model for Labeled Data

Jun 18, 2012
Dongwoo Kim, Suin Kim, Alice Oh

We describe a nonparametric topic model for labeled data. The model uses a mixture of random measures (MRM) as a base distribution of the Dirichlet process (DP) of the HDP framework, so we call it the DP-MRM. To model labeled data, we define a DP distributed random measure for each label, and the resulting model generates an unbounded number of topics for each label. We apply DP-MRM on single-labeled and multi-labeled corpora of documents and compare the performance on label prediction with MedLDA, LDA-SVM, and Labeled-LDA. We further enhance the model by incorporating ddCRP and modeling multi-labeled images for image segmentation and object labeling, comparing the performance with nCuts and rddCRP.

* ICML2012 

  Access Paper or Ask Questions

LA-LDA: A Limited Attention Topic Model for Social Recommendation

Jan 26, 2013
Jeon-Hyung Kang, Kristina Lerman, Lise Getoor

Social media users have finite attention which limits the number of incoming messages from friends they can process. Moreover, they pay more attention to opinions and recommendations of some friends more than others. In this paper, we propose LA-LDA, a latent topic model which incorporates limited, non-uniformly divided attention in the diffusion process by which opinions and information spread on the social network. We show that our proposed model is able to learn more accurate user models from users' social network and item adoption behavior than models which do not take limited attention into account. We analyze voting on news items on the social news aggregator Digg and show that our proposed model is better able to predict held out votes than alternative models. Our study demonstrates that psycho-socially motivated models have better ability to describe and predict observed behavior than models which only consider topics.

* The 2013 International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction (SBP 2013) 

  Access Paper or Ask Questions

Open-domain Topic Identification of Out-of-domain Utterances using Wikipedia

Jan 26, 2021
A. Augustin, A. Papangelis, M. Kotti, P. Vougiouklis, J. Hare, N. Braunschweiler

Users of spoken dialogue systems (SDS) expect high quality interactions across a wide range of diverse topics. However, the implementation of SDS capable of responding to every conceivable user utterance in an informative way is a challenging problem. Multi-domain SDS must necessarily identify and deal with out-of-domain (OOD) utterances to generate appropriate responses as users do not always know in advance what domains the SDS can handle. To address this problem, we extend the current state-of-the-art in multi-domain SDS by estimating the topic of OOD utterances using external knowledge representation from Wikipedia. Experimental results on real human-to-human dialogues showed that our approach does not degrade domain prediction performance when compared to the base model. But more significantly, our joint training achieves more accurate predictions of the nearest Wikipedia article by up to about 30% when compared to the benchmarks.

  Access Paper or Ask Questions

Entropic selection of concepts unveils hidden topics in documents corpora

May 11, 2018
Andrea Martini, Alessio Cardillo, Paolo De Los Rios

The organization and evolution of science has recently become itself an object of scientific quantitative investigation, thanks to the wealth of information that can be extracted from scientific documents, such as citations between papers and co-authorship between researchers. However, only few studies have focused on the concepts that characterize full documents and that can be extracted and analyzed, revealing the deeper organization of scientific knowledge. Unfortunately, several concepts can be so common across documents that they hinder the emergence of the underlying topical structure of the document corpus, because they give rise to a large amount of spurious and trivial relations among documents. To identify and remove common concepts, we introduce a method to gauge their relevance according to an objective information-theoretic measure related to the statistics of their occurrence across the document corpus. After progressively removing concepts that, according to this metric, can be considered as generic, we find that the topic organization displays a correspondingly more refined structure.

* Main + SI. Major overhaul after first round of review 

  Access Paper or Ask Questions

Topic Modeling Based Extractive Text Summarization

Jun 29, 2021
Kalliath Abdul Rasheed Issam, Shivam Patel, Subalalitha C. N

Text summarization is an approach for identifying important information present within text documents. This computational technique aims to generate shorter versions of the source text, by including only the relevant and salient information present within the source text. In this paper, we propose a novel method to summarize a text document by clustering its contents based on latent topics produced using topic modeling techniques and by generating extractive summaries for each of the identified text clusters. All extractive sub-summaries are later combined to generate a summary for any given source document. We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization. This dataset is unlike the commonly used news datasets which are available for text summarization. The well-known news datasets present their most important information in the first few lines of their source texts, which make their summarization a lesser challenging task when compared to summarizing the WikiHow dataset. Contrary to these news datasets, the documents in the WikiHow dataset are written using a generalized approach and have lesser abstractedness and higher compression ratio, thus proposing a greater challenge to generate summaries. A lot of the current state-of-the-art text summarization techniques tend to eliminate important information present in source documents in the favor of brevity. Our proposed technique aims to capture all the varied information present in source documents. Although the dataset proved challenging, after performing extensive tests within our experimental setup, we have discovered that our model produces encouraging ROUGE results and summaries when compared to the other published extractive and abstractive text summarization models.

* International Journal of Innovative Technology and Exploring Engineering, Volume-9 Issue-6, April 2020, Page No. 1710-1719 
* 10 pages, 13 figures, 3 tables 

  Access Paper or Ask Questions