Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Inter-Battery Topic Representation Learning

Jul 28, 2016
Cheng Zhang, Hedvig Kjellstrom, Carl Henrik Ek

In this paper, we present the Inter-Battery Topic Model (IBTM). Our approach extends traditional topic models by learning a factorized latent variable representation. The structured representation leads to a model that marries benefits traditionally associated with a discriminative approach, such as feature selection, with those of a generative model, such as principled regularization and ability to handle missing data. The factorization is provided by representing data in terms of aligned pairs of observations as different views. This provides means for selecting a representation that separately models topics that exist in both views from the topics that are unique to a single view. This structured consolidation allows for efficient and robust inference and provides a compact and efficient representation. Learning is performed in a Bayesian fashion by maximizing a rigorous bound on the log-likelihood. Firstly, we illustrate the benefits of the model on a synthetic dataset,. The model is then evaluated in both uni- and multi-modality settings on two different classification tasks with off-the-shelf convolutional neural network (CNN) features which generate state-of-the-art results with extremely compact representations.

* ECCV 2016 

  Access Paper or Ask Questions

Better Early than Late: Fusing Topics with Word Embeddings for Neural Question Paraphrase Identification

Jul 22, 2020
Nicole Peinelt, Dong Nguyen, Maria Liakata

Question paraphrase identification is a key task in Community Question Answering (CQA) to determine if an incoming question has been previously asked. Many current models use word embeddings to identify duplicate questions, but the use of topic models in feature-engineered systems suggests that they can be helpful for this task, too. We therefore propose two ways of merging topics with word embeddings (early vs. late fusion) in a new neural architecture for question paraphrase identification. Our results show that our system outperforms neural baselines on multiple CQA datasets, while an ablation study highlights the importance of topics and especially early topic-embedding fusion in our architecture.

  Access Paper or Ask Questions

Document Informed Neural Autoregressive Topic Models

Aug 11, 2018
Pankaj Gupta, Florian Buettner, Hinrich Schütze

Context information around words helps in determining their actual meaning, for example "networks" used in contexts of artificial neural networks or biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. This results in an improved performance in terms of generalization, interpretability and applicability. We apply our modeling approach to seven data sets from various domains and demonstrate that our approach consistently outperforms stateof-the-art generative topic models. With the learned representations, we show on an average a gain of 9.6% (0.57 Vs 0.52) in precision at retrieval fraction 0.02 and 7.2% (0.582 Vs 0.543) in F1 for text categorization.

  Access Paper or Ask Questions

Stochastic Divergence Minimization for Biterm Topic Model

May 01, 2017
Zhenghang Cui, Issei Sato, Masashi Sugiyama

As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is useful for understanding its hidden structure and predicting new contents. Unlike conventional topic models such as latent Dirichlet allocation (LDA), a biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation. In this work, we develop a stochastic divergence minimization inference algorithm for BTM to estimate latent topics more accurately in a scalable way. Experiments demonstrate the superiority of our proposed algorithm compared with existing inference algorithms.

* 19 pages, 4 figures 

  Access Paper or Ask Questions

Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company's Reputation

Mar 09, 2021
Nikolay Babakov, Varvara Logacheva, Olga Kozlova, Nikita Semenov, Alexander Panchenko

Not all topics are equally "flammable" in terms of toxicity: a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labeling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects: (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian: a topic-labeled dataset and an appropriateness-labeled dataset. We also release pre-trained classification models trained on this data.

* Accepted to the Balto-Slavic NLP workshop 2021 co-located with EACL-2021 

  Access Paper or Ask Questions

A framework for streamlined statistical prediction using topic models

Apr 15, 2019
Vanessa Glenny, Jonathan Tuke, Nigel Bean, Lewis Mitchell

In the Humanities and Social Sciences, there is increasing interest in approaches to information extraction, prediction, intelligent linkage, and dimension reduction applicable to large text corpora. With approaches in these fields being grounded in traditional statistical techniques, the need arises for frameworks whereby advanced NLP techniques such as topic modelling may be incorporated within classical methodologies. This paper provides a classical, supervised, statistical learning framework for prediction from text, using topic models as a data reduction method and the topics themselves as predictors, alongside typical statistical tools for predictive modelling. We apply this framework in a Social Sciences context (applied animal behaviour) as well as a Humanities context (narrative analysis) as examples of this framework. The results show that topic regression models perform comparably to their much less efficient equivalents that use individual words as predictors.

* Proceedings of the 2019 Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL `19) 

  Access Paper or Ask Questions

Cross-referencing using Fine-grained Topic Modeling

May 18, 2019
Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Emily Hales, Kevin Seppi

Cross-referencing, which links passages of text to other related passages, can be a valuable study aid for facilitating comprehension of a text. However, cross-referencing requires first, a comprehensive thematic knowledge of the entire corpus, and second, a focused search through the corpus specifically to find such useful connections. Due to this, cross-reference resources are prohibitively expensive and exist only for the most well-studied texts (e.g. religious texts). We develop a topic-based system for automatically producing candidate cross-references which can be easily verified by human annotators. Our system utilizes fine-grained topic modeling with thousands of highly nuanced and specific topics to identify verse pairs which are topically related. We demonstrate that our system can be cost effective compared to having annotators acquire the expertise necessary to produce cross-reference resources unaided.

* 6 figures 1 table 8 pages 

  Access Paper or Ask Questions

An Influence-Receptivity Model for Topic based Information Cascades

Sep 06, 2017
Ming Yu, Varun Gupta, Mladen Kolar

We consider the problem of estimating the latent structure of a social network based on observational data on information diffusion processes, or {\it cascades}. Here for a given cascade, we only observe the time a node/agent is infected but not the source of infection. Existing literature has focused on estimating network diffusion matrix without any underlying assumptions on the structure of the network. We propose a novel model for inferring network diffusion matrix based on the intuition that an information datum is more likely to propagate among two nodes if they are interested in similar topics, which are common with the information. In particular, our model endows each node with an influence vector (how authoritative they are on each topic) and a receptivity vector (how susceptible they are on each topic). We show how this node-topic structure can be estimated from observed cascades. The estimated model can be used to build recommendation system based on the receptivity vectors, as well as for marketing based on the influence vectors.

  Access Paper or Ask Questions

Online Bayesian Collaborative Topic Regression

May 28, 2016
Chenghao Liu, Tao Jin, Steven C. H. Hoi, Peilin Zhao, Jianling Sun

Collaborative Topic Regression (CTR) combines ideas of probabilistic matrix factorization (PMF) and topic modeling (e.g., LDA) for recommender systems, which has gained increasing successes in many applications. Despite enjoying many advantages, the existing CTR algorithms have some critical limitations. First of all, they are often designed to work in a batch learning manner, making them unsuitable to deal with streaming data or big data in real-world recommender systems. Second, the document-specific topic proportions of LDA are fed to the downstream PMF, but not reverse, which is sub-optimal as the rating information is not exploited in discovering the low-dimensional representation of documents and thus can result in a sub-optimal representation for prediction. In this paper, we propose a novel scheme of Online Bayesian Collaborative Topic Regression (OBCTR) which is efficient and scalable for learning from data streams. Particularly, we {\it jointly} optimize the combined objective function of both PMF and LDA in an online learning fashion, in which both PMF and LDA tasks can be reinforced each other during the online learning process. Our encouraging experimental results on real-world data validate the effectiveness of the proposed method.

  Access Paper or Ask Questions

Probabilistic Topic and Syntax Modeling with Part-of-Speech LDA

Mar 12, 2013
William M. Darling, Fei Song

This article presents a probabilistic generative model for text based on semantic topics and syntactic classes called Part-of-Speech LDA (POSLDA). POSLDA simultaneously uncovers short-range syntactic patterns (syntax) and long-range semantic patterns (topics) that exist in document collections. This results in word distributions that are specific to both topics (sports, education, ...) and parts-of-speech (nouns, verbs, ...). For example, multinomial distributions over words are uncovered that can be understood as "nouns about weather" or "verbs about law". We describe the model and an approximate inference algorithm and then demonstrate the quality of the learned topics both qualitatively and quantitatively. Then, we discuss an NLP application where the output of POSLDA can lead to strong improvements in quality: unsupervised part-of-speech tagging. We describe algorithms for this task that make use of POSLDA-learned distributions that result in improved performance beyond the state of the art.

* Currently under review for the journal Computational Linguistics 

  Access Paper or Ask Questions