Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling

Mar 04, 2018
Hao Zhang, Bo Chen, Dandan Guo, Mingyuan Zhou

To train an inference network jointly with a deep generative topic model, making it both scalable to big corpora and fast in out-of-sample prediction, we develop Weibull hybrid autoencoding inference (WHAI) for deep latent Dirichlet allocation, which infers posterior samples via a hybrid of stochastic-gradient MCMC and autoencoding variational Bayes. The generative network of WHAI has a hierarchy of gamma distributions, while the inference network of WHAI is a Weibull upward-downward variational autoencoder, which integrates a deterministic-upward deep neural network, and a stochastic-downward deep generative model based on a hierarchy of Weibull distributions. The Weibull distribution can be used to well approximate a gamma distribution with an analytic Kullback-Leibler divergence, and has a simple reparameterization via the uniform noise, which help efficiently compute the gradients of the evidence lower bound with respect to the parameters of the inference network. The effectiveness and efficiency of WHAI are illustrated with experiments on big corpora.

* ICLR 2018 

  Access Paper or Ask Questions

Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation

May 09, 2017
Zhiyuan Shi, Timothy M. Hospedales, Tao Xiang

We address the problem of localisation of objects as bounding boxes in images with weak labels. This weakly supervised object localisation problem has been tackled in the past using discriminative models where each object class is localised independently from other classes. We propose a novel framework based on Bayesian joint topic modelling. Our framework has three distinctive advantages over previous works: (1) All object classes and image backgrounds are modelled jointly together in a single generative model so that "explaining away" inference can resolve ambiguity and lead to better learning and localisation. (2) The Bayesian formulation of the model enables easy integration of prior knowledge about object appearance to compensate for limited supervision. (3) Our model can be learned with a mixture of weakly labelled and unlabelled data, allowing the large volume of unlabelled images on the Internet to be exploited for learning. Extensive experiments on the challenging VOC dataset demonstrate that our approach outperforms the state-of-the-art competitors.

* iccv 2013 

  Access Paper or Ask Questions

Same Author or Just Same Topic? Towards Content-Independent Style Representations

Apr 11, 2022
Anna Wegmann, Marijn Schraagen, Dong Nguyen

Linguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV): Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, a good performance on the AV task does not ensure good "general-purpose" style representations. For example, as the same author might typically write about certain topics, representations trained on AV might also encode content information instead of style alone. We introduce a variation of the AV training task that controls for content using conversation or domain labels. We evaluate whether known style dimensions are represented and preferred over content information through an original variation to the recently proposed STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no content control at representing style independent from content.

* accepted to the 7th workshop on RepL4NLP at ACL 2022 

  Access Paper or Ask Questions

ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling

Jan 04, 2022
Alexandre Alcoforado, Thomas Palmeira Ferraz, Rodrigo Gerber, Enzo Bustos, André Seidel Oliveira, Bruno Miguel Veloso, Fabio Levy Siqueira, Anna Helena Reali Costa

Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.

* Accepted at PROPOR 2022: 15th International Conference on Computational Processing of Portuguese 

  Access Paper or Ask Questions

Clustering Prominent People and Organizations in Topic-Specific Text Corpora

Jul 27, 2018
Abdulkareem Alsudais, Hovig Tchalian

Named entities in text documents are the names of people, organization, location or other types of objects in the documents that exist in the real world. A persisting research challenge is to use computational techniques to identify such entities in text documents. Once identified, several text mining tools and algorithms can be utilized to leverage these discovered named entities and improve NLP applications. In this paper, a method that clusters prominent names of people and organizations based on their semantic similarity in a text corpus is proposed. The method relies on common named entity recognition techniques and on recent word embeddings models. The semantic similarity scores generated using the word embeddings models for the named entities are used to cluster similar entities of the people and organizations types. Two human judges evaluated ten variations of the method after it was run on a corpus that consists of 4,821 articles on a specific topic. The performance of the method was measured using three quantitative measures. The results of these three metrics demonstrate that the method is effective in clustering semantically similar named entities.

  Access Paper or Ask Questions

The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly Articles

Apr 02, 2021
Angelo A. Salatino, Francesco Osborne, Thiviyan Thanapalasingam, Enrico Motta

Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.

* In Digital Libraries for Open Knowledge. LNCS, vol 11799. Springer, Cham (2019) 
* Conference paper at TPDL 2019 

  Access Paper or Ask Questions

Discovering Political Topics in Facebook Discussion threads with Graph Contextualization

Mar 24, 2018
Yilin Zhang, Marie Poux-Berthe, Chris Wells, Karolina Koc-Michalska, Karl Rohe

We propose a graph contextualization method, pairGraphText, to study political engagement on Facebook during the 2012 French presidential election. It is a spectral algorithm that contextualizes graph data with text data for online discussion thread. In particular, we examine the Facebook posts of the eight leading candidates and the comments beneath these posts. We find evidence of both (i) candidate-centered structure, where citizens primarily comment on the wall of one candidate and (ii) issue-centered structure (i.e. on political topics), where citizens' attention and expression is primarily directed towards a specific set of issues (e.g. economics, immigration, etc). To identify issue-centered structure, we develop pairGraphText, to analyze a network with high-dimensional features on the interactions (i.e. text). This technique scales to hundreds of thousands of nodes and thousands of unique words. In the Facebook data, spectral clustering without the contextualizing text information finds a mixture of (i) candidate and (ii) issue clusters. The contextualized information with text data helps to separate these two structures. We conclude by showing that the novel methodology is consistent under a statistical model.

  Access Paper or Ask Questions

Guided Alignment Training for Topic-Aware Neural Machine Translation

Jul 06, 2016
Wenhu Chen, Evgeny Matusov, Shahram Khadivi, Jan-Thorsten Peter

In this paper, we propose an effective way for biasing the attention mechanism of a sequence-to-sequence neural machine translation (NMT) model towards the well-studied statistical word alignment models. We show that our novel guided alignment training approach improves translation quality on real-life e-commerce texts consisting of product titles and descriptions, overcoming the problems posed by many unknown words and a large type/token ratio. We also show that meta-data associated with input texts such as topic or category information can significantly improve translation quality when used as an additional signal to the decoder part of the network. With both novel features, the BLEU score of the NMT system on a product title set improves from 18.6 to 21.3%. Even larger MT quality gains are obtained through domain adaptation of a general domain NMT system to e-commerce data. The developed NMT system also performs well on the IWSLT speech translation task, where an ensemble of four variant systems outperforms the phrase-based baseline by 2.1% BLEU absolute.

* 11 pages 

  Access Paper or Ask Questions

Cognitive Surveillance: Why does it never appear among the AVSS Conferences topics?

Jun 22, 2014
Emanuel Diamant

Video Surveillance is a fast evolving field of research and development (R&D) driven by the urgent need for public security and safety (due to the growing threats of terrorism, vandalism, and anti-social behavior). Traditionally, surveillance systems are comprised of two components - video cameras distributed over the guarded area and human observer watching and analyzing the incoming video. Explosive growth of installed cameras and limited human operator's ability to process the delivered video content raise an urgent demand for developing surveillance systems with human like cognitive capabilities, that is - Cognitive surveillance systems. The growing interest in this issue is testified by the tens of workshops, symposiums and conferences held over the world each year. The IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS) is certainly one of them. However, for unknown reasons, the term Cognitive Surveillance does never appear among its topics. As to me, the explanation for this is simple - the complexity and the indefinable nature of the term "Cognition". In this paper, I am trying to resolve the problem providing a novel definition of cognition equally suitable for biological as well as technological applications. I hope my humble efforts will be helpful.

* The paper was submitted to the 11th IEEE International Conference on Advanced Video Signal-Based Surveillance (AVSS2014, August 26-29, 2014, Seoul, Korea) and has not been accepted for presentation at the conference 

  Access Paper or Ask Questions

From Text to Topics in Healthcare Records: An Unsupervised Graph Partitioning Methodology

Jul 07, 2018
M. Tarik Altuncu, Erik Mayer, Sophia N. Yaliraki, Mauricio Barahona

Electronic Healthcare Records contain large volumes of unstructured data, including extensive free text. Yet this source of detailed information often remains under-used because of a lack of methodologies to extract interpretable content in a timely manner. Here we apply network-theoretical tools to analyse free text in Hospital Patient Incident reports from the National Health Service, to find clusters of documents with similar content in an unsupervised manner at different levels of resolution. We combine deep neural network paragraph vector text-embedding with multiscale Markov Stability community detection applied to a sparsified similarity graph of document vectors, and showcase the approach on incident reports from Imperial College Healthcare NHS Trust, London. The multiscale community structure reveals different levels of meaning in the topics of the dataset, as shown by descriptive terms extracted from the clusters of records. We also compare a posteriori against hand-coded categories assigned by healthcare personnel, and show that our approach outperforms LDA-based models. Our content clusters exhibit good correspondence with two levels of hand-coded categories, yet they also provide further medical detail in certain areas and reveal complementary descriptors of incidents beyond the external classification taxonomy.

  Access Paper or Ask Questions