Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

What is this Article about? Extreme Summarization with Topic-aware Convolutional Neural Networks

Jul 19, 2019
Shashi Narayan, Shay B. Cohen, Mirella Lapata

We introduce 'extreme summarization', a new single-document summarization task which aims at creating a short, one-sentence news summary answering the question ``What is the article about?''. We argue that extreme summarization, by nature, is not amenable to extractive strategies and requires an abstractive modeling approach. In the hope of driving research on this task further: (a) we collect a real-world, large scale dataset by harvesting online articles from the British Broadcasting Corporation (BBC); and (b) propose a novel abstractive model which is conditioned on the article's topics and based entirely on convolutional neural networks. We demonstrate experimentally that this architecture captures long-range dependencies in a document and recognizes pertinent content, outperforming an oracle extractive system and state-of-the-art abstractive approaches when evaluated automatically and by humans on the extreme summarization dataset.

* Accepted to appear in Journal of Artificial Intelligence Research (JAIR), 37 pages 

  Access Paper or Ask Questions

Summarizing Reviews with Variable-length Syntactic Patterns and Topic Models

Nov 21, 2012
Trung V. Nguyen, Alice H. Oh

We present a novel summarization framework for reviews of products and services by selecting informative and concise text segments from the reviews. Our method consists of two major steps. First, we identify five frequently occurring variable-length syntactic patterns and use them to extract candidate segments. Then we use the output of a joint generative sentiment topic model to filter out the non-informative segments. We verify the proposed method with quantitative and qualitative experiments. In a quantitative study, our approach outperforms previous methods in producing informative segments and summaries that capture aspects of products and services as expressed in the user-generated pros and cons lists. Our user study with ninety users resonates with this result: individual segments extracted and filtered by our method are rated as more useful by users compared to previous approaches by users.

  Access Paper or Ask Questions

Continual Learning of Long Topic Sequences in Neural Information Retrieval

Jan 10, 2022
Thomas Gerald, Laure Soulier

In information retrieval (IR) systems, trends and users' interests may change over time, altering either the distribution of requests or contents to be recommended. Since neural ranking approaches heavily depend on the training data, it is crucial to understand the transfer capacity of recent IR approaches to address new domains in the long term. In this paper, we first propose a dataset based upon the MSMarco corpus aiming at modeling a long stream of topics as well as IR property-driven controlled settings. We then in-depth analyze the ability of recent neural IR models while continually learning those streams. Our empirical study highlights in which particular cases catastrophic forgetting occurs (e.g., level of similarity between tasks, peculiarities on text length, and ways of learning models) to provide future directions in terms of model design.

  Access Paper or Ask Questions

Unveiling the Political Agenda of the European Parliament Plenary: A Topical Analysis

Jul 07, 2015
Derek Greene, James P. Cross

This study analyzes political interactions in the European Parliament (EP) by considering how the political agenda of the plenary sessions has evolved over time and the manner in which Members of the European Parliament (MEPs) have reacted to external and internal stimuli when making Parliamentary speeches. It does so by considering the context in which speeches are made, and the content of those speeches. To detect latent themes in legislative speeches over time, speech content is analyzed using a new dynamic topic modeling method, based on two layers of matrix factorization. This method is applied to a new corpus of all English language legislative speeches in the EP plenary from the period 1999-2014. Our findings suggest that the political agenda of the EP has evolved significantly over time, is impacted upon by the committee structure of the Parliament, and reacts to exogenous events such as EU Treaty referenda and the emergence of the Euro-crisis have a significant impact on what is being discussed in Parliament.

* Add link to implementation code on Github 

  Access Paper or Ask Questions

Beyond Fact Verification: Comparing and Contrasting Claims on Contentious Topics

May 24, 2022
Miyoung Ko, Ingyu Seong, Hwaran Lee, Joonsuk Park, Minsuk Chang, Minjoon Seo

As the importance of identifying misinformation is increasing, many researchers focus on verifying textual claims on the web. One of the most popular tasks to achieve this is fact verification, which retrieves an evidence sentence from a large knowledge source such as Wikipedia to either verify or refute each factual claim. However, while such problem formulation is helpful for detecting false claims and fake news, it is not applicable to catching subtle differences in factually consistent claims which still might implicitly bias the readers, especially in contentious topics such as political, gender, or racial issues. In this study, we propose ClaimDiff, a novel dataset to compare the nuance between claim pairs in both a discriminative and a generative manner, with the underlying assumption that one is not necessarily more true than the other. This differs from existing fact verification datasets that verify the target sentence with respect to an absolute truth. We hope this task assists people in making more informed decisions among various sources of media.

  Access Paper or Ask Questions

Detecting Potential Topics In News Using BERT, CRF and Wikipedia

Feb 28, 2020
Swapnil Ashok Jadhav

For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall.

* 6 pages, 5 tables, 1 figure, 2 examples. This is a report based on applied research work conducted at Dailyhunt 

  Access Paper or Ask Questions

WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling

Mar 04, 2018
Hao Zhang, Bo Chen, Dandan Guo, Mingyuan Zhou

To train an inference network jointly with a deep generative topic model, making it both scalable to big corpora and fast in out-of-sample prediction, we develop Weibull hybrid autoencoding inference (WHAI) for deep latent Dirichlet allocation, which infers posterior samples via a hybrid of stochastic-gradient MCMC and autoencoding variational Bayes. The generative network of WHAI has a hierarchy of gamma distributions, while the inference network of WHAI is a Weibull upward-downward variational autoencoder, which integrates a deterministic-upward deep neural network, and a stochastic-downward deep generative model based on a hierarchy of Weibull distributions. The Weibull distribution can be used to well approximate a gamma distribution with an analytic Kullback-Leibler divergence, and has a simple reparameterization via the uniform noise, which help efficiently compute the gradients of the evidence lower bound with respect to the parameters of the inference network. The effectiveness and efficiency of WHAI are illustrated with experiments on big corpora.

* ICLR 2018 

  Access Paper or Ask Questions

Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation

May 09, 2017
Zhiyuan Shi, Timothy M. Hospedales, Tao Xiang

We address the problem of localisation of objects as bounding boxes in images with weak labels. This weakly supervised object localisation problem has been tackled in the past using discriminative models where each object class is localised independently from other classes. We propose a novel framework based on Bayesian joint topic modelling. Our framework has three distinctive advantages over previous works: (1) All object classes and image backgrounds are modelled jointly together in a single generative model so that "explaining away" inference can resolve ambiguity and lead to better learning and localisation. (2) The Bayesian formulation of the model enables easy integration of prior knowledge about object appearance to compensate for limited supervision. (3) Our model can be learned with a mixture of weakly labelled and unlabelled data, allowing the large volume of unlabelled images on the Internet to be exploited for learning. Extensive experiments on the challenging VOC dataset demonstrate that our approach outperforms the state-of-the-art competitors.

* iccv 2013 

  Access Paper or Ask Questions

Same Author or Just Same Topic? Towards Content-Independent Style Representations

Apr 11, 2022
Anna Wegmann, Marijn Schraagen, Dong Nguyen

Linguistic style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV): Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, a good performance on the AV task does not ensure good "general-purpose" style representations. For example, as the same author might typically write about certain topics, representations trained on AV might also encode content information instead of style alone. We introduce a variation of the AV training task that controls for content using conversation or domain labels. We evaluate whether known style dimensions are represented and preferred over content information through an original variation to the recently proposed STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no content control at representing style independent from content.

* accepted to the 7th workshop on RepL4NLP at ACL 2022 

  Access Paper or Ask Questions

ZeroBERTo -- Leveraging Zero-Shot Text Classification by Topic Modeling

Jan 04, 2022
Alexandre Alcoforado, Thomas Palmeira Ferraz, Rodrigo Gerber, Enzo Bustos, André Seidel Oliveira, Bruno Miguel Veloso, Fabio Levy Siqueira, Anna Helena Reali Costa

Traditional text classification approaches often require a good amount of labeled data, which is difficult to obtain, especially in restricted domains or less widespread languages. This lack of labeled data has led to the rise of low-resource methods, that assume low data availability in natural language processing. Among them, zero-shot learning stands out, which consists of learning a classifier without any previously labeled data. The best results reported with this approach use language models such as Transformers, but fall into two problems: high execution time and inability to handle long texts as input. This paper proposes a new model, ZeroBERTo, which leverages an unsupervised clustering step to obtain a compressed data representation before the classification task. We show that ZeroBERTo has better performance for long inputs and shorter execution time, outperforming XLM-R by about 12% in the F1 score in the FolhaUOL dataset. Keywords: Low-Resource NLP, Unlabeled data, Zero-Shot Learning, Topic Modeling, Transformers.

* Accepted at PROPOR 2022: 15th International Conference on Computational Processing of Portuguese 

  Access Paper or Ask Questions