Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Topic-Aware Contrastive Learning for Abstractive Dialogue Summarization

Sep 10, 2021
Junpeng Liu, Yanyan Zou, Hainan Zhang, Hongshen Chen, Zhuoye Ding, Caixia Yuan, Xiaojie Wang

Unlike well-structured text, such as news reports and encyclopedia articles, dialogue content often comes from two or more interlocutors, exchanging information with each other. In such a scenario, the topic of a conversation can vary upon progression and the key information for a certain topic is often scattered across multiple utterances of different speakers, which poses challenges to abstractly summarize dialogues. To capture the various topic information of a conversation and outline salient facts for the captured topics, this work proposes two topic-aware contrastive learning objectives, namely coherence detection and sub-summary generation objectives, which are expected to implicitly model the topic change and handle information scattering challenges for the dialogue summarization task. The proposed contrastive objectives are framed as auxiliary tasks for the primary dialogue summarization task, united via an alternative parameter updating strategy. Extensive experiments on benchmark datasets demonstrate that the proposed simple method significantly outperforms strong baselines and achieves new state-of-the-art performance. The code and trained models are publicly available via \href{}{}.

* EMNLP 2021 

  Access Paper or Ask Questions

Few-shot Learning for Topic Modeling

Apr 19, 2021
Tomoharu Iwata

Topic models have been successfully used for analyzing text documents. However, with existing topic models, many documents are required for training. In this paper, we propose a neural network-based few-shot learning method that can learn a topic model from just a few documents. The neural networks in our model take a small number of documents as inputs, and output topic model priors. The proposed method trains the neural networks such that the expected test likelihood is improved when topic model parameters are estimated by maximizing the posterior probability using the priors based on the EM algorithm. Since each step in the EM algorithm is differentiable, the proposed method can backpropagate the loss through the EM algorithm to train the neural networks. The expected test likelihood is maximized by a stochastic gradient descent method using a set of multiple text corpora with an episodic training framework. In our experiments, we demonstrate that the proposed method achieves better perplexity than existing methods using three real-world text document sets.

  Access Paper or Ask Questions

Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling

Aug 29, 2017
Xiaofeng Zhu, Diego Klabjan, Patrick Bless

In this paper, we present hierarchical relationbased latent Dirichlet allocation (hrLDA), a data-driven hierarchical topic model for extracting terminological ontologies from a large number of heterogeneous documents. In contrast to traditional topic models, hrLDA relies on noun phrases instead of unigrams, considers syntax and document structures, and enriches topic hierarchies with topic relations. Through a series of experiments, we demonstrate the superiority of hrLDA over existing topic models, especially for building hierarchies. Furthermore, we illustrate the robustness of hrLDA in the settings of noisy data sets, which are likely to occur in many practical scenarios. Our ontology evaluation results show that ontologies extracted from hrLDA are very competitive with the ontologies created by domain experts.

  Access Paper or Ask Questions

VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection

Jan 15, 2020
Yuzhen Ding, Baoxin Li

Topic modeling has found wide application in many problems where latent structures of the data are crucial for typical inference tasks. When applying a topic model, a relatively standard pre-processing step is to first build a vocabulary of frequent words. Such a general pre-processing step is often independent of the topic modeling stage, and thus there is no guarantee that the pre-generated vocabulary can support the inference of some optimal (or even meaningful) topic models appropriate for a given task, especially for computer vision applications involving "visual words". In this paper, we propose a new approach to topic modeling, termed Vocabulary-Selection-Embedded Correspondence-LDA (VSEC-LDA), which learns the latent model while simultaneously selecting most relevant words. The selection of words is driven by an entropy-based metric that measures the relative contribution of the words to the underlying model, and is done dynamically while the model is learned. We present three variants of VSEC-LDA and evaluate the proposed approach with experiments on both synthetic and real databases from different applications. The results demonstrate the effectiveness of built-in vocabulary selection and its importance in improving the performance of topic modeling.

  Access Paper or Ask Questions

Automatic Text Summarization Approaches to Speed up Topic Model Learning Process

Mar 20, 2017
Mohamed Morchid, Juan-Manuel Torres-Moreno, Richard Dufour, Javier Ramírez-Rodríguez, Georges Linarès

The number of documents available into Internet moves each day up. For this reason, processing this amount of information effectively and expressibly becomes a major concern for companies and scientists. Methods that represent a textual document by a topic representation are widely used in Information Retrieval (IR) to process big data such as Wikipedia articles. One of the main difficulty in using topic model on huge data collection is related to the material resources (CPU time and memory) required for model estimate. To deal with this issue, we propose to build topic spaces from summarized documents. In this paper, we present a study of topic space representation in the context of big data. The topic space representation behavior is analyzed on different languages. Experiments show that topic spaces estimated from text summaries are as relevant as those estimated from the complete documents. The real advantage of such an approach is the processing time gain: we showed that the processing time can be drastically reduced using summarized documents (more than 60\% in general). This study finally points out the differences between thematic representations of documents depending on the targeted languages such as English or latin languages.

* International Journal of Computational Linguistics and Applications, 7(2):87-109, 2016 
* 16 pages, 4 tables, 8 figures 

  Access Paper or Ask Questions

Topic-aware Pointer-Generator Networks for Summarizing Spoken Conversations

Oct 03, 2019
Zhengyuan Liu, Angela Ng, Sheldon Lee, Ai Ti Aw, Nancy F. Chen

Due to the lack of publicly available resources, conversation summarization has received far less attention than text summarization. As the purpose of conversations is to exchange information between at least two interlocutors, key information about a certain topic is often scattered and spanned across multiple utterances and turns from different speakers. This phenomenon is more pronounced during spoken conversations, where speech characteristics such as backchanneling and false-starts might interrupt the topical flow. Moreover, topic diffusion and (intra-utterance) topic drift are also more common in human-to-human conversations. Such linguistic characteristics of dialogue topics make sentence-level extractive summarization approaches used in spoken documents ill-suited for summarizing conversations. Pointer-generator networks have effectively demonstrated its strength at integrating extractive and abstractive capabilities through neural modeling in text summarization. To the best of our knowledge, to date no one has adopted it for summarizing conversations. In this work, we propose a topic-aware architecture to exploit the inherent hierarchical structure in conversations to further adapt the pointer-generator model. Our approach significantly outperforms competitive baselines, achieves more efficient learning outcomes, and attains more robust performance.

* To appear in ASRU2019 

  Access Paper or Ask Questions

Provable Algorithms for Inference in Topic Models

May 27, 2016
Sanjeev Arora, Rong Ge, Frederic Koehler, Tengyu Ma, Ankur Moitra

Recently, there has been considerable progress on designing algorithms with provable guarantees -- typically using linear algebraic methods -- for parameter learning in latent variable models. But designing provable algorithms for inference has proven to be more challenging. Here we take a first step towards provable inference in topic models. We leverage a property of topic models that enables us to construct simple linear estimators for the unknown topic proportions that have small variance, and consequently can work with short documents. Our estimators also correspond to finding an estimate around which the posterior is well-concentrated. We show lower bounds that for shorter documents it can be information theoretically impossible to find the hidden topics. Finally, we give empirical results that demonstrate that our algorithm works on realistic topic models. It yields good solutions on synthetic data and runs in time comparable to a {\em single} iteration of Gibbs sampling.

* to appear at ICML'2016 

  Access Paper or Ask Questions

ComStreamClust: A communicative text clustering approach to topic detection in streaming data

Oct 11, 2020
Ali Najafi, Araz Gholipour-Shilabin, Rahim Dehkharghani, Ali Mohammadpur-Fard, Meysam Asgari-Chenaghlu

Topic detection is the task of determining and tracking hot topics in social media. Twitter is arguably the most popular platform for people to share their ideas with others about different issues. One such prevalent issue is the COVID-19 pandemic. Detecting and tracking topics on these kinds of issues would help governments and healthcare companies deal with this phenomenon. In this paper, we propose a novel communicative clustering approach, so-called ComStreamClust for clustering sub-topics inside a broader topic, e.g. COVID-19. The proposed approach was evaluated on two datasets: the COVID-19 and the FA CUP. The results obtained from ComStreamClust approve the effectiveness of the proposed approach when compared to existing methods such as LDA.

* 11 pages, 6 Figures, 4 Tables 

  Access Paper or Ask Questions

Probabilistic Model of Narratives Over Topical Trends in Social Media: A Discrete Time Model

Apr 14, 2020
Toktam A. Oghaz, Ece C. Mutlu, Jasser Jasser, Niloofar Yousefi, Ivan Garibay

Online social media platforms are turning into the prime source of news and narratives about worldwide events. However,a systematic summarization-based narrative extraction that can facilitate communicating the main underlying events is lacking. To address this issue, we propose a novel event-based narrative summary extraction framework. Our proposed framework is designed as a probabilistic topic model, with categorical time distribution, followed by extractive text summarization. Our topic model identifies topics' recurrence over time with a varying time resolution. This framework not only captures the topic distributions from the data, but also approximates the user activity fluctuations over time. Furthermore, we define significance-dispersity trade-off (SDT) as a comparison measure to identify the topic with the highest lifetime attractiveness in a timestamped corpus. We evaluate our model on a large corpus of Twitter data, including more than one million tweets in the domain of the disinformation campaigns conducted against the White Helmets of Syria. Our results indicate that the proposed framework is effective in identifying topical trends, as well as extracting narrative summaries from text corpus with timestamped data.

* 9 pages, 4 figures 

  Access Paper or Ask Questions

Topical Stance Detection for Twitter: A Two-Phase LSTM Model Using Attention

Jan 09, 2018
Kuntal Dey, Ritvik Shrivastava, Saroj Kaushik

The topical stance detection problem addresses detecting the stance of the text content with respect to a given topic: whether the sentiment of the given text content is in FAVOR of (positive), is AGAINST (negative), or is NONE (neutral) towards the given topic. Using the concept of attention, we develop a two-phase solution. In the first phase, we classify subjectivity - whether a given tweet is neutral or subjective with respect to the given topic. In the second phase, we classify sentiment of the subjective tweets (ignoring the neutral tweets) - whether a given subjective tweet has a FAVOR or AGAINST stance towards the topic. We propose a Long Short-Term memory (LSTM) based deep neural network for each phase, and embed attention at each of the phases. On the SemEval 2016 stance detection Twitter task dataset, we obtain a best-case macro F-score of 68.84% and a best-case accuracy of 60.2%, outperforming the existing deep learning based solutions. Our framework, T-PAN, is the first in the topical stance detection literature, that uses deep learning within a two-phase architecture.

* Accepted at the 40th European Conference on Information Retrieval (ECIR), 2018 

  Access Paper or Ask Questions