Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Topic Modeling": models, code, and papers

NewsEmbed: Modeling News through Pre-trained DocumentRepresentations

Jun 01, 2021
Jialu Liu, Tianqi Liu, Cong Yu

Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.

  

Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization

May 28, 2021
Xiachong Feng, Xiaocheng Feng, Libo Qin, Bing Qin, Ting Liu

Current dialogue summarization systems usually encode the text with a number of general semantic features (e.g., keywords and topics) to gain more powerful dialogue modeling capabilities. However, these features are obtained via open-domain toolkits that are dialog-agnostic or heavily relied on human annotations. In this paper, we show how DialoGPT, a pre-trained model for conversational response generation, can be developed as an unsupervised dialogue annotator, which takes advantage of dialogue background knowledge encoded in DialoGPT. We apply DialoGPT to label three types of features on two dialogue summarization datasets, SAMSum and AMI, and employ pre-trained and non pre-trained models as our summarizes. Experimental results show that our proposed method can obtain remarkable improvements on both datasets and achieves new state-of-the-art performance on the SAMSum dataset.

* ACL 2021 
  

FakeFlow: Fake News Detection by Modeling the Flow of Affective Information

Jan 24, 2021
Bilal Ghanem, Simone Paolo Ponzetto, Paolo Rosso, Francisco Rangel

Fake news articles often stir the readers' attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers' emotions. To capture this, we propose in this paper to model the flow of affective information in fake news articles using a neural architecture. The proposed model, FakeFlow, learns this flow by combining topic and affective information extracted from text. We evaluate the model's performance with several experiments on four real-world datasets. The results show that FakeFlow achieves superior results when compared against state-of-the-art methods, thus confirming the importance of capturing the flow of the affective information in news articles.

* 9 pages, 6 figures, EACL-2021 
  

NewsEmbed: Modeling News through Pre-trained Document Representations

Jun 05, 2021
Jialu Liu, Tianqi Liu, Cong Yu

Effectively modeling text-rich fresh content such as news articles at document-level is a challenging problem. To ensure a content-based model generalize well to a broad range of applications, it is critical to have a training dataset that is large beyond the scale of human labels while achieving desired quality. In this work, we address those two challenges by proposing a novel approach to mine semantically-relevant fresh documents, and their topic labels, with little human supervision. Meanwhile, we design a multitask model called NewsEmbed that alternatively trains a contrastive learning with a multi-label classification to derive a universal document encoder. We show that the proposed approach can provide billions of high quality organic training examples and can be naturally extended to multilingual setting where texts in different languages are encoded in the same semantic space. We experimentally demonstrate NewsEmbed's competitive performance across multiple natural language understanding tasks, both supervised and unsupervised.

* Accepted in SIGKDD 2021 
  

Assortment Optimization with Customer Choice Modeling in a Crowdfunding Setting

Jul 14, 2022
Fatemeh Nosrat

Crowdfunding, which is the act of raising funds from a large number of people's contributions, is among the most popular research topics in economic theory. Due to the fact that crowdfunding platforms (CFPs) have facilitated the process of raising funds by offering several features, we should take their existence and survival in the marketplace into account. In this study, we investigated the significant role of platform features in a customer behavioral choice model. In particular, we proposed a multinomial logit model to describe the customers' (backers') behavior in a crowdfunding setting. We proceed by discussing the revenue-sharing model in these platforms. For this purpose, we conclude that an assortment optimization problem could be of major importance in order to maximize the platforms' revenue. We were able to derive a reasonable amount of data in some cases and implement two well-known machine learning methods such as multivariate regression and classification problems to predict the best assortments the platform could offer to every arriving customer. We compared the results of these two methods and investigated how well they perform in all cases.

  

On Privacy Protection of Latent Dirichlet Allocation Model Training

Jun 04, 2019
Fangyuan Zhao, Xuebin Ren, Shusen Yang, Xinyu Yang

Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.

* 8 pages,5 figures,and is published in International Joint Conferences on Artificial Intelligence 
  

Exploring Context Modeling Techniques on the Spatiotemporal Crowd Flow Prediction

Jun 30, 2021
Liyue Chen, Leye Wang

In the big data and AI era, context is widely exploited as extra information which makes it easier to learn a more complex pattern in machine learning systems. However, most of the existing related studies seldom take context into account. The difficulty lies in the unknown generalization ability of both context and its modeling techniques across different scenarios. To fill the above gaps, we conduct a large-scale analytical and empirical study on the spatiotemporal crowd prediction (STCFP) problem that is a widely-studied and hot research topic. We mainly make three efforts:(i) we develop new taxonomy about both context features and context modeling techniques based on extensive investigations in prevailing STCFP research; (ii) we conduct extensive experiments on seven datasets with hundreds of millions of records to quantitatively evaluate the generalization ability of both distinct context features and context modeling techniques; (iii) we summarize some guidelines for researchers to conveniently utilize context in diverse applications.

  

Dependent Multinomial Models Made Easy: Stick Breaking with the P贸lya-Gamma Augmentation

Jun 18, 2015
Scott W. Linderman, Matthew J. Johnson, Ryan P. Adams

Many practical modeling problems involve discrete data that are best represented as draws from multinomial or categorical distributions. For example, nucleotides in a DNA sequence, children's names in a given state and year, and text documents are all commonly modeled with multinomial distributions. In all of these cases, we expect some form of dependency between the draws: the nucleotide at one position in the DNA strand may depend on the preceding nucleotides, children's names are highly correlated from year to year, and topics in text may be correlated and dynamic. These dependencies are not naturally captured by the typical Dirichlet-multinomial formulation. Here, we leverage a logistic stick-breaking representation and recent innovations in P\'olya-gamma augmentation to reformulate the multinomial distribution in terms of latent variables with jointly Gaussian likelihoods, enabling us to take advantage of a host of Bayesian inference techniques for Gaussian models with minimal overhead.

  

Thousand to One: Semantic Prior Modeling for Conceptual Coding

Mar 16, 2021
Jianhui Chang, Zhenghui Zhao, Lingbo Yang, Chuanmin Jia, Jian Zhang, Siwei Ma

Conceptual coding has been an emerging research topic recently, which encodes natural images into disentangled conceptual representations for compression. However, the compression performance of the existing methods is still sub-optimal due to the lack of comprehensive consideration of rate constraint and reconstruction quality. To this end, we propose a novel end-to-end semantic prior modeling-based conceptual coding scheme towards extremely low bitrate image compression, which leverages semantic-wise deep representations as a unified prior for entropy estimation and texture synthesis. Specifically, we employ semantic segmentation maps as structural guidance for extracting deep semantic prior, which provides fine-grained texture distribution modeling for better detail construction and higher flexibility in subsequent high-level vision tasks. Moreover, a cross-channel entropy model is proposed to further exploit the inter-channel correlation of the spatially independent semantic prior, leading to more accurate entropy estimation for rate-constrained training. The proposed scheme achieves an ultra-high 1000x compression ratio, while still enjoying high visual reconstruction quality and versatility towards visual processing and analysis tasks.

* ICME 2021 ORAL accepted 
  
<<
42
43
44
45
46
47
48
49
50
>>