Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.
Podcasts have emerged as a massively consumed online content, notably due to wider accessibility of production means and scaled distribution through large streaming platforms. Categorization systems and information access technologies typically use topics as the primary way to organize or navigate podcast collections. However, annotating podcasts with topics is still quite problematic because the assigned editorial genres are broad, heterogeneous or misleading, or because of data challenges (e.g. short metadata text, noisy transcripts). Here, we assess the feasibility to discover relevant topics from podcast metadata, titles and descriptions, using topic modeling techniques for short text. We also propose a new strategy to leverage named entities (NEs), often present in podcast metadata, in a Non-negative Matrix Factorization (NMF) topic modeling framework. Our experiments on two existing datasets from Spotify and iTunes and Deezer, a new dataset from an online service providing a catalog of podcasts, show that our proposed document representation, NEiCE, leads to improved topic coherence over the baselines. We release the code for experimental reproducibility of the results.
We describe a new method for visualizing topics, the distributions over terms that are automatically extracted from large text corpora using latent variable models. Our method finds significant $n$-grams related to a topic, which are then used to help understand and interpret the underlying distribution. Compared with the usual visualization, which simply lists the most probable topical terms, the multi-word expressions provide a better intuitive impression for what a topic is "about." Our approach is based on a language model of arbitrary length expressions, for which we develop a new methodology based on nested permutation tests to find significant phrases. We show that this method outperforms the more standard use of $\chi^2$ and likelihood ratio tests. We illustrate the topic presentations on corpora of scientific abstracts and news articles.
Supervised topic models utilize document's side information for discovering predictive low dimensional representations of documents. Existing models apply the likelihood-based estimation. In this paper, we present a general framework of max-margin supervised topic models for both continuous and categorical response variables. Our approach, the maximum entropy discrimination latent Dirichlet allocation (MedLDA), utilizes the max-margin principle to train supervised topic models and estimate predictive topic representations that are arguably more suitable for prediction tasks. The general principle of MedLDA can be applied to perform joint max-margin learning and maximum likelihood estimation for arbitrary topic models, directed or undirected, and supervised or unsupervised, when the supervised side information is available. We develop efficient variational methods for posterior inference and parameter estimation, and demonstrate qualitatively and quantitatively the advantages of MedLDA over likelihood-based topic models on movie review and 20 Newsgroups data sets.
Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach. We first formulate two formal constraints that the retrieval model should satisfy for documents having verbose and multi-topicality characteristic, respectively. Then, we modify language modeling approaches to better satisfy these two constraints, and derive novel smoothing methods. Experimental results show that the proposed method increases significantly the precision for keyword queries, and substantially improves MAP (Mean Average Precision) for verbose queries.
Topic detection is a process for determining topics from a collection of textual data. One of the topic detection methods is a clustering-based method, which assumes that the centroids are topics. The clustering method has the advantage that it can process data with negative representations. Therefore, the clustering method allows a combination with a broader representation learning method. In this paper, we adopt deep learning for topic detection by using a deep autoencoder and fuzzy c-means called deep autoencoder-based fuzzy c-means (DFCM). The encoder of the autoencoder performs a lower-dimensional representation learning. Fuzzy c-means groups the lower-dimensional representation to identify the centroids. The autoencoder's decoder transforms back the centroids into the original representation to be interpreted as the topics. Our simulation shows that DFCM improves the coherence score of eigenspace-based fuzzy c-means (EFCM) and is comparable to the leading standard methods, i.e., nonnegative matrix factorization (NMF) or latent Dirichlet allocation (LDA).
In this paper, we propose a Chinese multi-turn topic-driven conversation dataset, NaturalConv, which allows the participants to chat anything they want as long as any element from the topic is mentioned and the topic shift is smooth. Our corpus contains 19.9K conversations from six domains, and 400K utterances with an average turn number of 20.1. These conversations contain in-depth discussions on related topics or widely natural transition between multiple topics. We believe either way is normal for human conversation. To facilitate the research on this corpus, we provide results of several benchmark models. Comparative results show that for this dataset, our current models are not able to provide significant improvement by introducing background knowledge/topic. Therefore, the proposed dataset should be a good benchmark for further research to evaluate the validity and naturalness of multi-turn conversation systems. Our dataset is available at https://ai.tencent.com/ailab/nlp/dialogue/#datasets.
Plug-and-play language models (PPLMs) enable topic-conditioned natural language generation by pairing large pre-trained generators with attribute models used to steer the predicted token distribution towards the selected topic. Despite their computational efficiency, PPLMs require large amounts of labeled texts to effectively balance generation fluency and proper conditioning, making them unsuitable for low-resource settings. We present ETC-NLG, an approach leveraging topic modeling annotations to enable fully-unsupervised End-to-end Topic-Conditioned Natural Language Generation over emergent topics in unlabeled document collections. We first test the effectiveness of our approach in a low-resource setting for Italian, evaluating the conditioning for both topic models and gold annotations. We then perform a comparative evaluation of ETC-NLG for Italian and English using a parallel corpus. Finally, we propose an automatic approach to estimate the effectiveness of conditioning on the generated utterances.
We use structural topic modeling to examine racial bias in data collected to train models to detect hate speech and abusive language in social media posts. We augment the abusive language dataset by adding an additional feature indicating the predicted probability of the tweet being written in African-American English. We then use structural topic modeling to examine the content of the tweets and how the prevalence of different topics is related to both abusiveness annotation and dialect prediction. We find that certain topics are disproportionately racialized and considered abusive. We discuss how topic modeling may be a useful approach for identifying bias in annotated data.