Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Topic Modeling": models, code, and papers

Image Super-Resolution via Sparse Bayesian Modeling of Natural Images

Sep 19, 2012
Haichao Zhang, David Wipf, Yanning Zhang

Image super-resolution (SR) is one of the long-standing and active topics in image processing community. A large body of works for image super resolution formulate the problem with Bayesian modeling techniques and then obtain its Maximum-A-Posteriori (MAP) solution, which actually boils down to a regularized regression task over separable regularization term. Although straightforward, this approach cannot exploit the full potential offered by the probabilistic modeling, as only the posterior mode is sought. Also, the separable property of the regularization term can not capture any correlations between the sparse coefficients, which sacrifices much on its modeling accuracy. We propose a Bayesian image SR algorithm via sparse modeling of natural images. The sparsity property of the latent high resolution image is exploited by introducing latent variables into the high-order Markov Random Field (MRF) which capture the content adaptive variance by pixel-wise adaptation. The high-resolution image is estimated via Empirical Bayesian estimation scheme, which is substantially faster than our previous approach based on Markov Chain Monte Carlo sampling [1]. It is shown that the actual cost function for the proposed approach actually incorporates a non-factorial regularization term over the sparse coefficients. Experimental results indicate that the proposed method can generate competitive or better results than \emph{state-of-the-art} SR algorithms.

* 8 figures, 29 pages 

Comprehensive Information Integration Modeling Framework for Video Titling

Jun 24, 2020
Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Tan Jiang, Jingren Zhou, Hongxia Yang, Fei Wu

In e-commerce, consumer-generated videos, which in general deliver consumers' individual preferences for the different aspects of certain products, are massive in volume. To recommend these videos to potential consumers more effectively, diverse and catchy video titles are critical. However, consumer-generated videos seldom accompany appropriate titles. To bridge this gap, we integrate comprehensive sources of information, including the content of consumer-generated videos, the narrative comment sentences supplied by consumers, and the product attributes, in an end-to-end modeling framework. Although automatic video titling is very useful and demanding, it is much less addressed than video captioning. The latter focuses on generating sentences that describe videos as a whole while our task requires the product-aware multi-grained video analysis. To tackle this issue, the proposed method consists of two processes, i.e., granular-level interaction modeling and abstraction-level story-line summarization. Specifically, the granular-level interaction modeling first utilizes temporal-spatial landmark cues, descriptive words, and abstractive attributes to builds three individual graphs and recognizes the intra-actions in each graph through Graph Neural Networks (GNN). Then the global-local aggregation module is proposed to model inter-actions across graphs and aggregate heterogeneous graphs into a holistic graph representation. The abstraction-level story-line summarization further considers both frame-level video features and the holistic graph to utilize the interactions between products and backgrounds, and generate the story-line topic of the video. We collect a large-scale dataset accordingly from real-world data in Taobao, a world-leading e-commerce platform, and will make the desensitized version publicly available to nourish further development of the research community...

* 11 pages, 6 figures, to appear in KDD 2020 proceedings 

Multi-level computational methods for interdisciplinary research in the HathiTrust Digital Library

Jun 08, 2017
Jaimie Murdock, Colin Allen, Katy Börner, Robert Light, Simon McAlister, Andrew Ravenscroft, Robert Rose, Doori Rose, Jun Otsuka, David Bourget, John Lawrence, Chris Reed

We show how faceted search using a combination of traditional classification systems and mixed-membership topic models can go beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. The methods can be generalized to other research areas and ultimately support a system for semi-automatic identification of argument structures. We provide a case study for the application of the methods to the problem of identifying and extracting arguments about anthropomorphism during a critical period in the development of comparative psychology. We show how a combination of classification systems and mixed-membership models trained over large digital libraries can inform resource discovery in this domain. Through a novel approach of "drill-down" topic modeling---simultaneously reducing both the size of the corpus and the unit of analysis---we are able to reduce a large collection of fulltext volumes to a much smaller set of pages within six focal volumes containing arguments of interest to historians and philosophers of comparative psychology. The volumes identified in this way did not appear among the first ten results of the keyword search in the HathiTrust digital library and the pages bear the kind of "close reading" needed to generate original interpretations that is the heart of scholarly work in the humanities. Zooming back out, we provide a way to place the books onto a map of science originally constructed from very different data and for different purposes. The multilevel approach advances understanding of the intellectual and societal contexts in which writings are interpreted.

* revised, 29 pages, 3 figures 

Exploration and Exploitation of Victorian Science in Darwin's Reading Notebooks

Feb 02, 2017
Jaimie Murdock, Colin Allen, Simon DeDeo

Search in an environment with an uncertain distribution of resources involves a trade-off between exploitation of past discoveries and further exploration. This extends to information foraging, where a knowledge-seeker shifts between reading in depth and studying new domains. To study this decision-making process, we examine the reading choices made by one of the most celebrated scientists of the modern era: Charles Darwin. From the full-text of books listed in his chronologically-organized reading journals, we generate topic models to quantify his local (text-to-text) and global (text-to-past) reading decisions using Kullback-Liebler Divergence, a cognitively-validated, information-theoretic measure of relative surprise. Rather than a pattern of surprise-minimization, corresponding to a pure exploitation strategy, Darwin's behavior shifts from early exploitation to later exploration, seeking unusually high levels of cognitive surprise relative to previous eras. These shifts, detected by an unsupervised Bayesian model, correlate with major intellectual epochs of his career as identified both by qualitative scholarship and Darwin's own self-commentary. Our methods allow us to compare his consumption of texts with their publication order. We find Darwin's consumption more exploratory than the culture's production, suggesting that underneath gradual societal changes are the explorations of individual synthesis and discovery. Our quantitative methods advance the study of cognitive search through a framework for testing interactions between individual and collective behavior and between short- and long-term consumption choices. This novel application of topic modeling to characterize individual reading complements widespread studies of collective scientific behavior.

* Cognition 159 (2017) 117-126 
* Cognition pre-print, published February 2017; 22 pages, plus 17 pages supporting information, 7 pages references 

Large Scale Analysis of Open MOOC Reviews to Support Learners' Course Selection

Jan 11, 2022
Manuel J. Gomez, Mario Calderón, Victor Sánchez, Félix J. García Clemente, José A. Ruipérez-Valiente

The recent pandemic has changed the way we see education. It is not surprising that children and college students are not the only ones using online education. Millions of adults have signed up for online classes and courses during last years, and MOOC providers, such as Coursera or edX, are reporting millions of new users signing up in their platforms. However, students do face some challenges when choosing courses. Though online review systems are standard among many verticals, no standardized or fully decentralized review systems exist in the MOOC ecosystem. In this vein, we believe that there is an opportunity to leverage available open MOOC reviews in order to build simpler and more transparent reviewing systems, allowing users to really identify the best courses out there. Specifically, in our research we analyze 2.4 million reviews (which is the largest MOOC reviews dataset used until now) from five different platforms in order to determine the following: (1) if the numeric ratings provide discriminant information to learners, (2) if NLP-driven sentiment analysis on textual reviews could provide valuable information to learners, (3) if we can leverage NLP-driven topic finding techniques to infer themes that could be important for learners, and (4) if we can use these models to effectively characterize MOOCs based on the open reviews. Results show that numeric ratings are clearly biased (63\% of them are 5-star ratings), and the topic modeling reveals some interesting topics related with course advertisements, the real applicability, or the difficulty of the different courses. We expect our study to shed some light on the area and promote a more transparent approach in online education reviews, which are becoming more and more popular as we enter the post-pandemic era.

* 36 pages, 8 figures 

Modeling Engagement Dynamics of Online Discussions using Relativistic Gravitational Theory

Aug 10, 2019
Subhabrata Dutta, Dipankar Das, Tanmoy Chakraborty

Online discussions are valuable resources to study user behaviour on a diverse set of topics. Unlike previous studies which model a discussion in a static manner, in the present study, we model it as a time-varying process and solve two inter-related problems -- predict which user groups will get engaged with an ongoing discussion, and forecast the growth rate of a discussion in terms of the number of comments. We propose RGNet (Relativistic Gravitational Nerwork), a novel algorithm that uses Einstein Field Equations of gravity to model online discussions as `cloud of dust' hovering over a user spacetime manifold, attracting users of different groups at different rates over time. We also propose GUVec, a global user embedding method for an online discussion, which is used by RGNet to predict temporal user engagement. RGNet leverages different textual and network-based features to learn the dust distribution for discussions. We employ four baselines -- first two using LSTM architecture, third one using Newtonian model of gravity, and fourth one using a logistic regression adopted from a previous work on engagement prediction. Experiments on Reddit dataset show that RGNet achieves 0.72 Micro F1 score and 6.01% average error for temporal engagement prediction of user groups and growth rate forecasting, respectively, outperforming all the baselines significantly. We further employ RGNet to predict non-temporal engagement -- whether users will comment to a given post or not. RGNet achieves 0.62 AUC for this task, outperforming existing baseline by 8.77% AUC.


Social Media Text Processing and Semantic Analysis for Smart Cities

Sep 11, 2017
João Filipe Figueiredo Pereira

With the rise of Social Media, people obtain and share information almost instantly on a 24/7 basis. Many research areas have tried to gain valuable insights from these large volumes of freely available user generated content. With the goal of extracting knowledge from social media streams that might be useful in the context of intelligent transportation systems and smart cities, we designed and developed a framework that provides functionalities for parallel collection of geo-located tweets from multiple pre-defined bounding boxes (cities or regions), including filtering of non-complying tweets, text pre-processing for Portuguese and English language, topic modeling, and transportation-specific text classifiers, as well as, aggregation and data visualization. We performed an exploratory data analysis of geo-located tweets in 5 different cities: Rio de Janeiro, S\~ao Paulo, New York City, London and Melbourne, comprising a total of more than 43 million tweets in a period of 3 months. Furthermore, we performed a large scale topic modelling comparison between Rio de Janeiro and S\~ao Paulo. Interestingly, most of the topics are shared between both cities which despite being in the same country are considered very different regarding population, economy and lifestyle. We take advantage of recent developments in word embeddings and train such representations from the collections of geo-located tweets. We then use a combination of bag-of-embeddings and traditional bag-of-words to train travel-related classifiers in both Portuguese and English to filter travel-related content from non-related. We created specific gold-standard data to perform empirical evaluation of the resulting classifiers. Results are in line with research work in other application areas by showing the robustness of using word embeddings to learn word similarities that bag-of-words is not able to capture.


WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling

Mar 04, 2018
Hao Zhang, Bo Chen, Dandan Guo, Mingyuan Zhou

To train an inference network jointly with a deep generative topic model, making it both scalable to big corpora and fast in out-of-sample prediction, we develop Weibull hybrid autoencoding inference (WHAI) for deep latent Dirichlet allocation, which infers posterior samples via a hybrid of stochastic-gradient MCMC and autoencoding variational Bayes. The generative network of WHAI has a hierarchy of gamma distributions, while the inference network of WHAI is a Weibull upward-downward variational autoencoder, which integrates a deterministic-upward deep neural network, and a stochastic-downward deep generative model based on a hierarchy of Weibull distributions. The Weibull distribution can be used to well approximate a gamma distribution with an analytic Kullback-Leibler divergence, and has a simple reparameterization via the uniform noise, which help efficiently compute the gradients of the evidence lower bound with respect to the parameters of the inference network. The effectiveness and efficiency of WHAI are illustrated with experiments on big corpora.

* ICLR 2018 

Natural Language Processing via LDA Topic Model in Recommendation Systems

Sep 20, 2019
Hamed Jelodar, Yongli Wang, Mahdi Rabbani, SeyedValyAllah Ayobi

Today, Internet is one of the widest available media worldwide. Recommendation systems are increasingly being used in various applications such as movie recommendation, mobile recommendation, article recommendation and etc. Collaborative Filtering (CF) and Content-Based (CB) are Well-known techniques for building recommendation systems. Topic modeling based on LDA, is a powerful technique for semantic mining and perform topic extraction. In the past few years, many articles have been published based on LDA technique for building recommendation systems. In this paper, we present taxonomy of recommendation systems and applications based on LDA. In addition, we utilize LDA and Gibbs sampling algorithms to evaluate ISWC and WWW conference publications in computer science. Our study suggest that the recommendation systems based on LDA could be effective in building smart recommendation system in online communities.


Topic Modeling on User Stories using Word Mover's Distance

Jul 13, 2020
Kim Julian Gülle, Nicholas Ford, Patrick Ebel, Florian Brokhausen, Andreas Vogelsang

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.