Modern Automatic Speech Recognition (ASR) systems primarily rely on scores from an Acoustic Model (AM) and a Language Model (LM) to rescore the N-best lists. With the abundance of recent natural language processing advances, the information utilized by current ASR for evaluating the linguistic and semantic legitimacy of the N-best hypotheses is rather limited. In this paper, we propose a novel Learning-to-Rescore (L2RS) mechanism, which is specialized for utilizing a wide range of textual information from the state-of-the-art NLP models and automatically deciding their weights to rescore the N-best lists for ASR systems. Specifically, we incorporate features including BERT sentence embedding, topic vector, and perplexity scores produced by n-gram LM, topic modeling LM, BERT LM and RNNLM to train a rescoring model. We conduct extensive experiments based on a public dataset, and experimental results show that L2RS outperforms not only traditional rescoring methods but also its deep neural network counterparts by a substantial improvement of 20.67% in terms of NDCG@10. L2RS paves the way for developing more effective rescoring models for ASR.
The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
Latent Dirichlet Allocation (LDA) is a popular topic modeling technique for discovery of hidden semantic architecture of text datasets, and plays a fundamental role in many machine learning applications. However, like many other machine learning algorithms, the process of training a LDA model may leak the sensitive information of the training datasets and bring significant privacy risks. To mitigate the privacy issues in LDA, we focus on studying privacy-preserving algorithms of LDA model training in this paper. In particular, we first develop a privacy monitoring algorithm to investigate the privacy guarantee obtained from the inherent randomness of the Collapsed Gibbs Sampling (CGS) process in a typical LDA training algorithm on centralized curated datasets. Then, we further propose a locally private LDA training algorithm on crowdsourced data to provide local differential privacy for individual data contributors. The experimental results on real-world datasets demonstrate the effectiveness of our proposed algorithms.
Models for sequential data such as the recurrent neural network (RNN) often implicitly model a sequence as having a fixed time interval between observations and do not account for group-level effects when multiple sequences are observed. We propose a model for grouped sequential data based on the RNN that accounts for varying time intervals between observations in a sequence by learning a group-level base parameter to which each sequence can revert. Our approach is motivated by the mixed membership framework, and we show how it can be used for dynamic topic modeling in which the distribution on topics (not the topics themselves) are evolving in time. We demonstrate our approach on a dataset of 3.4 million online grocery shopping orders made by 206K customers.
Topic models have been the prominent tools for automatic topic discovery from text corpora. Despite their effectiveness, topic models suffer from several limitations including the inability of modeling word ordering information in documents, the difficulty of incorporating external linguistic knowledge, and the lack of both accurate and efficient inference methods for approximating the intractable posterior. Recently, pretrained language models (PLMs) have brought astonishing performance improvements to a wide variety of tasks due to their superior representations of text. Interestingly, there have not been standard approaches to deploy PLMs for topic discovery as better alternatives to topic models. In this paper, we begin by analyzing the challenges of using PLM representations for topic discovery, and then propose a joint latent space learning and clustering framework built upon PLM embeddings. In the latent space, topic-word and document-topic distributions are jointly modeled so that the discovered topics can be interpreted by coherent and distinctive terms and meanwhile serve as meaningful summaries of the documents. Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery, and is conceptually simpler than topic models. On two benchmark datasets in different domains, our model generates significantly more coherent and diverse topics than strong topic models, and offers better topic-wise document representations, based on both automatic and human evaluations.
Topic modeling analyzes documents to learn meaningful patterns of words. Dynamic topic models capture how these patterns vary over time for a set of documents that were collected over a large time span. We develop the dynamic embedded topic model (D-ETM), a generative model of documents that combines dynamic latent Dirichlet allocation (D-LDA) and word embeddings. The D-ETM models each word with a categorical distribution whose parameter is given by the inner product between the word embedding and an embedding representation of its assigned topic at a particular time step. The word embeddings allow the D-ETM to generalize to rare words. The D-ETM learns smooth topic trajectories by defining a random walk prior over the embeddings of the topics. We fit the D-ETM using structured amortized variational inference. On a collection of United Nations debates, we find that the D-ETM learns interpretable topics and outperforms D-LDA in terms of both topic quality and predictive performance.
Among American women, the rate of breast cancer is only second to lung cancer. An estimated 12.4% women will develop breast cancer over the course of their lifetime. The widespread use of social media across the socio-economic spectrum offers unparalleled ways to facilitate information sharing, in particular as it pertains to health. Social media is also used by many healthcare stakeholders, ranging from government agencies to healthcare industry, to disseminate health information and to engage patients. The purpose of this study is to investigate people's perceptions and attitudes relate to breast cancer, especially those that are related to physical activities, on Twitter. To achieve this, we first identified and collected tweets related to breast cancer; and then used topic modeling and sentiment analysis techniques to understanding discussion themes and quantify Twitter users' perceptions and emotions w.r.t breast cancer to answer 5 research questions.
Online hate speech, particularly over microblogging platforms like Twitter, has emerged as arguably the most severe issue of the past decade. Several countries have reported a steep rise in hate crimes infuriated by malicious hate campaigns. While the detection of hate speech is one of the emerging research areas, the generation and spread of topic-dependent hate in the information network remain under-explored. In this work, we focus on exploring user behaviour, which triggers the genesis of hate speech on Twitter and how it diffuses via retweets. We crawl a large-scale dataset of tweets, retweets, user activity history, and follower networks, comprising over 161 million tweets from more than $41$ million unique users. We also collect over 600k contemporary news articles published online. We characterize different signals of information that govern these dynamics. Our analyses differentiate the diffusion dynamics in the presence of hate from usual information diffusion. This motivates us to formulate the modelling problem in a topic-aware setting with real-world knowledge. For predicting the initiation of hate speech for any given hashtag, we propose multiple feature-rich models, with the best performing one achieving a macro F1 score of 0.65. Meanwhile, to predict the retweet dynamics on Twitter, we propose RETINA, a novel neural architecture that incorporates exogenous influence using scaled dot-product attention. RETINA achieves a macro F1-score of 0.85, outperforming multiple state-of-the-art models. Our analysis reveals the superlative power of RETINA to predict the retweet dynamics of hateful content compared to the existing diffusion models.
This paper presents a novel communication-efficient parallel belief propagation (CE-PBP) algorithm for training latent Dirichlet allocation (LDA). Based on the synchronous belief propagation (BP) algorithm, we first develop a parallel belief propagation (PBP) algorithm on the parallel architecture. Because the extensive communication delay often causes a low efficiency of parallel topic modeling, we further use Zipf's law to reduce the total communication cost in PBP. Extensive experiments on different data sets demonstrate that CE-PBP achieves a higher topic modeling accuracy and reduces more than 80% communication cost than the state-of-the-art parallel Gibbs sampling (PGS) algorithm.
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model inference have been based on a maximum likelihood objective. Efficient algorithms exist that approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but these algorithms are not practical because they are inefficient and not robust to violations of model assumptions. In this paper we present an algorithm for topic model inference that is both provable and practical. The algorithm produces results comparable to the best MCMC implementations while running orders of magnitude faster.