Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Scalable Bayesian Non-Negative Tensor Factorization for Massive Count Data

Aug 18, 2015
Changwei Hu, Piyush Rai, Changyou Chen, Matthew Harding, Lawrence Carin

We present a Bayesian non-negative tensor factorization model for count-valued tensor data, and develop scalable inference algorithms (both batch and online) for dealing with massive tensors. Our generative model can handle overdispersed counts as well as infer the rank of the decomposition. Moreover, leveraging a reparameterization of the Poisson distribution as a multinomial facilitates conjugacy in the model and enables simple and efficient Gibbs sampling and variational Bayes (VB) inference updates, with a computational cost that only depends on the number of nonzeros in the tensor. The model also provides a nice interpretability for the factors; in our model, each factor corresponds to a "topic". We develop a set of online inference algorithms that allow further scaling up the model to massive tensors, for which batch inference methods may be infeasible. We apply our framework on diverse real-world applications, such as \emph{multiway} topic modeling on a scientific publications database, analyzing a political science data set, and analyzing a massive household transactions data set.

* ECML PKDD 2015 

  Access Paper or Ask Questions

TwiSent: A Multistage System for Analyzing Sentiment in Twitter

Sep 18, 2012
Subhabrata Mukherjee, Akshat Malu, A. R. Balamurali, Pushpak Bhattacharyya

In this paper, we present TwiSent, a sentiment analysis system for Twitter. Based on the topic searched, TwiSent collects tweets pertaining to it and categorizes them into the different polarity classes positive, negative and objective. However, analyzing micro-blog posts have many inherent challenges compared to the other text genres. Through TwiSent, we address the problems of 1) Spams pertaining to sentiment analysis in Twitter, 2) Structural anomalies in the text in the form of incorrect spellings, nonstandard abbreviations, slangs etc., 3) Entity specificity in the context of the topic searched and 4) Pragmatics embedded in text. The system performance is evaluated on manually annotated gold standard data and on an automatically annotated tweet set based on hashtags. It is a common practise to show the efficacy of a supervised system on an automatically annotated dataset. However, we show that such a system achieves lesser classification accurcy when tested on generic twitter dataset. We also show that our system performs much better than an existing system.

* In Proceedings of The 21st ACM Conference on Information and Knowledge Management (CIKM), 2012 as a poster 
* The paper is available at 

  Access Paper or Ask Questions

Hierarchical Interpretation of Neural Text Classification

Feb 20, 2022
Hanqi Yan, Lin Gui, Yulan He

Recent years have witnessed increasing interests in developing interpretable models in Natural Language Processing (NLP). Most existing models aim at identifying input features such as words or phrases important for model predictions. Neural models developed in NLP however often compose word semantics in a hierarchical manner. Interpretation by words or phrases only thus cannot faithfully explain model decisions. This paper proposes a novel Hierarchical INTerpretable neural text classifier, called Hint, which can automatically generate explanations of model predictions in the form of label-associated topics in a hierarchical manner. Model interpretation is no longer at the word level, but built on topics as the basic semantic unit. Experimental results on both review datasets and news datasets show that our proposed approach achieves text classification results on par with existing state-of-the-art text classifiers, and generates interpretations more faithful to model predictions and better understood by humans than other interpretable neural text classifiers.

  Access Paper or Ask Questions

Analysis of Legal Documents via Non-negative Matrix Factorization Methods

Apr 28, 2021
Ryan Budahazy, Lu Cheng, Yihuan Huang, Andrew Johnson, Pengyu Li, Joshua Vendrow, Zhoutong Wu, Denali Molitor, Elizaveta Rebrova, Deanna Needell

The California Innocence Project (CIP), a clinical law school program aiming to free wrongfully convicted prisoners, evaluates thousands of mails containing new requests for assistance and corresponding case files. Processing and interpreting this large amount of information presents a significant challenge for CIP officials, which can be successfully aided by topic modeling techniques.In this paper, we apply Non-negative Matrix Factorization (NMF) method and implement various offshoots of it to the important and previously unstudied data set compiled by CIP. We identify underlying topics of existing case files and classify request files by crime type and case status (decision type). The results uncover the semantic structure of current case files and can provide CIP officials with a general understanding of newly received case files before further examinations. We also provide an exposition of popular variants of NMF with their experimental results and discuss the benefits and drawbacks of each variant through the real-world application.

* 16 pages, 4 figures 

  Access Paper or Ask Questions

Stop Bugging Me! Evading Modern-Day Wiretapping Using Adversarial Perturbations

Oct 24, 2020
Tal Ben Senior, Yael Mathov, Asaf Shabtai, Yuval Elovici

Mass surveillance systems for voice over IP (VoIP) conversations pose a huge risk to privacy. These automated systems use learning models to analyze conversations, and upon detecting calls that involve specific topics, route them to a human agent. In this study, we present an adversarial learning-based framework for privacy protection for VoIP conversations. We present a novel algorithm that finds a universal adversarial perturbation (UAP), which, when added to the audio stream, prevents an eavesdropper from automatically detecting the conversation's topic. As shown in our experiments, the UAP is agnostic to the speaker or audio length, and its volume can be changed in real-time, as needed. In a real-world demonstration, we use a Teensy microcontroller that acts as an external microphone and adds the UAP to the audio in real-time. We examine different speakers, VoIP applications (Skype, Zoom), audio lengths, and speech-to-text models (Deep Speech, Kaldi). Our results in the real world suggest that our approach is a feasible solution for privacy protection.

  Access Paper or Ask Questions

Global Health Monitor: A Web-based System for Detecting and Mapping Infectious Diseases

Nov 21, 2019
Son Doan, Quoc-Hung Ngo, Ai Kawazoe, Nigel Collier

We present the Global Health Monitor, an online Web-based system for detecting and mapping infectious disease outbreaks that appear in news stories. The system analyzes English news stories from news feed providers, classifies them for topical relevance and plots them onto a Google map using geo-coding information, helping public health workers to monitor the spread of diseases in a geo-temporal context. The background knowledge for the system is contained in the BioCaster ontology (BCO) (Collier et al., 2007a) which includes both information on infectious diseases as well as geographical locations with their latitudes/longitudes. The system consists of four main stages: topic classification, named entity recognition (NER), disease/location detection and visualization. Evaluation of the system shows that it achieved high accuracy on a gold standard corpus. The system is now in practical use. Running on a clustercomputer, it monitors more than 1500 news feeds 24/7, updating the map every hour.

* Proc. of the International Joint Conference on Natural Language Processing (IJCNLP) 2008, pages 951-956 
* 6 pages, 3 figures, Proc. of IJCNLP 2008 

  Access Paper or Ask Questions

Characterizing Diseases and disorders in Gay Users' tweets

Mar 24, 2018
Frank Webb, Amir Karami, Vanessa Kitzie

A lack of information exists about the health issues of lesbian, gay, bisexual, transgender, and queer (LGBTQ) people who are often excluded from national demographic assessments, health studies, and clinical trials. As a result, medical experts and researchers lack a holistic understanding of the health disparities facing these populations. Fortunately, publicly available social media data such as Twitter data can be utilized to support the decisions of public health policy makers and managers with respect to LGBTQ people. This research employs a computational approach to collect tweets from gay users on health-related topics and model these topics. To determine the nature of health-related information shared by men who have sex with men on Twitter, we collected thousands of tweets from 177 active users. We sampled these tweets using a framework that can be applied to other LGBTQ sub-populations in future research. We found 11 diseases in 7 categories based on ICD 10 that are in line with the published studies and official reports.

  Access Paper or Ask Questions

Distant finetuning with discourse relations for stance classification

Apr 27, 2022
Lifeng Jin, Kun Xu, Linfeng Song, Dong Yu

Approaches for the stance classification task, an important task for understanding argumentation in debates and detecting fake news, have been relying on models which deal with individual debate topics. In this paper, in order to train a system independent from topics, we propose a new method to extract data with silver labels from raw text to finetune a model for stance classification. The extraction relies on specific discourse relation information, which is shown as a reliable and accurate source for providing stance information. We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages going from the most noisy to the least noisy. Detailed experiments show that the automatically annotated dataset as well as the 3-stage training help improve model performance in stance classification. Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater, which confirms the effectiveness of our approach.

* NLPCC 2021 

  Access Paper or Ask Questions

Unsupervised Keyphrase Extraction via Interpretable Neural Networks

Mar 15, 2022
Rishabh Joshi, Vidhisha Balachandran, Emily Saldanha, Maria Glenski, Svitlana Volkova, Yulia Tsvetkov

Keyphrase extraction aims at automatically extracting a list of "important" phrases which represent the key concepts in a document. Prior approaches for unsupervised keyphrase extraction resort to heuristic notions of phrase importance via embedding similarities or graph centrality, requiring extensive domain expertise to develop them. Our work proposes an alternative operational definition: phrases that are most useful for predicting the topic of a text are important keyphrases. To this end, we propose INSPECT -- a self-explaining neural framework for identifying influential keyphrases by measuring the predictive impact of input phrases on the downstream task of topic classification. We show that this novel approach not only alleviates the need for ad-hoc heuristics but also achieves state-of-the-art results in unsupervised keyphrase extraction across four diverse datasets in two domains: scientific publications and news articles. Ultimately, our study suggests a new usage of interpretable neural networks as an intrinsic component in NLP systems, and not only as a tool for explaining model predictions to humans.

  Access Paper or Ask Questions

The MICCAI Hackathon on reproducibility, diversity, and selection of papers at the MICCAI conference

Mar 04, 2021
Fabian Balsiger, Alain Jungo, Naren Akash R J, Jianan Chen, Ivan Ezhov, Shengnan Liu, Jun Ma, Johannes C. Paetzold, Vishva Saravanan R, Anjany Sekuboyina, Suprosanna Shit, Yannick Suter, Moshood Yekini, Guodong Zeng, Markus Rempfler

The MICCAI conference has encountered tremendous growth over the last years in terms of the size of the community, as well as the number of contributions and their technical success. With this growth, however, come new challenges for the community. Methods are more difficult to reproduce and the ever-increasing number of paper submissions to the MICCAI conference poses new questions regarding the selection process and the diversity of topics. To exchange, discuss, and find novel and creative solutions to these challenges, a new format of a hackathon was initiated as a satellite event at the MICCAI 2020 conference: The MICCAI Hackathon. The first edition of the MICCAI Hackathon covered the topics reproducibility, diversity, and selection of MICCAI papers. In the manner of a small think-tank, participants collaborated to find solutions to these challenges. In this report, we summarize the insights from the MICCAI Hackathon into immediate and long-term measures to address these challenges. The proposed measures can be seen as starting points and guidelines for discussions and actions to possibly improve the MICCAI conference with regards to reproducibility, diversity, and selection of papers.

  Access Paper or Ask Questions