Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

A New Spectral Method for Latent Variable Models

Apr 04, 2017
Matteo Ruffini, Marta Casanellas, Ricard Gavaldà

This paper presents an algorithm for the unsupervised learning of latent variable models from unlabeled sets of data. We base our technique on spectral decomposition, providing a technique that proves to be robust both in theory and in practice. We also describe how to use this algorithm to learn the parameters of two well known text mining models: single topic model and Latent Dirichlet Allocation, providing in both cases an efficient technique to retrieve the parameters to feed the algorithm. We compare the results of our algorithm with those of existing algorithms on synthetic data, and we provide examples of applications to real world text corpora for both single topic model and LDA, obtaining meaningful results.

  Access Paper or Ask Questions

Dirichlet Variational Autoencoder for Text Modeling

Oct 31, 2018
Yijun Xiao, Tiancheng Zhao, William Yang Wang

We introduce an improved variational autoencoder (VAE) for text modeling with topic information explicitly modeled as a Dirichlet latent variable. By providing the proposed model topic awareness, it is more superior at reconstructing input texts. Furthermore, due to the inherent interactions between the newly introduced Dirichlet variable and the conventional multivariate Gaussian variable, the model is less prone to KL divergence vanishing. We derive the variational lower bound for the new model and conduct experiments on four different data sets. The results show that the proposed model is superior at text reconstruction across the latent space and classifications on learned representations have higher test accuracies.

  Access Paper or Ask Questions

Sentiment Analysis of the COVID-related r/Depression Posts

Jul 28, 2021
Zihan Chen, Marina Sokolova is a popular social media platform among young people. Reddit users share their stories to seek support from other users, especially during the Covid-19 pandemic. Messages posted on Reddit and their content have provided researchers with opportunity to analyze public concerns. In this study, we analyzed sentiments of COVID-related messages posted on r/Depression. Our study poses the following questions: a) What are the common topics that the Reddit users discuss? b) Can we use these topics to classify sentiments of the posts? c) What matters concern people more during the pandemic? Key Words: Sentiment Classification, Depression, COVID-19, Reddit, LDA, BERT

* 16 pages, 7 figures, 5 tables, 1 appendix 

  Access Paper or Ask Questions

COVID-19 Kaggle Literature Organization

Sep 02, 2020
Maksim Ekin Eren, Nick Solovyev, Edward Raff, Charles Nicholas, Ben Johnson

The world has faced the devastating outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, in 2020. Research in the subject matter was fast-tracked to such a point that scientists were struggling to keep up with new findings. With this increase in the scientific literature, there arose a need for organizing those documents. We describe an approach to organize and visualize the scientific literature on or related to COVID-19 using machine learning techniques so that papers on similar topics are grouped together. By doing so, the navigation of topics and related papers is simplified. We implemented this approach using the widely recognized CORD-19 dataset to present a publicly available proof of concept.

* Maksim Ekin Eren, Nick Solovyev, Edward Raff, Charles Nicholas, and Ben Johnson. 2020. COVID-19 Kaggle Literature Organization. In ACM Sym-posium on Document Engineering 2020 (DocEng 20), September 29-October2, 2020, Virtual Event, CA, USA.ACM, New York, NY, USA, 4 pages. 

  Access Paper or Ask Questions

Discrete Argument Representation Learning for Interactive Argument Pair Identification

Nov 05, 2019
Lu Ji, Zhongyu Wei, Jing Li, Qi Zhang, Xuanjing Huang

In this paper, we focus on extracting interactive argument pairs from two posts with opposite stances to a certain topic. Considering opinions are exchanged from different perspectives of the discussing topic, we study the discrete representations for arguments to capture varying aspects in argumentation languages (e.g., the debate focus and the participant behavior). Moreover, we utilize hierarchical structure to model post-wise information incorporating contextual knowledge. Experimental results on the large-scale dataset collected from CMV show that our proposed framework can significantly outperform the competitive baselines. Further analyses reveal why our model yields superior performance and prove the usefulness of our learned representations.

* 10 pages, 5 figures, 3 tables submitted for consideration of publication to the IEEE Transactions on Audio, Speech, and Language Processing, 2019 

  Access Paper or Ask Questions

Slugbot: An Application of a Novel and Scalable Open Domain Socialbot Framework

Jan 04, 2018
Kevin K. Bowden, Jiaqi Wu, Shereen Oraby, Amita Misra, Marilyn Walker

In this paper we introduce a novel, open domain socialbot for the Amazon Alexa Prize competition, aimed at carrying on friendly conversations with users on a variety of topics. We present our modular system, highlighting our different data sources and how we use the human mind as a model for data management. Additionally we build and employ natural language understanding and information retrieval tools and APIs to expand our knowledge bases. We describe our semistructured, scalable framework for crafting topic-specific dialogue flows, and give details on our dialogue management schemes and scoring mechanisms. Finally we briefly evaluate the performance of our system and observe the challenges that an open domain socialbot faces.

* Alexa Prize Proceedings 2017 

  Access Paper or Ask Questions

Trend and Thoughts: Understanding Climate Change Concern using Machine Learning and Social Media Data

Nov 06, 2021
Zhongkai Shangguan, Zihe Zheng, Lei Lin

Nowadays social media platforms such as Twitter provide a great opportunity to understand public opinion of climate change compared to traditional survey methods. In this paper, we constructed a massive climate change Twitter dataset and conducted comprehensive analysis using machine learning. By conducting topic modeling and natural language processing, we show the relationship between the number of tweets about climate change and major climate events; the common topics people discuss climate change; and the trend of sentiment. Our dataset was published on Kaggle (\url{}) and can be used in further research.

  Access Paper or Ask Questions

StoryDB: Broad Multi-language Narrative Dataset

Sep 29, 2021
Alexey Tikhonov, Igor Samenko, Ivan P. Yamshchikov

This paper presents StoryDB - a broad multi-language dataset of narratives. StoryDB is a corpus of texts that includes stories in 42 different languages. Every language includes 500+ stories. Some of the languages include more than 20 000 stories. Every story is indexed across languages and labeled with tags such as a genre or a topic. The corpus shows rich topical and language variation and can serve as a resource for the study of the role of narrative in natural language processing across various languages including low resource ones. We also demonstrate how the dataset could be used to benchmark three modern multilanguage models, namely, mDistillBERT, mBERT, and XLM-RoBERTa.

  Access Paper or Ask Questions

Assessing Political Prudence of Open-domain Chatbots

Jun 11, 2021
Yejin Bang, Nayeon Lee, Etsuko Ishii, Andrea Madotto, Pascale Fung

Politically sensitive topics are still a challenge for open-domain chatbots. However, dealing with politically sensitive content in a responsible, non-partisan, and safe behavior way is integral for these chatbots. Currently, the main approach to handling political sensitivity is by simply changing such a topic when it is detected. This is safe but evasive and results in a chatbot that is less engaging. In this work, as a first step towards a politically safe chatbot, we propose a group of metrics for assessing their political prudence. We then conduct political prudence analysis of various chatbots and discuss their behavior from multiple angles through our automatic metric and human evaluation metrics. The testsets and codebase are released to promote research in this area.

* SIGDIAL 2021 - Safety for E2E Conversational AI (Camera-ready Version) 

  Access Paper or Ask Questions

Arabic Offensive Language on Twitter: Analysis and Experiments

Apr 05, 2020
Hamdy Mubarak, Ammar Rashed, Kareem Darwish, Younes Samih, Ahmed Abdelali

Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building effective Arabic offensive tweet detection. We introduce a method for building an offensive dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech. Next, we analyze the dataset to determine which topics, dialects, and gender are most associated with offensive tweets and how Arabic speakers use offensive language. Lastly, we conduct a large battery of experiments to produce strong results (F1 = 79.7) on the dataset using Support Vector Machine techniques.

* 10 pages, 6 figures, 3 tables 

  Access Paper or Ask Questions