Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

SCDV : Sparse Composite Document Vectors using soft clustering over distributional representations

May 12, 2017
Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, Harish Karnick

We present a feature vector formation technique for documents - Sparse Composite Document Vector (SCDV) - which overcomes several shortcomings of the current distributional paragraph vector representations that are widely used for text representation. In SCDV, word embedding's are clustered to capture multiple semantic contexts in which words occur. They are then chained together to form document topic-vectors that can express complex, multi-topic documents. Through extensive experiments on multi-class and multi-label classification tasks, we outperform the previous state-of-the-art method, NTSG (Liu et al., 2015a). We also show that SCDV embedding's perform well on heterogeneous tasks like Topic Coherence, context-sensitive Learning and Information Retrieval. Moreover, we achieve significant reduction in training and prediction times compared to other representation methods. SCDV achieves best of both worlds - better performance with lower time and space complexity.

* 10 pages, 5 figures. Update: Added results on Information Retrieval and Topic Coherence with Discussion 

  Access Paper or Ask Questions

Modeling Human Behavior Part II -- Cognitive approaches and Uncertainty

May 13, 2022
Andrew Fuchs, Andrea Passarella, Marco Conti

As we discussed in Part I of this topic, there is a clear desire to model and comprehend human behavior. Given the popular presupposition of human reasoning as the standard for learning and decision-making, there have been significant efforts and a growing trend in research to replicate these innate human abilities in artificial systems. In Part I, we discussed learning methods which generate a model of behavior from exploration of the system and feedback based on the exhibited behavior as well as topics relating to the use of or accounting for beliefs with respect to applicable skills or mental states of others. In this work, we will continue the discussion from the perspective of methods which focus on the assumed cognitive abilities, limitations, and biases demonstrated in human reasoning. We will arrange these topics as follows (i) methods such as cognitive architectures, cognitive heuristics, and related which demonstrate assumptions of limitations on cognitive resources and how that impacts decisions and (ii) methods which generate and utilize representations of bias or uncertainty to model human decision-making or the future outcomes of decisions.

* This is Part 2 of our review (see Modeling Human Behavior Part I - Learning and Belief Approaches) relating to learning and modeling behavior. This work was partially funded by the following projects. European Union's Horizon 2020 research and innovation programme: HumaneAI-Net (No 952026). CHIST-ERA program: SAI project (grant CHIST-ERA-19-XAI-010, funded by MUR, grant number not yet available) 

  Access Paper or Ask Questions

Library of Congress Subject Heading (LCSH) Browsing and Natural Language Searching

Sep 30, 2021
Charles-Antoine Julien, Banafsheh Asadi, Jesse David Dinneen, Fei Shu

Controlled topical vocabularies (CVs) are built into information systems to aid browsing and retrieval of items that may be unfamiliar, but it is unclear how this feature should be integrated with standard keyword searching. Few systems or scholarly prototypes have attempted this, and none have used the most widely used CV, the Library of Congress Subject Headings (LCSH), which organizes monograph collections in academic libraries throughout the world. This paper describes a working prototype of a Web application that concurrently allows topic exploration using an outline tree view of the LCSH hierarchy and natural language keyword searching of a real-world Science and Engineering bibliographic collection. Pilot testing shows the system is functional, and work to fit the complex LCSH structure into a usable hierarchy is ongoing. This study contributes to knowledge of the practical design decisions required when developing linked interactions between topical hierarchy browsing and natural language searching, which promise to facilitate information discovery and exploration.

* In ASIST 2016: Proceedings of the 79th Annual Meeting of the Association for Information Science & Technology, 53 
* conference paper (ASIST '16), 4 pages plus a poster 

  Access Paper or Ask Questions

Regular Expressions for Fast-response COVID-19 Text Classification

Feb 18, 2021
Igor L. Markov, Jacqueline Liu, Adam Vagner

Text classifiers are at the core of many NLP applications and use a variety of algorithmic approaches and software. This paper describes how Facebook determines if a given piece of text - anything from a hashtag to a post - belongs to a narrow topic such as COVID-19. To fully define a topic and evaluate classifier performance we employ human-guided iterations of keyword discovery, but do not require labeled data. For COVID-19, we build two sets of regular expressions: (1) for 66 languages, with 99% precision and recall >50%, (2) for the 11 most common languages, with precision >90% and recall >90%. Regular expressions enable low-latency queries from multiple platforms. \hush{PHP, Python, Java and SQL code} Response to challenges like COVID-19 is fast and so are revisions. Comparisons to a DNN classifier show explainable results, higher precision and recall, and less overfitting. Our learnings can be applied to other narrow-topic classifiers.

* 10 pages, 7 tables 

  Access Paper or Ask Questions

A System for Worldwide COVID-19 Information Aggregation

Jul 28, 2020
Akiko Aizawa, Frederic Bergeron, Junjie Chen, Fei Cheng, Katsuhiko Hayashi, Kentaro Inui, Hiroyoshi Ito, Daisuke Kawahara, Masaru Kitsuregawa, Hirokazu Kiyomaru, Masaki Kobayashi, Takashi Kodama, Sadao Kurohashi, Qianying Liu, Masaki Matsubara, Yusuke Miyao, Atsuyuki Morishima, Yugo Murawaki, Kazumasa Omura, Haiyue Song, Eiichiro Sumita, Shinji Suzuki, Ribeka Tanaka, Yu Tanaka, Masashi Toyoda, Nobuhiro Ueda, Honai Ueoka, Masao Utiyama, Ying Zhong

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation ( ) containing reliable articles from 10 regions in 7 languages sorted by topics for Japanese citizens. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese. A BERT-based topic-classifier trained on an article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

* Poster on NLP COVID-19 Workshop at ACL 2020, 4 pages, 3 figures, 7 tables 

  Access Paper or Ask Questions

Generating Chinese Poetry from Images via Concrete and Abstract Information

Mar 24, 2020
Yusen Liu, Dayiheng Liu, Jiancheng Lv, Yongsheng Sang

In recent years, the automatic generation of classical Chinese poetry has made great progress. Besides focusing on improving the quality of the generated poetry, there is a new topic about generating poetry from an image. However, the existing methods for this topic still have the problem of topic drift and semantic inconsistency, and the image-poem pairs dataset is hard to be built when training these models. In this paper, we extract and integrate the Concrete and Abstract information from images to address those issues. We proposed an infilling-based Chinese poetry generation model which can infill the Concrete keywords into each line of poems in an explicit way, and an abstract information embedding to integrate the Abstract information into generated poems. In addition, we use non-parallel data during training and construct separate image datasets and poem datasets to train the different components in our framework. Both automatic and human evaluation results show that our approach can generate poems which have better consistency with images without losing the quality.

* Accepted by the 2020 International Joint Conference on Neural Networks (IJCNN 2020) 

  Access Paper or Ask Questions

Analyzing Stylistic Variation across Different Political Regimes

Dec 02, 2020
Liviu P. Dinu, Ana-Sabina Uban

In this article we propose a stylistic analysis of texts written across two different periods, which differ not only temporally, but politically and culturally: communism and democracy in Romania. We aim to analyze the stylistic variation between texts written during these two periods, and determine at what levels the variation is more apparent (if any): at the stylistic level, at the topic level etc. We take a look at the stylistic profile of these texts comparatively, by performing clustering and classification experiments on the texts, using traditional authorship attribution methods and features. To confirm the stylistic variation is indeed an effect of the change in political and cultural environment, and not merely reflective of a natural change in the author's style with time, we look at various stylistic metrics over time and show that the change in style between the two periods is statistically significant. We also perform an analysis of the variation in topic between the two epochs, to compare with the variation at the style level. These analyses show that texts from the two periods can indeed be distinguished, both from the point of view of style and from that of semantic content (topic).

* 19th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2018) 

  Access Paper or Ask Questions

When Politicians Talk About Politics: Identifying Political Tweets of Brazilian Congressmen

May 04, 2018
Lucas S. Oliveira, Pedro O. S. Vaz de Melo, Marcelo S. Amaral, José Antônio. G. Pinho

Since June 2013, when Brazil faced the largest and most significant mass protests in a generation, a political crisis is in course. In midst of this crisis, Brazilian politicians use social media to communicate with the electorate in order to retain or to grow their political capital. The problem is that many controversial topics are in course and deputies may prefer to avoid such themes in their messages. To characterize this behavior, we propose a method to accurately identify political and non-political tweets independently of the deputy who posted it and of the time it was posted. Moreover, we collected tweets of all congressmen who were active on Twitter and worked in the Brazilian parliament from October 2013 to October 2017. To evaluate our method, we used word clouds and a topic model to identify the main political and non-political latent topics in parliamentarian tweets. Both results indicate that our proposal is able to accurately distinguish political from non-political tweets. Moreover, our analyses revealed a striking fact: more than half of the messages posted by Brazilian deputies are non-political.

* 4 pages, 7 figures, 2 tables 

  Access Paper or Ask Questions

On the Place of Text Data in Lifelogs, and Text Analysis via Semantic Facets

Jun 08, 2016
Gregory Grefenstette, Lawrence Muchemi

Current research in lifelog data has not paid enough attention to analysis of cognitive activities in comparison to physical activities. We argue that as we look into the future, wearable devices are going to be cheaper and more prevalent and textual data will play a more significant role. Data captured by lifelogging devices will increasingly include speech and text, potentially useful in analysis of intellectual activities. Analyzing what a person hears, reads, and sees, we should be able to measure the extent of cognitive activity devoted to a certain topic or subject by a learner. Test-based lifelog records can benefit from semantic analysis tools developed for natural language processing. We show how semantic analysis of such text data can be achieved through the use of taxonomic subject facets and how these facets might be useful in quantifying cognitive activity devoted to various topics in a person's day. We are currently developing a method to automatically create taxonomic topic vocabularies that can be applied to this detection of intellectual activity.

* iConference 2016 SIE on Lifelogging, Mar 2016, Philadelphia, United States. iConference 2016 SIE on Lifelogging, 2016 

  Access Paper or Ask Questions

Unsupervised paradigm for information extraction from transcripts using BERT

Oct 09, 2021
Aravind Chandramouli, Siddharth Shukla, Neeti Nair, Shiven Purohit, Shubham Pandey, Murali Mohana Krishna Dandu

Audio call transcripts are one of the valuable sources of information for multiple downstream use cases such as understanding the voice of the customer and analyzing agent performance. However, these transcripts are noisy in nature and in an industry setting, getting tagged ground truth data is a challenge. In this paper, we present a solution implemented in the industry using BERT Language Models as part of our pipeline to extract key topics and multiple open intents discussed in the call. Another problem statement we looked at was the automatic tagging of transcripts into predefined categories, which traditionally is solved using supervised approach. To overcome the lack of tagged data, all our proposed approaches use unsupervised methods to solve the outlined problems. We evaluate the results by quantitatively comparing the automatically extracted topics, intents and tagged categories with human tagged ground truth and by qualitatively measuring the valuable concepts and intents that are not present in the ground truth. We achieved near human accuracy in extraction of these topics and intents using our novel approach

  Access Paper or Ask Questions