Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic": models, code, and papers

Word embeddings for topic modeling: an application to the estimation of the economic policy uncertainty index

Oct 29, 2021
Hairo U. Miranda Belmonte, Victor Muñiz-Sánchez, Francisco Corona

Quantification of economic uncertainty is a key concept for the prediction of macro economic variables such as gross domestic product (GDP), and it becomes particularly relevant on real-time or short-time predictions methodologies, such as nowcasting, where it is required a large amount of time series data, commonly with different structures and frequencies. Most of the data comes from the official agencies statistics and non-public institutions, however, relying our estimates in just the traditional data mentioned before, have some disadvantages. One of them is that economic uncertainty could not be represented or measured in a proper way based solely in financial or macroeconomic data, another one, is that they are susceptible to lack of information due to extraordinary events, such as the current COVID-19 pandemic. For these reasons, it is very common nowadays to use some non-traditional data from different sources, such as social networks or digital newspapers, in addition to the traditional data from official sources. The economic policy uncertainty (EPU) index, is the most used newspaper-based indicator to quantify the uncertainty, and is based on topic modeling of newspapers. In this paper, we propose a methodology to estimate the EPU index, which incorporates a fast and efficient method for topic modeling of digital news based on semantic clustering with word embeddings, allowing to update the index in real-time, which is a drawback with another proposals that use computationally intensive methods for topic modeling, such as Latent Dirichlet Allocation (LDA). We show that our proposal allow us to update the index and significantly reduces the time required for new document assignation into topics.

* Preprint version 

  Access Paper or Ask Questions

Deep Learning Interviews: Hundreds of fully solved job interview questions from a wide range of key topics in AI

Dec 30, 2021
Shlomo Kashani, Amir Ivry

The second edition of Deep Learning Interviews is home to hundreds of fully-solved problems, from a wide range of key topics in AI. It is designed to both rehearse interview or exam specific topics and provide machine learning M.Sc./Ph.D. students, and those awaiting an interview a well-organized overview of the field. The problems it poses are tough enough to cut your teeth on and to dramatically improve your skills-but they're framed within thought-provoking questions and engaging stories. That is what makes the volume so specifically valuable to students and job seekers: it provides them with the ability to speak confidently and quickly on any relevant topic, to answer technical questions clearly and correctly, and to fully understand the purpose and meaning of interview questions and answers. Those are powerful, indispensable advantages to have when walking into the interview room. The book's contents is a large inventory of numerous topics relevant to DL job interviews and graduate level exams. That places this work at the forefront of the growing trend in science to teach a core set of practical mathematical and computational skills. It is widely accepted that the training of every computer scientist must include the fundamental theorems of ML, and AI appears in the curriculum of nearly every university. This volume is designed as an excellent reference for graduates of such programs.

  Access Paper or Ask Questions

Textual Data Distributions: Kullback Leibler Textual Distributions Contrasts on GPT-2 Generated Texts, with Supervised, Unsupervised Learning on Vaccine & Market Topics & Sentiment

Jun 15, 2021
Jim Samuel, Ratnakar Palle, Eduardo Correa Soares

Efficient textual data distributions (TDD) alignment and generation are open research problems in textual analytics and NLP. It is presently difficult to parsimoniously and methodologically confirm that two or more natural language datasets belong to similar distributions, and to identify the extent to which textual data possess alignment. This study focuses on addressing a segment of the broader problem described above by applying multiple supervised and unsupervised machine learning (ML) methods to explore the behavior of TDD by (i) topical alignment, and (ii) by sentiment alignment. Furthermore we use multiple text generation methods including fine-tuned GPT-2, to generate text by topic and by sentiment. Finally we develop a unique process driven variation of Kullback-Leibler divergence (KLD) application to TDD, named KL Textual Distributions Contrasts(KL-TDC) to identify the alignment of machine generated textual corpora with naturally occurring textual corpora. This study thus identifies a unique approach for generating and validating TDD by topic and sentiment, which can be used to help address sparse data problems and other research, practice and classroom situations in need of artificially generated topic or sentiment aligned textual data.

  Access Paper or Ask Questions

Multi-label classification for biomedical literature: an overview of the BioCreative VII LitCovid Track for COVID-19 literature topic annotations

Apr 20, 2022
Qingyu Chen, Alexis Allot, Robert Leaman, Rezarta Islamaj Doğan, Jingcheng Du, Li Fang, Wang Kai, Shuo Xu, Yuefu Zhang, Parsa Bagherzadeh, Sabine Bergler, Aakash Bhatnagar, Nidhir Bhavsar, Yung-Chun Chang, Sheng-Jie Lin, Wentai Tang, Hongtong Zhang, Ilija Tavchioski, Shubo Tian, Jinfeng Zhang, Yulia Otmakhova, Antonio Jimeno Yepes, Hang Dong, Honghan Wu, Richard Dufour, Yanis Labrak, Niladri Chatterjee, Kushagri Tandon, Fréjus Laleye, Loïc Rakotoson, Emmanuele Chersoni, Jinghang Gu, Annemarie Friedrich, Subhash Chandra Pujari, Mariia Chizhikova, Naveen Sivadasan, Naveen Sivadasan, Zhiyong Lu

The COVID-19 pandemic has been severely impacting global society since December 2019. Massive research has been undertaken to understand the characteristics of the virus and design vaccines and drugs. The related findings have been reported in biomedical literature at a rate of about 10,000 articles on COVID-19 per month. Such rapid growth significantly challenges manual curation and interpretation. For instance, LitCovid is a literature database of COVID-19-related articles in PubMed, which has accumulated more than 200,000 articles with millions of accesses each month by users worldwide. One primary curation task is to assign up to eight topics (e.g., Diagnosis and Treatment) to the articles in LitCovid. Despite the continuing advances in biomedical text mining methods, few have been dedicated to topic annotations in COVID-19 literature. To close the gap, we organized the BioCreative LitCovid track to call for a community effort to tackle automated topic annotation for COVID-19 literature. The BioCreative LitCovid dataset, consisting of over 30,000 articles with manually reviewed topics, was created for training and testing. It is one of the largest multilabel classification datasets in biomedical scientific literature. 19 teams worldwide participated and made 80 submissions in total. Most teams used hybrid systems based on transformers. The highest performing submissions achieved 0.8875, 0.9181, and 0.9394 for macro F1-score, micro F1-score, and instance-based F1-score, respectively. The level of participation and results demonstrate a successful track and help close the gap between dataset curation and method development. The dataset is publicly available via for benchmarking and further development.

  Access Paper or Ask Questions

Discriminative Neural Topic Models

Feb 28, 2017
Gaurav Pandey, Ambedkar Dukkipati

We propose a neural network based approach for learning topics from text and image datasets. The model makes no assumptions about the conditional distribution of the observed features given the latent topics. This allows us to perform topic modelling efficiently using sentences of documents and patches of images as observed features, rather than limiting ourselves to words. Moreover, the proposed approach is online, and hence can be used for streaming data. Furthermore, since the approach utilizes neural networks, it can be implemented on GPU with ease, and hence it is very scalable.

* 6 pages, 9 figures 

  Access Paper or Ask Questions

Discovering topics in text datasets by visualizing relevant words

Jul 18, 2017
Franziska Horn, Leila Arras, Grégoire Montavon, Klaus-Robert Müller, Wojciech Samek

When dealing with large collections of documents, it is imperative to quickly get an overview of the texts' contents. In this paper we show how this can be achieved by using a clustering algorithm to identify topics in the dataset and then selecting and visualizing relevant words, which distinguish a group of documents from the rest of the texts, to summarize the contents of the documents belonging to each topic. We demonstrate our approach by discovering trending topics in a collection of New York Times article snippets.

* arXiv admin note: substantial text overlap with arXiv:1707.05261 

  Access Paper or Ask Questions

Scalable Probabilistic Entity-Topic Modeling

Sep 02, 2013
Neil Houlsby, Massimiliano Ciaramita

We present an LDA approach to entity disambiguation. Each topic is associated with a Wikipedia article and topics generate either content words or entity mentions. Training such models is challenging because of the topic and vocabulary size, both in the millions. We tackle these problems using a novel distributed inference and representation framework based on a parallel Gibbs sampler guided by the Wikipedia link graph, and pipelines of MapReduce allowing fast and memory-frugal processing of large datasets. We report state-of-the-art performance on a public dataset.

  Access Paper or Ask Questions

Multilingual Topic Models

Dec 18, 2017
Kriste Krstovski, Michael J. Kurtz, David A. Smith, Alberto Accomazzi

Scientific publications have evolved several features for mitigating vocabulary mismatch when indexing, retrieving, and computing similarity between articles. These mitigation strategies range from simply focusing on high-value article sections, such as titles and abstracts, to assigning keywords, often from controlled vocabularies, either manually or through automatic annotation. Various document representation schemes possess different cost-benefit tradeoffs. In this paper, we propose to model different representations of the same article as translations of each other, all generated from a common latent representation in a multilingual topic model. We start with a methodological overview on latent variable models for parallel document representations that could be used across many information science tasks. We then show how solving the inference problem of mapping diverse representations into a shared topic space allows us to evaluate representations based on how topically similar they are to the original article. In addition, our proposed approach provides means to discover where different concept vocabularies require improvement.

* 18 pages, 9 figures 

  Access Paper or Ask Questions

TeCoMiner: Topic Discovery Through Term Community Detection

Mar 23, 2021
Andreas Hamm, Jana Thelen, Rasmus Beckmann, Simon Odrowski

This note is a short description of TeCoMiner, an interactive tool for exploring the topic content of text collections. Unlike other topic modeling tools, TeCoMiner is not based on some generative probabilistic model but on topological considerations about co-occurrence networks of terms. We outline the methods used for identifying topics, describe the features of the tool, and sketch an application, using a corpus of policy related scientific news on environmental issues published by the European Commission over the last decade.

* 8 pages, 4 figures 

  Access Paper or Ask Questions

Unsupervised Topic Segmentation of Meetings with BERT Embeddings

Jun 24, 2021
Alessandro Solbiati, Kevin Heffernan, Georgios Damaskinos, Shivani Poddar, Shubham Modi, Jacques Cali

Topic segmentation of meetings is the task of dividing multi-person meeting transcripts into topic blocks. Supervised approaches to the problem have proven intractable due to the difficulties in collecting and accurately annotating large datasets. In this paper we show how previous unsupervised topic segmentation methods can be improved using pre-trained neural architectures. We introduce an unsupervised approach based on BERT embeddings that achieves a 15.5% reduction in error rate over existing unsupervised approaches applied to two popular datasets for meeting transcripts.

  Access Paper or Ask Questions