Get our free extension to see links to code for papers anywhere online!

Chrome logo  Add to Chrome

Firefox logo Add to Firefox

"Topic Modeling": models, code, and papers

Concept Modeling with Superwords

Apr 11, 2012
Khalid El-Arini, Emily B. Fox, Carlos Guestrin

In information retrieval, a fundamental goal is to transform a document into concepts that are representative of its content. The term "representative" is in itself challenging to define, and various tasks require different granularities of concepts. In this paper, we aim to model concepts that are sparse over the vocabulary, and that flexibly adapt their content based on other relevant semantic information such as textual structure or associated image features. We explore a Bayesian nonparametric model based on nested beta processes that allows for inferring an unknown number of strictly sparse concepts. The resulting model provides an inherently different representation of concepts than a standard LDA (or HDP) based topic model, and allows for direct incorporation of semantic features. We demonstrate the utility of this representation on multilingual blog data and the Congressional Record.

  
Access Paper or Ask Questions

Topic Discovery through Data Dependent and Random Projections

Mar 18, 2013
Weicong Ding, Mohammad H. Rohban, Prakash Ishwar, Venkatesh Saligrama

We present algorithms for topic modeling based on the geometry of cross-document word-frequency patterns. This perspective gains significance under the so called separability condition. This is a condition on existence of novel-words that are unique to each topic. We present a suite of highly efficient algorithms based on data-dependent and random projections of word-frequency patterns to identify novel words and associated topics. We will also discuss the statistical guarantees of the data-dependent projections method based on two mild assumptions on the prior density of topic document matrix. Our key insight here is that the maximum and minimum values of cross-document frequency patterns projected along any direction are associated with novel words. While our sample complexity bounds for topic recovery are similar to the state-of-art, the computational complexity of our random projection scheme scales linearly with the number of documents and the number of words per document. We present several experiments on synthetic and real-world datasets to demonstrate qualitative and quantitative merits of our scheme.

  
Access Paper or Ask Questions

Physical Modeling using Recurrent Neural Networks with Fast Convolutional Layers

Apr 21, 2022
Julian D. Parker, Sebastian J. Schlecht, Rudolf Rabenstein, Maximilian Schäfer

Discrete-time modeling of acoustic, mechanical and electrical systems is a prominent topic in the musical signal processing literature. Such models are mostly derived by discretizing a mathematical model, given in terms of ordinary or partial differential equations, using established techniques. Recent work has applied the techniques of machine-learning to construct such models automatically from data for the case of systems which have lumped states described by scalar values, such as electrical circuits. In this work, we examine how similar techniques are able to construct models of systems which have spatially distributed rather than lumped states. We describe several novel recurrent neural network structures, and show how they can be thought of as an extension of modal techniques. As a proof of concept, we generate synthetic data for three physical systems and show that the proposed network structures can be trained with this data to reproduce the behavior of these systems.

* Submitted to DAFx2022 
  
Access Paper or Ask Questions

Proceedings of the Fifth International Workshop on Domain-Specific Languages and Models for Robotic Systems (DSLRob 2014)

Nov 26, 2014
Luca Gherardi, Nico Hochgeschwender, Christian Schlegel, Ulrik Pagh Schultz, Serge Stinckwich

The Fifth International Workshop on Domain-Specific Languages and Models for Robotic Systems (DSLRob'14) was held in conjunction with the 2014 International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR 2014), October 2014 in Bergamo, Italy. The main topics of the workshop were Domain-Specific Languages (DSLs) and Model-driven Software Development (MDSD) for robotics. A domain-specific language is a programming language dedicated to a particular problem domain that offers specific notations and abstractions that increase programmer productivity within that domain. Model-driven software development offers a high-level way for domain users to specify the functionality of their system at the right level of abstraction. DSLs and models have historically been used for programming complex systems. However recently they have garnered interest as a separate field of study. Robotic systems blend hardware and software in a holistic way that intrinsically raises many crosscutting concerns (concurrency, uncertainty, time constraints, ...), for which reason, traditional general-purpose languages often lead to a poor fit between the language features and the implementation requirements. DSLs and models offer a powerful, systematic way to overcome this problem, enabling the programmer to quickly and precisely implement novel software solutions to complex problems within the robotics domain.

  
Access Paper or Ask Questions

Proceedings of the Third International Workshop on Domain-Specific Languages and Models for Robotic Systems (DSLRob 2012)

Feb 20, 2013
Christian Schlegel, Ulrik P. Schultz, Serge Stinckwich

Proceedings of the Third International Workshop on Domain-Specific Languages and Models for Robotic Systems (DSLRob'12), held at the 2012 International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR 2012), November 2012 in Tsukuba, Japan. The main topics of the workshop were Domain-Specific Languages (DSLs) and Model-driven Architecture (MDA) for robotics. A domain-specific language (DSL) is a programming language dedicated to a particular problem domain that offers specific notations and abstractions that increase programmer productivity within that domain. Models-driven architecture (MDA) offers a high-level way for domain users to specify the functionality of their system at the right level of abstraction. DSLs and models have historically been used for programming complex systems. However recently they have garnered interest as a separate field of study. Robotic systems blend hardware and software in a holistic way that intrinsically raises many crosscutting concerns (concurrency, uncertainty, time constraints, ...), for which reason, traditional general-purpose languages often lead to a poor fit between the language features and the implementation requirements. DSLs and models offer a powerful, systematic way to overcome this problem, enabling the programmer to quickly and precisely implement novel software solutions to complex problems within the robotics domain.

* Index submission 
  
Access Paper or Ask Questions

HNP3: A Hierarchical Nonparametric Point Process for Modeling Content Diffusion over Social Media

Oct 02, 2016
Seyed Abbas Hosseini, Ali Khodadadi, Soheil Arabzade, Hamid R. Rabiee

This paper introduces a novel framework for modeling temporal events with complex longitudinal dependency that are generated by dependent sources. This framework takes advantage of multidimensional point processes for modeling time of events. The intensity function of the proposed process is a mixture of intensities, and its complexity grows with the complexity of temporal patterns of data. Moreover, it utilizes a hierarchical dependent nonparametric approach to model marks of events. These capabilities allow the proposed model to adapt its temporal and topical complexity according to the complexity of data, which makes it a suitable candidate for real world scenarios. An online inference algorithm is also proposed that makes the framework applicable to a vast range of applications. The framework is applied to a real world application, modeling the diffusion of contents over networks. Extensive experiments reveal the effectiveness of the proposed framework in comparison with state-of-the-art methods.

* Accepted in IEEE International Conference on Data Mining (ICDM) 2016, Barcelona 
  
Access Paper or Ask Questions

Modelling Direct Messaging Networks with Multiple Recipients for Cyber Deception

Nov 21, 2021
Kristen Moore, Cody J. Christopher, David Liebowitz, Surya Nepal, Renee Selvey

Cyber deception is emerging as a promising approach to defending networks and systems against attackers and data thieves. However, despite being relatively cheap to deploy, the generation of realistic content at scale is very costly, due to the fact that rich, interactive deceptive technologies are largely hand-crafted. With recent improvements in Machine Learning, we now have the opportunity to bring scale and automation to the creation of realistic and enticing simulated content. In this work, we propose a framework to automate the generation of email and instant messaging-style group communications at scale. Such messaging platforms within organisations contain a lot of valuable information inside private communications and document attachments, making them an enticing target for an adversary. We address two key aspects of simulating this type of system: modelling when and with whom participants communicate, and generating topical, multi-party text to populate simulated conversation threads. We present the LogNormMix-Net Temporal Point Process as an approach to the first of these, building upon the intensity-free modeling approach of Shchur et al.~\cite{shchur2019intensity} to create a generative model for unicast and multi-cast communications. We demonstrate the use of fine-tuned, pre-trained language models to generate convincing multi-party conversation threads. A live email server is simulated by uniting our LogNormMix-Net TPP (to generate the communication timestamp, sender and recipients) with the language model, which generates the contents of the multi-party email threads. We evaluate the generated content with respect to a number of realism-based properties, that encourage a model to learn to generate content that will engage the attention of an adversary to achieve a deception outcome.

  
Access Paper or Ask Questions

Interactions in information spread: quantification and interpretation using stochastic block models

Apr 09, 2020
Gaël Poux-Médard, Julien Velcin, Sabine Loudcher

In most real-world applications, it is seldom the case that a given observable evolves independently of its environment. In social networks, users' behavior results from the people they interact with, news in their feed, or trending topics. In natural language, the meaning of phrases emerges from the combination of words. In general medicine, a diagnosis is established on the basis of the interaction of symptoms. Here, we propose a new model, the Interactive Mixed Membership Stochastic Block Model (IMMSBM), which investigates the role of interactions between entities (hashtags, words, memes, etc.) and quantifies their importance within the aforementioned corpora. We find that interactions play an important role in those corpora. In inference tasks, taking them into account leads to average relative changes with respect to non-interactive models of up to 150\% in the probability of an outcome. Furthermore, their role greatly improves the predictive power of the model. Our findings suggest that neglecting interactions when modeling real-world phenomena might lead to incorrect conclusions being drawn.

* 17 pages, 3 figures, submitted to ECML-PKDD 2020 
  
Access Paper or Ask Questions

Topic Detection and Summarization of User Reviews

May 30, 2020
Pengyuan Li, Lei Huang, Guang-jie Ren

A massive amount of reviews are generated daily from various platforms. It is impossible for people to read through tons of reviews and to obtain useful information. Automatic summarizing customer reviews thus is important for identifying and extracting the essential information to help users to obtain the gist of the data. However, as customer reviews are typically short, informal, and multifaceted, it is extremely challenging to generate topic-wise summarization.While there are several studies aims to solve this issue, they are heuristic methods that are developed only utilizing customer reviews. Unlike existing method, we propose an effective new summarization method by analyzing both reviews and summaries.To do that, we first segment reviews and summaries into individual sentiments. As the sentiments are typically short, we combine sentiments talking about the same aspect into a single document and apply topic modeling method to identify hidden topics among customer reviews and summaries. Sentiment analysis is employed to distinguish positive and negative opinions among each detected topic. A classifier is also introduced to distinguish the writing pattern of summaries and that of customer reviews. Finally, sentiments are selected to generate the summarization based on their topic relevance, sentiment analysis score and the writing pattern. To test our method, a new dataset comprising product reviews and summaries about 1028 products are collected from Amazon and CNET. Experimental results show the effectiveness of our method compared with other methods.

  
Access Paper or Ask Questions

Improved Patient Classification with Language Model Pretraining Over Clinical Notes

Sep 06, 2019
Jonas Kemp, Alvin Rajkomar, Andrew M. Dai

Clinical notes in electronic health records contain highly heterogeneous writing styles, including non-standard terminology or abbreviations. Using these notes in predictive modeling has traditionally required preprocessing (e.g. taking frequent terms or topic modeling) that removes much of the richness of the source data. We propose a pretrained hierarchical recurrent neural network model that parses minimally processed clinical notes in an intuitive fashion, and show that it improves performance for multiple classification tasks on the Medical Information Mart for Intensive Care III (MIMIC-III) dataset, increasing top-5 recall to 89.7% (up by 4.8%) for primary diagnosis classification and AUPRC to 35.2% (up by 2.4%) for multilabel diagnosis classification compared to models that treat the notes as an unordered collection of terms or without pretraining. We also apply an attribution technique to several examples to identify the words and the nearby context that the model uses to make its prediction, and show the importance of the words' context.

  
Access Paper or Ask Questions
<<
34
35
36
37
38
39
40
41
42
43
44
45
46
>>