Get our free extension to see links to code for papers anywhere online!

Chrome logo Add to Chrome

Firefox logo Add to Firefox

"Topic Modeling": models, code, and papers

Semantic Folding Theory And its Application in Semantic Fingerprinting

Mar 16, 2016
Francisco De Sousa Webber

Human language is recognized as a very complex domain since decades. No computer system has been able to reach human levels of performance so far. The only known computational system capable of proper language processing is the human brain. While we gather more and more data about the brain, its fundamental computational processes still remain obscure. The lack of a sound computational brain theory also prevents the fundamental understanding of Natural Language Processing. As always when science lacks a theoretical foundation, statistical modeling is applied to accommodate as many sampled real-world data as possible. An unsolved fundamental issue is the actual representation of language (data) within the brain, denoted as the Representational Problem. Starting with Jeff Hawkins' Hierarchical Temporal Memory (HTM) theory, a consistent computational theory of the human cortex, we have developed a corresponding theory of language data representation: The Semantic Folding Theory. The process of encoding words, by using a topographic semantic space as distributional reference frame into a sparse binary representational vector is called Semantic Folding and is the central topic of this document. Semantic Folding describes a method of converting language from its symbolic representation (text) into an explicit, semantically grounded representation that can be generically processed by Hawkins' HTM networks. As it turned out, this change in representation, by itself, can solve many complex NLP problems by applying Boolean operators and a generic similarity function like the Euclidian Distance. Many practical problems of statistical NLP systems, like the high cost of computation, the fundamental incongruity of precision and recall , the complex tuning procedures etc., can be elegantly overcome by applying Semantic Folding.

* 59 pages, white paper 

  Access Paper or Ask Questions

Geometric Analysis of the Conformal Camera for Intermediate-Level Vision and Perisaccadic Perception

Aug 29, 2009
Jacek Turski

A binocular system developed by the author in terms of projective Fourier transform (PFT) of the conformal camera, which numerically integrates the head, eyes, and visual cortex, is used to process visual information during saccadic eye movements. Although we make three saccades per second at the eyeball's maximum speed of 700 deg/sec, our visual system accounts for these incisive eye movements to produce a stable percept of the world. This visual constancy is maintained by neuronal receptive field shifts in various retinotopically organized cortical areas prior to saccade onset, giving the brain access to visual information from the saccade's target before the eyes' arrival. It integrates visual information acquisition across saccades. Our modeling utilizes basic properties of PFT. First, PFT is computable by FFT in complex logarithmic coordinates that approximate the retinotopy. Second, a translation in retinotopic (logarithmic) coordinates, modeled by the shift property of the Fourier transform, remaps the presaccadic scene into a postsaccadic reference frame. It also accounts for the perisaccadic mislocalization observed by human subjects in laboratory experiments. Because our modeling involves cross-disciplinary areas of conformal geometry, abstract and computational harmonic analysis, computational vision, and visual neuroscience, we include the corresponding background material and elucidate how these different areas interwove in our modeling of primate perception. In particular, we present the physiological and behavioral facts underlying the neural processes related to our modeling. We also emphasize the conformal camera's geometry and discuss how it is uniquely useful in the intermediate-level vision computational aspects of natural scene understanding.

* Ver2 with figures 

  Access Paper or Ask Questions

Private Topic Modeling

Nov 28, 2016
Mijung Park, James Foulds, Kamalika Chaudhuri, Max Welling

We develop a privatised stochastic variational inference method for Latent Dirichlet Allocation (LDA). The iterative nature of stochastic variational inference presents challenges: multiple iterations are required to obtain accurate posterior distributions, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorithm that overcomes this challenge by combining: (1) A relaxed notion of the differential privacy, called concentrated differential privacy, which provides high probability bounds for cumulative privacy loss, which is well suited for iterative algorithms, rather than focusing on single-query loss; and (2) Privacy amplification resulting from subsampling of large-scale data. Focusing on conjugate exponential family models, in our private variational inference, all the posterior distributions will be privatised by simply perturbing expected sufficient statistics. Using Wikipedia data, we illustrate the effectiveness of our algorithm for large-scale data.

  Access Paper or Ask Questions

Generating Cyber Threat Intelligence to Discover Potential Security Threats Using Classification and Topic Modeling

Aug 16, 2021
Md Imran Hossen, Ashraful Islam, Farzana Anowar, Eshtiak Ahmed, Mohammad Masudur Rahman

Due to the variety of cyber-attacks or threats, the cybersecurity community has been enhancing the traditional security control mechanisms to an advanced level so that automated tools can encounter potential security threats. Very recently a term, Cyber Threat Intelligence (CTI) has been represented as one of the proactive and robust mechanisms because of its automated cybersecurity threat prediction based on data. In general, CTI collects and analyses data from various sources e.g. online security forums, social media where cyber enthusiasts, analysts, even cybercriminals discuss cyber or computer security related topics and discovers potential threats based on the analysis. As the manual analysis of every such discussion i.e. posts on online platforms is time-consuming, inefficient, and susceptible to errors, CTI as an automated tool can perform uniquely to detect cyber threats. In this paper, our goal is to identify and explore relevant CTI from hacker forums by using different supervised and unsupervised learning techniques. To this end, we collect data from a real hacker forum and constructed two datasets: a binary dataset and a multi-class dataset. Our binary dataset contains two classes one containing cybersecurity-relevant posts and another one containing posts that are not related to security. This dataset is constructed using simple keyword search technique. Using a similar approach, we further categorize posts from security-relevant posts into five different threat categories. We then applied several machine learning classifiers along with deep neural network-based classifiers and use them on the datasets to compare their performances. We also tested the classifiers on a leaked dataset with labels named as our ground truth. We further explore the datasets using unsupervised techniques i.e. Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

  Access Paper or Ask Questions

TS-MPC for Autonomous Vehicle using a Learning Approach

Apr 29, 2020
Eugenio Alcalá, Olivier Sename, Vicenç Puig, Joseba Quevedo

In this paper, the Model Predictive Control (MPC) and Moving Horizon Estimator (MHE) strategies using a data-driven approach to learn a Takagi-Sugeno (TS) representation of the vehicle dynamics are proposed to solve autonomous driving control problems in real-time. To address the TS modeling, we use the Adaptive Neuro-Fuzzy Inference System (ANFIS) approach to obtain a set of polytopic-based linear representations as well as a set of membership functions relating in a non-linear way the different linear subsystems. The proposed control approach is provided by racing-based references of an external planner and estimations from the MHE offering a high driving performance in racing mode. The control-estimation scheme is tested in a simulated racing environment to show the potential of the presented approaches.

* 6 pages, 7 figures, IFAC 2020 World Congress 

  Access Paper or Ask Questions

The encoding of proprioceptive inputs in the brain: knowns and unknowns from a robotic perspective

Jul 20, 2016
Matej Hoffmann, Nada Bednarova

Somatosensory inputs can be grossly divided into tactile (or cutaneous) and proprioceptive -- the former conveying information about skin stimulation, the latter about limb position and movement. The principal proprioceptors are constituted by muscle spindles, which deliver information about muscle length and speed. In primates, this information is relayed to the primary somatosensory cortex and eventually the posterior parietal cortex, where integrated information about body posture (postural schema) is presumably available. However, coming from robotics and seeking a biologically motivated model that could be used in a humanoid robot, we faced a number of difficulties. First, it is not clear what neurons in the ascending pathway and primary somatosensory cortex code. To an engineer, joint angles would seem the most useful variables. However, the lengths of individual muscles have nonlinear relationships with the angles at joints. Kim et al. (Neuron, 2015) found different types of proprioceptive neurons in the primary somatosensory cortex -- sensitive to movement of single or multiple joints or to static postures. Second, there are indications that the somatotopic arrangement ("the homunculus") of these brain areas is to a significant extent learned. However, the mechanisms behind this developmental process are unclear. We will report first results from modeling of this process using data obtained from body babbling in the iCub humanoid robot and feeding them into a Self-Organizing Map (SOM). Our results reveal that the SOM algorithm is only suited to develop receptive fields of the posture-selective type. Furthermore, the SOM algorithm has intrinsic difficulties when combined with population code on its input and in particular with nonlinear tuning curves (sigmoids or Gaussians).

* in Proceedings of Kognice a um\v{e}l\'y \v{z}ivot XVI [Cognition and Artificial Life XVI] 2016, ISBN 978-80-01-05915-9 

  Access Paper or Ask Questions

Modeling the dynamics of domain specific terminology in diachronic corpora

Jul 11, 2017
Gerhard Heyer, Cathleen Kantner, Andreas Niekler, Max Overbeck, Gregor Wiedemann

In terminology work, natural language processing, and digital humanities, several studies address the analysis of variations in context and meaning of terms in order to detect semantic change and the evolution of terms. We distinguish three different approaches to describe contextual variations: methods based on the analysis of patterns and linguistic clues, methods exploring the latent semantic space of single words, and methods for the analysis of topic membership. The paper presents the notion of context volatility as a new measure for detecting semantic change and applies it to key term extraction in a political science case study. The measure quantifies the dynamics of a term's contextual variation within a diachronic corpus to identify periods of time that are characterised by intense controversial debates or substantial semantic transformations.

*; Proceedings of the 12th International conference on Terminology and Knowledge Engineering (TKE 2016) 

  Access Paper or Ask Questions

Harnessing Heterogeneity: Learning from Decomposed Feedback in Bayesian Modeling

Jul 07, 2021
Kai Wang, Bryan Wilder, Sze-chuan Suen, Bistra Dilkina, Milind Tambe

There is significant interest in learning and optimizing a complex system composed of multiple sub-components, where these components may be agents or autonomous sensors. Among the rich literature on this topic, agent-based and domain-specific simulations can capture complex dynamics and subgroup interaction, but optimizing over such simulations can be computationally and algorithmically challenging. Bayesian approaches, such as Gaussian processes (GPs), can be used to learn a computationally tractable approximation to the underlying dynamics but typically neglect the detailed information about subgroups in the complicated system. We attempt to find the best of both worlds by proposing the idea of decomposed feedback, which captures group-based heterogeneity and dynamics. We introduce a novel decomposed GP regression to incorporate the subgroup decomposed feedback. Our modified regression has provably lower variance -- and thus a more accurate posterior -- compared to previous approaches; it also allows us to introduce a decomposed GP-UCB optimization algorithm that leverages subgroup feedback. The Bayesian nature of our method makes the optimization algorithm trackable with a theoretical guarantee on convergence and no-regret property. To demonstrate the wide applicability of this work, we execute our algorithm on two disparate social problems: infectious disease control in a heterogeneous population and allocation of distributed weather sensors. Experimental results show that our new method provides significant improvement compared to the state-of-the-art.

  Access Paper or Ask Questions

Hierarchical Phenotyping and Graph Modeling of Spatial Architecture in Lymphoid Neoplasms

Jun 30, 2021
Pingjun Chen, Muhammad Aminu, Siba El Hussein, Joseph Khoury, Jia Wu

The cells and their spatial patterns in the tumor microenvironment (TME) play a key role in tumor evolution, and yet remains an understudied topic in computational pathology. This study, to the best of our knowledge, is among the first to hybrid local and global graph methods to profile orchestration and interaction of cellular components. To address the challenge in hematolymphoid cancers where the cell classes in TME are unclear, we first implemented cell level unsupervised learning and identified two new cell subtypes. Local cell graphs or supercells were built for each image by considering the individual cell's geospatial location and classes. Then, we applied supercell level clustering and identified two new cell communities. In the end, we built global graphs to abstract spatial interaction patterns and extract features for disease diagnosis. We evaluate the proposed algorithm on H\&E slides of 60 hematolymphoid neoplasm patients and further compared it with three cell level graph-based algorithms, including the global cell graph, cluster cell graph, and FLocK. The proposed algorithm achieves a mean diagnosis accuracy of 0.703 with the repeated 5-fold cross-validation scheme. In conclusion, our algorithm shows superior performance over the existing methods and can be potentially applied to other cancer types.

* Accepted by MICCAI2021 

  Access Paper or Ask Questions

Long-term, Short-term and Sudden Event: Trading Volume Movement Prediction with Graph-based Multi-view Modeling

Aug 23, 2021
Liang Zhao, Wei Li, Ruihan Bao, Keiko Harimoto, YunfangWu, Xu Sun

Trading volume movement prediction is the key in a variety of financial applications. Despite its importance, there is few research on this topic because of its requirement for comprehensive understanding of information from different sources. For instance, the relation between multiple stocks, recent transaction data and suddenly released events are all essential for understanding trading market. However, most of the previous methods only take the fluctuation information of the past few weeks into consideration, thus yielding poor performance. To handle this issue, we propose a graphbased approach that can incorporate multi-view information, i.e., long-term stock trend, short-term fluctuation and sudden events information jointly into a temporal heterogeneous graph. Besides, our method is equipped with deep canonical analysis to highlight the correlations between different perspectives of fluctuation for better prediction. Experiment results show that our method outperforms strong baselines by a large margin.

* Accepted as a main track paper by IJCAI 21 

  Access Paper or Ask Questions