Due to the variety of cyber-attacks or threats, the cybersecurity community enhances the traditional security control mechanisms to an advanced level so that automated tools can encounter potential security threats. Very recently, Cyber Threat Intelligence (CTI) has been presented as one of the proactive and robust mechanisms because of its automated cybersecurity threat prediction. Generally, CTI collects and analyses data from various sources e.g., online security forums, social media where cyber enthusiasts, analysts, even cybercriminals discuss cyber or computer security-related topics and discovers potential threats based on the analysis. As the manual analysis of every such discussion (posts on online platforms) is time-consuming, inefficient, and susceptible to errors, CTI as an automated tool can perform uniquely to detect cyber threats. In this paper, we identify and explore relevant CTI from hacker forums utilizing different supervised (classification) and unsupervised learning (topic modeling) techniques. To this end, we collect data from a real hacker forum and constructed two datasets: a binary dataset and a multi-class dataset. We then apply several classifiers along with deep neural network-based classifiers and use them on the datasets to compare their performances. We also employ the classifiers on a labeled leaked dataset as our ground truth. We further explore the datasets using unsupervised techniques. For this purpose, we leverage two topic modeling algorithms namely Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).
Recent advances in topic models have explored complicated structured distributions to represent topic correlation. For example, the pachinko allocation model (PAM) captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (DAG). While PAM provides more flexibility and greater expressive power than previous models like latent Dirichlet allocation (LDA), it is also more difficult to determine the appropriate topic structure for a specific dataset. In this paper, we propose a nonparametric Bayesian prior for PAM based on a variant of the hierarchical Dirichlet process (HDP). Although the HDP can capture topic correlations defined by nested data structure, it does not automatically discover such correlations from unstructured data. By assuming an HDP-based prior for PAM, we are able to learn both the number of topics and how the topics are correlated. We evaluate our model on synthetic and real-world text datasets, and show that nonparametric PAM achieves performance matching the best of PAM without manually tuning the number of topics.
Presented herein is a novel model for similar question ranking within collaborative question answer platforms. The presented approach integrates a regression stage to relate topics derived from questions to those derived from question-answer pairs. This helps to avoid problems caused by the differences in vocabulary used within questions and answers, and the tendency for questions to be shorter than answers. The performance of the model is shown to outperform translation methods and topic modelling (without regression) on several real-world datasets.
India locked down 1.3 billion people on March 25, 2020 in the wake of COVID-19 pandemic. The economic cost of it was estimated at USD 98 billion, while the social costs are still unknown. This study investigated how government formed reactive policies to fight coronavirus across its policy sectors. Primary data was collected from the Press Information Bureau (PIB) in the form press releases of government plans, policies, programme initiatives and achievements. A text corpus of 260,852 words was created from 396 documents from the PIB. An unsupervised machine-based topic modelling using Latent Dirichlet Allocation (LDA) algorithm was performed on the text corpus. It was done to extract high probability topics in the policy sectors. The interpretation of the extracted topics was made through a nudge theoretic lens to derive the critical policy heuristics of the government. Results showed that most interventions were targeted to generate endogenous nudge by using external triggers. Notably, the nudges from the Prime Minister of India was critical in creating herd effect on lockdown and social distancing norms across the nation. A similar effect was also observed around the public health (e.g., masks in public spaces; Yoga and Ayurveda for immunity), transport (e.g., old trains converted to isolation wards), micro, small and medium enterprises (e.g., rapid production of PPE and masks), science and technology sector (e.g., diagnostic kits, robots and nano-technology), home affairs (e.g., surveillance and lockdown), urban (e.g. drones, GIS-tools) and education (e.g., online learning). A conclusion was drawn on leveraging these heuristics are crucial for lockdown easement planning.
Being able to rapidly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. The literature presents several approaches to identifying the emergence of new research topics, which rely on the assumption that the topic is already exhibiting a certain degree of popularity and consistently referred to by a community of researchers. However, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. In this dissertation, we begin to address this challenge by performing a study of the dynamics preceding the creation of new topics. This study indicates that the emergence of a new topic is anticipated by a significant increase in the pace of collaboration between relevant research areas, which can be seen as the 'ancestors' of the new topic. Based on this understanding, we developed Augur, a novel approach to effectively detecting the emergence of new research topics. Augur analyses the diachronic relationships between research areas and is able to detect clusters of topics that exhibit dynamics correlated with the emergence of new research topics. Here we also present the Advanced Clique Percolation Method (ACPM), a new community detection algorithm developed specifically for supporting this task. Augur was evaluated on a gold standard of 1,408 debutant topics in the 2000-2011 timeframe and outperformed four alternative approaches in terms of both precision and recall.
This paper presents a Bayesian multilingual topic model for learning language-independent document embeddings. Our model learns to represent the documents in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear classifiers for zero-shot cross-lingual topic identification. Our experiments on 5 language Europarl and Reuters (MLDoc) corpora show that the proposed model outperforms multi-lingual word embedding and BiLSTM sentence encoder based systems with significant margins in the majority of the transfer directions. Moreover, our system trained under a single day on a single GPU with much lower amounts of data performs competitively as compared to the state-of-the-art universal BiLSTM sentence encoder trained on 93 languages. Our experimental analysis shows that the amount of parallel data improves the overall performance of embeddings. Nonetheless, exploiting the uncertainties is always beneficial.
For the TREC-8 routing, one specific filter is built for each topic. Each filter is a classifier trained to recognize the documents that are relevant to the topic. When presented with a document, each classifier estimates the probability for the document to be relevant to the topic for which it has been trained. Since the procedure for building a filter is topic-independent, the system is fully automatic. By making use of a sample of documents that have previously been evaluated as relevant or not relevant to a particular topic, a term selection is performed, and a neural network is trained. Each document is represented by a vector of frequencies of a list of selected terms. This list depends on the topic to be filtered; it is constructed in two steps. The first step defines the characteristic words used in the relevant documents of the corpus; the second one chooses, among the previous list, the most discriminant ones. The length of the vector is optimized automatically for each topic. At the end of the term selection, a vector of typically 25 words is defined for the topic, so that each document which has to be processed is represented by a vector of term frequencies. This vector is subsequently input to a classifier that is trained from the same sample. After training, the classifier estimates for each document of a test set its probability of being relevant; for submission to TREC, the top 1000 documents are ranked in order of decreasing relevance.
With the evolution of the cloud and customer centric culture, we inherently accumulate huge repositories of textual reviews, feedback, and support data.This has driven enterprises to seek and research engagement patterns, user network analysis, topic detections, etc.However, huge manual work is still necessary to mine data to be able to mine actionable outcomes. In this paper, we proposed and developed an innovative Semi-Supervised Learning approach by utilizing Deep Learning and Topic Modeling to have a better understanding of the user voice.This approach combines a BERT-based multiclassification algorithm through supervised learning combined with a novel Probabilistic and Semantic Hybrid Topic Inference (PSHTI) Model through unsupervised learning, aiming at automating the process of better identifying the main topics or areas as well as the sub-topics from the textual feedback and support.There are three major break-through: 1. As the advancement of deep learning technology, there have been tremendous innovations in the NLP field, yet the traditional topic modeling as one of the NLP applications lag behind the tide of deep learning. In the methodology and technical perspective, we adopt transfer learning to fine-tune a BERT-based multiclassification system to categorize the main topics and then utilize the novel PSHTI model to infer the sub-topics under the predicted main topics. 2. The traditional unsupervised learning-based topic models or clustering methods suffer from the difficulty of automatically generating a meaningful topic label, but our system enables mapping the top words to the self-help issues by utilizing domain knowledge about the product through web-crawling. 3. This work provides a prominent showcase by leveraging the state-of-the-art methodology in the real production to help shed light to discover user insights and drive business investment priorities.
We present Variational Aspect-Based Latent Dirichlet Allocation (VALDA), a family of autoencoding topic models that learn aspect-based representations of reviews. VALDA defines a user-item encoder that maps bag-of-words vectors for combined reviews associated with each paired user and item onto structured embeddings, which in turn define per-aspect topic weights. We model individual reviews in a structured manner by inferring an aspect assignment for each sentence in a given review, where the per-aspect topic weights obtained by the user-item encoder serve to define a mixture over topics, conditioned on the aspect. The result is an autoencoding neural topic model for reviews, which can be trained in a fully unsupervised manner to learn topics that are structured into aspects. Experimental evaluation on large number of datasets demonstrates that aspects are interpretable, yield higher coherence scores than non-structured autoencoding topic model variants, and can be utilized to perform aspect-based comparison and genre discovery.