Alert button
Picture for Mark Chignell

Mark Chignell

Alert button

Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection

Jul 13, 2023
Jaturong Kongmanee, Mark Chignell, Khilan Jerath, Abhay Raman

Figure 1 for Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection
Figure 2 for Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection
Figure 3 for Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection
Figure 4 for Unsupervised Learning of Distributional Properties can Supplement Human Labeling and Increase Active Learning Efficiency in Anomaly Detection

Exfiltration of data via email is a serious cybersecurity threat for many organizations. Detecting data exfiltration (anomaly) patterns typically requires labeling, most often done by a human annotator, to reduce the high number of false alarms. Active Learning (AL) is a promising approach for labeling data efficiently, but it needs to choose an efficient order in which cases are to be labeled, and there are uncertainties as to what scoring procedure should be used to prioritize cases for labeling, especially when detecting rare cases of interest is crucial. We propose an adaptive AL sampling strategy that leverages the underlying prior data distribution, as well as model uncertainty, to produce batches of cases to be labeled that contain instances of rare anomalies. We show that (1) the classifier benefits from a batch of representative and informative instances of both normal and anomalous examples, (2) unsupervised anomaly detection plays a useful role in building the classifier in the early stages of training when relatively little labeling has been done thus far. Our approach to AL for anomaly detection outperformed existing AL approaches on three highly unbalanced UCI benchmarks and on one real-world redacted email data set.

Viaarxiv icon

Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails

Mar 03, 2023
Mu-Huan Chung, Lu Wang, Sharon Li, Yuhong Yang, Calvin Giang, Khilan Jerath, Abhay Raman, David Lie, Mark Chignell

Figure 1 for Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails
Figure 2 for Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails
Figure 3 for Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails
Figure 4 for Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails

Research on email anomaly detection has typically relied on specially prepared datasets that may not adequately reflect the type of data that occurs in industry settings. In our research, at a major financial services company, privacy concerns prevented inspection of the bodies of emails and attachment details (although subject headings and attachment filenames were available). This made labeling possible anomalies in the resulting redacted emails more difficult. Another source of difficulty is the high volume of emails combined with the scarcity of resources making machine learning (ML) a necessity, but also creating a need for more efficient human training of ML models. Active learning (AL) has been proposed as a way to make human training of ML models more efficient. However, the implementation of Active Learning methods is a human-centered AI challenge due to potential human analyst uncertainty, and the labeling task can be further complicated in domains such as the cybersecurity domain (or healthcare, aviation, etc.) where mistakes in labeling can have highly adverse consequences. In this paper we present research results concerning the application of Active Learning to anomaly detection in redacted emails, comparing the utility of different methods for implementing active learning in this context. We evaluate different AL strategies and their impact on resulting model performance. We also examine how ratings of confidence that experts have in their labels can inform AL. The results obtained are discussed in terms of their implications for AL methodology and for the role of experts in model-assisted email anomaly screening.

Viaarxiv icon

MD-MTL: An Ensemble Med-Multi-Task Learning Package for DiseaseScores Prediction and Multi-Level Risk Factor Analysis

Mar 05, 2021
Lu Wang, Haoyan Jiang, Mark Chignell

Figure 1 for MD-MTL: An Ensemble Med-Multi-Task Learning Package for DiseaseScores Prediction and Multi-Level Risk Factor Analysis
Figure 2 for MD-MTL: An Ensemble Med-Multi-Task Learning Package for DiseaseScores Prediction and Multi-Level Risk Factor Analysis
Figure 3 for MD-MTL: An Ensemble Med-Multi-Task Learning Package for DiseaseScores Prediction and Multi-Level Risk Factor Analysis
Figure 4 for MD-MTL: An Ensemble Med-Multi-Task Learning Package for DiseaseScores Prediction and Multi-Level Risk Factor Analysis

While many machine learning methods have been used for medical prediction and risk factor analysis on healthcare data, most prior research has involved single-task learning (STL) methods. However, healthcare research often involves multiple related tasks. For instance, implementation of disease scores prediction and risk factor analysis in multiple subgroups of patients simultaneously and risk factor analysis at multi-levels synchronously. In this paper, we developed a new ensemble machine learning Python package based on multi-task learning (MTL), referred to as the Med-Multi-Task Learning (MD-MTL) package and applied it in predicting disease scores of patients, and in carrying out risk factor analysis on multiple subgroups of patients simultaneously. Our experimental results on two datasets demonstrate the utility of the MD-MTL package, and show the advantage of MTL (vs. STL), when analyzing data that is organized into different categories (tasks, which can be various age groups, different levels of disease severity, etc.).

* 14 pages, 8 figures 
Viaarxiv icon