Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Edward Raff

University of Maryland, Baltimore County, Booz Allen Hamilton

AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

Jun 09, 2023

Robert J. Joyce, Tirth Patel, Charles Nicholas, Edward Raff

Figure 1 for AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

Figure 2 for AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

Figure 3 for AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

Figure 4 for AVScan2Vec: Feature Learning on Antivirus Scan Data for Production-Scale Malware Corpora

Abstract:When investigating a malicious file, searching for related files is a common task that malware analysts must perform. Given that production malware corpora may contain over a billion files and consume petabytes of storage, many feature extraction and similarity search approaches are computationally infeasible. Our work explores the potential of antivirus (AV) scan data as a scalable source of features for malware. This is possible because AV scan reports are widely available through services such as VirusTotal and are ~100x smaller than the average malware sample. The information within an AV scan report is abundant with information and can indicate a malicious file's family, behavior, target operating system, and many other characteristics. We introduce AVScan2Vec, a language model trained to comprehend the semantics of AV scan data. AVScan2Vec ingests AV scan data for a malicious file and outputs a meaningful vector representation. AVScan2Vec vectors are ~3 to 85x smaller than popular alternatives in use today, enabling faster vector comparisons and lower memory usage. By incorporating Dynamic Continuous Indexing, we show that nearest-neighbor queries on AVScan2Vec vectors can scale to even the largest malware production datasets. We also demonstrate that AVScan2Vec vectors are superior to other leading malware feature vector representations across nearly all classification, clustering, and nearest-neighbor lookup algorithms that we evaluated.

Via

Access Paper or Ask Questions

Recasting Self-Attention with Holographic Reduced Representations

May 31, 2023

Mohammad Mahmudul Alam, Edward Raff, Stella Biderman, Tim Oates, James Holt

Figure 1 for Recasting Self-Attention with Holographic Reduced Representations

Figure 2 for Recasting Self-Attention with Holographic Reduced Representations

Figure 3 for Recasting Self-Attention with Holographic Reduced Representations

Figure 4 for Recasting Self-Attention with Holographic Reduced Representations

Abstract:In recent years, self-attention has become the dominant paradigm for sequence modeling in a variety of domains. However, in domains with very long sequence lengths the $\mathcal{O}(T^2)$ memory and $\mathcal{O}(T^2 H)$ compute costs can make using transformers infeasible. Motivated by problems in malware detection, where sequence lengths of $T \geq 100,000$ are a roadblock to deep learning, we re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR). In doing so we perform the same high-level strategy of the standard self-attention: a set of queries matching against a set of keys, and returning a weighted response of the values for each key. Implemented as a ``Hrrformer'' we obtain several benefits including $\mathcal{O}(T H \log H)$ time complexity, $\mathcal{O}(T H)$ space complexity, and convergence in $10\times$ fewer epochs. Nevertheless, the Hrrformer achieves near state-of-the-art accuracy on LRA benchmarks and we are able to learn with just a single layer. Combined, these benefits make our Hrrformer the first viable Transformer for such long malware classification sequences and up to $280\times$ faster to train on the Long Range Arena benchmark. Code is available at \url{https://github.com/NeuromorphicComputationResearchProgram/Hrrformer}

* To appear in Proceedings of the 40th International Conference on Machine Learning (ICML)

Via

Access Paper or Ask Questions

Sparse Private LASSO Logistic Regression

Apr 29, 2023

Amol Khanna, Fred Lu, Edward Raff, Brian Testa

Figure 1 for Sparse Private LASSO Logistic Regression

Figure 2 for Sparse Private LASSO Logistic Regression

Figure 3 for Sparse Private LASSO Logistic Regression

Figure 4 for Sparse Private LASSO Logistic Regression

Abstract:LASSO regularized logistic regression is particularly useful for its built-in feature selection, allowing coefficients to be removed from deployment and producing sparse solutions. Differentially private versions of LASSO logistic regression have been developed, but generally produce dense solutions, reducing the intrinsic utility of the LASSO penalty. In this paper, we present a differentially private method for sparse logistic regression that maintains hard zeros. Our key insight is to first train a non-private LASSO logistic regression model to determine an appropriate privatized number of non-zero coefficients to use in final model selection. To demonstrate our method's performance, we run experiments on synthetic and real-world datasets.

* 20 pages, 5 figures

Via

Access Paper or Ask Questions

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Apr 03, 2023

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff(+3 more)

Figure 1 for Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Figure 2 for Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Figure 3 for Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Figure 4 for Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Abstract:How do large language models (LLMs) develop and evolve over the course of training? How do these patterns change as models scale? To answer these questions, we introduce \textit{Pythia}, a suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters. We provide public access to 154 checkpoints for each one of the 16 models, alongside tools to download and reconstruct their exact training dataloaders for further study. We intend \textit{Pythia} to facilitate research in many areas, and we present several case studies including novel results in memorization, term frequency effects on few-shot performance, and reducing gender bias. We demonstrate that this highly controlled setup can be used to yield novel insights toward LLMs and their training dynamics. Trained models, analysis code, training code, and training data can be found at https://github.com/EleutherAI/pythia.

* Code at https://github.com/EleutherAI/pythia

Via

Access Paper or Ask Questions

The Challenge of Differentially Private Screening Rules

Mar 18, 2023

Amol Khanna, Fred Lu, Edward Raff

Abstract:Linear $L_1$-regularized models have remained one of the simplest and most effective tools in data analysis, especially in information retrieval problems where n-grams over text with TF-IDF or Okapi feature values are a strong and easy baseline. Over the past decade, screening rules have risen in popularity as a way to reduce the runtime for producing the sparse regression weights of $L_1$ models. However, despite the increasing need of privacy-preserving models in information retrieval, to the best of our knoweledge, no differentially private screening rule exists. In this paper, we develop the first differentially private screening rule for linear and logistic regression. In doing so, we discover difficulties in the task of making a useful private screening rule due to the amount of noise added to ensure privacy. We provide theoretical arguments and experimental evidence that this difficulty arises from the screening step itself and not the private optimizer. Based on our results, we highlight that developing an effective private $L_1$ screening method is an open problem in the differential privacy literature.

* 5 pages, 2 figures

Via

Access Paper or Ask Questions

Measuring Equality in Machine Learning Security Defenses

Mar 01, 2023

Luke E. Richards, Edward Raff, Cynthia Matuszek

Abstract:The machine learning security community has developed myriad defenses for evasion attacks over the past decade. An understudied question in that community is: for whom do these defenses defend? In this work, we consider some common approaches to defending learned systems and whether those approaches may offer unexpected performance inequities when used by different sub-populations. We outline simple parity metrics and a framework for analysis that can begin to answer this question through empirical results of the fairness implications of machine learning security methods. Many methods have been proposed that can cause direct harm, which we describe as biased vulnerability and biased rejection. Our framework and metric can be applied to robustly trained models, preprocessing-based methods, and rejection methods to capture behavior over security budgets. We identify a realistic dataset with a reasonable computational cost suitable for measuring the equality of defenses. Through a case study in speech command recognition, we show how such defenses do not offer equal protection for social subgroups and how to perform such analyses for robustness training, and we present a comparison of fairness between two rejection-based defenses: randomized smoothing and neural rejection. We offer further analysis of factors that correlate to equitable defenses to stimulate the future investigation of how to assist in building such defenses. To the best of our knowledge, this is the first work that examines the fairness disparity in the accuracy-robustness trade-off in speech data and addresses fairness evaluation for rejection-based defenses.

* In Submission

Via

Access Paper or Ask Questions

When Visible-to-Thermal Facial GAN Beats Conditional Diffusion

Feb 18, 2023

Catherine Ordun, Edward Raff, Sanjay Purushotham

Figure 1 for When Visible-to-Thermal Facial GAN Beats Conditional Diffusion

Figure 2 for When Visible-to-Thermal Facial GAN Beats Conditional Diffusion

Figure 3 for When Visible-to-Thermal Facial GAN Beats Conditional Diffusion

Figure 4 for When Visible-to-Thermal Facial GAN Beats Conditional Diffusion

Abstract:Thermal facial imagery offers valuable insight into physiological states such as inflammation and stress by detecting emitted radiation in the infrared spectrum, which is unseen in the visible spectra. Telemedicine applications could benefit from thermal imagery, but conventional computers are reliant on RGB cameras and lack thermal sensors. As a result, we propose the Visible-to-Thermal Facial GAN (VTF-GAN) that is specifically designed to generate high-resolution thermal faces by learning both the spatial and frequency domains of facial regions, across spectra. We compare VTF-GAN against several popular GAN baselines and the first conditional Denoising Diffusion Probabilistic Model (DDPM) for VT face translation (VTF-Diff). Results show that VTF-GAN achieves high quality, crisp, and perceptually realistic thermal faces using a combined set of patch, temperature, perceptual, and Fourier Transform losses, compared to all baselines including diffusion.

Via

Access Paper or Ask Questions

A Coreset Learning Reality Check

Jan 15, 2023

Fred Lu, Edward Raff, James Holt

Abstract:Subsampling algorithms are a natural approach to reduce data size before fitting models on massive datasets. In recent years, several works have proposed methods for subsampling rows from a data matrix while maintaining relevant information for classification. While these works are supported by theory and limited experiments, to date there has not been a comprehensive evaluation of these methods. In our work, we directly compare multiple methods for logistic regression drawn from the coreset and optimal subsampling literature and discover inconsistencies in their effectiveness. In many cases, methods do not outperform simple uniform subsampling.

* To appear in the Thirty-Seventh AAAI Conference on Artificial Intelligence

Via

Access Paper or Ask Questions

Efficient Malware Analysis Using Metric Embeddings

Dec 05, 2022

Ethan M. Rudd, David Krisiloff, Scott Coull, Daniel Olszewski, Edward Raff, James Holt

Abstract:In this paper, we explore the use of metric learning to embed Windows PE files in a low-dimensional vector space for downstream use in a variety of applications, including malware detection, family classification, and malware attribute tagging. Specifically, we enrich labeling on malicious and benign PE files using computationally expensive, disassembly-based malicious capabilities. Using these capabilities, we derive several different types of metric embeddings utilizing an embedding neural network trained via contrastive loss, Spearman rank correlation, and combinations thereof. We then examine performance on a variety of transfer tasks performed on the EMBER and SOREL datasets, demonstrating that for several tasks, low-dimensional, computationally efficient metric embeddings maintain performance with little decay, which offers the potential to quickly retrain for a variety of transfer tasks at significantly reduced storage overhead. We conclude with an examination of practical considerations for the use of our proposed embedding approach, such as robustness to adversarial evasion and introduction of task-specific auxiliary objectives to improve performance on mission critical tasks.

* Pre-print of a manuscript submitted to the ACM Digital Threats: Research and Practice (DTRAP) Special Issue on Applied Machine Learning for Information Security. 19 Pages

Via

Access Paper or Ask Questions

Lempel-Ziv Networks

Nov 23, 2022

Rebecca Saul, Mohammad Mahmudul Alam, John Hurwitz, Edward Raff, Tim Oates, James Holt

Abstract:Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally expensive when applied to very long sequences. Compression-based methods have demonstrated more robustness when processing such sequences -- in particular, an approach pairing the Lempel-Ziv Jaccard Distance (LZJD) with the k-Nearest Neighbor algorithm has shown promise on long sequence problems (up to $T=200,000,000$ steps) involving malware classification. Unfortunately, use of LZJD is limited to discrete domains. To extend the benefits of LZJD to a continuous domain, we investigate the effectiveness of a deep-learning analog of the algorithm, the Lempel-Ziv Network. While we achieve successful proof of concept, we are unable to improve meaningfully on the performance of a standard LSTM across a variety of datasets and sequence processing tasks. In addition to presenting this negative result, our work highlights the problem of sub-par baseline tuning in newer research areas.

* I Can't Believe It's Not Better Workshop at NeurIPS 2022

Via

Access Paper or Ask Questions