Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Title:KiloGrams: Very Large N-Grams for Malware Classification

Aug 01, 2019

Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles Nicholas, Mark McLean

Figure 1 for KiloGrams: Very Large N-Grams for Malware Classification

Figure 2 for KiloGrams: Very Large N-Grams for Malware Classification

Figure 3 for KiloGrams: Very Large N-Grams for Malware Classification

Figure 4 for KiloGrams: Very Large N-Grams for Malware Classification

Share this with someone who'll enjoy it:

Abstract:N-grams have been a common tool for information retrieval and machine learning applications for decades. In nearly all previous works, only a few values of $n$ are tested, with $n > 6$ being exceedingly rare. Larger values of $n$ are not tested due to computational burden or the fear of overfitting. In this work, we present a method to find the top-$k$ most frequent $n$-grams that is 60$\times$ faster for small $n$, and can tackle large $n\geq1024$. Despite the unprecedented size of $n$ considered, we show how these features still have predictive ability for malware classification tasks. More important, large $n$-grams provide benefits in producing features that are interpretable by malware analysis, and can be used to create general purpose signatures compatible with industry standard tools like Yara. Furthermore, the counts of common $n$-grams in a file may be added as features to publicly available human-engineered features that rival efficacy of professionally-developed features when used to train gradient-boosted decision tree models on the EMBER dataset.

* Appearing in LEMINCS @ KDD'19, August 5th, 2019, Anchorage, Alaska, United States

View paper on

Share this with someone who'll enjoy it:

Title:KiloGrams: Very Large N-Grams for Malware Classification

Paper and Code