Abstract:As over 200 million new malware samples are identified each year, antivirus systems must continuously adapt to the evolving threat landscape. However, retraining solely on new samples leads to catastrophic forgetting and exploitable blind spots, while retraining on the entire dataset incurs substantial computational cost. We propose FreeMOCA, a memory- and compute-efficient continual learning framework for malicious code analysis that preserves prior knowledge via adaptive layer-wise interpolation between consecutive task updates, leveraging the fact that warm-started task optima are connected by low-loss paths in parameter space. We evaluate FreeMOCA in both class-incremental (Class-IL) and domain-incremental (Domain-IL) settings on large-scale Windows (EMBER) and Android (AZ) malware benchmarks. FreeMOCA achieves substantial gains in Class-IL, outperforming 11 baselines on both EMBER and AZ benchmarks. It also significantly reduces forgetting, achieving the best retention across baselines, and improving accuracy by up to 42% and 37% on EMBER and AZ, respectively. These results demonstrate that warm-started interpolation in parameter space provides a scalable and effective alternative to replay for continual malware detection. Code is available at: https://github.com/IQSeC-Lab/FreeMOCA.




Abstract:Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)-based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model's performance on old data degrades over time. In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model's hidden representations. Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario -- where new classes are introduced continuously over multiple tasks -- demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems. The implementation is available at \url {https://github.com/MalwareReplayGAN/MalCL}\footnote{The code will be made public upon the presentation of the paper}.




Abstract:Recent advances in learning Deep Neural Network (DNN) architectures have received a great deal of attention due to their ability to outperform state-of-the-art classifiers across a wide range of applications, with little or no feature engineering. In this paper, we broadly study the applicability of deep learning to website fingerprinting. We show that unsupervised DNNs can be used to extract low-dimensional feature vectors that improve the performance of state-of-the-art website fingerprinting attacks. When used as classifiers, we show that they can match or exceed performance of existing attacks across a range of application scenarios, including fingerprinting Tor website traces, fingerprinting search engine queries over Tor, defeating fingerprinting defenses, and fingerprinting TLS-encrypted websites. Finally, we show that DNNs can be used to predict the fingerprintability of a website based on its contents, achieving 99% accuracy on a data set of 4500 website downloads.