Abstract:The way we communicate and work has changed significantly with the rise of the Internet. While it has opened up new opportunities, it has also brought about an increase in cyber threats. One common and serious threat is phishing, where cybercriminals employ deceptive methods to steal sensitive information.This study addresses the pressing issue of phishing by introducing an advanced detection model that meticulously focuses on HTML content. Our proposed approach integrates a specialized Multi-Layer Perceptron (MLP) model for structured tabular data and two pretrained Natural Language Processing (NLP) models for analyzing textual features such as page titles and content. The embeddings from these models are harmoniously combined through a novel fusion process. The resulting fused embeddings are then input into a linear classifier. Recognizing the scarcity of recent datasets for comprehensive phishing research, our contribution extends to the creation of an up-to-date dataset, which we openly share with the community. The dataset is meticulously curated to reflect real-life phishing conditions, ensuring relevance and applicability. The research findings highlight the effectiveness of the proposed approach, with the CANINE demonstrating superior performance in analyzing page titles and the RoBERTa excelling in evaluating page content. The fusion of two NLP and one MLP model,termed MultiText-LP, achieves impressive results, yielding a 96.80 F1 score and a 97.18 accuracy score on our research dataset. Furthermore, our approach outperforms existing methods on the CatchPhish HTML dataset, showcasing its efficacies.
Abstract:Rising cyber threats, with miscreants registering thousands of new domains daily for Internet-scale attacks like spam, phishing, and drive-by downloads, emphasize the need for innovative detection methods. This paper introduces a cutting-edge approach for identifying suspicious domains at the onset of the registration process. The accompanying data pipeline generates crucial features by comparing new domains to registered domains,emphasizing the crucial similarity score. Leveraging a novel combination of Natural Language Processing (NLP) techniques, including a pretrained Canine model, and Multilayer Perceptron (MLP) models, our system analyzes semantic and numerical attributes, providing a robust solution for early threat detection. This integrated approach significantly reduces the window of vulnerability, fortifying defenses against potential threats. The findings demonstrate the effectiveness of the integrated approach and contribute to the ongoing efforts in developing proactive strategies to mitigate the risks associated with illicit online activities through the early identification of suspicious domain registrations.