Phishing has grown significantly in the past few years and is predicted to further increase in the future. The dynamics of phishing introduce challenges in implementing a robust phishing detection system and selecting features which can represent phishing despite the change of attack. In this paper, we propose PhishZip which is a novel phishing detection approach using a compression algorithm to perform website classification and demonstrate a systematic way to construct the word dictionaries for the compression models using word occurrence likelihood analysis. PhishZip outperforms the use of best-performing HTML-based features in past studies, with a true positive rate of 80.04%. We also propose the use of compression ratio as a novel machine learning feature which significantly improves machine learning based phishing detection over previous studies. Using compression ratios as additional features, the true positive rate significantly improves by 30.3% (from 51.47% to 81.77%), while the accuracy increases by 11.84% (from 71.20% to 83.04%).
Reading Comprehension has received significant attention in recent years as high quality Question Answering (QA) datasets have become available. Despite state-of-the-art methods achieving strong overall accuracy, Multi-Hop (MH) reasoning remains particularly challenging. To address MH-QA specifically, we propose a Deep Reinforcement Learning based method capable of learning sequential reasoning across large collections of documents so as to pass a query-aware, fixed-size context subset to existing models for answer extraction. Our method is comprised of two stages: a linker, which decomposes the provided support documents into a graph of sentences, and an extractor, which learns where to look based on the current question and already-visited sentences. The result of the linker is a novel graph structure at the sentence level that preserves logical flow while still allowing rapid movement between documents. Importantly, we demonstrate that the sparsity of the resultant graph is invariant to context size. This translates to fewer decisions required from the Deep-RL trained extractor, allowing the system to scale effectively to large collections of documents. The importance of sequential decision making in the document traversal step is demonstrated by comparison to standard IE methods, and we additionally introduce a BM25-based IR baseline that retrieves documents relevant to the query only. We examine the integration of our method with existing models on the recently proposed QAngaroo benchmark and achieve consistent increases in accuracy across the board, as well as a 2-3x reduction in training time.
We present a technique for developing a network of re-used features, where the topology is formed using a coarse learning method, that allows gradient-descent fine tuning, known as an Abstract Deep Network (ADN). New features are built based on observed co-occurrences, and the network is maintained using a selection process related to evolutionary algorithms. This allows coarse ex- ploration of the problem space, effective for irregular domains, while gradient descent allows pre- cise solutions. Accuracy on standard UCI and Protein-Structure Prediction problems is comparable with benchmark SVM and optimized GBML approaches, and shows scalability for addressing large problems. The discrete implementation is symbolic, allowing interpretability, while the continuous method using fine-tuning shows improved accuracy. The binary multiplexer problem is explored, as an irregular domain that does not support gradient descent learning, showing solution to the bench- mark 135-bit problem. A convolutional implementation is demonstrated on image classification, showing an error-rate of 0.79% on the MNIST problem, without a pre-defined topology. The ADN system provides a method for developing a very sparse, deep feature topology, based on observed relationships between features, that is able to find solutions in irregular domains, and initialize a network prior to gradient descent learning.