A Review on Phishing Website Detection Using Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
JOURNAL OF CRITICAL REVIEWS ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 A REVIEW ON PHISHING WEBSITE DETECTION USING MACHINE LEARNING Sudha M1, Jaanavi R V2, Blessy Ida Gladys S3, Priyadharshini4 1,2,3,4,School of Information Technology & Engineering,Vellore Institute of Technology - Vellore Campus, India. E mail: [email protected] Received: May 2020 Revised and Accepted: August 2020 ABSTRACT: Fraudulent communication in the internet is an ever growing issue in the cyber world. This article reviews the negative impacts of fraudulent sites referred as Spoofed websites or phishing websites. These spoofed-sites attempts to steal the essential credentials of any individual by means of false websites that appears same as the original website in the cyber space. Any legitimate user in the Internet communication may prompt to use these spoofed-sites by mistyping the web-address. On the other side when an individual attempts to get his site using a browser cache directly instead of typing the site address on own would lead to these type of spoofed web logging. It is severe issue, as it leads to fiscal losses for both industries and individuals. Therefore this article endeavor to investigate the applicability of widely adopted machine learning model for predicting the Spoofed websites. The proposed algorithm is used to identify and characterize the rules and factors required to classify the spoofed websites. Further these classification techniques are used to identify the relationship between rules and factors to correlate them with each other so as to detect the performance, accuracy, number of rules generated and speed. A Divide and conquer approach is applied in this assessment to detect the spoofed websites. The learning models are trained to match up the distrustful website with the matching legal website by using set of features and if the similarity is higher than the predefined threshold-value then it is declared spoofed-website. The assessment conducted on phishing website detection using machine learning revealed Random Forest tree as suitable detecting the spoofed websites to avoid financial loss and mental stress attaining overall prediction accuracy 92.6%. KEYWORDS: Machine learning, random forest, phishing-website detection and prediction accuracy I. INTRODUCTION The process of acquiring sensitive information by convincing the users to reveal their personal information such as usernames, passwords, credit card credentials, etc. by masquerading as a trusty source in an electronic transmission is known as spoofed. It is a criminal offense which targets both social engineering and technical tricks to steal personal identity or financial account information of user and is an automated form of identity theft. A spoofed URL is created with a malicious purpose to download malware, to perform spoofed attacks or to manipulate search engine’s results. Botnets are the main building blocks which are used to host spoofed sites or send spoofed emails. The internet is becoming a common place for information retrieval as information is easily accessible and available to all of the Internet users. Due these disastrous spoofed attacks posing a threat to the electronic commerce there has been a loss of user trust on the Internet. Spoofed-site is a rapidly growing form of identity theft scam and it has become a reason of both short term and long-term economic damage. Due to these reasons, designing and implementing effective spoofed identification strategies to withstand cybercrime and to ensure cyber security has become a major need. Spoofed makes utilization of spoof messages that are made to look valid and implied to be originating from honest to goodness sources like money related foundations, e-commerce destinations and so forth, to draw clients to visit fake sites through joins gave in the spoofed email. Supervised learning technique performance relies on various parameters like size of training data, feature set, and type of supervised-learning techniques.. Its limitation is that it fails to detect when attacker use compromised domain for hosting their site. Different Supervised-learning techniques utilized are Decision tree, Artificial Neural Network (ANN), Naive Bayes (NB) and Random Forest (RF). Among all of this tree-based supervised-learning techniques DT and RF is best as increase dataset as per literature survey. Lots of research has been done on improving the accuracy of spoofed website detection using different supervised-learning techniques. Classification techniques are proven as significant tools in modeling various scientific problems such weather prediction (Sudha, M. and B. Valarmathi, 2014, 2015 &2016) in disease diagnosis (Sudha, M, 2017) and 4847 JOURNAL OF CRITICAL REVIEWS ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 consequently this investigation choose to apply these classification models for phishing website detection to address the existing cyber-security issue. II. LITERATURE REVIEW (Daisuke Miyamoto A. et al., 2008) utilized 9 of the machine learning techniques including AdaBoost, Bagging, Support Vector Machines, Classification and Regression Trees, Logistic Regression, Random Forests, Neural Networks, Naive Bayes, and Bayesian Additive Regression Trees. The most elevated f1 measure is 0.8581, the lowest error rate is 14.15%, and the most elevated AUC is 0.9342, which are all seen on account of AdaBoost. (Joby James A.et al., 2013) analyzed the prepared URL feature dataset using Naïve Bayes, J48 Decision Tree, k-NN, and SVM classifying algorithms. In MATLAB, using Regression Tree got 91.08% detection accuracy when using 60% of dataset for testing and 85.63% detection accuracy when using 90% of data for testing. (Kang Leng Chiew et.al., 2013) proposed new feature selection framework called HEFS, where existing filter measures are leveraged to find an effective subset of features for utilization in machine learning based-spoofed detection. Their performed results say that the baseline features perform best when integrated with Random Forest supervised-learning techniques achieving accuracy of 94.6% using 20.8% of the original number of features. (Mahmoud Khonji et.al., 2013) applied the Machine Learning-based detection techniques and they achieved high classification accuracy for analyzing similar data parts to those of rule-based heuristic techniques. Anti-Phish Phil training material reduced FN rate from 46% to29%. (H. B. Kazemian A. et al., 2015) Used Naive Bayes supervised-learning techniques-means, SVM, K-Means and affinity propagation. The machine learning techniques showed an encouraging performance with accuracies above 89%. (Singh et al., 2015) Combined Adaline and Support Vector Machine and proposed an algorithm for the classification of websites and achieved an accuracy of 99.1 %.( Feroz, M. N., & Mengel, S, 2015) Describes an approach that classifies URLs automatically based on their lexical and host-based features. (Chiew, K. L. et al., 2015) Proposed a method which involves logo extraction using machine learning technique and identity verification processes using Google image search to retrieve the portrayed identity, used for the verification. (Sananse, B. E., & Sarode, T. K, 2015) Applied web mining heuristics on the Random Forest algorithm, and achieved a precision of more than 90%.(Routhu Srinivasa Rao and Syed Taqi Ali, 2015) in this they have used Phish Shield technique. Phish-Shield can detect zero-hour spoofed attacks which blacklists unable to detect and it is faster than visual based assessment techniques that are used in detecting spoofed. The accuracy rate obtained for Phish-Shield is 96.57% and covers a wide range of spoofed web sites resulting less false negative and false positive rate. (Nguyen, H. H., & Nguyen, D. T, 2016) trained the dataset using Support Vector Machine, Naive Bayes, J48 decision tree, Random forest, Neural networks . (Hodzic, A. et al., 2016) discussed Random Forest (RF), C4.5, REP Tree, Decision Stump, Hoeffding Tree, Rotation Forest and MLP algorithms and showed that random forest (RF) with REP Tree showed the best performance. (Zhongyi Hu , 2016 ) To identify the malware and spoofed websites based on popularity and performance data, nine well-studied machine learning algorithms, which include five single models (i.e., SVM, ANN, KNN, C4.5 and Naïve Bayes) and four ensemble models (i.e., RF, GBM, Adaboost and bagging), were investigated. Experimental results show that the ensemble models significantly outperform those single models. (Anand Desai A.et al., 2017)They have used K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest algorithms. They have not only considered the accuracy, but also other metrics which are used to select the best supervised-learning techniques. As per the result of these metrics, they got the best scores when the RF algorithm was used. (Altyeb Altaher, 2017) Proposes a hybrid approach for classifying the websites as a Spoofed, Legitimate or Suspicious websites, the proposed approach intelligently combines the K-nearest neighbors (KNN) algorithm with the Support Vector Machine (SVM) algorithm in two stages. (Subasi, A. et al., 2017) used Artificial Neural Network, K-nearest neighbor, Support Vector Machine, C4.5 decision tree, Random Forest algorithms. And achieved the highest accuracy of 97.36% using Random Forest. (Neda Abdelhamid , 2017) compared large numbers of ML approaches (C4.5, One Rule, Conjunctive Rule, eDRI, RIDOR, Bayes Net, SVM, Boosting) have been contrasted with respect to different rules. Decision trees Bayes