JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 A REVIEW ON DETECTION USING MACHINE LEARNING

Sudha M1, Jaanavi R V2, Blessy Ida Gladys S3, Priyadharshini4

1,2,3,4,School of Information Technology & Engineering,Vellore Institute of Technology - Vellore Campus, India.

E mail: [email protected]

Received: May 2020 Revised and Accepted: August 2020

ABSTRACT: Fraudulent communication in the is an ever growing issue in the cyber world. This article reviews the negative impacts of fraudulent sites referred as Spoofed or phishing websites. These spoofed-sites attempts to steal the essential credentials of any individual by means of false websites that appears same as the original website in the cyber space. Any legitimate user in the Internet communication may prompt to use these spoofed-sites by mistyping the web-address. On the other side when an individual attempts to get his site using a browser cache directly instead of typing the site address on own would lead to these type of spoofed web logging. It is severe issue, as it leads to fiscal losses for both industries and individuals. Therefore this article endeavor to investigate the applicability of widely adopted machine learning model for predicting the Spoofed websites. The proposed algorithm is used to identify and characterize the rules and factors required to classify the spoofed websites. Further these classification techniques are used to identify the relationship between rules and factors to correlate them with each other so as to detect the performance, accuracy, number of rules generated and speed. A Divide and conquer approach is applied in this assessment to detect the spoofed websites. The learning models are trained to match up the distrustful website with the matching legal website by using set of features and if the similarity is higher than the predefined threshold-value then it is declared spoofed-website. The assessment conducted on phishing website detection using machine learning revealed Random Forest tree as suitable detecting the spoofed websites to avoid financial loss and mental stress attaining overall prediction accuracy 92.6%. KEYWORDS: Machine learning, random forest, phishing-website detection and prediction accuracy

I. INTRODUCTION The process of acquiring sensitive information by convincing the users to reveal their personal information such as usernames, passwords, credentials, etc. by masquerading as a trusty source in an electronic transmission is known as spoofed. It is a criminal offense which targets both social engineering and technical tricks to steal personal identity or financial account information of user and is an automated form of . A spoofed URL is created with a malicious purpose to download malware, to perform spoofed attacks or to manipulate search engine’s results. Botnets are the main building blocks which are used to host spoofed sites or send spoofed . The internet is becoming a common place for information retrieval as information is easily accessible and available to all of the Internet users. Due these disastrous spoofed attacks posing a threat to the electronic commerce there has been a loss of user trust on the Internet.

Spoofed-site is a rapidly growing form of identity theft scam and it has become a reason of both short term and long-term economic damage. Due to these reasons, designing and implementing effective spoofed identification strategies to withstand and to ensure cyber security has become a major need. Spoofed makes utilization of spoof messages that are made to look valid and implied to be originating from honest to goodness sources like money related foundations, e-commerce destinations and so forth, to draw clients to visit fake sites through joins gave in the spoofed .

Supervised learning technique performance relies on various parameters like size of training data, feature set, and type of supervised-learning techniques.. Its limitation is that it fails to detect when attacker use compromised domain for hosting their site. Different Supervised-learning techniques utilized are Decision tree, Artificial Neural Network (ANN), Naive Bayes (NB) and Random Forest (RF). Among all of this tree-based supervised-learning techniques DT and RF is best as increase dataset as per literature survey. Lots of research has been done on improving the accuracy of spoofed website detection using different supervised-learning techniques. Classification techniques are proven as significant tools in modeling various scientific problems such weather prediction (Sudha, M. and B. Valarmathi, 2014, 2015 &2016) in disease diagnosis (Sudha, M, 2017) and

4847

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 consequently this investigation choose to apply these classification models for phishing website detection to address the existing cyber-security issue.

II. LITERATURE REVIEW

(Daisuke Miyamoto A. et al., 2008) utilized 9 of the machine learning techniques including AdaBoost, Bagging, Support Vector Machines, Classification and Regression Trees, Logistic Regression, Random Forests, Neural Networks, Naive Bayes, and Bayesian Additive Regression Trees. The most elevated f1 measure is 0.8581, the lowest error rate is 14.15%, and the most elevated AUC is 0.9342, which are all seen on account of AdaBoost. (Joby James A.et al., 2013) analyzed the prepared URL feature dataset using Naïve Bayes, J48 Decision Tree, k-NN, and SVM classifying algorithms. In MATLAB, using Regression Tree got 91.08% detection accuracy when using 60% of dataset for testing and 85.63% detection accuracy when using 90% of data for testing.

(Kang Leng Chiew et.al., 2013) proposed new feature selection framework called HEFS, where existing filter measures are leveraged to find an effective subset of features for utilization in machine learning based-spoofed detection. Their performed results say that the baseline features perform best when integrated with Random Forest supervised-learning techniques achieving accuracy of 94.6% using 20.8% of the original number of features. (Mahmoud Khonji et.al., 2013) applied the Machine Learning-based detection techniques and they achieved high classification accuracy for analyzing similar data parts to those of rule-based heuristic techniques. Anti-Phish Phil training material reduced FN rate from 46% to29%. (H. B. Kazemian A. et al., 2015) Used Naive Bayes supervised-learning techniques-means, SVM, K-Means and affinity propagation. The machine learning techniques showed an encouraging performance with accuracies above 89%.

(Singh et al., 2015) Combined Adaline and Support Vector Machine and proposed an algorithm for the classification of websites and achieved an accuracy of 99.1 %.( Feroz, M. N., & Mengel, S, 2015) Describes an approach that classifies automatically based on their lexical and host-based features. (Chiew, K. L. et al., 2015) Proposed a method which involves logo extraction using machine learning technique and identity verification processes using Google image search to retrieve the portrayed identity, used for the verification. (Sananse, B. E., & Sarode, T. K, 2015) Applied web mining heuristics on the Random Forest algorithm, and achieved a precision of more than 90%.(Routhu Srinivasa Rao and Syed Taqi Ali, 2015) in this they have used Phish Shield technique. Phish-Shield can detect zero-hour spoofed attacks which blacklists unable to detect and it is faster than visual based assessment techniques that are used in detecting spoofed. The accuracy rate obtained for Phish-Shield is 96.57% and covers a wide range of spoofed web sites resulting less false negative and false positive rate. (Nguyen, H. H., & Nguyen, D. T, 2016) trained the dataset using Support Vector Machine, Naive Bayes, J48 decision tree, Random forest, Neural networks .

(Hodzic, A. et al., 2016) discussed Random Forest (RF), C4.5, REP Tree, Decision Stump, Hoeffding Tree, Rotation Forest and MLP algorithms and showed that random forest (RF) with REP Tree showed the best performance. (Zhongyi Hu , 2016 ) To identify the malware and spoofed websites based on popularity and performance data, nine well-studied machine learning algorithms, which include five single models (i.e., SVM, ANN, KNN, C4.5 and Naïve Bayes) and four ensemble models (i.e., RF, GBM, Adaboost and bagging), were investigated. Experimental results show that the ensemble models significantly outperform those single models.

(Anand Desai A.et al., 2017)They have used K-Nearest Neighbors (KNN), Support Vector Machines (SVM), Random Forest algorithms. They have not only considered the accuracy, but also other metrics which are used to select the best supervised-learning techniques. As per the result of these metrics, they got the best scores when the RF algorithm was used. (Altyeb Altaher, 2017) Proposes a hybrid approach for classifying the websites as a Spoofed, Legitimate or Suspicious websites, the proposed approach intelligently combines the K-nearest neighbors (KNN) algorithm with the Support Vector Machine (SVM) algorithm in two stages. (Subasi, A. et al., 2017) used Artificial Neural Network, K-nearest neighbor, Support Vector Machine, C4.5 decision tree, Random Forest algorithms. And achieved the highest accuracy of 97.36% using Random Forest.

(Neda Abdelhamid , 2017) compared large numbers of ML approaches (C4.5, One Rule, Conjunctive Rule, eDRI, RIDOR, Bayes Net, SVM, Boosting) have been contrasted with respect to different rules. Decision trees Bayes Net, and SVM achieved good detection rates. Models extracted from decision trees showed a very large amount of information which may overwhelm novice users and security experts, and thus will be hard to manage or understand.(Hemali Sampat et.al., 2018) applied URLs and domain name features which can be inspected using sets of rules to distinguish URLs of spoofed webpages from the URLs of legitimate websites. Features of URLs and domain names are checked using several criteria, for example, IP Address, long URL address, adding a prefix or suffix, redirecting using the symbol “//”, and URLs have the symbol “@”. (Vaibhav Patil et.al., 2018) resulted

4848

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 to the efficiency that can be achieved using the hybrid solution of heuristic features, visual features and blacklist and white list approach and feeding these features to machine learning algorithms.

Jain, A. K., & Gupta, B (2018) compared SVM and naïve Bayes algorithm and achieved more accuracy using SVM. Later they discussed a machine learning based novel anti-spoofed approach. They discussed random forest, SVM, neural networks, logistic regression and naïve Bayes for classification of websites. They have projected that the random forest supervised-learning techniques predicts the spoofed website more accurately when comparing to other learning algorithms. (Sneha Man A.et al., 2018) Focuses on some popular machine learning techniques such as back-propagation neural network (BPNN), support vector machine (SVM), naïve Bayes supervised-learning techniques (NB), decision tree (C4.5), random forest (RF), and k-Nearest neighbor (KNN). (Purvi Pujara A.et al., 2018) Supervised-learning techniques used are KNN, SVM, Decision tree, ANN, Naïve Bayes, PART, ELM and Random forest. Among the wide range of machine learning models, DT and RF is best for spoofed-site detection.

(Ozgur Koray A.et al., 2019) used a combination of various machine learning algorithms like Decision tree, AdaBoost, K-star, KNN, RFand Naive Bayes and different types of features as word vectors and hybrid features. A Random Forest algorithm with only NLP based features gives the best performance with the 97.98% accuracy rate for detection of spoofed URLs. (Kahksha A.et al., 2018) Random Forest, Decision Tree, Linear model and Neural Network algorithms have been used in a spoofed dataset. The results were determined based on the accuracy, true positive rate (TPR), true negative rate (TNR), Precision, F measure and false positive rate. The work presented here the highest accuracy achieved is 95.7% with the Random Forest algorithm. Rao, R. S., & Pais, A. R (2019) have used following algorithms J48 tree, Random Forest (RF), sequential minimal optimization (SMO), logistic regression (LR), multilayer perceptron (MLP), Bayesian network (BN), support vector machine (SVM) and AdaBoostM1 (AM1).

(R. Kiruthiga and D. Akilachecked. 2019) the validity of URLs. The most of the work done by using familiar machine learning algorithms like Naïve Bayesian, SVM, Decision Tree and Random Forest etc. Random forest have high accuracy of 98.4%.(Arun Kulkarni et.al., 2019) implemented four supervised-learning techniques using MATLAB scripts, which are the decision tree, Naïve Bayes’ supervised-learning techniques, Support Vector Machine (SVM), and the Neural Network. The supervised-learning techniques were used to detect spoofed URLs. The pruned decision tree provided the highest classification accuracy 90.39%, the neural network supervised-learning techniques yielded the lowest accuracy for this data set.

Routhu Srinivasa Rao and Alwyn Roshan Pais (2019) identified the spoofed sites based on the features extracted from URL, Website content and third-party services using machine learning algorithms. They also detected spoofed sites which imitate legitimate sites by replacing the website content with an image, which most of the anti-spoofed techniques fail to detect and detects zero-day spoofed sites which the list-based techniques fail to detect. With the help of these rich set of heuristic features, their model is able to achieve high detection rate of 99.55% and a low false positive rate of 0.45% using an oblique Random Forest algorithm. Yaokaiyang (2019) they have proposed an effective detection system that crawls websites and automatically discovers malicious pages, they intend their system to be used by a blacklist provider who can automatically compile and maintain an up-to-date blacklist of malicious URLs.

III. METHODOLOGY The proposed approach will be on improving the accuracy of spoofed website detection using different supervised-learning techniques. The dataset was gathered from Kaggle. The dataset contains 11056 instances and 32 features. Then the dataset is split based on the entropy. Accuracy is observed on the refined dataset. Then the accuracy is observed on the splitted dataset. For each leaf node the best features are identified through correlation and performing model. The model is tuned based on the best features for each partition and the best feature and hyper-tuned accuracy was observed.

Similar to heuristic tests, ML-based techniques can mitigate zero-hour spoofed attacks, which makes them advantageous when compared with blacklists. Interestingly ML techniques are also capable of constructing their own classification models by analyzing large set of data. ML algorithms are able to find their own models this elevates the need of manually creating heuristics tests. The model workflow involves, (1) Split the dataset based on entropy, (2) Observe accuracy on refined dataset and (3) Observe accuracy on the split dataset,

4849

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020

Figure-1 Process flow schematic of phishing website detection 3.1 Machine learning models

Decision Tree (DT) is a decision tree is a supervised learning model based on tree, generates all probable chances of occurrence of trial and its consequences by evaluating the logical connection between each features in the dataset. This algorithm splits the data itself into dissimilar sub-trees by means of specific split criteria such as entropy or information gain to generate subsets of logical sets from the given sample. It classifies the input with smaller number of samples. ID3, CHAID and C4.5 are some of the widely used tree based classifiers.

Random Forest (RF) Random Forest is an ensemble learning method. It follows the divide and conquers technique to get better overall performance. This FR tree has potential to boost up the weak-learners into strong-learner (Karthik, S and Sudha, M., 2018). Distinct weak-learners are associated in order to form strong-learners. It is widely applied for classification and regression problems. RF constructs a group of decision-trees to train data with individual-trees. The result is computed by estimating the mean value returned by all the decision trees. This algorithm works efficient with dataset comprising many predictive variables and handles vast amount of samples in training the model.

Linear Regression (LR) is a supervised learning approach; it classifies the data based on the previous history of the data analyzing the relationship linking the parameters and the data (Karthik, S and Sudha, M., 2018). From the observation, it will predict unseen and future cases. LR needs an independent and dependent variable to pertain regression. If more than two independent variables are involved then it is called as multiple regressions, this act as both linear and nonlinear.

Support Vector Machines (SVM) is a classification algorithm. It handles single and multi-class data, SVM segregates the data based on an optimal hyper-plane and kernel act as a crucial is a similarity function (Karthik, S and Sudha, M., 2018). It acquires the input; divide it with correspondence between each. Linear, Gaussian and Polynomial are widely adopted kernel types. Kernel selection is detrimental factor in model accuracy.

4850

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 IV. RESULTS AND DISCUSSION

Furthermore, the results obtained revealed that RF is faster than decision tree, support vector machine and linear regression model. Also the Random Forest attained highest overall prediction accuracy of 92.6%. Machine learning models are reliable and widely applied in scientific domain such as weather forecast (Sudha, M, 2017a) and in medical diagnostics (Sudha, M, 2017b) further from this study it is revealed that some of the machine learning models are appropriate to handle the existing cyber security attacks and threats .

Table 1. Prediction Models Accuracy (%)

Proportion of Tree Node Decision Random Support Vector Linear Tree Forest Machine Regression

43.015 94.7 93.2 92.6 94.7

57.05 84.8 92 84.1 54.02 Predictable Accuracy 89.75 92.6 88.35 74.36

100 92.6 89.75 88.35 90 80 74.36 70 60 50 43.015 40 30 57.05 Overall Accuracy

PredictionAccuracy (%) 20 10 0 Decision Tree Random Support Linear Forest Vector Regression Machine Machine Learning Techniques

Figure-2 Visualization of Machine learning models performance

V. CONCLUSION

Phishing website is a major problem in cyberspace which causes significant financial losses for industries and individuals. Attackers attempt to gain access to the credential and other sensitive information of a user through these spoofed websites. In this manuscript the existing literature and proposed methodology for overcoming the current spoofed-site issue has been addressed with appropriate experimental results and discussions. Experimental results revealed the suitability of machine learning approach for handling the cyber issue Therefore; this research concludes random forest approach as a suitable model for building phishing website detection based on the attained prediction performance. Application specific neural network models can outperform the existing generic classifier (Sudha, M, 2017b). Phishing website detection is one of the chief cyber attacks hence it requires more efficient models like neural network learning, as ANN models have phenomenal potential to attain the expected optimal prediction accuracy.

4851

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 VI. REFERENCES

[1]. Miyamoto, Daisuke, Hiroaki Hazeyama, and Younkin Kadobayashi. "An evaluation of machine learning-based methods for detection of spoofed sites." International Conference on Neural Information Processing. Springer, Berlin, Heidelberg, (2008). [2]. James, Joby, L. Sandhya, and Ciza Thomas. "Detection of spoofed websites using Machine learning techniques, International Conference on Control Communication and Computing (ICCC),(2013) [3]. Sudha, M. and B. Valarmathi ,Impact of hybrid intelligent computing in identifying constructive weather parameters for modeling effective rainfall prediction, AGRIS on-line Papers in Economics and Informatics, Vol. 07, No. 04 , (2015), pp. 151-160. [4]. Sudha, M. and B. Valarmathi, Identification of effective features and classifiers for short term rainfall prediction using rough set based maximum frequency weighted feature reduction technique, Journal of Computing and Information Technology, 24: 02, (2016), pp. 181-194. [5]. Sudha, M. and B. Valarmathi , Exploration on Feature Selection based on Rough Set Approach , International Journal of Applied Engineering Research, Vol. 08, No. 13, (2013), pp. 1555-1568. [6]. Sudha, M. and B. Valarmathi , Rainfall forecast analysis using rough set attribute reduction and data mining methods, Agris Online Papers in Economics and Informatics, Vol. 06, No. 04, (2014),pp. 145-154. [7]. Chang, Ee Hung, Kang Leng Chiew, and Wei King Tiong. "Spoofed detection via identification of website identity." International Conference on IT Convergence and Security (ICITCS), pp. 1-4. IEEE, 2013. [8]. Khonji, Mahmoud, Youssef Iraqi, and Andrew Jones. "Spoofed detection: a literature survey." IEEE Communications Surveys & Tutorials 15, no. 4 (2013): 2091-2121. [9]. Kazemian, Hassan B., and Shafi Ahmed. "Comparisons of machine learning techniques for detecting malicious webpages." Expert Systems with Applications 42.3 (2015): 1166-1177. [10]. Singh, P., Maravi, Y. P., & Sharma, S. (2015, February). Spoofed websites detection through supervised learning networks. In 2015 International Conference on Computing and Communications Technologies (ICCCT) (pp. 61-65). IEEE. [11]. Feroz, M. N., & Mengel, S. (2015, June). Spoofed URL detection using URL ranking. In 2015 ieee international congress on big data (pp. 635-638). IEEE. [12]. Chiew, K. L., Chang, E. H., & Tiong, W. K. Utilization of website logo for spoofed detection. Computers & Security, 54, (2015), pp. 16-26. [13]. Sananse, B. E., & Sarode, T. K. (2015). Spoofed URL detection: a machine learning and web mining-based approach. International Journal of Computer Applications, 123(13). [14]. Routhu srinivasa rao and Syed Taqi Ali “A Desktop Application to Detect Spoofed Webpages through Heuristic Approach” (2015) [15]. Nguyen, H. H., & Nguyen, D. T. (2016). Machine Learning based spoofed web sites detection. In AETA , Recent Advances in Electrical Engineering and Related Sciences (pp. 123-131. Springer, Cham. [16]. Hodžić, A., Kevrić, J., & Karadag, A. ,Comparison of machine learning techniques in spoofed website classification. In International Conference on Economic and Social Studies ,pp. 249-256, (2016). [17]. Hu, Zhongyi, et al. "Identifying malicious web domains using machine learning techniques with online credibility and performance data." 2016 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2016. [18]. Desai, Anand, et al. "Malicious web content detection using machine leaning." 2017 2nd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT). IEEE, (2017). [19]. Altaher, Altyeb. "Spoofed websites classification using hybrid SVM and KNN approach." International Journal of Advanced Computer Science and Applications 8.6 (2017), 90-95. [20]. Subasi, A., Molah, E., Almkallawi, F., & Chaudhery, T. J.,Intelligent spoofed website detection using random forest supervised-learning techniques. In 2017 International Conference on Electrical and Computing Technologies and Applications (ICECTA) (2017). [21]. Abdelhamid, Neda, FadiThabtah, and Hussein Abdel-jaber. "Spoofed detection: A recent intelligent machine learning comparison based on models’ content and features." In 2017 IEEE international conference on intelligence and security informatics (ISI), pp. 72-77. IEEE, 2017. [22]. Hemali sampat et.al, "Detection of Spoofed Website Using Machine Learning” (2018). [23]. Patil, Vaibhav, Pritesh Thakkar, Chirag Shah, Tushar Bhat, and S. P. Godse. “Detection and Prevention of Spoofed Websites Using Machine Learning Approach.” In 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1-5. IEEE, (2018). [24]. Jain, A. K., & Gupta, B. B. ,PHISH-SAFE: URL features-based spoofed detection system using machine learning. In Cyber Security (pp. 467-474). Springer, Singapore(2018).. [25]. Jain, A. K., & Gupta, B. B. ,Towards detection of spoofed websites on client-side using machine learning based approach. Telecommunication Systems, 68(4), 687-700, (2018)..

4852

JOURNAL OF CRITICAL REVIEWS

ISSN- 2394-5125 VOL 7, ISSUE 19, 2020 [26]. Mande, Miss Sneha, and D. S. Thosar. "Detection of Spoofed Web Sites Based On Extreme Machine Learning.(2018) [27]. Pujara, Purvi, and M. B. Chaudhari. "Spoofed Website Detection using Machine Learning: A Review. (2018). [28]. Sahingoz, Ozgur Koray, et al. "Machine learning based spoofed detection from URLs." Expert Systems with Applications 117 , pp. 345-357, (2019). [29]. Naaz, Sameena. "Detection of Spoofed Websites Using Machine Learning Approach. (2019). [30]. Rao, R. Svand Pais, A. R., Detection of spoofed websites using an efficient feature-based machine learning framework. Neural Computing and Applications, 31(8), pp, 3851-3873, (2019). [31]. Kiruthiga, R., and D. Akila. "Spoofed Websites Detection Using Machine Learning.(2019). [32]. Kulkarni, Arun, Spoofed Websites Detection using Machine Learning, Electronics and Informatics (ICOEI) IEEE Xplore (2019). [33]. Rao, Routhu Srinivasa, and Alwyn Roshan Pais. Detection of spoofed websites using an efficient feature-based machine learning framework. Neural Computing &Applications 31(8), pp.3851-3873, (2019). [34]. Yaokai, Yang, Effective Spoofed Detection Using Machine Learning Approach." PhD diss., Case Western Reserve University, (2019). [35]. Karthik, S & Sudha, M. A Survey on Machine Learning Approaches in Gene Expression Classification in Modelling Computational Diagnostic System for Complex Diseases, International Journal of Engineering and Advanced Technology, 8:2, pp.182-199, (2018) [36]. Sudha, M. Intelligent decision support system based on rough set and fuzzy logic approach for efficacious precipitation forecast, Decision Science Letters, 06, pp. 96-105. (2017a) [37]. Sudha M. Evolutionary and Neural Computing Based Decision Support System for Disease Diagnosis from Clinical Data Sets in Medical Practice. Journal of Medical Systems, 41(11):p.178. (2017b)

4853