A Heuristic Model to Detect Malicious Urls Using Case Based Reasoning

Journal of Information and Computational Science ISSN: 1548-7741 A Heuristic Model to Detect Malicious URLs using Case Based Reasoning Dr. Sarika S Assistant professor Naipunnya Institute of Management and Information Technology,Koratty,Trissur [email protected] Abstract Phishing is the fraudulent practice of deceiving users by hijacking sensitive information like details of bank account, credit card, login on email and social networking sites, through social engineering. It can occur in varied forms like using malicious URLs and links, attacks based on emails etc. The most devious form of attack occurs when the user see an innocuous page with same look and feel of a genuine webpage but with a phishing URL. Malicious URLs host unsolicited content and attract users to become victims of identity theft and financial losses. Effective systems to detect such malicious URLs in a timely manner are necessary to counter a variety of web security threats. This paper presents a heuristic method to analyze the pattern of phishing URLs and check whether they are malicious or not. The method leverages the use of multi agent system with case based reasoning to detect the attack. The experimental results show that the proposed method is capable of achieving good true positive rate with an accuracy of 98%. Keywords: Antiphishing, multi agent system, URL analysis, case based reasoning 1. Introduction With the advent of internet for business, finance and personal investments, threats due to internet frauds are on rise. They conduct counterfeit transactions to deceive users and steal personal information from them. The most venomous form of internet fraud is phishing. Phishing is a form of online identity theft that aims to steal sensitive information such as usernames, passwords, and credit card details by masquerading as a trustworthy entity in an electronic communication. Nowadays, identity theft is one of the most profitable crimes committed by fraudsters and criminals are exploiting best possible resources to carry out their tasks. A recent report highlighted an increase in the number of identity thefts and the phishing loss is estimated around billions for the affected organizations. A good example is WannaCry ransomware attack [1], a worldwide cyber attack which infected more than 230,000 computers running Microsoft Windows operating system around 150 countries during February 2017. As phishing sites are a cornerstone of internet criminal activities, there has been broad interest in developing systems to prevent the end user from visiting such sites. However, the effectiveness of such systems is questionable as the phishers change their tactics and materialize with new threats. Volume 9 Issue 11 - 2019 1066 www.joics.org Journal of Information and Computational Science ISSN: 1548-7741 Most methods of phishing use Spoofed URL, which is a common trait for phishing scams; pose a serious threat to end-users and commercial institutions. URL obfuscation is a form of technical deception wherein the victims are made to think that a link or web site displayed in their web browser or email client is that of a trusted site when it is not. As users are aware of phishing and can detect fake emails and web sites, the attackers design webpages with convincing content as baits to steal user’s personal information. URL obfuscation allows a hostile web site to exploit vulnerabilities in web browsers that allow the download and execution of malicious codes. These methods tend to be technically simple yet highly effective, and are still used to perform deception. Thus, identifying phishing URLs has become a necessity and challenge also, in the context of online security. The main contribution of this paper is to monitor and detect phishing sites which masquerade as benevolent ones using multi agent systems with case based reasoning. The agents in this system can learn and detect phishing attacks based on the selected features. The technique is adaptive and dynamic to new cases. The method runs with a few set of features and detects fraudulent URLs. 2. Literature Review Most browsers now use URL verification method to protect users from phishing attacks. This is because URL filtering magnitudes faster than typical web page classification, as the entire webpage does not have to be fetched and analyzed. The approach is simple and effective as it produces less average error rate. Moreover, the average response time is less as it considers only the URL of a webpage. Some of the URL filtering techniques are described below. A simple filtering algorithm called SPS [2] that protects clients from phishing attacks by removing part of the malicious content that traps clients into entering personal information. This approach analyzes the behaviors of novice users and of phishers to formulate requirements for SPS. The approach uses a two-level filtering algorithm which is composed of Strict URL Filtering (SUF) and HTTP Response Sanitizing (HRS). The URL filtering component uses a rule set which analyses the URLs on HTML documents and categorize them as safe or suspicious. If a URL is suspicious, the HTTP response sanitizing component removes any input forms from the HTML documents and alert the users about the malicious parts by a sanitized web page. The method is convenient as it does not prohibit the user from browsing any webpages, except blocking them to disclose their personal information to unknown URLs. However, SPS is vulnerable to pharming attacks. According to [3], phishing attacks can be identified through target domain identification. In this work, an algorithm is implemented to identify the phishing target which masquerades as genuine. The approach also groups the domains from hyperlinks having direct or indirect association with a suspected webpage. In order to obtain a target domain set, the domains gathered from the directly associated webpages are cross checked with the domains gathered from the indirectly associated webpages. After applying Target Identification (TID) algorithm on this set, the matched domains are cancelled. The resultant set is compared with a DNS server to identify the legitimacy of a suspicious page. PageSafe [4] is an antiphishing tool that prevents accesses to phishing sites through URL validation and also detects DNS poisoning attacks. PageSafe Volume 9 Issue 11 - 2019 1067 www.joics.org Journal of Information and Computational Science ISSN: 1548-7741 examines the anomalies in web pages and uses a machine learning approach for automatic classification. As the method does not preserve any secret information, it requires very less input from user. PageSafe performs automatic classification by incorporating user assistance so that the number of false positives is reduced by a significant value. The approach also maintains a whitelist, a list of domains with mapping to corresponding IP addresses. This list is referenced first for resolving IP of a URL to protect user from DNS poisoning attacks. Whitelist is encrypted by a master password. By using PageSafe, users are able to decide whether or not a web page is legitimate. The method proposed by Ma et al.[5] described how to identify suspicious URLs. The method has detected malicious websites using both lexical and host based features of URLs in a balanced set. They have compared the accuracy of batch Ma et al.[6] and online learning algorithms Ma et al.[7] using these features. Six lexical and host based features are considered from a URL without including any content from the webpage. They have used various online algorithms such as Perceptron, Logistic Regression with Stochastic Gradient Descent, Passive Aggressive(PA) and Confidence Weighted(CW) algorithms to compare with a batch processing algorithm, and proved that online learning algorithms work better than batch learning for detecting malicious websites. Among the classifiers, they have seen that Confidence-Weighted (CW) Algorithm offers best accuracy than others which is up to 99% over a balanced data set. A hybrid model [8] to detect phishing sites using K-Means Clustering [16] and Naive Bayes Classifier [22] have considered URL features and HTML features of a site to label it as phishing or legitimate. The K-Means Clustering is applied on the URL features of the web site by defining three clusters and a feature set is plotted in one of the clusters of database to check the validity of the site. If the result of k-means clustering is not enough to determine the site’s legitimacy, the method further extracts HTML features of the webpage by using DOM representation. A combination of both URL and HTML features of the webpage are considered to create the feature set and is applied to a Naive Bayes Classifier to determine the probability of the website as phishing or not. The work of Huang et al.[9] have demonstrated a method to find phishing URLs using a supervised learning approach using SVM classifier. To model the SVM, they have considered 23 features to construct the feature vector. The different features include 4 structural features, 9 lexical features and 10 brand name features. If the classification output is -1, the URL is phishing otherwise it is labeled as genuine for output 1. A semisupervised approach for identifying phishing URL in a realistic scenario has been proposed by Gyawali et al. [10]. This method is focused on reducing the cost of training a supervised algorithm by relying on fewer manually labeled data. In order to reduce manually labeled data, a semisupervised approach is applied and trained the learning algorithm using a collection of manually labeled and pseudo labeled data. The approach could detect more phishing URLs comparative to a fully supervised approach, using only 10% manually labeled data. To show the usefulness of the URL alone in performing web page classification, Kan and Thi have proposed a technique for webpage classification using URL [11]. The approach segments the URL into meaningful tokens.

A Heuristic Model to Detect Malicious Urls Using Case Based Reasoning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support