Detecting Malicious Web Links and Identifying Their Attack Types

Detecting Malicious Web Links and Identifying Their Attack Types Hyunsang Choi Bin B. Zhu Heejo Lee Korea University Microsoft Research Asia Korea University Seoul, Korea Beijing, China Seoul, Korea [email protected] [email protected] [email protected] Abstract particularly human feedbacks that are highly accurate yet time-consuming. Blacklisting incurs no false positives, Malicious URLs have been widely used to mount various yet is effective only for known malicious URLs. It can- cyber attacks including spamming, phishing and mal- not detect unknown malicious URLs. The very nature of ware. Detection of malicious URLs and identification of exact match in blacklisting renders it easy to be evaded. threat types are critical to thwart these attacks. Know- ing the type of a threat enables estimation of severity This weakness of blacklisting has been addressed by of the attack and helps adopt an effective countermea- anomaly-based detection methods designed to detect un- sure. Existing methods typically detect malicious URLs known malicious URLs. In these methods, a classifica- of a single attack type. In this paper, we propose method tion model based on discriminative rules or features is using machine learning to detect malicious URLs of all built with either knowledge a priori or through machine the popular attack types and identify the nature of at- learning. Selection of discriminative rules or features tack a malicious URL attempts to launch. Our method plays a critical role for the performance of a detector. uses a variety of discriminative features including tex- A main research effort in malicious URL detection has tual properties, link structures, webpage contents, DNS focused on selecting highly effective discriminative fea- information, and network traffic. Many of these features. Existing methods were designed to detect mali- tures are novel and highly effective. Our experimental cious URLs of a single attack type, such as spamming, studies with 40,000 benign URLs and 32,000 malicious phishing, or malware. URLs obtained from real-life Internet sources show that In this paper, we propose a method using machine our method delivers a superior performance: the accu- learning to detect malicious URLs of all the popular at- racy was over 98% in detecting malicious URLs and over tack types including phishing, spamming and malware 93% in identifying attack types. We also report our stud- infection, and identify the attack types malicious URLs ies on the effectiveness of each group of discriminative attempt to launch. We have adopted a large set of dis- features, and discuss their evadability. criminative features related to textual patterns, link structures, content composition, DNS information, and network traffic. Many of these features are novel and highly 1 Introduction effective. As described later in our experimental studies, link popularity and certain lexical and DNS features While the World Wide Web has become a killer applica- are highly discriminative in not only detecting malicious tion on the Internet, it has also brought in an immense URLs but also identifying attack types. In addition, our risk of cyber attacks. Adversaries have used the Web as method is robust against known evasion techniques such a vehicle to deliver malicious attacks such as phishing, as redirection [42], link manipulation [16], and fast-flux spamming, and malware infection. For example, phish- hosting [17]. ing typically involves sending an email seemingly from Identification of attack types is useful since the knowl- a trustworthy source to trick people to click a URL (Uni- edge of the nature of a potential threat allows us to form Resource Locator) contained in the email that links take a proper reaction as well as a pertinent and effec- to a counterfeit webpage. tive countermeasure against the threat. For example, To address Web-based attacks, a great effort has been we may conveniently ignore spamming but should re- directed towards detection of malicious URLs. A com- spond immediately to malware infection. Our exper- mon countermeasure is to use a blacklist of malicious iments on 40,000 benign URLs and 32,000 malicious URLs, which can be constructed from various sources, URLs obtained from real-life Internet sources show that This work was done when Hyunsang Choi was an intern at Mi- our method has achieved an accuracy rate of more than crosoft Research Asia. Contact author: Bin B. Zhu ([email protected]). 98% in detecting malicious URLs and an accuracy rate 1 of more than 93% in identifying attack types. machine learning methods. The first task is a binary clas- This paper has the following major contributions: sification problem. The Support Vector Machine (SVM) is used to detect malicious URLs. The second task is a • We propose several groups of novel, highly discrim- multi-label classification problem. Two multi-label clas- inative features that enable our method to deliver sification methods, (RAkEL [38] and ML-kNN [48]), are a superior performance and capability on both de- used to identify attack types. tection and threat-type identification of malicious Task1: Support Vector Machine (SVM). SVM is URLs of main attack types including spamming, a widely used machine learning method introduced by phishing, and malware infection. Our method pro- Vapnik et al. [8]. SVM constructs hyperplanes in a high vides a much larger coverage than existing methods or infinite dimensional space for classification. Based while maintaining a high accuracy. on the Structural Risk Maximization theory, SVM finds the hyperplane that has the largest distance to the nearest • To the best of our knowledge, this is the first study training data points of any class, called functional mar- on classifying multiple types of malicious URLs, gin. Functional margin optimization can be achieved by known as a multi-label classification problem, in a maximizing the following equation systematic way. Multi-label classification is much harder than binary detection of malicious URLs since multi-label learning has to deal with the am- Xn 1 Xn α − α α y y K(x ; x ) biguity that an entity may belong to several classes. i 2 i j i j i j i=1 i;j=1 The remainder of this paper is organized as follows. We present the proposed method and the learning algo- subject to rithms it uses in Section 2, and describe the discriminative features our method uses in Section 3. Evaluation of our method with real-life data is reported in Section 4. Xn ≤ ≤ We review related work in Section 5, and conclude the αiyi = 0, 0 αi C, i = 1; 2; :::; n paper in Section 6. i=1 where α and α are coefficients assigned to training 2 Our Framework i j samples xi and xj. K(xi; xj) is a kernel function used to measure similarity between the two samples. After 2.1 Overview specifying the kernel function, SVM computes the coefficients which maximize the margin of correct classi- Our method consists of three stages as shown in Fig- fication on the training set. C is a regulation parameter ure 1: training data collection, supervised learning with used for tradeoff between training error and margin, and the training data, and malicious URL detection and at- training accuracy and model complexity. tack type identification. These stages can operate se- quentially as in batch learning, or in an interleaving man- Task2: RAkEL. and ML-kNN. RAkEL is a high- ner: additional data is collected to incrementally train the performance multi-label learning method that accepts classification models while the models are used in de- any multi-label learner as a parameter. RAkEL creates tection and identification. Interleaving operations enable m random sets of k label combinations, and builds an our method to adapt and improve continuously with new ensemble of Label Powerset (LP) [47] classifiers from data, especially with online learning where the output of each of the random sets. LP is a transformation-based our method is subsequently labeled and used to train the algorithm that accepts a single-label classifier as a pa- classification models. rameter. It considers each distinct combination of labels that exists in the training set as a different class value of a single-label classification task. Ranking of the la- 1. Data Collection 2. Supervised Learning bels is produced by averaging the zero-one predictions Input: URL 3-1. Detection 3-2. Identification of each model per considered label. An ensemble voting process under a threshold t is then employed to make a Output: Benign URL Malicious URL, {Type} decision for the final classification set. We use C4.5 [32] as the single-label classifier and LP as a parameter of the multi-label learner. Figure 1: The framework of our method. ML-kNN is derived from the traditional k-Nearest Neighbor (kNN) algorithm [1]. For each unseen instance, its k nearest neighbors in the training set are first 2.2 Learning Algorithms identified. Based on the statistical information gained from the label sets of these neighboring instances, max- The two tasks performed by our method, detecting mali- imum a posteriori principle is then utilized to determine cious URLs and identifying attack types, need different the label set for the unseen instance. 2 3 Discriminative Features at other positions. Therefore, we discard the widely used “bag-of-words” approach and adopt several new features Our method uses the same set of discriminative features differentiating SLDs from other positions, resulting in for both tasks: malicious URL detection and attack type a higher robustness against lexical manipulations by at- identification. These features can be classified into six tackers. Lexical features No. 1 to No. 4 in Table 1 are groups: lexicon, link popularity, webpage content, DNS, from previous work. Feature No. 10 is different from DNS fluxiness, and network traffic.

Detecting Malicious Web Links and Identifying Their Attack Types

The Internet and Drug Markets

Fully Automatic Link Spam Detection∗ Work in Progress

Clique-Attacks Detection in Web Search Engine for Spamdexing Using K-Clique Percolation Technique

Download PDF Document, 456 KB

The History of Digital Spam

Information Retrieval and Web Search Engines

Web Spam Taxonomy

The Domain Abuse Activity Reporting (DAAR) System

Spam in Blogs and Social Media

By Stephen Harrison MBA

A Survey on Adversarial Information Retrieval on the Web

Classification of Malicious Web Pages Through a J48 Decision Tree, Anaïve Bayes, a RBF Network and a Random Forest Classifier Forwebspam Detection