SUSPICIOUS URL AND DEVICE

DETECTION BY LOG MINING

by

Yu Tao B.Sc., University of Science and Technology of China, 2012

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

in the School of Computing Science Faculty of Applied Sciences

c Yu Tao 2014 SIMON FRASER UNIVERSITY Spring 2014

All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL

Name: Yu Tao

Degree: Master of Science

Title of Thesis: SUSPICIOUS URL AND DEVICE DETECTION BY LOG MINING

Examining Committee: Dr. Greg Mori, Associate Professor Chair

Dr. Jian Pei, Professor, Senior Supervisor

Dr. Jiangchuan Liu, Associate Professor, Supervisor

Dr. Wo-Shun Luk, Professor, Internal Examiner

Date Approved: April 22th, 2014

ii Partial Copyright Licence

iii Abstract

Malicious URL detection is a very important task in Internet security intelligence. Existing works rely on inspecting web page content and URL text to determine whether a URL is malicious or not. There are a lot of new malicious emerging on the web every day, which make it inefficient and not scalable to scan URL one by one using traditional methods. In this thesis, we harness the power of big data to detect unknown malicious URLs based on known ones with the help of Internet access logs. Using our method, we can find out not only related malicious URLs, but also URLs of new updates and CC(command and control) servers for existing malware, botnets and viruses. In addition, we can also detect possibly infected devices. We also discuss how to scale up our method on huge data sets, up to hundreds of gigabytes in our experiment. Our extensive empirical study using the real data sets from Fortinet, a leader in Internet security industry, shows the effectiveness and efficiency of our method.

iv To my parents.

v “Men love to wonder, and that is the seed of science.”

— Ralph Waldo Emerson (1803-1882)

vi Acknowledgments

I would like to express my sincerest gratitude to my senior supervisor, Dr Jian Pei, who provides creative ideas for my research and warm encouragement for my life. Throughout my master study, he shared with me not only valuable knowledge, but also the wisdom of life. Without his help, never can I accomplish this thesis. My gratitude also goes to my supervisor, Dr Jiangchuan Liu, for reviewing my work and helpful suggestions that helped me to improve my thesis. I am grateful to thank Dr. Wo-Shun Luk and Dr. Greg Mori, for serving in my examing committe. I thank Guanting Tang, Xiao Meng, Juhua Hu, Xiangbo Mao, Xiaoning Xu, Chuancong Gao, Yu Yang, Li Xiong, Lin Liu, Beier Lu and Jiaxing Liang for their kind help during my study at SFU. I am also grateful to my friends at Fortinet. I thank Kai Xu, for his guide and insight suggestions. My deepest gratitude goes to my parents. Their endless love supports me to overcome all the difficulties in my study and life.

vii Contents

Approval ii

Partial Copyright License iii

Abstract iv

Dedication v

Quotation vi

Acknowledgments vii

Contents viii

List of Tablesx

List of Figures xi

1 Introduction1 1.1 Background and Motivation...... 1 1.2 Challenges...... 2 1.3 Major Idea...... 3 1.4 Contributions...... 4 1.5 Thesis Organization...... 4

2 Related Work6 2.1 Blacklisting...... 6 2.2 Heuristics Based Methods...... 7

viii 2.3 Classification Based Methods...... 8 2.3.1 Content Based Methods...... 8 2.3.2 URL Based Methods...... 9

3 Problem Definition and Graph Representation 13 3.1 Problem Definition...... 13 3.2 Bipartite Graph Representation...... 15 3.3 Assumptions...... 17

4 Scalable Methods 20 4.1 The Basic Method...... 20 4.2 Limitation of Our Method...... 22 4.3 Data Storage...... 23 4.3.1 Data Storage of Graph Structure...... 23 4.3.2 Data Storage of URLs’ Suspicious Scores...... 24 4.3.3 Data Storage of Devices’ Suspicious Scores...... 27 4.4 MapReduce Approach...... 27 4.5 Relationship between Scalable Version and MapReduce Version...... 32

5 Experimental Results 33 5.1 Data Sets...... 33 5.2 Efficiency of Our Basic Method...... 35 5.3 Effectiveness of Our Method...... 37 5.3.1 Effectiveness of Malicious URLs Found by Our Method...... 38 5.3.2 Effectiveness of Infected Devices Found By Our Method...... 42 5.4 Efficiency of Our MapReduce Method...... 44 5.4.1 Number of Mappers...... 45 5.4.2 Number of Reducers...... 45 5.4.3 Number of Machines in Hadoop Cluster...... 47

6 Conclusions 48

Bibliography 50

ix List of Tables

1.1 Malicious URLs with the same IP address...... 2 1.2 Malicious URLs from the same family of virus...... 3

5.1 Top popular that we have filtered...... 34 5.2 Malicious URLs detected by traditional methods...... 35 5.3 Comparision of top 10 URLs of first and second iteration...... 43 5.4 Suspicious URLs that D1 has visited...... 43 5.5 Suspicious URLs that D2 has visited...... 45

x List of Figures

3.1 first example of bipartite graph representation...... 16 3.2 second example of bipartite graph representation...... 17

4.1 Store the adjacency list of the bipartite graph on disk...... 24 4.2 Store the suspicious scores of URLs on disk with the neighbors...... 26 4.3 Partition the suspicious scores of devices and the graph structure...... 29 4.4 Overview of MapReduce framework...... 30

5.1 Degree distribution of URLs...... 34 5.2 Running time with size of data set...... 36 5.3 Memory storage with size of data set...... 36 5.4 Disk storage with size of data set...... 37 5.5 Accuracy of Top K URLs found by our method...... 38 5.6 Accuracy of Top K URLs that end with ’exe’ or ’php’...... 39 5.7 Accuracy of Top K URLs with different definition of being malicious..... 40 5.8 Accuracy after one week and two weeks...... 41 5.9 Accuracy of different iterations...... 42 5.10 Running time with different number of reducers...... 46 5.11 Running time with size of dataset...... 47

xi Chapter 1

Introduction

In this chapter, we first briefly introduce the background of Internet security, how web based attacks work, and the motivation and challenges of malicious URL detection. Then, we will summarize our major contributions and describe the structure of the thesis.

1.1 Background and Motivation

The development of Internet not only improves our quality of life and drives new oppor- tunities for commerce, but also creates opportunities for malicious attacks. The attackers are the people that design web based attacks to achieve several goals, such as installation of malware and virus, spam-advertised commerce, identity theft, financial fraud and botnet information flow. How to identify web based attacks and guard the safety of users on the Internet is very important. Several factors have made the identification of web based attacks challenging. The first is the large scale of the . The amount of websites is so huge and different websites provide different kinds of data and services, which make it difficult to distinguish between attack websites and benign websites. Second, the attackers can disguise their attacks anytime and duplicate them in multiple locations. Most web based attacks share a common pattern, the attackers will put their attack code on the web and attract the users to visit it via its Uniform Resource Locator(URL). As a result, users need to evaluate the associated risk when deciding whether to click on an unfamiliar URL. Is this URL safe or not, or will it make the computer get infected after clicking this URL? This is a very difficult decision for users to make.

1 CHAPTER 1. INTRODUCTION 2

http://coolstowage.com/ponyb/gate.php http://coolstowage.com/2/gate.php http://deeliodin.com/ponyb/gate.php http://couponwalla.com/ponyb/gate.php http://dealdin.com/ponyb/gate.php http://coolstowage.com/ponyb/gate.php

Table 1.1: Malicious URLs with the same IP address

There are various systems that help users decide whether a URL is safe to click on or not. In recent years, the most common method, used in Web filtering applications, and browser toolbars, is blacklisting. The bad URLs that direct users to web based attacks are called malicious URLs. The Internet security companies maintain a list of malicious URLs, which is called blacklist. After a user clicks on a URL, the URL will be checked in the blacklist. If the URL is in the blacklist, the user will be prevented from visiting it. How to maintain and update a blacklist is a key issue for Internet security companies. Currently, blacklists are constructed using a lot of different techniques, such as honeypots, mannual reporting, human feedbacks and web crawlers combined with web page analysis heuristics. However, some malicious URLs might not be in the blacklist. One reason is that they are too new and have not been evaluated. Another reason is that they are evaluated incorrectly, which unfortunately happens from time to time. To overcome this problem, some systems inspect the web page content and analyze the behavior after visiting the web page. Unfortunately, this increases the run time overhead and affects the user experience. With the development of machine learning, several classification based methods, using the features of web page content and URL text, are also used to detect malicious URLs. Classification based methods have made the detection of malicious URLs much easier and more efficient. However, the attackers also adjust their strategies accordingly and invent new kinds of attacks.

1.2 Challenges

There are some challenges which traditional malicious URL detection methods are facing. The first challenge is that different malicious URLs may point to the same IP address. Table 1.1 provides an example. Even though these six URLs have different domain names, they are registered to the same IP address. Actually, all of them are the URLs of the CC server of a botnet. The attacker may apply for many domain names and bind them CHAPTER 1. INTRODUCTION 3

http://www.finanzkonzepte-czekalla.de/bPDyoe.exe http://rupprechtsteuerung.de/img/sst.exe http://selection4fashion.nl/backup/swich.exe http://www.svi.kiev.ua/video/so.exe http://www.phaseii.net/load51.exe http://www.finanzkonzepte-czekalla.de/bPDyoe.exe http://fallimentodipietrospa.it/rEvTsC.exe http://keep-smile.net/t4T.exe

Table 1.2: Malicious URLs from the same family of virus

together with the same machine that holds the attack code. Once one URL is blocked by Internet security companies, the attacker will use the next one. This makes the detection of malicious URLs very difficult. We can put the IP address into blacklist and all the requests to this IP address will be blocked. However, there may also be some benign URLs under this IP address and we do not want to prevent the users from visiting them. Virtual hosting, which is a method for hosting multiple domain names on a single server, is very popular for web hosting. Most small web masters share the web servers with others and blocking IP address is not a good choice in this situation. The second challenge is that the same family of malicious URLs may have been put on different websites. The eight URLs in Table 1.2 are from the same family of virus. All the eight host names are clean and distributed around the world. The attacker hacked into these websites and put the attack code under these host names. Detection of these malicious URLs is extremely difficult if we look at these URLs separately. Once one URL is blocked by Internet security companies, the attacker can switch to the next URL. Thus the detection is always behind the emergency of new malicious URLs.

1.3 Major Idea

We can see that malicious URLs are not isolated from each other. There is some relationship among them and the ones from the same family are related. The computers that are infected by the same family of malicious URLs will take similar actions in the future. This is CHAPTER 1. INTRODUCTION 4 controlled by the attacker. If we can know the URLs that most of these infected computers visit together, we may find the related URLs easily. In this thesis, we take advantage of Internet access logs, which provide us valuable information about the relationship between devices and URLs, to find possibly infected devices and malicious URLs. The devices that are infected by the same family of botnets, malware or viruses will take similar actions and visit the same CC servers. Thus we can detect the URLs for CC servers and new updates of malicious software.

1.4 Contributions

The problem of malicious URL detection has been carefully and thoroughly addressed in different research fields. In this thesis, we take advantage of mining large scale log data to predict unknown malicious URLs.

1. We formulate the problem of malicious URL detection by mining Internet access log data.

2. We develop a practical algorithm that can find out unknown malicious URLs based on known malicious URLs.

3. We also discuss how to make our algorithm scalable on a huge amount of data, which is up to several hundred gigabytes in our experiments.

4. Using one week’s Internet access logs from Fortinet, we can achieve about 50% pre- cision for the top 1000 suspicious URLs found by our method. Our method can also detect some malicious URLs faster than all the other methods.

5. It takes about 1 hour to run one iteration of our MapReduce method on 960GB Internet access logs using 10 computers. We also compare the running time and memory usage of different implementations of our basic method.

1.5 Thesis Organization

The rest of the thesis is organized as follows. In Chapter 2, we review the related work. In Chapter 3, we provide our formal problem definition and analyze our problem as a bipartite graph mining problem. We also make two assumptions about the behavior features of CHAPTER 1. INTRODUCTION 5 malicious URLs and infected devices. In Chapter 4, we propose our algorithm and discuss how to make it scalable. We report our experimental results in Chapter 5, and conclude the thesis in Chapter 6. Chapter 2

Related Work

There is already extensive work on malicious URL detection. The blacklist based method is the most basic one. With the development of machine learning, classification based methods are more and more popular recently. The classification based methods can also be divided into two categories, the content based methods and the URL based methods. In this part, we will give a brief summary of the existing work on malicious URL detection.

2.1 Blacklisting

Blacklisting is a basic solution to prevent users from requesting malicious URLs. It provides a list of URLs, domain names and IP addresses, which are reported by human feedbacks, honeyports, manual labeling or collected by crawlers and verified as malicious by other systems, such as classification based methods. When a user tries to visit a URL that is inside the blacklist, the request is blocked and the user can stay safe. There are several malicious URLs blacklist providers, such as SORBS [7], URIBL [5] and PhishTank [4]. The most famous usage of blacklist is Google Safe Browsing [6], which maintains a blacklist of malware and phishing URLs. It checks the requests from Google Chrome and Firefox browsers and notifies the users if the requested URLs are in the blacklist. There are some other similar service providers such as the Fortinet URL lookup tool [2] and McAfee SiteAdvisor [3]. Blacklists need to be updated, because malicious URLs tend to be short-lived. Their substrings may be partially mutated to avoid blacklisting. It can be assumed that unknown malicious URLs are in the neighborhood of known malicious URLs, created by the same

6 CHAPTER 2. RELATED WORK 7 attacker. Akiyama et al. [10] proposed an effective blacklist URL generation method. They tried to discover the URLs in the neighborhood of a malicious URL by using a search engine. Their method can improve maintenance of the coverage of blacklists and identification of malicious URLs effectively. The advantage of blacklisting is that it can be constructed from various sources, par- ticularly human feedbacks, which are highly accurate and incurs nearly no false positives. Testing whether a URL is in the blacklist is very fast and it will provide great web surfing experience as there is nearly no dely. The disadvantage of the blacklist based method is also apparent. Human feedbacks are time-consuming and blacklist is only effective for known malicious URLs. It cannot detect new, unknown malicious URLs. It takes time for users to discover malicious web pages and report them. As a result, new malicious URLs will not be on the blacklist or still be waiting for verification. The updates of blacklist are always behind the emergence of new malicious URLs.

2.2 Heuristics Based Methods

The heuristics based methods rely on the signatures of known malicious payloads from antiviral systems. They scan each web page and treat it as malicious if its heuristic pattern matches a signature in the . Andreas et al. [13] designed ADSandbox, which detected malicious websites by executing the embedded Javascript within an isolated environment and logging each critical action. ADSandbox decides whether the site is malicious or not using heuristics on these logs. Christian et al. [27] detected malicious web pages by inspecting the underlying static at- tributes of the initial HTTP response and HTTP code. Because malicious web pages import exploits from remote resources and hide exploit code, static attributes, which characterize these actions, can be used to identify a lot of malicious web pages. Unfortunately, the signatures can be easily found by crackers, thus the heuristics fail to detect new kind of malicious web pages. CHAPTER 2. RELATED WORK 8

2.3 Classification Based Methods

The malicious URL detection problem can also be regarded as a classification problem in machine learning. Based on the features, which are extracted from page content and URL text, we can take advantage of classification algorithms to classify all the URLs into two categories, malicious URLs and benign URLs.

2.3.1 Content Based Methods

The content based methods determine whether a URL is malicious by analyzing the content of the corresponding web page and the actions after visiting a URL. The content of web page can always provide us some valuable information and we can take advantage of this information to detect malicious URLs. Moshchuk et al. [23] detected malicious URLs, which contain spyware, by analyzing downloaded trojan executables. They also analyzed the density of spyware, the types of threats, and the most dangerous Web zones in which spyware likely to be encountered. Provos et al. [25] used features of web page content, such as whether iFrames are out of place and whether there are certain kind of javascript code, to detect malicious URLs. Honeyclients [26] are widely used systems that mimic a human visitor and use an iso- lated sandbox to visit a URL. Execution dynamics of a web page is analyzed and used to detect malicious activities. However, it cannot detect zero-day exploits. High-interaction honeyclients [28] check integrity changes in system states, which requires monitoring file system, processes, network and CPU consumption. The drawback of the content based methods is that they need to download and analyze the web pages, which takes a lot of time. After a user requests an URL, the web page will be crawled and analyzed. The content based methods also need to inspect the actions after visiting this URL. Because of the significant latency, online detection using content based methods cannot provide a good web surfing experience. Even if we do offline detection and analysis, it still costs much computation time and resource. Our method in this thesis is also not online. Unlike content based methods, our method does not need to download the web pages first. CHAPTER 2. RELATED WORK 9

2.3.2 URL Based Methods

URL based methods do not need to download and scan web pages. They only rely on the features of URLs themselves. The features used by a URL based classifier can be divided into two parts, host-based features and lexical features. The combination of the two is called full-featured analysis.

Host-based features

Host-based features are composed of the external sources of a URL, including IP address information, Domain Name, WHOIS data and Connection speed.

• IP address information - whether an IP address or A, MX and NS records is in the blacklist.

• Domain Name - whether or not some certain keywords exist in the hostname

• WHOIS data - registration dates, registrars and registrants, expiration date, whether or not the WHOIS entry is locked.

• Connection speed - malicious URLs usually have low connection speeds.

• Geographic properties - which country the IP address belongs to.

Lexical features

Lexical features are extracted from the text of the URLs. It includes the domain name, directory, filename and arguments.

• Domain name - Length, number of hyphens and tokens, whether the domain name contains IP address or port number

• Directory - number of directories used, the longest directory name

• Filename - length of filename, number of delimiters

• Arguments - number of variables, longest variable value CHAPTER 2. RELATED WORK 10

Kan and Thi [17] provided one of the early studies of malicious URL classification based on URL features. They used bag-of-words representation of URL tokens and also took into account where the tokens appear within URLs. By comparing classifiers of page content features and URL lexical features, they found that the latter can achieve 95% of the accuracy of the former. McGrath and Gupta [21] analyzed the differences between phishing URLs and normal URLs by examining the anatomy of phishing URLs and domains, registration of phishing domains, time to activation and the machines used to host the phishing sites. These features are used to construct a classifier for phishing URLs detection. Garera et al. [14] classified phishing URLs by using logistic regression over 18 manually- selected features, including Google Page Rank and Google Web page quality guidelines. Their large scale measurement study was conducted on Google Toolbar URLs. They found about 777 unique phishing pages per day. Ma et al. [19] combined several kinds of features to construct classifier for malicious URL detection. They analyzed and used different kinds of features, such as page based, domain based, type based and word based features, to train a logistic regression classifier, which achieved a very high accuracy. Choi et al. [11] proposed a method to detect various kinds of attacks using a variety of discriminative features, including textual properties, link structures, webpage contents, DNS information and network traffic. One of the most important features used in their method is “link popularity”, which is estimated by counting the number of incoming links from other web pages. Link popularity can be seen as a reputation measure of a URL. Instead of using any pre-deffined features or fixed delimiters, Huang et al. [15] proposed to dynamically extract lexical patterns from URLs. Their novel model of URL patterns pro- vided new flexibility and capability on capturing malicious URLs algorithmically generated by malicious programs. Using their URL patterns, they can achieve 90% recall when the malicious probability ratio threshold is set as 10. Pao et al. [24] proposed a detection method that is based on an estimation of the conditional Kolmogorov complexity of URL strings. To overcome the incomputability of Kolmogorov complexity, they adopted a compression method for its approximation, called conditional Kolmogorov measure. As a single significant feature for detection, it can achieve a decent performance that cannot be achieved by any other single feature. Moreover, the CHAPTER 2. RELATED WORK 11 proposed Kolmogorov measure can work together with other features for a successful detec- tion. Unfortunately, the previous methods do not scale up with hundreds of millions of URLs encountered every day, because the problem is a heavily imbalanced, large scale binary classification problem. Lin et al. [18] presented a novel lightweight filter based only on the URL string itself, which can be used before existing processing methods. They generated two filtering models by using lexical features and descriptive features, and then combine the filtering results. Without the host-based information and content-based analysis, they were able to deal with two millions URL strings in five minutes. They also introduced an online learning technique in their framework so that the filtering models can be modified dynamically if there is any feedback from the back-end content analysis engine. Unlike the batch machine learning algorithms, online Learning [22] has been proposed as a scalable approach to tackling large-scale online malicious URL detection tasks. In general, online learning methods are more suitable for large-scale, real-world online web applications due to their high efficiency and scalability. Ma et al. [20] used a real-time system for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider. They demonstrated that online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set. Most of the online learning algorithms were designed to optimize the classification accu- racy, typically by assuming the underlying training data distribution is class-balanced ex- plicitly or implicitly. This is clearly inappropriate for online malicious URL detection tasks since the real-world URL data distribution is often highly class-imbalanced. Zhao et al. [29] presented a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. Sometimes, recall is more important than precision for malicious URL detection. Finding as more malicious URLs as possible is important for the safety of users. The existing classification based method can achieve about 90% recall [15]. Unfortunately, there are always some malicious URLs missing using existing methods. The focus of this thesis is to find malicious URLs and possibly infected devices based on known malicious URLs with the help of Internet access logs. Internet access logs are difficult to get and there are currently no existing methods that use them to find malicious URLs. The results of traditional malicious URL detection methods are the input of our CHAPTER 2. RELATED WORK 12 problem. We can also use the traditional malicious URL detection methods to verify the output of our method. We will introduce Internet access logs and clarify the difference and connection between our method and traditional methods in Chapter 3.1. Chapter 3

Problem Definition and Graph Representation

In this chapter, we formulate our problem. We will also discuss the difference and connection between our problem and the traditional malicious URL detection problem. To get a better understanding, we analyze our problem as a bipartite graph mining problem, too. We also propose two assumptions about the behavior features of malicious URLs and infected devices. Both these two assumptions and bipartite graph representation will help us design the algorithm in the next chapter.

3.1 Problem Definition

Nowadays, the majority of computer attacks are launched by visiting a malicious web page. A user can be tricked into giving away confidential information on a phishing page or becoming victim to a drive-by download resulting in a malware infection. So how to prevent users from visiting a malicious web page is very important. A uniform resource locator, abbreviated URL, is a specific character string that con- stitutes a reference to a web page. The URLs of malicious web pages are called malicious URLs. There are many new malicious URLs found by different companies everyday us- ing several traditional malicious URL detection methods. Most of those URLs have been verified as malicious by humans or sandboxes. Users are prevented from visiting them. A device is a machine that is used by users to surf the Internet and request URLs. There

13 CHAPTER 3. PROBLEM DEFINITION AND GRAPH REPRESENTATION 14 are several different kinds of devices, such as computers, mobile phones, tablets. The devices that have been infected by viruses, botnets or malware are called infected devices. After getting infected, the devices are under whole control of attackers and will communicate with the attackers to get new commands and updates. We call the device that may have been infected by malicious software as possibly infected devices. It will be very valuable for users to get notified once their devices have been infected by malware, botnets or viruses. The relationship among URLs and devices are represented as requests. An device visits a web page by sending an HTTP or HTTPs request to a URL. Most of the time, a URL may be requested by many devices and a device may request several URLs. Most of the requests are sent by humans to surf the Internet. However, once a device gets infected, the virus can also send some requests to communicate with the cracker, which is unknown by the user. The Internet security companies have the Internet access logs of a large number of devices, which record all the requests from the devices to URLs. In the Internet access log, a record is a tuple < device id, timestamp, url > when device id is a device ID, which requests at timestamp. A device may probably be infected if it requested a malicious URL. Similarly, a URL may be malicious if it has been requested by a large number of infected devices. We can see that the requests from the devices to the URLs provide us some valuable information to predict unknown malicious URLs and find out infected devices. The limitation of traditional malicious URL detection methods is that they are not scal- able because they need to inspect URLs one by one. There are many new URLs appearing on the web everyday and scanning them one by one is not efficient. So most of the security companies can only identify a part of malicious URLs. There are usually some malicious URLs that are missed. Because of the mechanism of malware, botnets and viruses, malicious URLs are not isolated from each other. They are correlated and the infected devices often visit several malicious URLs in a period of time. The first malicious URL a device visited may cause this device infected. After some time, the device may visit some CC servers to get new commands and updates. Once a device is infected, the cracker has control of the infected device and the device may also download some new kind of malware, botnets or viruses. So, it is possible to recognize unknown malicious URLs based on known malicious URLs by mining the Internet access logs. CHAPTER 3. PROBLEM DEFINITION AND GRAPH REPRESENTATION 15

The input of our problem consists of two parts: malicious URLs found by traditional methods and the relationship among the URLs and devices, which is represented as the Internet access logs of large number of devices. We want to take advantage of these two kinds of input to find out unknown malicious URLs and possibly infected devices. Our problem here is totally different from the traditional malicious URL detection problem. The traditional malicious URL detection problem relies on inspecting the html, javascript, exe files and URL text to determine whether a URL is malicious or not. How- ever, our problem is based on mining the relationship between devices and URLs to find out unknown malicious URLs, using the malicious URLs found by existing methods. Another difference is that, other than the unknown malicious URLs, we can also find the possibly infected devices in our problem definition. There are also some connections between our problem and the traditional malicious URL detection problem. The results of the traditional malicious URL detection methods are used as the input of our problem definition here. At the same time, the result of our problem can also be used as the input of the traditional methods. We can use the traditional URL detection methods to verify the output URLs of our method.

3.2 Bipartite Graph Representation

To get a better understanding of our problem, we can view it as a bipartite graph mining problem. This will make the problem more intuitive and can also help us come up with the algorithm in the next chapter. The devices, URLs and their visiting relationship can be represented as a graph G = (V,E). For every node v ∈ V , it is either a device or a URL. For every edge e ∈ E, it must be an edge between a URL and a device, which indicates that the device has requested that URL before. There will be no edges between two devices or two URLs. Because of this special characteristic, it is actually a bipartite graph G = (Vdev,Vurl,E). The nodes of the bipartite graph can be divided into two disjoint sets Vdev and Vurl such that every edge connects a node in Vdev to another one in Vurl. In our problem, Vdev is the set of all the devices and Vurl is the set of all the URLs. E can be got from the Internet access logs, which record the requests from the devices to the URLs.

Some URLs u ∈ Vurl are marked malicious initially. These URLs can be viewed as the training set of our problem and we need to take advantage of the graph structure to predict CHAPTER 3. PROBLEM DEFINITION AND GRAPH REPRESENTATION 16

V1 U1

V2 U2

V3 U3

V4 U4

V5 U5

Figure 3.1: first example of bipartite graph representation

other malicious nodes in the graph. There are two kinds of newly found malicious nodes, the ones from Vdev are the possibly infected devices and the ones from Vurl are the new malicious URLs. The basic idea is that if one node in the graph is malicious, its neighbors, which are the URLs a malicious device visited or the devices that requested a malicious URL, may also be malicious. We will provide two examples to explain how bipartite graph representation can help us better understand the problem. In the first example, if u2 is marked malicious initially in Figure 3.1, v2 and v3 may have been infected because both of them have visited u2. In the next step, we can realize that both v2 and v3 have visited u5 and u5 has only been visited by v2 and v3. Then, u5 is very likely a malicious URL, too. If u5 has a high probability of being malicious, v5 may have been infected as well. In the second example, if u2 and u4 are marked malicious initially in Figure 3.2, v1, CHAPTER 3. PROBLEM DEFINITION AND GRAPH REPRESENTATION 17

V1 U1

V2 U2

V3 U3

V4 U4

V5 U5

Figure 3.2: second example of bipartite graph representation

v3 and v4 may have been infected because all of them have visited u2 or u4. In the next step, we can realize that all of v1, v3 and v4 have visited u3. In other words, out of the four devices that visited u3, three of them may have been infected. Then, u3 is very likely a malicious URL too. In this example, u2 and u4 seems unrelated if they are looked at independently. However, if we study them together in the graph, they may come from the same family of botnet and we can find out new malicious URLs of this family easily. From these two examples, we can see that if a URL is malicious, all its neighbors in the bipartite graph may have been infected. Similarly, for a URL, if most of its visits come from suspicious devices, it may be a malicious URL.

3.3 Assumptions

Once we represent the URLs, devices and their relationship as a bipartite graph, we also need to know what kind of nodes will be more likely to be malicious in the bipartite graph. CHAPTER 3. PROBLEM DEFINITION AND GRAPH REPRESENTATION 18

In other words, we need to figure out the difference between the behavior of ordinary devices, URLs and infected devices, malicious URLs. This will be the foundation of our algorithm in the next chapter. The mechanism of malicious software is composed of two parts. The first part is how to get a device infected. Usually, a device will get infected after visiting a malicious URL. This malicious URL may points to an executable file that contains some malicious code. Another situation is that a user tries to visit an ordinary URL. However, there may be some drive-by download with the URL and the user is totally unknown about it. The second part is how the infected devices communicate with the attackers. Either to steal information from users or to send spam emails, the attackers need to communicate with the infected devices. The malicious software on the infected devices will communicate with the attackers by visiting URLs of CC servers to get new commands and updates, which is unknown by the user. Based on the analysis of malicious URLs, infected devices and the mechanism of botnets, malware and viruses, we propose the following two assumptions.

Assumption 1. A device is likely infected if it visits a malicious URL

The first assumption is about how to tell whether a device is infected or not. If a device have requested a malicious URL, it is very likely that this device has been infected. One situation is that this malicious URL contains some executable code that caused the device infected. Another situation is that this URL is a CC server and the infected device visits it to get the new commands and updates. We can also argue that ordinary devices will nearly never visit a malicious URL. The first reason is that most of the malicious URLs are not web pages. It can be just a binary file or an encrypted text message, which ordinary users will never have interest in. The requests to these URLs are sent by malware or botnets that is unknown by the user. The second reason is that most of the malicious URLs are generated by program and only known by the viruses, botnets or malware. They are just the message channel between infected devices and the cracker. The possibility that an ordinary device have requested a malicious URL is very small.

Assumption 2. A URL that is mainly visited by infected devices is likely a malicious URL.

The second assumption is about the difference between malicious URLs and the ordinary ones. CHAPTER 3. PROBLEM DEFINITION AND GRAPH REPRESENTATION 19

A key fact about the infected devices is that most of their Internet visits are still to the ordinary URLs. Only a very small part of the Internet visits are to the malicious URLs, which is sent by the virus and unknown by the user. So we cannot say a URL tends to be malicious because it is requested by an infected device. It might just be an ordinary URL of a that the user visits every day, even though the user’s device has been infected. However, we should pay attention to the URLs that are mainly visited by the possibly infected devices. The possibly infected devices are always a small part of all the devices. So for most of the URLs, most of their requests should come from the ordinary devices. If most of the devices that visit a URL tend to have been infected, this URL might be the communication channel of the infected devices and should be considered as malicious. Chapter 4

Scalable Methods

In this chapter, we will first introduce our basic algorithm, which is easy to implement when the data can be held in main memory. However, the data we are facing in the real life application is always so huge that we have to store it on disk. Sometimes, we may even need to distribute the storage and calculation among multiple machines. So, how to make our algorithm scalable and efficient is very important. From another point of view, we also need to harness the power of big data to make our algorithm effective. The reason is that we rely on massive visits of a large number of devices when deciding whether a URL is malicious or not. If we only have network access logs for a very limited number of devices, there may be very few devices that have visited a malicious URL. The difficulty to predict unknown malicious URLs may increase dramatically. Therefore, we will discuss the strategies of data storage, which makes our algorithm scalable with the increasing volume of data. After that, we will introduce the MapReduce version of our algorithm, which is easier to implement and much more efficient. We will also compare the basic method and the MapReduce approach at last.

4.1 The Basic Method

First of all, we will define the suspicious scores for devices and URLs. The suspicious score for a device or URL is a measurement of the possibility that this device or URL is an infected device or a malicious URL. It is a non-negative number. The devices with large suspicious scores are more likely infected by viruses, botnets or malware. The URLs with large suspicious scores are more likely to be malicious URLs. Initially, only the known

20 CHAPTER 4. SCALABLE METHODS 21 malicious URLs have a positive suspicious score, which is 1. All the other URLs and all devices have a suspicious score 0. Based on the two assumptions made in chapter 3, we develop an iterative algorithm to update the suspicious scores of both devices and URLs. We iteratively update a device’s suspicious score based on the URLs’ suspicious scores that it has requested. Similarly, we update a URL’s suspicious score based on the suspicious scores of the devices, which have requested this URL. In the bipartite graph representation, this means that the updated suspicious score of one node is only determined by its neighbors’ suspicious scores. According to the first assumption, we set the suspicious score of a device to the maximum score of its URL neighbors’ suspicious scores. In other words, the suspicious score of a device is only determined by the most suspicious URL it has requested.

Algorithm 1: Update the suspicious scores of devices and URLs iteratively on a single machine when everything can be held in main memory 1 foreach url in URL do 2 url.score ← 0 3 foreach dev in DEV do 4 dev.score ← 0 5 foreach url in MURL do 6 url.score ← 1 7 foreach iteration ∈ {1, ..., k} do 8 foreach dev in DEV do 9 dev.score ← 0 10 foreach url in dev.neighbors do 11 if url.score > dev.score then 12 dev.score ← url.score 13 foreach url in URL do 14 url.degree ← 0 15 url.sum ← 0 16 foreach dev in url.neighbors do 17 url.degree ← url.degree + 1 18 url.sum ← url.sum + dev.score 19 url.score ← url.sum ∗ log(url.degree)/url.degree

According to the second assumption, we set the suspicious score of a URL to the product of two parts. The first part is the average suspicious scores of all the devices that have visited this URL. The second part is the logarithm of the degree of this URL. The first part is to distinguish suspicious URLs from ordinary ones. An ordinary URL may have a low score CHAPTER 4. SCALABLE METHODS 22 for the first part because most of its visits come from ordinary devices with low suspicious scores. Suspicious URLs may have a large score for this part because they are mostly requested by suspicious devices. We also need the second part to distinguish different suspicious URLs with different degrees. This part acts as the confidence of the first part. For a URL with a degree of 2, even though its neighbors’ average suspicious scores is high, we are not confident to say this URL is suspicious. In the contrast, if a URL with a degree of 100 still have a large average suspicious score from its neighbors, it is more suspicious than the previous URL. After taking logarithm of degree, the URL with degree of 100 is 2 times more suspicious than the URL with degree of 10. If we do not take logarithm and use degree directly as the second part, the suspicious score of a device will be the sum of its neighbors’ suspicious scores. In this situation, the URL with large degree will have large suspicious value. However, the malicious URLs usually do not have large degrees and are only used as communication channels by attackers to control infected devices. This idea of combing two factors by product and taking the logarithm of the degree of URL is similar to the idea of TF-IDF [16] in information retrieval. Algotithm 1 is the pseudo-code of our basic algorithm, assuming everything can be put in main memory. In each iteration, we scan the whole graph twice to update the suspicious scores of devices and URLs. If there are in total d devices, u URLs, and e edges between them, the running time of one iteration of our basic algorithm is O(d + u + e). The memory storage is also O(d + u + e).

4.2 Limitation of Our Method

As our method is based on the two assumptions made in chapter 3, it can detect the malware, botnets and viruses, which try to infect a lot of devices. Most of the existing malicious software tries to control as many devices as possible and the attacker use the same list of URLs to distribute the new control and updates. As the infected devices of the same family of malicious software will visit the same list of malicious URLs, our method can find the new CC servers and updates of existing malicious software easily. Unfortunately, our method also has its own limitations. If devices will have totally different behaviors after getting infected, our method cannot find new malicious URLs. For example, if the attacker has applied for a large number of host names and the infected CHAPTER 4. SCALABLE METHODS 23 devices choose one host name randomly as its CC server, it will be difficult for our method to detect these malicious URLs. Another situation is that when an attacker wants to launch an infection into a specific device, we cannot rely on the visits of a lot of devices and our method will not work.

4.3 Data Storage

There are two kinds of data that we need to store for our algorithm. First, we need to store the suspicious scores of the URLs and devices. Second, we need to store the relationships between the URLs and devices to update the suspicious scores in every iteration. When it comes to the graph representation, it means we need to store both the score and the neighbors of each node. In the following, we will discuss the storage strategy of the graph structure, URLs’ suspicious scores, devices’ suspicious scores, when the data is too large to be held in main memory.

4.3.1 Data Storage of Graph Structure

There are basically two kinds of representation of graphs: adjacency list and adjacency matrix. An adjacency list stores a list of adjacent nodes for every node. An adjacency matrix is a two-dimensional matrix, in which the rows represent source nodes and columns represent destination nodes. Even though there can be billions of URLs on the web, most of the devices have only requested a very limited number of them. Similarly, most of the URLs have only been visited by a very small number of devices. So, the graph we are facing is very sparse and thus using an adjacency list is much better in this situation. Actually, we do not need to have random access to the neighbors of a node in our algorithm. In each iteration, we calculate the new suspicious scores of each node one by one and we do not care about the order. We just need to scan the adjacency list and update each node’s suspicious score. The adjacency list of the real graph is always very large, which can be hundreds of gigabytes. We can just put them on disk and scan the disk to get each node’s neighbors in every iteration. Figure 4.1 shows an example when we store the adjacency list of the bipartite graph on disk. If memory is c times faster than disk for IO (input and output), the running time CHAPTER 4. SCALABLE METHODS 24

V1 U1 Memory(suspicious scores of devices and URLs): V1:0, V2:0, V3:0, V4:0, V5:0 U1:0, U2:1, U3:0, U4:1, U5:0

V2 U2 Disk(graph structure, Adjacency list): V1: U2, U3 V3 U3 V2: U1, U5 V3: U2, U3 V4: U1, U3, U4 V5: U1, U3, U5 V4 U4 U1: V2, V4, V5 U2: V1, V3 U3: V1, V3, V4, V5 U4: V4 V5 U5 U5: V2, V5

Figure 4.1: Store the adjacency list of the bipartite graph on disk

becomes O(d + u + e ∗ c) by storing bipartite graph on disk. The disk storage is O(e) and the memory storage to O(d + u).

4.3.2 Data Storage of URLs’ Suspicious Scores

In our algorithm, in order to update the suspicious score of a device, we need to get all the suspicious scores of this device’s neighbors and find the maximum one from them. This means we need to have random access to URLs’ suspicious scores. If we can store the suspicious scores of all the URLs in main memory, we can get each URL’s suspicious score very fast. However, there may be billions of URLs on the web, which may be difficult to be held in main memory. The number of URLs is also increasing every day and we need to develop a scalable solution for storage. We can store the suspicious scores of the URLs in the database. However, this may involves several database queries for each device, which is costly in the real implementation. As the new suspicious score of a device is only determined by the suspicious score of the most suspicious URL it requested, we can set the new suspicious scores of the devices while we are computing the new suspicious scores of the URLs. In other words, we only need to CHAPTER 4. SCALABLE METHODS 25

Algorithm 2: Store the suspicious scores of URLs on disk with the neighbors 1 foreach url in URL do 2 url.score ← 0 3 foreach dev in DEV do 4 dev.score ← 0 5 foreach url in MURL do 6 foreach dev in url.neighbors do 7 dev.score ← 1 8 foreach iteration ∈ {1, ..., k} do 9 foreach dev in DEV do 10 dev.newscore ← 0 11 foreach url in URL do 12 url.degree ← 0 13 url.sum ← 0 14 foreach dev in url.neighbors do 15 url.degree ← url.degree + 1 16 url.sum ← url.sum + dev.score 17 url.score ← url.sum ∗ log(url.degree)/url.degree 18 foreach dev in url.neighbors do 19 if dev.newscore < url.score then 20 dev.newscore ← url.score 21 foreach dev in DEV do 22 dev.score ← dev.newscore CHAPTER 4. SCALABLE METHODS 26

V1 U1

Memory(suspicious scores of devices): V2 U2 V1:1, V2:0, V3:1, V4:1, V5:0

V3 U3 Disk(suspicious scores, adjacency list of URLs): U1: 0, V2, V4, V5 U2: 1, V1, V3 U3: 0, V1, V3, V4, V5 V4 U4 U4: 1, V4 U5: 0, V2, V5

V5 U5

Figure 4.2: Store the suspicious scores of URLs on disk with the neighbors

calculate the new suspicious score of every URL in each iteration. After getting a URL’s new suspicious score, we propagate it to the neighbors (devices). In this way, we can avoid looking up the suspicious scores of a device’s neighbors and we do not need to get a URL’s suspicious score randomly any more. For more detail, we can just store the suspicious score of a URL with the URL’s neighbors on disk. After getting the neighbors and suspicious score of a URL from the disk, we calculate the new suspicious score of this URL and set this new suspicious score to the neighbors(devices) if it is larger than the new suspicious score of the device. This approach can also reduce the volume of graph structure storage as we do not need to store the neighbors of the devices now. Algotithm 2 is the pseudo-code, when we store the suspicious scores of URLs on disk with the neighbors. In each iteration, we scan the whole graph just once. However, we update the suspicious scores of devices for 2 ∗ d + e times. The overall running time is O(d + e + c ∗ (u + e)). The disk storage is O(e + u) and the memory storage to O(d). CHAPTER 4. SCALABLE METHODS 27

4.3.3 Data Storage of Devices’ Suspicious Scores

In the previous section we decide to store the suspicious scores of the URLs on disk with the graph structure. Unfortunately, we still need to store the suspicious scores of the devices in main memory for random access. This is feasible for thousands of millions of devices with the currently available computers. We need to figure out what to do if we are going to have even more devices or we only have very little space left in main memory. Another problem is how to distribute the storage and calculation among several computers if the data volume is so large to store and process in a single computer. After careful consideration, we found that we can split the suspicious scores of devices into several parts such that each part can be held in main memory. At the same time, we also need to split the graph into several parts accordingly such that we only need one part of devices’ suspicious scores when processing one part of the graph.

Specifically, we split the devices into N parts: DEV1,DEV2, ..., DEVN and every part can be held in main memory. We also split the graph into N parts: G1,G2, ..., GN accord- ingly. Gi is a subgraph of G, which is induced by the devices from DEVi and the whole

URLs. Gi is stored on disk, which contains each URL’s suspicious score and the URL’s neighbors within DEVi. So we only need to scan Gi when processing DEVi, because all the graph structure and suspicious scores related to DEVi are stored in Gi.

If all the DEVi and Gi are saved on the same computer, we can process them one by one with limited memory space. We can also distribute DEVi and Gi among several computers to achieve distributed storage and computation. We only need to make sure that DEVi and

Gi are on the same computer for each i. Algotithm 3 is the pseudo-code, when we partition and distribute the suspicious scores of devices and the graph structure among multiple machines. If there are totally p partitions and p machines, the overall running time is O(d/p + e/p + c ∗ (u + e)/p + c ∗ u). The disk storage is O(e/p + u/p) and the memory storage to O(d/p).

4.4 MapReduce Approach

Our previous method treats the problem as a bipartite graph mining problem and achieves scalability by partitioning the bipartite graph. The drawback is that the graph is very difficult to maintain while it is changing quickly. In this section, instead of building the CHAPTER 4. SCALABLE METHODS 28

Algorithm 3: Partition the suspicious scores of devices and the graph structure

1 Split all the devices into N parts: DEV1,DEV2, ..., DEVN 2 Split the graph into N parts accordingly: G1,G2, ..., GN 3 foreach iteration ∈ {1, ..., k} do 4 foreach i ∈ {1, ..., N} do 5 take DEVi into main memory 6 foreach url in Gi do 7 foreach dev in url.neighbors do 8 if url.score > dev.score then 9 dev.score ← url.score 10 foreach url in Gi do 11 foreach dev in url.neighbors do 12 url.degree ← url.degree + 1 13 url.sum ← url.sum + dev.score 14 update url.sum and url.degree on disk 15 foreach url in URL do 16 foreach i ∈ {1, ..., N} do 17 if url in Gi then 18 degree ← degree + url.degree 19 sum ← sum + url.sum 20 url.score ← sum ∗ log(degree)/degree 21 update url.score in each Gi CHAPTER 4. SCALABLE METHODS 29

V1 U1 First Partition: Second Partition:

Memory(suspicious scores Memory(suspicious scores V2 U2 of devices): of devices): V1:1, V2:0, V3:1 V4:1, V5:0

V3 U3 Disk(suspicious scores, Disk(suspicious scores, adjacency list of URLs): adjacency list of URLs): U1: 0, V2 U1: 0, V4, V5 U2: 1, V1, V3 U2: 1 V4 U4 U3: 0, V1, V3 U3: 0, V4, V5 U4: 1 U4: 1, V4 U5: 0, V2 U5: 0, V5 V5 U5

Figure 4.3: Partition the suspicious scores of devices and the graph structure

virtual graph, we will propose a method that processes the raw Internet access log directly. MapReduce [12] is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of two functions: a Map function and a Reduce function. The Map function performs filtering and sorting. The Reduce function performs a aggregation operation. Figure 4.4 provides an overview of MapReduce framework. The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain. It returns a list of pairs in another domain: Map(k1, v1) → list(k2, v2) The Map function is applied to each pair in parallel in the input dataset, which produces a list of pairs for each call. In the next step, the MapReduce framework collects all pairs with the same key from all lists and groups them together and creates one group for each key. The Reduce function is then applied to each group in parallel, which in turn produces a collection of values in the same domain: CHAPTER 4. SCALABLE METHODS 30

Input Pairs Map Intermediate Pairs Reduce Output Pairs

......

Figure 4.4: Overview of MapReduce framework

Reduce(k2, list(v2)) → list(v3) Each Reduce call typically produces either one value v3 or an empty return. However, one call is also allowed to return more than one value. The returns of all calls are collected as the output list. One record of Internet access log can be seen as a (key, value) pair, which is also an edge in the bipartite graph. We can either use device as key and URL as value or URL as key and device as value. Then, the key-value pair can be used to represent the propagation from one node to another one. Nevertheless, as we do not maintain the neighbors of a node explicitly, how to gather the key-value pairs of one node together is a challenge. The core idea behind MapReduce is mapping the data set into a collection of (key, value) pairs and reducing over all pairs with the same key. This is a perfect solution for our problem if we can represent every edge (A, B) as a (key, value) pair. The key is A and the value is the suspicious score of B. As a result, all the suspicious scores of A’s neighbors are sent to the same reducer and we can get A’s new suspicious score within the reducer. We still need to have several iterations to update the suspicious scores of URLs and devices. Each iteration will consists of two MapReduce jobs: one to update the devices’ scores, the other to update the URLs’ scores. Both the input and output of each MapReduce jobs are files. All the files should have the same format: DEV, DEVscore, URL, URLscore. CHAPTER 4. SCALABLE METHODS 31

The output of the first reducer acts as the input of the second mapper.

Algorithm 4: Mapper and Reducer used to update devices’ scores 1 Class DEVMapper 2 method MAP(int i, string log) 3 EMIT(DEV, pair(URL, URLscore)) 4 Class DEVReducer 5 method REDUCE(string DEV, pairs[(URL1, URLscore1), ...]) 6 DEV score = 0 7 foreach pair(URL, URLscore) in pairs[(URL1, URLscore1), ...] 8 If URLscore > DEV score 9 DEV score ← URLscore 10 foreach pair(URL, URLscore) in pairs[(URL1, URLscore1), ...] 11 EMIT(DEV+” ”+DEVscore, URL+” ”+URLscore)

The first MapReduce job uses DEV as the key, (URL, URLscore) as the value. So all the URLs with their scores that DEV have requested come to the same reducer. In the reducer, we set DEV score as the maximum URLscore and output the updated log entries associated with DEV .

Algorithm 5: Mapper and Reducer used to update URLs’ scores 1 Class URLMapper 2 method MAP(int i, string log) 3 EMIT(URL, pair(DEV, DEVscore)) 4 Class URLReducer 5 method REDUCE(string URL, pairs[(DEV1, DEV score1), ...]) 6 sum ← 0 7 degree ← 0 8 foreach pair(DEV, DEV score) in pairs[(DEV1, DEV score1), ...] 9 sum ← sum + DEV score 10 degree ← degree + 1 11 URLscore ← sum ∗ log(degree)/degree 12 foreach pair(DEV, DEV score) in pairs[(DEV1, DEV score1), ...] 13 EMIT(DEV+” ”+DEVscore, URL+” ”+URLscore)

The second MapReduce job uses URL as key, (DEV, DEV score) as value. So all the devices with their scores that have requested URL comes to the same reducer. In the reducer, we calculate new URLscore and output the updated log entries associated with URL. CHAPTER 4. SCALABLE METHODS 32

4.5 Relationship between Scalable Version and MapReduce Version

There are several connections and differences between the basic version and MapReduce version of our algorithm. Both of them are scalable, which can be run on several computers at the same time. There are some open source platforms, such as MapReduce, which makes it much easier to implement in the real life. To achieve scalability, these two versions have adopted different strategies. The basic version tries to make sure that all the data needed for a sub-problem is on the same machine. So it partitions the graph into several parts. However, the MapReduce version treats the problem as a large number of edges instead of a graph. It brings all the edges of the one node to the same reducer by the shuffle and sort of MapReduce. There are several difficulties of the basic version of our method in the real implemen- tation. First, how to partition the graph is a problem. We want to balance the size of each partition of the graph, such that each partition will has similar running time. This is difficult, because the size of each partition is changing every day. Second, how to maintain the changing bipartite graph is another problem. We need to update the adjacency list of each node every day, which will takes a lot of time. Third, the communication of multiple machines is a challenge if we distribute different partitions among different machines. However, the MapReduce version of our method is very easy to implement in the real application. We do not need to do partition and communication manually. MapReduce can help us do that. As we do not maintain the bipartite graph explicitly, we just use different days’ Internet access logs as the input. There are several open source implementation of MapReduce, such as Hadoop [1], which saves us a lot of time. Chapter 5

Experimental Results

In this chapter, we will report an extensive empirical study to evaluate our approach. First, we will introduce the data set that we used in our experiment. Second, we will compare running time, memory and disk usage of the three versions of our basic method, which are provided in chapter 4. Third, we will exam the effectiveness and efficiency of MapReduce version of our method.

5.1 Data Sets

We used the real data from Fortinet. Fortinet is a company, which specializes in network security appliances. It provides network security service to tens of millions of users and prevents them from getting infected. Fortinet maintains a list of malicious URLs and up- dates this list using several malicious URL detection methods every day. The requests to malicious URLs, which are sent by devices protected by Fortinet, are blocked and the users can stay safe. Fortinet has also collected the Internet access logs of its users and stores the logs on its Hadoop platform. There are more than nine hundred gigabytes of logs for one week. We used them as the data set of our experiment. For the Internet access logs, Fortinet filtered out the requests to the most popular 100,000 websites. Because of the power law distribution of the URL popularities (Figure 5.1), the requests to those 100,000 most popular websites (Table 5.1) will occupy a large part of the whole Internet access logs. However, the URLs under these most popular websites are typically not malicious URLs. They have large degrees in the bipartite graph representation and their suspicious scores may be very low according to our definition. At the same time,

33 CHAPTER 5. EXPERIMENTAL RESULTS 34

1e+07

1e+06

100000

10000

1000 Number of URLs

100

10

1 1 10 100 Degree

Figure 5.1: Degree distribution of URLs

they are well monitored by the existing malicious URL detection mechanism.

Top Popular Websites google.com facebook.com youtube.com yahoo.com linkedin.com amazon.com twitter.com wikipedia.com blogspot.com bing.com sina.com.cn vk.com ebay.com google.de babylon.com msn.com google.co.uk soso.com google.fr rumblr.com mail.ru pinterest.com google.co.jp apple.com baidu.com live.com yandex.ru qq.com google.co.in taobao.com tumblr.com weibo.com microsoft.com PayPal.com fc2.com imdb.com ......

Table 5.1: Top popular websites that we have filtered

We used a Hadoop cluster at Fortinet consisting 10 machines. Each machine has 64G CHAPTER 5. EXPERIMENTAL RESULTS 35 memory and 600GB disk storage and Intel Xeon E5-2650 2.00GHz CPU. Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. We store the Internet access logs on HDFS as files.

http://dl.downloadahsaequouzet.com/n/3.0.30.1/4540812/Mediaget.exe http://188.116.40.24/209cbqmz/z5p98n1h9.exe http://d1gd1n3cyjx2ep.cloudfront.net/mirror/speedypcpro/SSStub Somo SpeedyPC.exe http://d289gmom2gguvn.cloudfront.net/fst co 3101-ba8a7111.exe ...

Table 5.2: Malicious URLs detected by traditional methods

In Fortinet, there are many newly found malicious URLs every day using traditional methods. Some examples are shown in Table 5.2. Among all the malicious URLs, the exe URLs, which end with ‘exe’, are the most important ones. Because these exe URLs point to the executable files and users will get infected after visiting them. Fortinet has scanned these exe URLs with sandboxes and we are sure that they are actually malicious. So we use them as the input of our method. In other words, we set their suspicious scores to 1 in the first iteration of our experiment.

5.2 Efficiency of Our Basic Method

We provide three versions of our basic method in chapter 4. Algorithm 1 is the method when everything (suspicious scores and graph structure) can be held in main memory. Algorithm 2 is the method when we store the suspicious scores and neighbors of URLs on disk. Only the suspicious scores of devices are put in main memory. Algorithm 3 partitions the suspicious scores of devices and the graph structure into several parts. It distributes different parts among different machines and processes them at the same time. To compare running time, memory and disk usage of these three versions of our basic method, we cannot use the original data set, which cannot be put in main memory by algorithm 1. We randomly choose 122480 devices and use half day’s Internet access logs of them as the data set in this part. To test the relationship between the size of data set and the efficiency of our method, we run our method ten times on the randomly selected subset CHAPTER 5. EXPERIMENTAL RESULTS 36 of the whole data set, which consists of k ∗ 12248(1 ≤ k ≤ 10) devices’ Internet access logs. For algorithm 3, we partition the devices and graph into four parts in our experiment.

100 Alogrithm 1 90 Alogrithm 2 Alogrithm 3 80

70

60

50

40

Running time (seconds) 30

20

10

0 1 2 3 4 5 6 7 8 9 10 Size of data set (k*12248 devices)

Figure 5.2: Running time with size of data set

10000 Alogrithm 1 Alogrithm 2 Alogrithm 3 1000

100

10 Memory storage usage (MB) 1

0.1 1 2 3 4 5 6 7 8 9 10 Size of data set (k*12248 devices)

Figure 5.3: Memory storage with size of data set CHAPTER 5. EXPERIMENTAL RESULTS 37

250 Alogrithm 1 Alogrithm 2 Alogrithm 3 200

150

100 Disk storage usage (MB)

50

0 1 2 3 4 5 6 7 8 9 10 Size of data set (k*12248 devices)

Figure 5.4: Disk storage with size of data set

The running time, memory usage and disk usage of our basic method’s three different implementations are reported in Figure 5.2, Figure 5.3, Figure 5.4. From Figure 5.2, we can see that algorithm 1 is the fastest one, because it puts everything in main memory. Algorithm 2 and algorithm 3 put the graph structure on disk and they need to scan the disk to update the suspicious scores. For the usage of memory storage (Figure 5.3), algorithm 1 needs much more memory, because graph structure will take much larger space than the suspicious values in memory. From Figure 5.4, we can see that algorithm 3 needs less disk storage, because we only store a sub graph of the original graph for each part.

5.3 Effectiveness of Our Method

We used the Internet access logs of one week (February 1st, 2014 - February 7th, 2014) for our experiment in this part. The initial malicious URLs were 11376 URLs that are found by Fortinet from January 7th to February 7th, which end with ‘exe’. The experiment was done on the platform of Hadoop using MapReduce version of our method. In the following, we will verify the effectiveness of our method in two aspects: the malicious URLs that were CHAPTER 5. EXPERIMENTAL RESULTS 38 found by our method and the possibly infected devices.

5.3.1 Effectiveness of Malicious URLs Found by Our Method

After three iterations of our method, we sorted all the URLs by their suspicious scores. The URLs with high suspicious scores were the output URLs of our method. We filtered out the input malicious URLs from the output list. We also filtered out the URLs that point to the CSS, javascript or image files, because these URLs are typically not malicious URLs. VirusTotal [8] is a free service that analyzes suspicious URLs and facilitates the quick detection of viruses, worms, trojans, and all kinds of malware. It has aggregated the result of different antivirus engines, website scanners and security companies. Given a URL, Virus- Total will provide the results from different vendors about whether this URL is malicious or not. We will use VirusTotal to verify the effectiveness of our output URLs. We use VirusTotals API as the ground truth of our experiment. For a URL, if there are at least one company assigning it as malicious, we will treat this URL as malicious. Because of the limitation of VirusTotals API, we only sent the top 1000 suspicious URLs found by our method to VirusTotal. We report the accuracy of the top K URLs found by our method in the following.

0.85

0.8

0.75

0.7

0.65 Precision 0.6

0.55

0.5

0.45 0 100 200 300 400 500 600 700 800 900 1000 TOP K

Figure 5.5: Accuracy of Top K URLs found by our method CHAPTER 5. EXPERIMENTAL RESULTS 39

From Figure 5.5, we can see that the accuracy decreases with the increase of K. This means that the URLs with higher suspicious scores are more likely to be recognized as malicious by other companies. We can see that the accuracy of the top 200 suspicious URLs is more than 0.7. When we increase K to 1000, the accuracy is still about 0.45.

0.9 Accuracy of Top K exe URLs 0.88 Accuracy of Top K php URLs

0.86

0.84

0.82

0.8

Precision 0.78

0.76

0.74

0.72

0.7 50 100 150 200 250 300 350 400 450 500 TOP K

Figure 5.6: Accuracy of Top K URLs that end with ’exe’ or ’php’

The URLs may end with different file types, such as html, exe, php, asp and so on. Among all the malicious URLs, the ones that end with ‘exe’ are the most important ones. Other than the exe URLs, another kind of URLs that end with ‘php’ also attracts our attention. In recent years, PHP is more and more popular in web development. A lot of open source web systems are based on PHP. It is very easy for a cracker to hack into a php website and use it as his CC server. We also find the fact that most CC servers’ URLs end with ‘php’. We extract all the output URLs that end with ‘exe’ or ‘php’ and submit them to VirusTotal to get the detection accuracy. The result is reported in Figure 5.6. From the result we can see that the accuracy of exe and php URLs are even better than the whole URLs. In the previous two experiments, we assume that a URL is malicious if there is at least one company treats it as malicious. Most of time, this assumption holds because the probability that a clean URL is classified as malicious by another security company is very CHAPTER 5. EXPERIMENTAL RESULTS 40

0.9 At least one company treat it as malicious At least two company treat it as malicious At least three company treat it as malicious 0.8

0.7

0.6

Precision 0.5

0.4

0.3

0.2 0 100 200 300 400 500 600 700 800 900 1000 TOP K

Figure 5.7: Accuracy of Top K URLs with different definition of being malicious

small. However, there are often some mistakes. If a URL has been classified as malicious by two companies or even three companies, we will be more confident that this URL is actually malicious. From Figure 5.7 we can see that the accuracy is still pretty high after raising the standards of the malicious URL definition. CHAPTER 5. EXPERIMENTAL RESULTS 41

0.85 Feb 8th Feb 15th 0.8 Feb 23th

0.75

0.7

0.65 Precision 0.6

0.55

0.5

0.45 0 100 200 300 400 500 600 700 800 900 1000 TOP K

Figure 5.8: Accuracy after one week and two weeks

It is possible that our method is faster to detect some malicious URLs than other meth- ods. In this case, even if a URL is not recognized as malicious by any company currently, it can also be a malicious URL. So we submit our result URLs again to the VirusTotal API and see the result on February 15th, 2014 and February 23th, 2014. From Figure 5.8, we can see that there exist some URLs, which are classified as malicious after Feb 8th. In other words, our method can find out some malicious URLs that are unknown to all the other companies at that time. Our algorithm is an iterative algorithm that can be run for several iterations. Previous experiments are all done with the result of the third iteration. For the first and the second iterations, we also test their precision using VirusTotals API. From Figure 5.9, we can see that the results of different iterations are almost the same. We also provide the top 10 malicious URLs of the first and the second iterations in Table 5.3. The top 10 URLs with the largest suspicious scores are almost the same for different iterations. Only the order changes a little. In real production, we do not need to run our method for a lot of iterations and two or three iterations are good enough. Our method can achieve pretty high accuracy for the top 1000 URLs with high suspicious scores. However, it is impossible for us to get the recall of our method. Because of the limitation of VirusTotal API, we cannot get the result of all the URLs in our experiment. If CHAPTER 5. EXPERIMENTAL RESULTS 42

0.8 third iteration second iteration 0.75 first iteration

0.7

0.65

0.6 Precision 0.55

0.5

0.45

0.4 0 100 200 300 400 500 600 700 800 900 1000 TOP K

Figure 5.9: Accuracy of different iterations

we sample some URLs from the whole URLs in the experiment, only very small percentage of them may be malicious. For example, if we randomly sample 1000 URLs from our experiment and submit them to VirusTotal API, there may be only two of them that are recognized as malicious by other companies. Even though both of these two URLs are treated as malicious by our method, it does not mean that the recall of our method is 1. The number is meaningless in this case.

5.3.2 Effectiveness of Infected Devices Found By Our Method

Other than malicious URLs, our method can also find the possibly infected devices. In this part, we will do some case studies to analyze some possibly infected devices found by our method. We found a possibly infected device called D1 and its suspicious URL visits are listed in Table 5.4. It has downloaded an exe file for Microsoft Excel from softonic.com. Unfor- tunately, this exe file contains malware and this device is infected. After getting infected, the malware has downloaded several updates from http://d18okb3pa33axu.cloudfront.net/. Those suspicious URLs have been classified as malicious by several security companies, such as Avira, Emsisoft, SCUMWARE, Sophos and Websense ThreatSeeker. According CHAPTER 5. EXPERIMENTAL RESULTS 43

top 10 URLs of first iteration top 10 URLs of second iteration http://eib.su/beta.php http://eib.su/beta.php http://188.225.33.165/ssdc16372/gate.php http://androbandro.com/bla/gate.php http://androbandro.com/bla/gate.php http://188.225.33.165/ssdc16372/gate.php http://196.196.8.53/use.exe http://92.53.105.245/ssdc16372/file.php http://92.53.105.245/ssdc16372/file.php http://92.53.105.175/ssdc16372/file.php http://92.53.105.175/ssdc16372/file.php http://speakstyle.net/mirror.php http://speakstyle.net/mirror.php http://196.196.8.53/use.exe http://eagletoy.com/future.php http://idealjoy.com/dandy.php http://molenerin.com/pon/forum.php http://weekcafe.com/philippinen.php http://server.flashxpo.co.uk/ap68nkt.php http://molenerin.com/pon/forum.php

Table 5.3: Comparision of top 10 URLs of first and second iteration

to SCUMWARE, this URL is distributing a malware variant of Win32/ChatZum. We are pretty sure that this device has been infected by malware.

Suspicious URLs that D1 has visited Suspicious score http://dplus.en.softonic.com/ud-client/121000/121166/ 1.31 SoftonicDownloader for microsoft-excel.exe http://d18okb3pa33axu.cloudfront.net/SHIELDAPPS/pcreg.exe 0.88 http://d18okb3pa33axu.cloudfront.net/SHIELDAPPS/taskinst14.exe 0.85 http://d18okb3pa33axu.cloudfront.net/SHIELDAPPS/pcreg.exe 0.81 http://d18okb3pa33axu.cloudfront.net/SHIELDAPPS/taskinst14.exe 0.79

Table 5.4: Suspicious URLs that D1 has visited

In this example, cloudfront.net is a host name of Amazon CloudFront, which is part of Amazon Web Services. We can see that Amazon Web Services are also used by crackers to distribute malware and virus. It is very easy for crackers to change the URLs for their malware and virus quickly nowadays. Once a URL is blocked by the Internet security companies, they will change to another URL. This brings us a new challenge for malicious URL detection. Detecting malicious URLs one by one using traditional methods is more and more difficult nowadays. However, our method can always find out the most recent malicious URLs of a malware or virus by mining the Internet access logs of the possibly infected devices. CHAPTER 5. EXPERIMENTAL RESULTS 44

We have found another possibly infected device D2 and its suspicious URL requests are listed in Table 5.5. We think D2 may have been infected, because it sent several HTTP requests to a list of host names whose suspicious scores are pretty high. The reason that these URLs have large suspicious scores in our method is that many of devices, which have downloaded a kind of malware, are requesting them at the same time. After inspecting this device by security expert, we are sure that this device is infected by Pushdo/Cutwail spamming Botnet. The Cutwail botnet, founded around 2007, is a botnet mostly involved in sending spam e-mails. The bot is typically installed on infected machines by a Trojan component called Pushdo. The first generation of Pushdo used a clear text (with many different parameters) HTTP request to communicate with its CC servers. This was encrypted in the second gen- eration. The second generation of Pushdo also generated lots of fake SSL trafc to legitimate websites, trying to hide its communication data amongst it. This is why we have seen a large number of requests to a list of legitimate websites from these infected devices. After getting infected by a binary file, the bot decrypts the data, which contains not only the CC server domain names, but also the names of many other legitimate domains. Obviously, the cracker is trying to hide the CC server domain amongst the many other legitimate domain names to make it hard to pick out during static analysis. It also tries to hide its communication data amongst other legitimate website traffic when undergoing dynamic analysis. This makes it very hard to find out the CC server for traditional malicious URL detection methods. However, this is not a problem for our method, because the devices, which visit this list of URLs, will be recognized as infected by our method and we can always get the latest list of legitimate websites from the logs.

5.4 Efficiency of Our MapReduce Method

For malicious URL detection methods, the efficiency matters. We need to find out the malicious URLs very quickly to prevent the users from getting infected. For our method, we are facing a huge amount of data and we rely on MapRedcue to speed up the execution. Then how to set the numbers of Mapper tasks and Reducer tasks is a problem. In this part, we will only use one day’s log (137GB) data for the experiment, which will save us a lot of time. CHAPTER 5. EXPERIMENTAL RESULTS 45

Suspicious URLs that D2 has visited Suspicious score sgprinting.ca 0.77 naijagurus.com 0.76 wlf.louisiana.gov 0.76 bredainternet.nl 0.75 shakeyspizza.ph 0.74 rodeoshow.com.au 0.72 churchsupplies.net 0.71 fraser-high.school.nz 0.71 x-cellcommunications.de 0.70

Table 5.5: Suspicious URLs that D2 has visited

5.4.1 Number of Mappers

As we use file as the input of our MapReduce jobs. The number of Mapper tasks is usually driven by the number of DFS blocks in the input files. If the input file is 100MB and the HDFS block size is 64MB, the input file will take 2 blocks. So, 2 map tasks will be spawned. The right level of parallelism for maps seems to be around 10-100 maps/node. Basically, our map tasks read all the log data and produce intermediate key-value pairs for the reducers. Because our map tasks are CPU-light tasks, the number of mappers does not affect the running time a lot. They are IO bound tasks and they need to read the same amount of data totally for any number of mappers. The best way is just to use the default setting of HDFS block size and Hadoop will determine the number of mappers for us. For our method, the running time of Mapper tasks is much shorter than the running time of Reducer tasks. Even though we increase the running time of Mapper a little, it will not improve the overall running time of MapReduce a lot. So we will focus on improving the Reducers’ running time.

5.4.2 Number of Reducers

We can take control of the number of reducers by setting conf.setNumReduceTasks of Job- Conf. According to the official documents of Hadoop, the right number of reduces seems to be 0.95 or 1.75 * (#nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reducers can launch immediately and start transferring map outputs as the maps finish. At 1.75 the CHAPTER 5. EXPERIMENTAL RESULTS 46

22

20

18

16

14

12

10 Running time of reducers (minutes) 8

6 5 10 15 20 25 30 35 40 Number of reducers

Figure 5.10: Running time with different number of reducers

faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing. As we have 10 machines for our Hadoop cluster, we decided to test the running time with 5, 10, 15, 20, 25, 30, 35, 40 reducers. Because our jobs are not the only ones running on the Hadoop cluster of Fortinet, the running time of our jobs will also be affected by other jobs. For the same job, we will get different running time if we run it for multiple times. So we run the same job for five times and take the average running time as the result, which is plotted in . We also present the minimal and maximal running time for different number of reducers in . From the figure, best number of Reducer tasks for our cluster is 25. When we set the number of reducers as 5, only 5 machines are doing the reduce jobs and we cannot take full advantage of the 10 machines in the cluster. This is a waste of time. The overload of each machine will also be very high because every reducer is taking a lot of work. Setting up a large number of reducers will also affect the running time. More reducers means more startup time totally. There will also be larger network overhead between the mappers and reducers. We also split one day’s Internet access logs data into five partitions and test the running CHAPTER 5. EXPERIMENTAL RESULTS 47

14

12

10

8

6

Running time of reducers (minutes) 4

2 0 1 2 3 4 5 6 Data set size (number of partition used in experiment)

Figure 5.11: Running time with size of dataset

time of Reducer task using different number of partitions as the data set. Their minimal, maximal and average running time is plotted in . We can see that the running time increase with the increasing size of data set.

5.4.3 Number of Machines in Hadoop Cluster

The number of machines in Hadoop cluster will also affects the running time of our method. More machines mean more powerful computing ability and it may save the total running time. However, too many machines will not help, because it will also increase the overhead of network transmission. It is difficult for us to change the number of machines in the Hadoop cluster in the experiment. There are also a lot of other MapReduce jobs running on Fortinet’s Hadoop platform every day. So we will not do the experiment to test the effect of number of machines in this thesis. Chapter 6

Conclusions

In this thesis, we formulate the problem of malicious URL detection by mining Internet access log data. We develop a practical algorithm that can find out unknown malicious URLs and possibly infected devices based on known malicious URLs. We also discuss how to do data storage when we are facing a large amount of data. Our partition strategy of devices and graph structure allows us to distribute the storage and computation among multiple machines. We also provide MapReduce version of our method, which is much easier to implement. Our extensive empirical study using the real data sets from Fortinet, a leader in the network security industry, clearly shows the effectiveness and efficiency of our algorithm. As future work, there are following directions we can further work on.

• There may be better methods to update the suspicious scores of devices and URLs in each iteration. Currently, we set the suspicious score of a device as the maximum suspicious score of all the URLs that it visits. The updated suspicious score of a URL is determined by the average suspicious scores of its neighbors and the degree of this URL. Our algorithm is based on our two assumptions in chapter 3. More sophisticated methods to update the suspicious scores of URLs and devices may bring better result.

• We do not consider timestamps in the Internet access logs in this thesis. If we prop- agate the suspicious scores according to the timestamp, we may get better results for the detection of new malicious URLs. However, this will also makes the problem much more complicated.

48 CHAPTER 6. CONCLUSIONS 49

• For the detection of possibly infected devices, there are many other directions we can try. We can detect possibly infected devices by analyzing their Internet visits. If a device visits a URL frequently in a period of time and has not ever visited this URL before, this device may get infected. We can also compare the Internet visits of similar devices. The outlier of a group of devices should attract our attention. Bibliography

[1] The apache hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. it is designed to scale up from single servers to thousands of machines, each offering local computation and storage. http://hadoop.apache.org/. 32

[2] Fortiguard url look up tool will provides a category for each url. http://www. fortiguard.com/ip_rep.php.6 [3] Mcafee siteadvisor software is a free browser plug-in that provides simple web site safety ratings and a secure search box. http://www.siteadvisor.ca/.6

[4] Phishtank is an anti-phishing site. http://www.phishtank.com/.6

[5] Realtime uri blacklist. http://uribl.com/.6 [6] Safe browsing is a service provided by google that enables applications to check urls against google’s constantly updated lists of suspected phishing and malware pages. http://www.google.ca/tools/firefox/safebrowsing/.6

[7] Spam and open relay blocking system. http://www.sorbs.net/.6 [8] Virustotal is a free service that analyzes suspicious files and urls and facilitates the quick detection of viruses, worms, trojans, and all kinds of malware. https://www. virustotal.com/. 38

[9] Webmapreduce in education. http://webmapreduce.sourceforge.net/education. php. [10] Mitsuaki Akiyama, Takeshi Yagi, and Mitsutaka Itoh. Searching structural neigh- borhood of malicious urls to improve blacklisting. In Applications and the Internet (SAINT), 2011 IEEE/IPSJ 11th International Symposium on, pages 1–10. IEEE, 2011. 7

[11] Hyunsang Choi, Bin B. Zhu, and Heejo Lee. Detecting malicious web links and iden- tifying their attack types. In Proceedings of the 2Nd USENIX Conference on Web Ap- plication Development, WebApps’11, pages 11–11, Berkeley, CA, USA, 2011. USENIX Association. 10

50 BIBLIOGRAPHY 51

[12] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. 29

[13] Andreas Dewald, Thorsten Holz, and Felix C. Freiling. Adsandbox: sandboxing javascript to fight malicious websites. In SAC, pages 1859–1864, 2010.7

[14] Sujata Garera, Niels Provos, Monica Chew, and Aviel D. Rubin. A framework for detec- tion and measurement of phishing attacks. In Proceedings of the 2007 ACM Workshop on Recurring Malcode, WORM ’07, pages 1–8, New York, NY, USA, 2007. ACM. 10

[15] Da Huang, Kai Xu, and Jian Pei. Malicious url detection by dynamically mining patterns without pre-defined elements. World Wide Web, pages 1–20, 2012. 10, 11

[16] Karen Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972. 22

[17] Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast webpage classification using url features. In CIKM, pages 325–326, 2005. 10

[18] Min-Sheng Lin, Chien-Yi Chiu, Yuh-Jye Lee, and Hsing-Kuo Pao. Malicious url filtering - a big data application. In BigData Conference, pages 589–596, 2013. 11

[19] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Beyond black- lists: learning to detect malicious web sites from suspicious urls. In KDD, pages 1245– 1254, 2009. 10

[20] Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 681–688. ACM, 2009. 11

[21] D. Kevin McGrath and Minaxi Gupta. Behind phishing: An examination of phisher modi operandi. In LEET, 2008. 10

[22] Michael G Moore and Greg Kearsley. Distance education: A systems view of online learning. Cengage Learning, 2011. 11

[23] Alexander Moshchuk, Tanya Bragin, Steven D. Gribble, and Henry M. Levy. A crawler- based study of spyware in the web. In NDSS, 2006.8

[24] Hsing-Kuo Pao, Yan-Lin Chou, and Yuh-Jye Lee. Malicious url detection based on kolmogorov complexity estimation. In Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology- Volume 01, pages 380–387. IEEE Computer Society, 2012. 10

[25] Niels Provos, Panayiotis Mavrommatis, Moheeb Abu Rajab, and Fabian Monrose. All your iframes point to us. In USENIX Security Symposium, pages 1–16, 2008.8 BIBLIOGRAPHY 52

[26] Mahmoud T. Qassrawi and Hongli Zhang. Detecting malicious web servers with hon- eyclients. JNW, 6(1):145–152, 2011.8

[27] Christian Seifert, Ian Welch, and Peter Komisarczuk. Identification of malicious web pages with static heuristics. In Telecommunication Networks and Applications Confer- ence, 2008. ATNAC 2008. Australasian, pages 91–96. IEEE, 2008.7

[28] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad Verbowski, Shuo Chen, and Samuel T. King. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In NDSS, 2006.8

[29] Peilin Zhao and Steven CH Hoi. Cost-sensitive online active learning with application to malicious url detection. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 919–927. ACM, 2013. 11