Suspicious Url and Device Detection by Log Mining
Total Page:16
File Type:pdf, Size:1020Kb
SUSPICIOUS URL AND DEVICE DETECTION BY LOG MINING by Yu Tao B.Sc., University of Science and Technology of China, 2012 Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in the School of Computing Science Faculty of Applied Sciences c Yu Tao 2014 SIMON FRASER UNIVERSITY Spring 2014 All rights reserved. However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for \Fair Dealing." Therefore, limited reproduction of this work for the purposes of private study, research, criticism, review and news reporting is likely to be in accordance with the law, particularly if cited appropriately. APPROVAL Name: Yu Tao Degree: Master of Science Title of Thesis: SUSPICIOUS URL AND DEVICE DETECTION BY LOG MINING Examining Committee: Dr. Greg Mori, Associate Professor Chair Dr. Jian Pei, Professor, Senior Supervisor Dr. Jiangchuan Liu, Associate Professor, Supervisor Dr. Wo-Shun Luk, Professor, Internal Examiner Date Approved: April 22th, 2014 ii Partial Copyright Licence iii Abstract Malicious URL detection is a very important task in Internet security intelligence. Existing works rely on inspecting web page content and URL text to determine whether a URL is malicious or not. There are a lot of new malicious URLs emerging on the web every day, which make it inefficient and not scalable to scan URL one by one using traditional methods. In this thesis, we harness the power of big data to detect unknown malicious URLs based on known ones with the help of Internet access logs. Using our method, we can find out not only related malicious URLs, but also URLs of new updates and CC(command and control) servers for existing malware, botnets and viruses. In addition, we can also detect possibly infected devices. We also discuss how to scale up our method on huge data sets, up to hundreds of gigabytes in our experiment. Our extensive empirical study using the real data sets from Fortinet, a leader in Internet security industry, shows the effectiveness and efficiency of our method. iv To my parents. v \Men love to wonder, and that is the seed of science." | Ralph Waldo Emerson (1803-1882) vi Acknowledgments I would like to express my sincerest gratitude to my senior supervisor, Dr Jian Pei, who provides creative ideas for my research and warm encouragement for my life. Throughout my master study, he shared with me not only valuable knowledge, but also the wisdom of life. Without his help, never can I accomplish this thesis. My gratitude also goes to my supervisor, Dr Jiangchuan Liu, for reviewing my work and helpful suggestions that helped me to improve my thesis. I am grateful to thank Dr. Wo-Shun Luk and Dr. Greg Mori, for serving in my examing committe. I thank Guanting Tang, Xiao Meng, Juhua Hu, Xiangbo Mao, Xiaoning Xu, Chuancong Gao, Yu Yang, Li Xiong, Lin Liu, Beier Lu and Jiaxing Liang for their kind help during my study at SFU. I am also grateful to my friends at Fortinet. I thank Kai Xu, for his guide and insight suggestions. My deepest gratitude goes to my parents. Their endless love supports me to overcome all the difficulties in my study and life. vii Contents Approval ii Partial Copyright License iii Abstract iv Dedication v Quotation vi Acknowledgments vii Contents viii List of Tablesx List of Figures xi 1 Introduction1 1.1 Background and Motivation............................1 1.2 Challenges......................................2 1.3 Major Idea.....................................3 1.4 Contributions....................................4 1.5 Thesis Organization................................4 2 Related Work6 2.1 Blacklisting.....................................6 2.2 Heuristics Based Methods.............................7 viii 2.3 Classification Based Methods...........................8 2.3.1 Content Based Methods..........................8 2.3.2 URL Based Methods............................9 3 Problem Definition and Graph Representation 13 3.1 Problem Definition................................. 13 3.2 Bipartite Graph Representation.......................... 15 3.3 Assumptions.................................... 17 4 Scalable Methods 20 4.1 The Basic Method................................. 20 4.2 Limitation of Our Method............................. 22 4.3 Data Storage.................................... 23 4.3.1 Data Storage of Graph Structure..................... 23 4.3.2 Data Storage of URLs' Suspicious Scores................ 24 4.3.3 Data Storage of Devices' Suspicious Scores............... 27 4.4 MapReduce Approach............................... 27 4.5 Relationship between Scalable Version and MapReduce Version........ 32 5 Experimental Results 33 5.1 Data Sets...................................... 33 5.2 Efficiency of Our Basic Method.......................... 35 5.3 Effectiveness of Our Method............................ 37 5.3.1 Effectiveness of Malicious URLs Found by Our Method........ 38 5.3.2 Effectiveness of Infected Devices Found By Our Method........ 42 5.4 Efficiency of Our MapReduce Method...................... 44 5.4.1 Number of Mappers............................ 45 5.4.2 Number of Reducers............................ 45 5.4.3 Number of Machines in Hadoop Cluster................. 47 6 Conclusions 48 Bibliography 50 ix List of Tables 1.1 Malicious URLs with the same IP address....................2 1.2 Malicious URLs from the same family of virus..................3 5.1 Top popular websites that we have filtered.................... 34 5.2 Malicious URLs detected by traditional methods................ 35 5.3 Comparision of top 10 URLs of first and second iteration........... 43 5.4 Suspicious URLs that D1 has visited....................... 43 5.5 Suspicious URLs that D2 has visited....................... 45 x List of Figures 3.1 first example of bipartite graph representation.................. 16 3.2 second example of bipartite graph representation................ 17 4.1 Store the adjacency list of the bipartite graph on disk............. 24 4.2 Store the suspicious scores of URLs on disk with the neighbors........ 26 4.3 Partition the suspicious scores of devices and the graph structure....... 29 4.4 Overview of MapReduce framework........................ 30 5.1 Degree distribution of URLs............................ 34 5.2 Running time with size of data set........................ 36 5.3 Memory storage with size of data set....................... 36 5.4 Disk storage with size of data set......................... 37 5.5 Accuracy of Top K URLs found by our method................. 38 5.6 Accuracy of Top K URLs that end with 'exe' or 'php'............. 39 5.7 Accuracy of Top K URLs with different definition of being malicious..... 40 5.8 Accuracy after one week and two weeks..................... 41 5.9 Accuracy of different iterations.......................... 42 5.10 Running time with different number of reducers................. 46 5.11 Running time with size of dataset......................... 47 xi Chapter 1 Introduction In this chapter, we first briefly introduce the background of Internet security, how web based attacks work, and the motivation and challenges of malicious URL detection. Then, we will summarize our major contributions and describe the structure of the thesis. 1.1 Background and Motivation The development of Internet not only improves our quality of life and drives new oppor- tunities for commerce, but also creates opportunities for malicious attacks. The attackers are the people that design web based attacks to achieve several goals, such as installation of malware and virus, spam-advertised commerce, identity theft, financial fraud and botnet information flow. How to identify web based attacks and guard the safety of users on the Internet is very important. Several factors have made the identification of web based attacks challenging. The first is the large scale of the World Wide Web. The amount of websites is so huge and different websites provide different kinds of data and services, which make it difficult to distinguish between attack websites and benign websites. Second, the attackers can disguise their attacks anytime and duplicate them in multiple locations. Most web based attacks share a common pattern, the attackers will put their attack code on the web and attract the users to visit it via its Uniform Resource Locator(URL). As a result, users need to evaluate the associated risk when deciding whether to click on an unfamiliar URL. Is this URL safe or not, or will it make the computer get infected after clicking this URL? This is a very difficult decision for users to make. 1 CHAPTER 1. INTRODUCTION 2 http://coolstowage.com/ponyb/gate.php http://coolstowage.com/2/gate.php http://deeliodin.com/ponyb/gate.php http://couponwalla.com/ponyb/gate.php http://dealdin.com/ponyb/gate.php http://coolstowage.com/ponyb/gate.php Table 1.1: Malicious URLs with the same IP address There are various systems that help users decide whether a URL is safe to click on or not. In recent years, the most common method, used in Web filtering applications, search engine and browser toolbars, is blacklisting. The bad URLs that direct users to web based attacks are called malicious URLs. The Internet security companies maintain a list of malicious URLs, which is called blacklist. After a user clicks on a URL, the URL will be checked in the blacklist. If the URL is in the blacklist, the user will be prevented from visiting it. How to maintain and update a blacklist is a key issue for Internet security companies. Currently, blacklists are constructed