Towards Detection and Prevention of Malicious Activities Against Web Applications and Services

Dissertation zur Erlangung des Grades eines Doktor-Ingenieurs der Fakult¨at fur¨ Elektrotechnik und Informationstechnik an der Ruhr-Universit¨at Bochum

vorgelegt von

Apostolos Zarras aus Athen, Griechenland

Bochum, August 2015 Tag der mundlichen¨ Prufung:¨ 27.08.2015

Gutachter: Prof. Dr. Thorsten Holz, Ruhr-Universit¨at Bochum

Zweitgutachter: Prof. Dr. Herbert Bos, Vrije Universiteit Amsterdam Abstract

The Internet has witnessed a tremendous growth the last years. Undoubtedly, its services and mostly the have become an integral part in the lives of hundreds of millions of people, who use it in daily basis. Unfortunately, as the Internet’s popularity increases, so does the interest of attackers who seek to exploit vulnerabilities in users’ machines. What was primary meant to be mainly a competition among computer experts to test and improve their technical skills has turned into a multi-billion dollar business. Nowadays, attackers try to take under their control vulnerable computers that will allow them to perform their nefarious tasks, such as sending spam emails, launching distributed denial-of-service (DDoS) attacks, generate revenue from online advertisements by performing click-frauds, or stealing personal data like email accounts and banking credentials. In this dissertation, we address the security issues online users face every day from two points of view. First, we investigate how infected computers that constitu- te a botnet—network of compromised machines which are remotely controlled by an entity, known as botmaster—perform their malicious activities and we propose countermeasures against them. We study two of the most fundamental Internet pro- tocols, SMTP and HTTP, and leverage the fact that numerous entities, including cybercriminals, implement these protocols with subtle but perceivable differences, which we can accurately detect. We then develop novel mitigation techniques that utilize these discrepancies to block the compromised computers from successfully executing the commands issued by a botmaster. Second, we examine different ways in which attackers exploit the web infrastructure to infect new victims. We initially study the formed alliances among web spammers that aim to boost the page rank of their . As a result, these websites gain more popularity and can potentially host exploits. We then move forward and in- vestigate the extent to which attackers can abuse the logic of crawlers in order to perform various attacks. Next, we evaluate the ecosystem of online ad- vertisements and analyze the risks related to abuses. Finally, we propose a system that can protect users from online threats while surfing the web.

i

Zusammenfassung

Das Internet hat in den letzten Jahren ein enormes Wachstum erfahren. Seine Anwendungen, allen voran das World Wide Web, sind ein wesentlicher Bestand- teil des t¨aglichen Lebens fur¨ Millionen von Menschen geworden. Mit der steigen- den Popularit¨at des wuchs auch das Interesse von Angreifern, die nach Schwachstellen in den Computern suchen, um sich diese zu Eigen zu machen. Was als eine Art Wettkampf bzw. Kr¨aftemessen unter Computerexperten begann, um technische F¨ahigkeiten zu testen und weiterzuentwickeln, hat sich mittlerweile in ein Multi-Milliarden-Dollar-Gesch¨aft gewandelt. Heutzutage versuchen Angreifer, anf¨allige Computer unter ihre Kontrolle zu bekommen, um Spam-Emails zu ver- senden, “distributed denial-of-service” (DDoS) Attacken durchzufuhren,¨ Profit aus Online-Anzeigen durch sogenannte Click-Frauds zu erzielen oder aber sensible Da- ten aus Email- und Bankkonten zu stehlen. In dieser Dissertation wenden wir uns von zwei Seiten den Sicherheitsproblemen zu, mit denen Internet Benutzer t¨aglich konfrontiert sind. Erstens untersuchen wir wie infizierte Computer in einem Botnetz—ein Netzwerk aus infiltrierten und von einem Botmaster ferngesteuerten Computern—ihre Angriffe ausuben¨ und pr¨asentieren ent- sprechende Gegenmaßnahmen. Dazu untersuchen wir zwei integrale Protokolle des Internets, SMTP und HTTP, und machen uns zunutze, dass diese Protokolle von unterschiedlichen Akteuren wie z.B. Cyberkriminellen in leicht unterschiedlicher Weise genutzt werden. Durch die Analyse dieser Unterschiede k¨onnen wir neue Me- thoden zur Schadensbegrenzung entwicklen, die infizierte Computer davon abhalten die Befehle eines Botmasters auszufuhren.¨ Zweitens untersuchen wir verschiedene Methoden mit denen die Angreifer uber¨ das Web neue Computer infizieren. Hierbei untersuchen wir zum einen Allianzen von Web-Spammern deren Ziel die Steigerung der so genannten PageRanks b¨osartiger Internetseiten ist. Ein h¨oherer PageRank fuhrt¨ zu einer erh¨ohten Sichtbarkeit einer Webseite und typischerweise daher auch zu h¨oheren Infektionsraten. Zum anderen ermittelten wir in welchem Ausmaß Angreifer die Logik von Web-Crawlern von Suchmaschinen ausnutzen, um verschiedene Angriffe durchzufuhren.¨ Daneben be- trachten wir wie Kriminelle Online-Werbung fur¨ ihre Zwecke missbrauchen. Schließ- lich schlagen wir aufbauend auf unseren Untersuchungen ein System vor das An- wendern helfen kann sicher im Web zu surfen.

iii

Acknowledgements

It is a great pleasure to acknowledge everyone that have significantly influenced and inspired my life during my graduate studies. Thus, in this part, I would like to take the time to thank every single one of them. First and foremost, I would like to express my special appreciation to my advisor Thorsten Holz for the valuable guidance and support he showed me over the past four years. I would like to thank him for giving me the opportunity to focus on topics that I was interested in, encouraging my research and allowing me to grow as a research scientist. Next, I would like to thank all former and current members of the Chair for Systems Security of Ruhr-University Bochum that contributed for a pleasant and productive environment. I recall countless meetings in which we had to switch to English because of me. Thank you guys and sorry for the inconvenience. During the time of my Ph.D. studies I had the privilege of collaborating with re- searchers from different universities and countries. This work would have not been possible without all these folks. Apart from this remote collaboration, I also had the chance of visiting the security labs of University of California, Santa Barbara (UCSB) and Institut Eur´ecom. There, I met and worked together with some really intelligent people. I could not help but thank all my old friends along with the new friends I made here in Bochum. The time we spent together was a welcome break in my daily routine. I would also like to express my deepest gratitude to my parents Nikos and Maria, my beloved little sister Nantia, and my grandparents for their support, patience, encouragement and wise advice during my whole life. I never would have made it through, without their love and support. Last, but certainly not least, I would like to thank my wife Eva who has been a bright light in my life. I am truly grateful for her unconditional love, support, and patience. Without her, I would not have been able to balance my research with everything else. I cannot wait until the day we will meet our baby for the first time. This dissertation is dedicated to all these people who believed in me from the time of my birth until now.

v

Contents

1 Introduction1 1.1 Motivation...... 2 1.2 Dissertation Scope...... 3 1.3 Contributions...... 4 1.4 Publications...... 6 1.5 Outline...... 8

I Botnet Detection and Mitigation 11

2 Spam Mitigation 15 2.1 Introduction...... 15 2.2 SMTP Protocol...... 18 2.3 SMTP Dialects...... 20 2.4 Learning Dialects...... 22 2.4.1 Learning Algorithm...... 22 2.4.2 Collecting SMTP Conversations...... 24 2.5 Matching Conversations to Dialects...... 27 2.5.1 Building the Decision State Machine...... 27 2.5.2 Making a Decision...... 29 2.5.3 Applying the Decision...... 30 2.6 Botnet Feedback Mechanism...... 31 2.7 System Overview...... 33 2.8 Evaluation...... 34 2.8.1 Evaluating the Dialects...... 35

vii Contents

2.8.2 Evaluating the Feedback Manipulation...... 39 2.9 Discussion...... 40 2.9.1 Evading Dialects Detection...... 40 2.9.2 Mitigating Feedback Manipulation...... 41 2.10 Related Work...... 41 2.11 Summary...... 43

3 HTTP-Based Mitigation 45 3.1 Introduction...... 46 3.2 HTTP Protocol...... 48 3.3 HTTP-Based Malware...... 49 3.4 HTTP-Level Detection...... 50 3.4.1 Header Chains...... 51 3.4.2 HTTP Templates...... 52 3.5 System Overview...... 53 3.5.1 Virtual Machine Zoo...... 54 3.5.2 Learner...... 55 3.5.3 Decision Maker...... 56 3.6 Evaluation...... 57 3.6.1 Establishing the Ground Truth...... 57 3.6.2 Model Generation in Various Web Clients...... 57 3.6.3 Detection Accuracy...... 58 3.6.4 Popular Malware Families...... 60 3.6.5 Real-World Deployment...... 60 3.6.6 Advanced Persistent Threats...... 62 3.7 Discussion...... 63 3.7.1 ...... 63 3.7.2 Randomness...... 63 3.8 Related Work...... 64 3.9 Summary...... 65

II Web Security 67

4 Revealing the Relationship Network Behind Link Spam 71 4.1 Introduction...... 72 4.2 Web Page Ranking...... 74 4.3 Web Spam...... 75 4.4 Data Collection...... 76 4.4.1 Study Objectives...... 76 4.4.2 Data Collection Architecture...... 77

viii Contents

4.5 SEO Forums Analysis...... 80 4.5.1 Spammers Behavior...... 80 4.5.2 Spam Pages Categorization...... 82 4.6 ...... 84 4.7 Relationship Network Graph...... 86 4.8 Summary of Findings...... 88 4.9 Discussion...... 89 4.10 Related Work...... 90 4.11 Summary...... 91

5 Abusing Crawlers for Indirect Web Attacks 93 5.1 Introduction...... 93 5.2 Web Vulnerabilities and Exploits...... 96 5.3 Security Problems...... 96 5.3.1 Attack Model...... 96 5.3.2 Blackhat SEO Attacks...... 97 5.3.3 Indirect Attacks...... 99 5.4 Susceptibility Assessment...... 100 5.4.1 Methodology and Measurement Infrastructure...... 100 5.4.2 Findings...... 101 5.5 Defenses...... 102 5.5.1 Stopping Indirect Attacks...... 102 5.5.2 Stopping Blackhat SEO Attacks...... 103 5.6 Discussion...... 112 5.6.1 Attack Models...... 112 5.6.2 Measurement...... 112 5.6.3 Deterministic Defenses...... 113 5.6.4 Learning-Based Defenses...... 113 5.7 Related Work...... 113 5.8 Summary...... 115

6 Understanding Malicious Advertisements 117 6.1 Introduction...... 117 6.2 Malicious Advertising...... 120 6.2.1 Drive-by Downloads...... 120 6.2.2 Deceptive Downloads...... 120 6.2.3 Link Hijacking...... 120 6.3 Methodology...... 121 6.3.1 Data Collection...... 121 6.3.2 Oracle...... 122 6.4 Analyzing Malvertisements...... 123

ix Contents

6.4.1 Type of Maliciousness...... 124 6.4.2 Identifying Risky Advertisers...... 124 6.4.3 Ad Arbitration...... 128 6.4.4 Secure Environment...... 129 6.5 Discussion...... 129 6.5.1 Ad Networks Filtering...... 129 6.5.2 Last Line of Defense...... 130 6.6 Related Work...... 130 6.7 Summary...... 131

7 Fake Client Honeypots 133 7.1 Introduction...... 134 7.2 Client Honeypots...... 136 7.2.1 Low-Interaction Client Honeypots...... 136 7.2.2 High-Interaction Client Honeypots...... 136 7.2.3 Hybrid Client Honeypots...... 137 7.3 System Overview...... 137 7.3.1 Threat Model...... 137 7.3.2 Design Details...... 138 7.4 Evaluation...... 141 7.4.1 Experimental Environment...... 141 7.4.2 Protection Effectiveness...... 143 7.5 Discussion...... 143 7.5.1 User Interaction Interference...... 143 7.5.2 File Content Verification...... 144 7.6 Related Work...... 144 7.7 Summary...... 145

8 Conclusion 147

List of Figures 153

List of Tables 155

Bibliography 157

Curriculum Vitae 177

x “If I have seen further than others, it is by standing upon the shoulders of giants.”

Isaac Newton

1 Introduction

“People don’t want to talk about death, just like they don’t want to talk about computer security. Maybe I should have named my workstation Fear. People are so motivated by fear.”

Dan Farmer

The Internet has become the medium of choice for people to search for infor- mation, conduct business, and enjoy entertainment. It offers an abundance of in- formation accessible to anyone with an Internet connection. Recently, with the progressive influence of social media, the information culture of many people has changed. Services such as Facebook, Twitter and Ebay increasingly determine ev- eryday life, and offer new ways of communication. Hence, it is sometimes hard to imagine modern societies without people who spent hours on the Internet chatting with their friends, reading news, playing games, or working. The evolution of the Internet not only changed the way that a society interacts, but it also had a fundamental affect on the behavior of the criminals within that society. Miscreants discovered that the Internet is an alternative domain, with much less risk compared to traditional law-breaking, in which they can operate and generate large profits. As a result, a new type of criminal arose, which is widely known as cybercriminal. Computers have essentially unlocked the gates to a new era in law-breaking. In fact, in 2013 cybercrime complaints in the U.S. outnumbered the 260 thousands, with losses linked to online fraud totaling more than $780 million that is a 48.8% increase in reported losses since 2012, according to the Internet Crime Complaint Center [84]. Consequently, we need to realize that cybercrime is here to stay, whether we like it or not.

1 1 Introduction

1.1 Motivation

The ubiquitous presence of the Internet in our everyday life has dramatically changed the way we communicate and work. Unfortunately, the same did not happen with users’ understanding of key security and privacy concepts. Criminals took advantage of this security gap and developed new malevolent techniques to make profit faster and more secure. More precisely, they can steal money and private information from the comfort of their own homes, potentially making millions of dollars, with little chance of being caught. Nowadays, cybercrime constitutes a serious threat for our Internet-based society. Recently, Europol has issued the 2014 Internet Organised Crime Threat Assess- ment (iOCTA), which describes the evolution of cybercrime and the models of sales adopted in the criminal ecosystem [50]. This report highlights that the entry barri- ers into cybercrime are being lowered, allowing those lacking technical expertise to venture into cybercrime by purchasing the skills and tools they lack. Additionally, it emphasizes that cybercriminals predominantly operate from jurisdictions with out- dated legal tools and insufficient response capacities, which allow them to operate with minimum risk. Although cybercrime began as an attempt at gaining bragging rights and repu- tation among computer experts, today’s threats are targeted and all about making money. Back in 2000, the “I Love You” worm, which arrived via spam, infected millions of users [17]. While this insidious worm cost businesses and governments billions to eradicate, it did not generate revenue for its authors. On the other hand, modern malware is mostly driven by profit. For instance, scareware, a type of mal- ware, frighten users into buying fake anti-virus software. Only this scam results in annual revenues of greater than $180 million, according to McAfee Labs [119]. In addition, miscreants create armies of compromised machines. In fact, they use these armies to perform their nefarious tasks, such as sending spam emails, spreading new malware, stealing personal data like email accounts and banking credentials, while hiding at the same time their own identities. To do so, they first need to find vulnerabilities in victims’ computers. The most common technique is to exploit vulnerable browsers and plugins. Therefore, they need to lure users to visit cybercriminals’ malicious websites. Then, they secretly install malware that offers adversaries a remote access on these infected computers. Consequently, these compromised machines become zombies, which blindly execute the commands issued by cybercriminals. Whether criminals are looking for a way to drug traffic, bully, extortion or launder money, the list of cybercrimes continues to grow and become much more difficult to govern. In this dissertation, we study some of the threats that online users face and propose solutions. Overall, we try to find the root cause behind the threats we investigate, and present countermeasures where possible.

2 1.2 Dissertation Scope

1.2 Dissertation Scope

As we have briefly stated previously, bots are malware-infected devices, which are typically organized into botnets (i.e., networks of compromised computers) that are remotely controlled by an entity, known as botmaster, through a Command and Control (C&C) channel [1,7, 34, 40]. Today, malware and botnets are the primary means for cybercriminals to carry out their nefarious tasks, such as send- ing spam emails [88, 147, 157, 210], launching distributed denial-of-service (DDoS) attacks [59, 126], spreading new malware [64], generate revenue from online ad- vertisements by performing click-frauds [174], or stealing personal data like email accounts and banking credentials [4]. Over the past years botnets got into the focus of security researchers. Current botnet detection techniques are divided into passive and active [148]. The passive techniques are defined as those that solely collect data by observation. A detection system can monitor an environment without actually interfering with it. Although, this makes these approaches transparent, at the same time it limits the amount of data that can be gathered for analysis. Examples for such methods are analysis of network traffic, spam records, or log files [16, 73, 112, 160, 211]. On the other hand, the active techniques involve interaction with the monitored environment. These methods hold the opportunity to gather more detailed insights, but are also likely to leave traces that botmasters can observe [41, 80, 92, 133, 154, 171]. While most of the works focus on identifying the behavior of infected machines or detecting the C&C channel, there is limited work on the implementation of the protocols used by bots and especially these that are difficult to distinguish when they blend with benign traffic. Consequently, there is open space for further research in this area. The very same existence of botnets would not have been possible if attackers could not find and infect vulnerable machines. To do so, cybercriminals try to trick innocent users into visiting malicious websites or redirect them to a malevolent site when they access other benign, yet compromised web pages. Hence, miscreants need to either increase the popularity of their own websites or insert malicious source code on other benign sites. While in the first category rest attacks such as blackhat Search Engine Optimization, in the second belong even more advanced threats such as infections through malicious advertisements. Therefore, it is of great importance to understand the model behind these threats and propose countermeasures. Due to the enormous size of the Internet and the plethora of different technologies that constitute its foundations, one’s research could lead to miscellaneous directions. Consequently, in this dissertation, we chose to focus on two complementary issues revolving around botnet detection and web security. More specifically, we tried to fight cybercriminals operations (i) by detecting compromised machines and mitigat- ing their malicious activities, and (ii) by understanding how cybercriminals operate and thus preventing them from infecting new computers.

3 1 Introduction

In essence, this dissertation tries to answer the following research questions. As threats become more advanced, similarities between benign and malicious traffic at network level become subtler and more difficult to identify. Therefore, is there a way to successfully distinguish between benign and malicious traffic? In addition, cybercriminals use the web to discover and infect new victims by creating malicious websites that usually exploit vulnerabilities in users’ browsers. To do so, they need to lure victims to visit these websites and one way to do it is by improving websites’ ranking in search engine results. Hence the research questions here are: How attackers boost the ranking of their websites in search engine results? Can they manipulate other parts of search engines to perform attacks on their behalf? What other ways they use to find vulnerable machines? And finally, is there a way for users to surf the web in a more secure way? In PartI, we try to address the problem of distinguishing between benign and malicious implementations of fundamental Internet protocols. More specifically, we attempt to detect and mitigate botnets and malware instances that have infected vulnerable computers and leverage the SMTP and HTTP protocols to perpetrate their malicious activities. To this end, we trace subtle deviations in the structure of the two aforementioned protocols by various actors. Through a series of experi- ments, both in a sanitized environment and in real world, we show that our approach can successfully identify the malicious traffic that tries to blend with the benign one. Overall, the purpose of this part is to detect and mitigate activities from machines that already have been compromised to contaminate the damage they could cause. In PartII, we examine how cybercriminals try to lure unsuspecting users into visiting malicious websites and infecting their machines. To be precise, we divide our research in two categories: (i) we examine the formed alliances among spammers that try to boost the ranking of their (malicious) websites with unethical means and (ii) we investigate how cybercriminals manipulate advertisements, an ostensible harmless web service, to inject malicious code in benign websites. Additionally, we study several attacks against search engines and legitimate web pages, and propose countermeasures. To sum up, in this part we try to understand how cybercriminals operate when they searching to infect new victims, so that future researchers may be able to accurately tackle their employed techniques.

1.3 Contributions

In this dissertation we address several topics in the fields of network and web security. Our contributions on these topics are twofold. One the one hand we detect and mitigate malicious activities from compromised machines that constitute botnet components, while on the other hand we study the methods used by cybercriminals and lead to these infections.

4 1.3 Contributions

In summary, we make the following main contributions:

• In Chapter2, we present a system that detects and mitigates spam emails. This system focuses on the email delivery mechanism and more precisely, it captures small variations in the ways in which various clients implement the SMTP protocol. Our experimental results demonstrate that our prototype is able to correctly identify spambots in a real-world scenario. In addition, we show how providing incorrect feedback to bots can have a negative impact on the effectiveness of a botnet.

• In Chapter3, we propose a novel approach to distinguish the network traffic generated by HTTP-based bots and malware from benign HTTP traffic. We introduce two models that focus on particular characteristics, such as HTTP headers’ sequence and structure, which reveal concrete discrepancies between benign and malicious requests. Our experimental results demonstrate very low classification errors and high performance.

• In Chapter4, we study the ways that web spammers form alliances to boost the ranking of their websites. More precisely, we provide a systematic analysis of the spam link exchange performed through 15 Search Engine Optimization forums. Furthermore, we visualize the web spam using a graph representation of link exchange. Our analysis reveal the different approaches and strategies used by advanced and inexperienced web spammers.

• In Chapter5, we provide a systematic overview of attacks that abuse search engine crawlers and study their consequences on search engines as well as the affected third-party websites. Additionally, we deploy multiple sites with vulnerabilities together with attacker-controlled websites and measure the sus- ceptibility of the crawlers of various search engines. Finally, we propose coun- termeasures and evaluate their efficacy in stopping the aforementioned attacks.

• In Chapter6, we perform the first large-scale study of ad networks that serve malicious advertisements. We analyze different ad exchanges and show that some of them are more prone to serving malicious advertisements than others. We show that every who does not have an exclusive agreement with the advertiser is a potential publisher of malicious advertisements.

• In Chapter7, we transform a traditional attack detection approach, such as client honeypots, to a system with a clear attack prevention focus. Our pro- totype creates events that trigger the inspection mechanisms against client honeypots used by intelligent adversaries. The preliminary results of our experiments demonstrate that our prototype can successfully protect users against malicious websites that exhibit split personality.

5 1 Introduction

1.4 Publications

This thesis is based on academic publications published on several academic con- ferences but also contains new and unpublished material. Additionally, the author was involved in other publications during his Ph.D. studies and all these publications are listed in this section in chronological order with a brief summary.

Understanding Fraudulent Activities in Online Ad Exchanges. In this paper, we presented a detailed view of how one of the largest ad exchanges operates and the associated security issues from the vantage point of a member ad network. This work was published together with Brett Stone-Gross, Ryan Stevens, Richard Kemmerer, Christopher Kruegel, and Giovanni Vigna at the 11th ACM Internet Measurement Conference (IMC) in 2011 [174].

B@bel: Leveraging Email Delivery for Spam Mitigation. This publication tackles the problem of spam emails by presenting a system for filtering spam that takes into account how messages are sent by spammers. This work was jointly published with Gianluca Stringhini, Manuel Egele, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna at the 21st USENIX Security Symposium in 2012 [175]. Chapter2 is based on this publication. k-subscription: Privacy-Preserving Microblogging Browsing Through Obfusca- tion. In this work, we proposed an obfuscation-based approach that enables users to follow privacy-sensitive microblogging channels, while, at the same time, mak- ing it difficult for the microblogging service to find out their actual interests. This work was published in 2013 at the 29th Annual Computer Security Applications Conference (ACSAC) [140] in collaboration with Panagiotis Papadopoulos, An- tonis Papadogiannakis, Michalis Polychronakis, Thorsten Holz, and Evangelos P. Markatos.

Automated Generation of Models for Fast and Precise Detection of HTTP- Based Malware. This paper presents a novel approach to precisely detect HTTP- based botnets and malware instances at the network level by analyzing the im- plementations of the HTTP protocol by different entities. This work was done in cooperation with Antonis Papadogiannakis, Robert Gawlik, and Thorsten Holz. It was published at the 12th Annual Conference on Privacy, Security and Trust (PST) in 2014 [215], and Chapter3 is based on it.

The Art of False Alarms in the Game of Deception: Leveraging Fake Honeypots for Enhanced Security. This work mitigates the problem of split personalities that

6 1.4 Publications exhibit a portion of malicious web pages by designing and developing a framework that benefits from this bi-faceted behavior, and protect users from online threats while surfing the web. Chapter7 is based on this work, which was published at the 48th IEEE International Carnahan Conference on Security Technology (ICCST) in 2014 [213].

The Dark Alleys of Madison Avenue: Understanding Malicious Advertisements. In this publication, we performed a study of the security of web advertisements, and more precisely, we analyzed how secure users are when surfing the web and served with advertisements. This was a joint work with Alexandros Kapravelos, Gianluca Stringhini, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. It was pub- lished in 2014 at the 14th ACM Internet Measurement Conference (IMC) [214] and constitutes the core of Chapter6.

Revealing the Relationship Network Behind Link Spam. This paper provides a systematic analysis of the spam link exchange performed through 15 Search En- gine Optimization (SEO) forums by capturing and analyzing the activity of web spammers, identifying spam link exchanges, and visualizing the web spam ecosys- tem. This was a joint work with Antonis Papadogiannakis, Sotiris Ioannidis, and Thorsten Holz and it was published at the 13th Annual Conference on Privacy, Security and Trust (PST) in 2014 [216]. The details of this work are presented in Chapter4.

Hiding Behind the Shoulders of Giants: Abusing Crawlers for Indirect Web Attacks. This study describes the ways in which autonomous crawlers can be abused by attackers in order to exploit vulnerabilities on third-party websites while hiding the true origin of the attacks. This work was done in cooperation with Nick Nikiforakis and Federico Maggi and is currently in submission. Chapter5 provides a detailed analysis of the attacks we studied and the countermeasures we proposed.

Dynamic Firmware Analysis at Scale: A Case Study on Embedded Web Inter- faces. In this work, we studied the security of embedded web interfaces running in commercial off-the-shelf (COTS) embedded devices, such as routers, DSL/cable modems, VoIP phones, IP/CCTV cameras, as well as we introduced an efficient methodology and presented a scalable framework for discovery of vulnerabilities in embedded web interfaces regardless of the vendor, device, or architecture. This work is currently in submission and was done in collaboration with Andrei Costin and Aur´elienFrancillon.

7 1 Introduction

Scalable Firmware Classification and Identification of Embedded Devices. In this paper, we presented two complementary techniques, namely embedded firmware trained classification and embedded web interface fingerprinted identification in or- der to automatically identify the vendor, the model, and the firmware version of an arbitrary remote web-enabled device as well as to automatically and accurately label the brand and the model of the device for which the firmware is intended. This was a collaborative work with Andrei Costin and Aur´elienFrancillon and is currently in submission.

Neuralyzer: Flexible Expiration Times for the Revocation of Online Data. This work tackles the problem of prefixed lifetimes in data deletion and proposes a novel data structure that protects the publicly accessible data by encrypting it, encapsu- lating the encrypted content, and ensuring its disappearance when the expiration time is reached. This was a joint work with Katharina Kohls, Markus D¨urmuth, and Christina P¨opper, and is currently in submission.

Leveraging Internet Services to Escape from the Embargo of Thoughts. In this work, we introduced a novel approach to combine multiple non-blocked communica- tion protocols, and dynamically switch between these tunnels, to achieve uncensored access to the Internet. This work was done in collaboration with Thorsten Holz and is currently in submission.

1.5 Outline

The remainder of this dissertation is structured in seven different chapters. Each chapter covers one main topic of this dissertation and is a self-contained unit that can be read independently from the other chapters. Chapter2 proposes a novel technique that can successfully identify spam emails sent by botnets. The chapter first explains the necessary technical background of the SMTP protocol and then introduces the SMTP dialects. Next, it presents a mechanism that learns which dialect is spoken by a client and which is afterwards used to match an unknown dialect to its corresponding client. After that, it provides details regarding the manipulation of the botnets feedback mechanisms. Then, it describes the system overview and continues with its evaluation. Finally, it discusses its limitations, presents the related work, and concludes with a brief summary. Chapter3 expands our detection area to HTTP-based botnets and malware in- stances. The structure is this chapter is organized as follows. First, it describes the HTTP protocol. Second, it presents the different categories of HTTP-based malware. Third, it gives details for the HTTP-level detection techniques it uses. Forth, it describes the system implementation and fifth, it evaluates the prototype.

8 1.5 Outline

Sixth, it discusses potential evasion techniques. Seventh, it provides details for the related work and eighth, it summarizes the chapter. Chapter4 studies how attackers can form alliances to confuse search engine algo- rithms. Initially, it gives insights into the most popular web page ranking algorithms and explains how web spammers can abuse them. Next, it provides an overview of the measurement methodology it uses and presents the results of the analysis of SEO forums and link exchanges. Then, it graphically represents the discovered alliances and summarizes the findings. Finally, it discusses the several issues of the study, presents the related work, and concludes in the last section. Chapter5 investigates the extent to which attackers can abuse the logic of search engine crawlers to perform various attacks. First, it gives an introduction on web vulnerabilities and exploits and describes how an attacker can benefit from the infrastructure of web crawlers. Next, it evaluates the aforementioned attacks in real world and proposes countermeasures. This chapter concludes with the discussion of limitations, related work, and a brief summary. Chapter6 studies the ecosystem of malicious advertisements and how it can be abused. It begins by providing a description of different types of malicious advertise- ments. Next, it presents the methodology which is used to generate and evaluate a large corpus of advertisements. It continues with the analysis of the discovered ma- licious advertisements. Then, it discusses proactive and reactive countermeasures against this security issue. Finally, it presents the related work and concludes. Chapter7 proposes a solution which enhances users’ security while surfing the web by cloaking a normal browser to appear as a client honeypot. First, it gives an abstract overview of the different types of client honeypots. Then, it provides details on the architecture of the proposed system and afterwards presents the evaluation results. Next, it tackles the system’s limitations and suggests possible improvements. Finally, it summarizes the solutions proposed in this chapter after it has given an overview of the existing related work. Chapter8 concludes this dissertation by recapitulating previous chapters and providing directions for future research.

9

Part I

Botnet Detection and Mitigation

11

Preamble

Being one of the most serious threats in our modern technology-based society, botnets moved into focus of scientific research. Bots are compromised machines whose resources, like computational power or bandwidth, can be used by a botmas- ter. By infecting a certain number of vulnerable machines, which can vary from few hundred to several thousands, this botmaster can form a powerful network of bots. Then, she can command these bots (mostly motivated by financial, political, or ideological reasons) to perform a series of malicious activities, such as spread spam, malware, and steal personal information from users. In the first part of this dissertation, we propose two novel detection and prevention approaches against botnets. More precisely, we study two fundamental Internet protocols, SMTP and HTTP, and trace subtle deviations in the implementation of these protocols by legitimate parties and cybercriminals. We then use these differences to implement detection models to discover traffic generated by infected computers. Additionally, in the case of spam botnets, we show how providing incorrect feedback to bots can have a negative impact on the spamming effectiveness of a botnet. Finally, we utilized our proposed models in real-world scenarios and evaluate their behavior.

13

2 Spam Mitigation

“Like almost everyone who uses email, I receive a ton of spam every day. Much of it offers to help me get out of debt or get rich quick. It would be funny if it weren’t so exciting.”

Bill Gates

Traditional spam detection systems either rely on content analysis to detect spam emails, or attempt to detect spammers before they send a message (i.e., they rely on the origin of message). In this chapter, we present an alternative solution. We leverage the email delivery mechanism and analyze the communication at the SMTP protocol level to prevent spam emails from being delivered. Additionally, we manipulate the feedback from email servers to remove recipients from spammers’ lists. Our results reveal that by utilizing our approach along with traditional spam defense mechanisms we can sufficiently mitigate the delivery of spam emails.

2.1 Introduction

Email spam, or unsolicited bulk email, is one of the major open security problems of the Internet. Accounting for more than 77% of the overall worldwide email traffic [97], spam is annoying for users who receive emails they did not request, and it is damaging for users who fall for scams and other attacks. Also, spam wastes resources on SMTP servers, which have to process a significant amount of unwanted emails [182]. A lucrative business has emerged around , and recent studies estimate that large affiliate campaigns generate between $400K and $1M revenue per month [93]. Nowadays, about 85% of worldwide spam traffic is sent by botnets [181].

15 2 Spam Mitigation

During recent years, a wealth of research has been performed to mitigate both spam and botnets [88, 103, 141, 147, 157, 158, 219]. Existing spam detection systems fall into two main categories. The first category focuses on the content of an email. By identifying features of an email’s content, one can classify it as spam or ham (i.e., a benign email message) [49, 124, 164]. The second category focuses on the origin of an email [77, 190]. By analyzing distinctive features about the sender of an email (e.g., the IP address or autonomous system from which the email is sent, or the geographical distance between the sender and the recipient), one can assess whether an email is likely spam, without looking at the email content. While existing approaches reduce spam, they also suffer from limitations. For instance, running content analysis on every received email is not always feasible for high-volume servers [182]. In addition, such content analysis systems can be evaded [113, 134]. Similarly, origin-based techniques have coverage problems in practice. Previous work showed how IP blacklisting, a popular origin-based tech- nique [170], misses a large fraction of the IP addresses that are actually sending spam [155, 169]. In this chapter, we propose a novel, third approach. Instead of looking at the content of messages (what) or their origins (who), we analyze the way in which emails are sent (how). More precisely, we focus on the email delivery mechanism; we look at the communication between the sender of an email and the receiving mail server at the SMTP protocol level. Our approach can be used in addition to traditional spam defense mechanisms. We introduce two complementary techniques as concrete instances of our new approach: SMTP dialects and Server feedback manipulation.

SMTP dialects. This technique leverages the observation that different email clients (and bots) implement the SMTP protocol in slightly different ways. These deviations occur at various levels, and range from differences in the case of protocol keywords, to differences in the syntax of individual messages, to the way in which messages are parsed. We refer to deviations from the strict SMTP specification (as defined in the corresponding RFCs) as SMTP dialects. As with human language dialects, the listener (the server) typically understands what the speaker (a legit- imate email client or a bot) is saying. This is because SMTP servers, similar to many other Internet services, follow Postel’s law, which states: “Be liberal in what you accept, and conservative in what you send.” We introduce a model that represents SMTP dialects as state machines, and we present an algorithm that learns dialects for different email clients (and their respective email engines). Our algorithm uses both passive observation and active probing to efficiently generate models that can distinguish between different email engines. Unlike previous work on service and protocol fingerprinting, our models

16 2.1 Introduction are stateful. This is important, because it is almost never enough to inspect a single message to be able to identify a specific dialect. Leveraging our models, we implement a decision procedure that can, based on the observation of an SMTP transaction, determine the sender’s dialect. This is useful, as it allows an email server to terminate the connection with a client when this client is recognized as a spambot. The connection can be dropped before any content is transmitted, which saves computational resources at the server. Moreover, the identification of a sender’s dialect allows analysts to group bots of the same family, or track the evolution of spam engines within a single malware family.

Server feedback manipulation. The SMTP protocol is used by a client to send a message to the server. During this transaction, the client receives from the server information related to the delivery process. One important piece of information is whether the intended recipient exists or not. The performance of a spam campaign can improve significantly when a botmaster takes into account server feedback. In particular, it is beneficial for spammers to remove non-existent recipient addresses from their email lists. This prevents a spammer from sending useless messages during subsequent campaigns. Indeed, previous research has shown that certain bots report the error codes received from email servers back to their command and control nodes [103, 173]. To exploit the way in which botnets currently leverage server feedback, it is possible to manipulate the responses from the mail server to a bot. In particular, when a mail server identifies the sender as a bot, instead of dropping the connection, it could simply reply that the recipient address does not exist. To identify a bot, one can either use traditional origin-based approaches [77, 190] or leverage the SMTP dialects proposed in this chapter. When the server feedback is poisoned in this fashion, spammers have to decide between two options. One possibility is to continue to consider server feedback and remove valid email addresses from their email list. This reduces the spam emails that these users will receive in the future. Alternatively, spammers can decide to distrust and discard any server feedback. This reduces the effectiveness of future campaigns since emails will be sent to non- existent users. Our experimental results demonstrate that our techniques are successful in iden- tifying (and rejecting) bots that attempt to send unwanted emails. Moreover, we show that we can successfully poison spam campaigns and prevent recipients from receiving subsequent emails from certain spammers. However, we recognize that spam is an adversarial activity and an arms race. Thus, a successful deployment of our approach might prompt spammers to adapt. We discuss possible paths for spammers to evolve, and we argue that such evolution comes at a cost in terms of performance and flexibility.

17 2 Spam Mitigation

In summary, we make the following main contributions:

• We introduce a novel approach to detect and mitigate spam emails. This ap- proach focuses on the email delivery mechanism — the SMTP communication between the email client and the email server. It is complementary to tradi- tional techniques that operate either on the message origin or on the message content.

• We introduce the concept of SMTP dialects as one concrete instance of our approach. Dialects capture small variations in the ways in which clients im- plement the SMTP protocol. This allows us to distinguish between legitimate email clients and spambots. We designed and implemented a technique to automatically learn the SMTP dialects of both legitimate email clients and spambots.

• We implemented our approach in a tool, called B@bel. Our experimental results demonstrate that B@bel is able to correctly identify spambots in a real-world scenario.

• We study how the feedback provided by email servers to bots is used by their botmasters. As a second instance of our approach, we show how providing incorrect feedback to bots can have a negative impact on the spamming effec- tiveness of a botnet.

2.2 SMTP Protocol

The Simple Mail Transfer Protocol (SMTP), as defined in RFC 821 [150], is a text-based protocol that is used to send email messages originating from Mail User Agents (MUAs — e.g., Outlook, Thunderbird, or Mutt), through intermediate Mail Transfer Agents (MTAs — e.g., Sendmail, Postfix, or Exchange) to the recipients’ mailboxes. The protocol is defined as an alternating dialog where the sender and the receiver take turns transmitting their messages. Messages sent by the sender are called commands, and they instruct the receiver to perform an action on behalf of the sender. The SMTP RFC defines 14 commands. Each command consists of four case-insensitive, alphabetic-character command codes (e.g., MAIL) and additional, optional arguments (e.g., FROM:). One or more space characters separate command codes and argument fields. All commands are terminated by a line terminator, which we denote as . An exception is the DATA command, which instructs the receiver to accept the subsequent lines as the email’s content, until the sender transmits a dot character as the only character on a line (i.e., .).

18 2.2 SMTP Protocol

Server: 220 debian § Client: HELO example.com ¤ Server: 250 Ok Client: MAIL FROM: Server: 250 2.1.0 Ok Client: RCPT TO: Server: 250 2.1.5 Ok Client: DATA

¦ ¥ Figure 2.1: A typical SMTP conversation.

SMTP replies are sent by the receiver to inform the sender about the progress of the email transfer process. Replies consist of a three-digit status code, followed by a space separator, followed by a short textual description. For example, the reply 250 Ok indicates to the sender that the last command was executed successfully. Commonly, replies are one line long and terminated by . 1 The RFC defines 21 different reply codes. These codes inform the sender about the specific state that the receiver has advanced to in its protocol state machine and, thus, allows the sender to synchronize its state with the state of the receiver. A plethora of additional RFCs have been introduced to extend and modify the original SMTP protocol. For example, RFC 1869 introduced SMTP Service Extensions. These extensions define how an SMTP receiver can inform a client about the extensions it supports. More precisely, if a client wants to indicate that it supports SMTP Service Extensions, it will greet the server with EHLO instead of the regular HELO command. The server then replies with a list of available service extensions as a multi-line reply. For example, a server capable of handling encryption can announce this capability by responding with a 250-STARTTLS reply to the client’s EHLO command. MTAs, mail clients, and spambots implement different sets of these extensions. As we will discuss in detail later, we leverage these differences to determine the SMTP dialect spoken in a specific SMTP conversation. In this work, we consider an SMTP conversation the sequence of commands and replies that leads to a DATA command, to a QUIT command, or to an abrupt termination of the connection. This means that we do not consider any reply or command that is sent after the client starts transmitting the actual content of an email. An example of an SMTP conversation is listed in Figure 2.1.

1 The protocol allows the server to answer with multi-line replies. In a multi-line reply, all lines but the last must begin with the status code followed by a dash character. The last line of a multi-line reply must be formatted like a regular one-line reply.

19 2 Spam Mitigation

2.3 SMTP Dialects

The RFCs that define SMTP specify the protocol that a client has to speak to properly communicate with a server. However, different clients might implement the SMTP protocol in slightly different ways, for three main reasons:

1. The SMTP RFCs do not always provide a single possible format when speci- fying the commands a client must send. For example, command identifiers are case insensitive, which means that EHLO and ehlo are both valid command codes. 2. By using different SMTP extensions, clients might add different parameters to the commands they send. 3. Servers typically accept commands that do not comply with the strict SMTP definitions. Therefore, a client might implement the protocol in slightly wrong ways while still succeeding in sending email messages.

We call different implementations of the SMTP protocol SMTP dialects. A dialect D is defined as a state machine

D =< Σ, S, s0,T,Fg,Fb > (2.1) where Σ is the input alphabet (composed of server replies), S is a set of states, s0 is the initial state, and T is a set of transitions. Each state s in S is labeled with a client command, and each transition t in T is labeled with a server reply. Fg ⊆ S is a set of good states, which represent successful SMTP conversations, while Fb ⊆ S is a set of bad states, which represent failed SMTP conversations. The state machine D captures the order in which commands are sent in relation to server replies by that particular dialect. Since SMTP messages are not always constant, but contain variable fields (e.g., the recipient email address in an RCPT command), we abstract commands and replies as templates, and label states and transitions with such templates. We do not require D to be deterministic. The reason for this is that some clients show a non-deterministic behavior in the messages they exchange with SMTP servers. For example, bots belonging to the Lethic malware family use EHLO and HELO interchangeably when responding to a server 220 reply. Figure 2.2 shows two example dialect state machines (Outlook Express and Bagle, a spambot).

Message Templates As explained previously, we label states and transitions with message templates. We define the templates of the messages that belong to a dialect as regular expres- sions. Each message is composed of a sequence of tokens. We define a token as

20 2.3 SMTP Dialects

Figure 2.2: Simplified state machines for Outlook Express (left) and Bagle (right).

any sequence of characters separated by delimiters. We define spaces, colons, and equality symbols as delimiters. We leverage domain knowledge to develop a num- ber of regular expressions for the variable elements in an SMTP conversation. In particular, we define regular expressions for email addresses, fully qualified domain names, domain names, IP addresses, numbers, and hostnames (see Figure 2.3 for details). Every token that does not match any of these regular expressions is treated as a keyword.

Email address: ? § IP address: \[?\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\]? ¤ Fully qualified domain name: [\w-]+\.[\w-]+\.\w[\w-]+ Domain name: [\w-]+\.[\w-]+ Number: [0-9]{3}[0-9]+ Hostname: [\w-]{5}[\w-]+

¦ ¥ Figure 2.3: Regular expressions used in message templates.

An example of a message template is

MAIL FROM: ,

where email-addr is a regular expression that matches email addresses. Given two dialects D and D’, we consider them different if their state machines are different. For example, the two dialects in Figure 2.2 differ in the sequence of

21 2 Spam Mitigation commands that the two clients send: Bagle sends a RSET command after the HELO, while Outlook Express sends a MAIL command directly. Also, the format of the commands of the two dialects differs: Outlook Express puts a space between MAIL FROM: and the sender email address, while Bagle does not. In Section 2.4, we show how we can learn the dialect spoken by an SMTP client. In Section 2.5, we show how these learned dialects can be matched against an SMTP conversation, which is crucial for performing spam mitigation, as we will show in Section 2.8.

2.4 Learning Dialects

To distinguish among different SMTP speakers, we require a mechanism that learns which dialect is spoken by a particular client. To do this, we need a set of SMTP conversations C generated by the client. In detail, each conversation is a sequence of pairs, where command can be empty if the client did not send anything after receiving a reply from the server. It is important to note that the type of conversations in C affects the state machine learned for the dialect. For example, if C only contains successful SMTP conversations, the portion of the dialect state machine that we can learn from it is very small. In the typical SMTP conversation listed in Figure 2.1, the client first connects to the SMTP server, then announces itself (i.e., sends a HELO command), states who the sender of the email is (i.e., sends a MAIL command), lists recipients (by using one or more RCPT commands), and starts sending the actual email content (by sending a DATA command). Observing this type of communication gives no information on what a client would do upon receiving a particular error, or a specific SMTP reply from the server. To mitigate this problem, we collect a diverse set of SMTP conversations. We do this by directing the client to an SMTP server under our control, and sending specific SMTP replies to it (see Section 2.4.2). Even though sending specific replies allows us to explore more states than the ones we could explore otherwise, we still cannot be sure that the dialects we learn are complete. In Section 2.8, we show how the inferred state machines are usually good enough for discriminating between different SMTP dialects. However, in some cases, we might not be able to distinguish two different dialects because the learned state machines are identical.

2.4.1 Learning Algorithm Analyzing the set C allows us to learn part of the dialect spoken by the client. Our learning algorithm processes one SMTP conversation from C at a time, and iteratively builds the dialect state machine.

22 2.4 Learning Dialects

Learning the Message Templates

For each message observed in a conversation Con in C, our algorithm generates a regular expression that matches it. The regular expression generation algorithm works in three steps: Step 1: First, we split the message into tokens. As mentioned in Section 2.3, we consider the space, colon, and equality characters as delimiters. Step 2: For each token, we check if it matches a known regular expression. More precisely, we check it against all the regular expressions defined in Figure 2.3, from the most specific to the least specific, until one matches (this means that we check the regular expressions in the following order: email address, IP address, fully qualified domain name, domain name, number, hostname). If a token matches a regular expression, we substitute the token with the matched regular expression’s identifier (e.g., ). If none of the regular expressions are matched, we consider the token a keyword, and we include it verbatim in the template. Step 3: We build the message template, by concatenating the template tokens (which can be keywords or regular expressions) and the delimiters, in the order in which we encountered them in the original message. Consider, for example, the command:

MAIL FROM:

First, we break the command into tokens:

[MAIL, FROM, ]

The only token that matches one of the known regular expressions is the email address. Therefore, we consider the other tokens as keywords. The final template for this command will therefore be:

MAIL FROM:

Notice that, by defining message format templates as we described, we can be more precise than the SMTP standard specification and detect the (often subtle) differences between two dialects even though both might comply with the SMTP RFC. For instance, we would build two different message format templates (and, therefore, have two dialects) for two clients that use different case for the EHLO keyword (e.g., one uses EHLO as a keyword, while the other uses Ehlo).

23 2 Spam Mitigation

Learning the State Machine We incrementally build the dialect state machine by starting from an empty initial state s0 and adding new transitions and states as we observe more SMTP conversations from C. For each conversation Con in C, the algorithm executes the following steps:

Step 1: We set the current state s to s0.

Step 2: We examine all tuples in Con. An example of a tuple is: <220 server, HELO evil.com>.

Step 3: We apply the algorithm described in Section 2.4.1 to ri and ci, and build the corresponding templates tr and tc. In the example, tr is 220 hostname and tc is HELO domain. Note that ci could be empty, because the client might not have sent any command after a reply from the server. In this case tc will be an empty string.

Step 4: If the state machine has a state sj labeled with tc, we check if there is a transition t labeled with tr going from s to sj.(i) If there is one, we set the current state s to sj, and go to Step 6.(ii) If there is no such transition, we connect s and sj with a transition labeled with tr, set the current state s to sj, and go to Step 6. (iii) If none of the previous conditions hold, we go to Step 5.

Step 5: If there is no state labeled with tc, we create a new state sn, label it with tc , and connect s and sn with a transition labeled tr. We then set the current state s to sn. Following the previous example, if we have no state labeled with HELO domain, we create a new state with that label, and connect it to the current state s (in this case the initial state) with a transition labeled with 220 hostname. If there are no tuples left in Con, and tc is empty, we set the current state as a failure state for the current dialect, and add it to Fb. We then move to the next conversation in C, and go back to Step 2. 2 Otherwise, we go to Step 6. Step 6: If s is labeled with DATA, we mark the state as a good final state for this dialect, and add it to Fg. Else, if s is labeled with QUIT, we mark s as a bad final state and add it to Fb. We then move to the next conversation in C, and we go back to Step 2.

2.4.2 Collecting SMTP Conversations To be able to model as much of a dialect as possible, we need a comprehensive set of SMTP conversations generated by a client. As previously discussed, the straightforward approach to collect SMTP conversa- tions is to passively observe the messages exchanged between a client and a server.

2 By doing this, we handle cases in which the client abruptly terminates the connection.

24 2.4 Learning Dialects

In practice, this is often enough to uniquely determine the dialect spoken by a client (see Section 2.8 for experimental results). However, there are cases in which passive observation is not sufficient to precisely identify a dialect. In such cases, it would be beneficial to be able to send specifically-crafted replies to a client (e.g., malformed replies) and observe its responses. To perform this exploration, we set up a testing environment in which we direct clients to a mail server we control, and we instrument the server to be able to craft specific responses to the commands the client sends. The SMTP RFCs define how a client should respond to unexpected SMTP replies, such as errors and malformed messages. However, both legitimate clients and spam engines either exhibit small differences in the implementation of these guidelines, or they do not implement them correctly. The reason for this is that implementing a subset of the SMTP guidelines is enough to be able to perform a correct conversation with a server and successfully send an email, in most cases. Therefore, there is no need for a client to implement the full SMTP protocol. Of course, for legitimate clients, we expect the SMTP implementation to be mature, robust, and complete— that is, corner cases are handled correctly. In contrast, spambots have a very focused purpose when using SMTP: send emails as fast as possible. For spammers, taking into account every possible corner case of the SMTP protocol is unnecessary; even more problematic, it could impact the performance of the spam engine (see Section 2.9 for more details). In summary, we want to achieve two goals when actively learning an SMTP di- alect. First, we want to learn how a client reacts to replies that belong to the language defined in the SMTP RFCs, but are not exposed during passive obser- vation. Second, we want to learn how a client reacts to messages that are invalid according to the SMTP RFCs. We aim to systematically explore the message structure as well as the state ma- chine of the dialect spoken by a client. To this end, the variations to the SMTP protocol we use for active probing are of two types: (i) variations to the protocol state machine, which modify the sequence or the number of the replies that are sent by the server, and (ii) variations to the replies, which modify the structure of the reply messages that are sent by the server. In the following, we discuss how we generate variations of both the protocol state machine and the replies.

Protocol state machine variations We use four types of protocol variation techniques:

1. Standard SMTP replies: These variations aim at exposing responses to replies that comply with the RFCs, but are not observable during a standard, suc-

25 2 Spam Mitigation

cessful SMTP conversation, like the one in Figure 2.1. An example is sending SMTP errors to the commands a client sends. Some dialects continue the conversation with the server even after receiving a critical error.

2. Additional SMTP replies: These variations add replies to the SMTP conver- sation. More precisely, this technique replies with more than one message to the commands the client sends. Some dialects ignore the additional replies, while others will only consider one of the replies.

3. Out-of-order SMTP replies: These variations are used to analyze how a client reacts when it receives a reply that should not be sent at that point in the protocol (i.e., a state transition that is not defined by the standard SMTP state machine). For example, some senders might start sending the email content as soon as they receive a 354 reply, even if they did not specify the sender and recipients of the email yet.

4. Missing replies: These variations aim at exposing the behavior of a dialect when the server never sends a reply to a command.

Message format variations These variations represent changes in the format of the replies that the server sends back to a client. As described in Section 2.2, SMTP server replies to a client’s command have the format CODE TEXT, where CODE represents the actual response to the client’s command, TEXT provides human-readable information to the user, and is the line terminator. According to the SMTP specification, a client should read the data from the server until it receives a line terminator, parse the code to check the response, and pass the text of the reply to the user if necessary (e.g., in case an error occurred). Given the specification, we craft reply variations in four distinct ways to system- atically study how a client reacts to them:

1. Compliant replies: These reply variations comply with the SMTP standard, but are seldom observed in a common conversation. For example, this tech- nique might vary the capitalization of the reply (uppercase/lowercase/mixed case). The SMTP specification states that reply text should be case-insensitive.

2. Incorrect replies: The SMTP specification states that reply codes should al- ways start with one of the digits 2, 3, 4, or 5 (according to the class of the status code), and be three-digits long. These variations are replies that do not comply with the protocol (e.g., a message with a reply code that is four digits long). A client is expected to respond with a QUIT command to these malformed replies, but certain dialects behave differently.

26 2.5 Matching Conversations to Dialects

3. Truncated replies: As discussed previously, the SMTP specification dictates how a client is supposed to handle the replies it receives from the server. Of course, it is not guaranteed that clients will follow the specification and process the entire reply. The reason is that the only important information the client needs to analyze to determine the server’s response is the status code. Some dialects might only check for the status code, discarding the rest of the message. For these reasons, we generate variations as follows: For each reply, we first separate it into tokens as described in Section 2.3. Then, for each token, we generate N different variations, where N is the number of tokens in each reply. We obtain such variations by truncating the reply with a line terminator after each token.

4. Incorrectly-terminated replies: From a practical point of view, there is no need for a client to parse the full reply until it reaches the line terminator. To assess whether a dialect checks for the line terminator when receiving a reply, we terminate the replies with incorrect terminators. In particular, we use the following four sequences: , , , and as line terminators. For each terminator, similar to what we did for truncated replies, we generate 4N different variations of each reply, by truncating the reply after every token.

We developed 228 variations to use for our active probing. More precisely, we extracted the set of replies that are contained in the Postfix source code. Then, we applied to them the variations described in this section, and we injected them into a reference SMTP conversation. To this end, we used the sequence of server replies from the conversation in Figure 2.1.

2.5 Matching Conversations to Dialects

After having learned the SMTP dialects for different clients, we obtain a different state machine for each one of them. Given a conversation between a client and a server, we want to assess which dialect the client is speaking. To do this, we merge all inferred dialect state machines together into a single Decision State Machine MD.

2.5.1 Building the Decision State Machine

We use the approach proposed by Wolf [201] to merge the dialect state machines into a single state machine. Given two dialects D1 and D2, the approach works as follows:

27 2 Spam Mitigation

Figure 2.4: An example of decision state machine.

Step 1: We build the Cartesian product D1 × D2. That is, for each combination of states < s1, s2 >, where s1 is a state in D1 and s2 is a state in D2, we build a new state sD in the decision state machine MD.

The label of sD is a table with two columns. The first column contains the identifier of one of the dialects sD was built from (e.g., D1), and the second column contains the label that dialect had in the original state (either s1 or s2). Note that we add one row for each of the two states that sD was built from. For example, the second state of the state machine in Figure 2.4 is labeled with a table containing the two possible message templates that the clients C1 and C2 would send in that state (i.e., HELO hostname and HELO domain).

We then check all the incoming transitions to s1 and s2 in the original state machines D1 and D2. For each combination of transitions , where t1 is an incoming transition for s1 and t2 is an incoming transition for s2, we check if t1 and t2 have the same label. If they do, we generate a new transition td, and add it to MD. The label of td is the label of t1 and t2. The start state of td is the Cartesian product of the start states of t1 and t2, respectively, while the end state is sD. If the labels of s1 and s2 do not match, we discard td. For example, a transition t1 labeled as 250 OK and a transition t2 labeled as 553 Relaying Denied would not generate a transition in MD. At the end of this process, if sD is not connected to any other state, it will be not part of the decision state machines MD, since that state would not be reachable if added to MD.

28 2.5 Matching Conversations to Dialects

Step 2: We reduce the number of states in MD by merging together states that are equivalent. To evaluate if two states s1 and s2 are equivalent, we first extract the set of incoming transitions to s1 and s2. We name these sets I1 and I2. Then, we extract the set of outgoing transitions from s1 and s2, and name these sets O1 and O2. We consider s1 and s2 as equivalent if |I1| = |I2| and |O1| = |O2|, and if the edges in the sets I1 and I2, and in O1 and O2 have the exact same labels. If s1 and s2 are equivalent, we remove them from MD, and we add a state sd to MD. The label for sd is a table composed of the combined rows of the label tables of s1 and s2. We then adjust the transitions in MD that had s1 or s2 as start states to start from sd, and the transitions that had s1 or s2 as end states to end at sd. We iteratively run this algorithm on all the dialects we learned, and we build the final decision state machine MD. As an example, Figure 2.4 shows the decision state machine built from the two dialects in Figure 2.2. Wolf shows how this algorithm produces nearly minimal resulting state machines [201]. Empirical results indicate that this works well in practice and is enough for our purposes. Also, as for the dialect state machines, the decision state machine is non-deterministic. This is not a problem, since we analyze different states in parallel to make a decision as we explain in the next section.

2.5.2 Making a Decision Given an SMTP conversation Con, we assign it to an SMTP dialect by traversing the decision state machine MD in the following way:

Step 1: We keep a list A of active states, and a list CD of dialect candidates. At the beginning of the algorithm, A only contains the initial state of MD, while CD contains all the learned dialects.

Step 2: Every time we see a server reply r in Con, we check each state sa in A for outgoing transitions labeled with r. If such transition exists, we follow each of them and add the end states to a list A0. Then, we set A0 as the new active state list A.

Step 3: Every time we see a client command c in Con, we check each state sa in A. If sa’s table has an entry that matches c, and the identifier for that entry is in 0 the dialect candidate list CD, we copy sa to a list A . We then remove from CD all 0 dialect candidates whose table entry in sa did not match c. We set A as the new active state list A.

The dialects that are still in CD at the end of the process are the possible candi- dates the conversation belongs to. If CD contains a single candidate, we can make a decision and assign the conversation to a unique dialect.

29 2 Spam Mitigation

2.5.3 Applying the Decision

The decision approach explained in the previous section can be used in different ways, and for different purposes. In particular, we can use it to assess to which client a server is talking. Furthermore, we can use it for spam mitigation, and close connections whenever a conversation matches a dialect spoken by a bot. Similarly to what we discussed in Section 2.4, the decision process can happen passively, or actively, by having a server decide which replies to send to the client. In the first case, we traverse the decision state machine for each reply, as described in Section 2.5.2, and end up with a dialect candidate set at the end of the conversa- tion. Consider, for example, the decision state machine in Figure 2.4. By passively observing the SMTP conversation, we are able to discard one of the two dialects from the candidate set as soon as the client sends the HELO message. If the com- mands of the remaining candidate match the ones in the decision state machine for that client until we observe the DATA command, we can attribute the conversation to that dialect. Otherwise, the conversation does not belong to any learned dialect. As discussed in Section 2.4, passive observation gives no guarantee to uniquely identify a dialect. In this context, a less problematic use case is to deploy this approach for spam detection: once the candidate set CD contains only bots, we can close the connection and classify this conversation as related to spam. As we will show in Section 2.8, this approach works well in practice on a real-world data set. If passive observation is not enough to identify a dialect, one can use active probing.

Gain Heuristic

To perform active detection, we need to identify “good” replies that we can send to achieve our purpose (dialect classification or spam mitigation). More specifically, we need to find out which replies can be used to expose the deviations in different implementations. To achieve this goal, we use the following heuristic: For each state ci in which a dialect i reaches the end of a conversation (i.e., sends a DATA or QUIT command, or just closes the connection), we assign a gain value gi to the dialect i in that state. The gain value represents how much it would help achieve our detection goal if we reached that state during our decision process. Then, we propagate the gain values backward along the transitions of the decision state machine. For each state s, we set the gain for i in that state as the maximum of the gain values for i that have been propagated to that state. To correctly handle loops, we continue propagating the gain values until we reach a fixed point. We then calculate the gain for s as the minimum of the gains for any dialect j in s. We do this to ensure that our decision is safe in the worst-case scenario (i.e., for the client with the minimal gain for that state). We calculate the initial gain for a state in different ways, depending on the goal of our decision process.

30 2.6 Botnet Feedback Mechanism

When performing spam mitigation, we want to avoid a legitimate client from failing to send an email. For this reason, we strongly penalize failure states for legitimate clients, while we want to have high gains for states in which spambots would fail. For each state in which a dialect reaches a final state, we calculate the gain for that state as follows: First, we assign a score to each client with a final label for that state (i.e., a QUIT, a DATA, or a connection closed label). We want to give more importance to states that make bots fail, while we never want to visit states that make legitimate clients fail. Also, we want to give a neutral gain to states that make legitimate clients succeed, and a slightly lower gain to states that make bots succeed. To achieve this, we assign a score of 1 for bot failure states, a score of 0 for legitimate clients failure states, a score of 0.5 for legitimate-client success states, and a score of 0.2 for bot success states. Notice that what we need here is a lattice of values that respect the stated precedence. Therefore, any set of numbers that maintain this relationship would work. When performing classification, we want to be as aggressive as possible in reducing the number of possible dialect candidates. In other words, we want to have high gains for states that allow us to make a decision on which dialect is spoken by a given client. Such states are those with a single possible client in them, or with different clients, each one with a different command label. To achieve this property, d we set the gain for each state that includes a final label as G = n , where n is the total number of labels in that state, and d is the number of unique labels.

Reply Selection At each iteration of the algorithm explained in Section 2.5.2, we decide which reply to send by evaluating the gain for every possible reply from the states in A. For all the states reachable in one transition from the states in A, we select the states Sa that still have at least an active client in their label table. We group together those states that are connected to the active states by transitions with the same label. For each group, we pick the minimum gain among the states in that group. We consider this number as the gain we would get by sending that reply. After calculating the gain for all possible replies, we send the reply with the highest gain associated to it. In case more than one reply yields the same gain we pick one randomly.

2.6 Botnet Feedback Mechanism

Modern spamming botnets typically use template-based spamming in order to send emails [103, 147, 173]. This way, the botnet Command and Control (C&C) infrastructure tells the bots what kind of emails to send out, and the bots relay back information about the delivery as they received it from the SMTP server.

31 2 Spam Mitigation

This server feedback is an important piece of information to the botmaster, since it enables her to monitor if her botnet is working correctly. Of course, a legitimate sender is also interested in information about the delivery process. However, she is interested in different information compared to the bot- master. In particular, a legitimate user wants to know whether the delivery of her emails failed (e.g., due to a typo in the email address). In such a case, the user wants to correct the mistake and send the message again. In contrast, a spammer usually sends emails in batches, and typically does not care about sending an email again in case of failure. Nonetheless, there are three main pieces of information related to server feedback that a rational spammer is interested in: (i) whether the delivery failed because the IP address of the bot is blacklisted; (ii) whether the delivery failed because of specific policies in place at the receiving end (e.g., greylisting); (iii) whether the delivery failed because the recipient address does not exist. In all three cases, the spammer can leverage the information obtained from the mail server to make her operation more effective and profitable. In the case of a blacklisted bot, she can stop sending spam using that IP address, and wait for it to be whitelisted again after several hours or days. Empirical evidence suggests that spammers already collect this information and act accordingly [173]. If the recipient server replied with an SMTP non-critical error (i.e., the ones used in greylisting), the spammer can send the email again after some minutes to comply with the recipient’s policy. The third case, in which the recipient address does not exist, is the most inter- esting, because it implies that the spammer can permanently remove that email address from her email lists, and avoid using it during subsequent campaigns. Re- cent research suggests that bot feedback is an important part of a spamming botnet operation. For example, Stone-Gross et al. [173] showed that about 35% of the email addresses used by the Cutwail botnet were in fact non-existent. By leveraging the server feedback received by the bots, a botmaster may consider to get rid of those non-existing addresses, and optimize her spamming performance significantly.

Breaking the Loop: Providing False Responses to Spam Emails Based on these insights, we want to study how we can manipulate the SMTP delivery process of bots to influence their sending behavior. We want to investigate what would happen if mail servers started giving erroneous feedback to bots. In particular, we are interested in the third case, since influencing the first two pieces of information has only a limited, short-term impact on a spammer. However, if we provide false information about the status of a recipient’s address, this leads to a double bind for the spammer: on the one hand, if a spammer considers server feedback, she will remove a valid recipient address from her email list. Effectively, this leads to a reduced number of spam emails received at this particular address. On

32 2.7 System Overview the other hand, if the spammer does not consider server feedback, this reduces the effectiveness of her spam campaigns since emails are sent to non-existent addresses. In the long run, this will significantly degrade the freshness of her email lists and reduce the number of successfully sent emails. In the following, we discuss how we can take advantage of this situation. As a first step, we need to identify that a given SMTP conversation belongs to a bot. To this end, a mail server can either use traditional IP-based blacklists or lever- age the analysis of SMTP dialects introduced previously. Once we have identified a bot, a mail server can (instead of closing the connection) start sending erroneous feedback to the bot, which will relay this information to the C&C infrastructure. For instance, the mail server could report that the recipient of that email does not exist. By doing this, the email server would lead the botmaster to the lose-lose situation discussed before. For a rational botmaster, we expect that this technique would reduce the amount of spam the email address receives. We have implemented this approach as a second instance of our technique to leverage the email delivery for spam mitigation and report on the empirical results in Section 2.8.2.

2.7 System Overview

We implemented our approach in a tool, called B@bel. B@bel runs email clients (legitimate or malicious) in virtual machines, and applies the learning techniques explained in Section 2.4 to learn the SMTP dialect of each client. Then, it leverages the learned dialects to build a decision machine MD, and uses it to perform malware classification or spam mitigation. Figure 2.5 illustrates an overview of B@bel. In the following paragraphs, we discuss the components of our tool in more detail. The first component of B@bel is a virtual machine zoo. Each of the virtual machines in the zoo runs a different email client. We used VirtualBox as our vir- tualization environment, and Windows XP SP3, Windows Server 2008, Windows 7, Ubuntu 11.10, or Mac OS X Lion as operating systems on the virtual machines, depending on the needed to run each of the legitimate clients or MTAs. Additionally, we used Windows XP SP3 to run the malware sam- ples. In summary, clients can be legitimate email programs, mail transfer agents, or spambots. The second component of B@bel is a gateway, used to confine suspicious network traffic. Since the clients that we run in the virtual machines are potentially mali- cious, we need to make sure that they do not harm the outside world. To this end, while still allowing the clients to connect to the Internet, we use restricting firewall rules, and we throttle their bandwidth, to make sure that they will not be able to launch denial of service attacks. Furthermore, we sinkhole all SMTP connections by redirecting them to local mail servers under our control.

33 2 Spam Mitigation

Server/1 Server/2 Server/3

Mail/Transfer Agent Legi%mate Gateway Firewall HTTP9based Client Malware

Virtual/Machine/Zoo Internet

Learner Decision/Maker

Figure 2.5: Overview of B@bel architecture.

We use three different mail servers in B@bel. The first email server is a regular server that speaks plain SMTP, and will perform passive observation of the client’s SMTP conversation. The second server is instrumented to perform active probing, as described in Section 2.4.2. Finally, the third server is configured to always report to the client that the recipient of an email does not exist, and is used to study how spammers use the feedback they receive from their bots. The third component of B@bel is the learner. This component analyzes the active or passive observations generated between the clients in the zoo and the mail servers, learns an SMTP dialect for each client, and generates the decision state machine using the various dialects as input, as explained in Section 2.5. According to the task we want to perform (dialect classification or spam mitigation), B@bel tags the states in the decision state machine with the appropriate gain. The last component of B@bel is the decision maker. This component analyzes an SMTP conversation, by either passively observing it or by impersonating the server, and makes a decision about which dialect is spoken by the client, using the process described in Section 2.5.2.

2.8 Evaluation

In this section, we evaluate the effectiveness of our approach. We first evaluate the dialects and then we assess the feedback manipulation techniques.

34 2.8 Evaluation

Table 2.1: MTAs, MUAs, and bots used to learn dialects. Mail User Agents Mail Transfer Agents Bots (by AV labels) Eudora, Opera, Exchange 2010, Waledac, Donbot, Lethic, Thunderbird, The Bat!, Exim, Postfix, Qmail, Klez, Buzus, Bagle, Pegasus, Outlook 2010, Sendmail Cutwail, Grum, Tedroo, Outlook Express, Mydoom, Mazben Windows Live Mail

2.8.1 Evaluating the Dialects We are now evaluating the dialects both for their classification accuracy and spam detection precision. We show that B@bel can successfully classify spam malware and detect spam emails with high accuracy.

Evaluating Dialects for Classification

We trained B@bel by running active probing on a variety of popular Mail User Agents, Mail Transfer Agents, and bot samples. Table 2.1 lists the clients we used for dialect learning. Since we are extracting dialects by looking at the SMTP conver- sations only, B@bel is agnostic to the family a bot belongs to. However, for legibility purposes, Table 2.1 groups bots according to the most frequent label assigned by the anti-virus products deployed by VirusTotal [191]. Our dataset contained 13 legiti- mate MUAs and MTAs, and 91 distinct malware samples. We picked the spambot samples to be representative of the largest active spamming botnets according to a recent report [116] (the report lists Lethic, Cutwail, Mazben, Tedroo, Bagle). We also picked worm samples that spread through email, such as Mydoom. In total, the malware samples we selected belonged to 11 families. The dialect learning phase resulted in a total of 60 dialects. We explain the reason for the high number of discovered dialects later in this section. We then wanted to assess whether a dialect (i.e., a state machine) is unique or not. For each combination of dialects , we merged their state machines together as explained in Section 2.5.1. We consider two dialects as distinct if any state of the merged state machine has two different labels in the label table for the dialects d1 and d2, or if any state has a single possible dialect in it. The results show that the dialects spoken by the legitimate MUAs and MTAs are distinct from the ones spoken by the bots. By analyzing the set of dialects spoken by legitimate MUAs and MTAs, we found that they all speak distinct dialects, except for Outlook Express and Windows Live Mail. We believe that Microsoft used the same email engine for these two products.

35 2 Spam Mitigation

The 91 malware samples resulted in 48 unique dialects. We manually analyzed the spambots that use the same dialect, and we found that they always belong to the same family, with the exception of six samples. These samples were either not flagged by any anti-virus at the time of our analysis, or match a dropper that downloaded the spambot at a later time [23]. This shows that B@bel is able to classify spambot samples by looking at their email behavior, and label them more accurately than anti-virus products. We then wanted to understand the reason for the high number of dialects we dis- covered. To this end, we considered clusters of malware samples that were speaking the same dialect. For each cluster, we assigned a label to it, based on the most common anti-virus label among the samples in the cluster. All the clusters were unique, with the exception of eleven clusters marked as Lethic and two clusters marked as Mydoom. By manual inspection, we found that Lethic randomly closes the connection after sending the EHLO message. Since our dialect state machines are non-deterministic, our approach handles this case, in principle. However, in some cases, this non-deterministic behavior made it impossible to record a reply for a particular test case during our active probing. We found that each cluster labeled as Lethic differs for at most five non-recorded test cases with every other Lethic cluster. This gives us confidence to say that the dialect spoken by Lethic is indeed unique. For the two clusters labeled as Mydoom, we believe this is a common label assigned to unknown worms. In fact, the two dialects spoken by the samples in the clusters are very different. This is another indicator that B@bel can be used to classify spamming malware in a more precise fashion than is possible by relying on anti-virus labels only.

Evaluating Dialects for Spam Detection To evaluate how the learned dialects can be used for spam detection, we col- lected the SMTP conversations for 621,919 email messages on four mail servers in Computer Science Department of University of California Santa Barbara (UCSB), spanning 40 days of activity. For each email received by the department servers, we extracted the SMTP con- versation associated with it, and then ran B@bel on it to perform spam detection. To this end, we used the conversations logged by the Anubis system [13] during a period of one year (corresponding to 7,114 samples) to build the bot dialects, and the dialects learned in Section 2.8.1 for MUAs and MTAs as legitimate clients. In addition, we manually extracted the dialects spoken by popular web mail services from the conversations logged by the department mail servers, and added them to the legitimate MTAs dialects. Note that, since the goal of this experiment is to perform passive spam detection, learning the dialects by passively observing SMTP conversations is sufficient.

36 2.8 Evaluation

During our experiment, B@bel marked any conversation as spam if, at the end of the conversation, the dialects in CD were all associated with bots. Furthermore, if the dialects in CD were all associated with MUAs or MTAs, B@bel marked the conversation as legitimate (ham). If there were both benign and malicious clients in CD, B@bel did not make a decision. Finally, if the decision state machine did not recognize the SMTP conversation at all, B@bel considered that conversation as spam. This could happen when we observe a conversation from a client that was not in our training set. As we will show later, considering it as spam is a reasonable assumption, and is not a major source of false positives. In total, B@bel flagged 260,074 conversations as spam, and 218,675 as ham. For 143,170 emails, B@bel could not make a decision, because the decision process ended up in a state where there were both legitimate clients and bots in CD. To verify how accurate our decisions were, we used a number of techniques. First, we checked whether the department mail servers blocked the email in the first place. These servers have a common configuration, where incoming emails are first checked against an IP blacklist, and then against more expensive content-analysis techniques. In particular, these servers used a commercial blacklist for discarding emails coming from known spamming IP addresses, and utilized SpamAssassin and ClamAV for content analysis. Any time one of these techniques and B@bel agreed on flagging a conversation as spam, we consider this as a true positive of our system. We also consider as a true positive those conversations B@bel marked as spam, and that lead to an NXDOMAIN or to a timeout when we tried to resolve the domain associated to the sender email address. In addition, we checked the sender IP address against 30 additional IP blacklists [105], and considered any match as a true positive. According to this ground truth, the true positive rate for the emails B@bel flagged as being sent by bots is 99.32%. Surprisingly, 98% of the 24,757 conversations that were not recognized by our decision state machine were flagged as spam by existing methods. This shows that, even if the set of clients from which our framework learned the dialects from is not complete, there are no widely-used legitimate clients we missed, and that it is safe to consider any conversation generated by a non- observed dialect as spam. For the remaining 2,074 emails that B@bel flagged as spam, we could not assess if they were spam or not. They might have been a false positive of our approach, or a false negative of the existing methods. To remain on the safe side, we consider them as false positives. This results in B@bel a precision of 99.3%. It is worth noting that in a scenario where we had control of the mail server, we could assess whether these conversations belong to known bot dialects by active probing. We then looked at our false negatives. We consider as false negatives those con- versations that B@bel classified as belonging to a legitimate client dialect, but that have been flagged as spam by any of the previously mentioned techniques. In total, the other spam detection mechanisms flagged 71,342 emails as spam, among the

37 2 Spam Mitigation ones that B@bel flagged as legitimate. Considering these emails as false negatives, this results in our framework having a false negative rate of 21%. The number of false negatives might appear large at first. However, we need to consider the sources of these spam messages. While the vast majority of spam comes from bot- nets, spam can also be sent by dedicated MTAs, as well as through misused web mail accounts. Since B@bel is designed to detect email clients, we are able to detect which MTA or web mail application the email comes from, but we cannot assess whether that email is ham or spam. To show that this is the case, we investigated these 71,342 messages, which originated from 7,041 unique IP addresses. Assum- ing these are legitimate MTAs, we connected to each IP address on TCP port 25 and observed greeting messages for popular MTAs. For 3,183 IP addresses, one of the MTAs that we used to learn the dialects responded. The remaining 3,858 IP addresses did not respond within a 10 second timeout. We performed reverse DNS lookups on these IP addresses and assessed whether their assigned DNS names contained indicative names such as smtp or mail. 1,654 DNS names were in this group. We could not find any conclusive proof that the remaining 2,204 addresses belong to legitimate MTAs. For those dialects for which B@bel could not make a decision (because the con- versation lead to a state where both one or more legitimate clients and bots were active), we investigated if we could have assessed whether the client was a bot by using active probing. Since the spambot and legitimate client dialects we observed are disjoint, this is always possible. In particular, B@bel found that it is always pos- sible to distinguish between the dialects spoken by a spambot and by a legitimate email client that look identical from passive analysis by sending a single SMTP reply. For example, the SMTP RFC specifies that multi-line replies are allowed, in the case all the lines in the reply have the same code, and all the reply codes but the last one are followed by a dash character. Therefore, multi-line replies that use different reply codes are not allowed by the standard. We can leverage different handling of this corner case to disambiguate between Qmail and Mydoom. More precisely, if we send the reply 250-OK550 Error, Qmail will take the first reply code as the right one, and continue the SMTP transaction, while Mydoom will take the second reply code as the right one, and close the connection. Based on these observations, we can say that if we ran B@bel in active mode, we could distin- guish between these ambiguous cases, and make the right decision. Unfortunately, we could run B@bel only in passive mode on our department mail servers. Our results show that B@bel can detect (and possibly block) spam emails sent by bots with accurately enough. However, our approach is unable to detect those spam emails sent by dedicated MTAs or by compromised webmail accounts. For this reason, similar to the other state-of-the-art mitigation techniques, B@bel is not a silver bullet, but should be used in combination with other anti-spam mechanisms. To show what would be the advantage of deploying our proposed framework on a

38 2.8 Evaluation mail server, we studied how much spam would have been blocked on the department server if B@bel was used in addition to or in substitution to the commercial blacklist and the content analysis systems that are currently in use on those servers. Similarly to IP blacklists, B@bel is a lightweight technique. Such techniques are typically used as a first spam-mitigation step to make quick decisions, as they avoid having to apply resource-intensive content analysis techniques to most emails. For this reason, the first configuration we studied is substituting the commercial black- list with our approach. In this case, 259,974 emails would have been dropped as spam, instead of the 219,726 that were blocked by the IP blacklist. This would have resulted in 15.5% less emails being sent to the content analysis system, reducing the load on the servers. Moreover, the emails detected as spam by B@bel and the IP blacklist do not overlap completely. For example, the IP blacklist flags as spam emails sent by known misused MTAs. Therefore, we analyzed the amount of spam that the two techniques could have caught if used together. In this scenario, 278,664 emails would have been blocked, resulting in 26.8% less emails being forwarded to the content analysis system compared to using the blacklist alone. As a last experi- ment, we studied how much spam would have been blocked on our servers by using B@bel in combination with both the commercial blacklist and the content analysis systems. In this scenario, 297,595 emails would have been flagged as spam, which constitutes an improvement of 3.9% compared to the servers’ original configuration. These results shows that, by using B@bel, a server could reduce the number of emails that need to be processed by content analysis. Since content analysis is expensive and not always feasible on busy servers, using our approach would be advantageous.

2.8.2 Evaluating the Feedback Manipulation To investigate the effects of wrong server feedback to bots, we set up the following experiment. We ran 32 malware samples from four large spamming botnet families (Cutwail, Lethic, Grum, and Bagle) in a controlled environment, and redirected all of their SMTP activity to the third mail server in the B@bel architecture. We configured this server to report that any recipient of the emails the bots were sending to was non-existent, as described in Section 2.7. To assess whether the different botnets stopped sending emails to those addresses, we leveraged a under our control. A spamtrap is a set of email addresses that do not belong to real users, and, therefore, collect only spam mails. To evaluate our approach, we leverage the following idea: if an email address is successfully removed from an email list used by a spam campaign, we will not observe the same campaign targeting that address again. We define as a spam campaign the set of emails that share the same URL templates in their links, similar to the work of Xie et al. [210]. While there are more advanced methods to detect spam campaigns [147], the chosen approach leads to sufficiently good results for our purposes.

39 2 Spam Mitigation

We ran our experiment for 73 days, from June 18 to August 30, 2011. During this period, our mail server replied with false server feedback for 3,632 destination email addresses covered by our spamtrap, which were targeted by 29 distinct spam campaigns. Of these, three campaigns never targeted again the addresses for which we gave erroneous feedback. To assess the impact we would have had when sending erroneous feedback to all the addresses in the spamtrap, we look at how many emails the whole spamtrap received from the campaigns. In total, 2,864,474 emails belonged to campaigns. Of these, 550,776 belonged to the three campaigns for which we are confident that our technique works and reduced the amount of spam emails received at these addresses. Surprisingly, this accounts for 19% of the total number of emails received, indicating that this approach could have impact in practice. We acknowledge that these results are preliminary and provide only a first insight into the large-scale application of server feedback poisoning. Nevertheless, we are confident that this approach is reasonable since it leads to a lose-lose situation for the botmaster, as discussed in Section 2.6. We argue that the uncertainty about server feedback introduced by our method is beneficial since it reduces the amount of information a spammer can obtain when sending spam.

2.9 Discussion

Our results demonstrate that B@bel is successful in detecting current spambots. However, spam detection is an adversarial game. Thus, once B@bel is deployed, we have to expect that spammers will evolve and try to bypass our systems. In this section, we discuss potential paths for evasion.

2.9.1 Evading Dialects Detection The most immediate path to avoid detection by dialects is to implement an SMTP engine that precisely follows the specification. Alternatively, a bot author could use an existing SMTP engine that is used by legitimate email clients. We argue that this has a negative impact on the effectiveness and flexibility of spamming botnets. Many spambots are built for performance; their primary purpose is to distribute as many messages as possible. In some cases, spambots even send multiple mes- sages without waiting for any server response. Clearly, any additional checks and parsing of server replies incurs overhead that might slow down the sender. To this end, we performed a simple experiment to measure the speed difference between a malware program sending spam (Bagle) and a legitimate email library on Windows (Collaboration Data Objects - CDO). We found that Bagle can send an email every 20 ms to a local mail server. When trying to send emails as fast as possi- ble using the Windows library (in a tight loop), we measured that a single email

40 2.10 Related Work required 200 ms, an order of magnitude longer. Thus, when bots are forced to faithfully implement large portions of the SMTP specification (because otherwise, active probing will detect differences), spammers suffer a performance penalty. Spammers could still decide to adopt a well-known SMTP implementation or revert to a well-known SMTP library. In this case, another aspect of spamming botnets has to be taken into account. Typically, cybercriminals who infect machines with bots are not the same as the spammers who rent botnets to distribute their messages. Modern spamming botnets allow their customers to customize the email headers to mimic legitimate clients. In this scenario, B@bel could exploit possible discrepancies between the email client identified by the SMTP dialect and the one announced in the body of an email (e.g., via the X-Mailer header). When these two dialects do not match, we can detect that the sender pretends to speak a dialect that is inconsistent with the content of the message. Of course, the botmasters could take away the possibility for their customers to customize the headers of their emails, and force them to match the ones typical of a certain legitimate client. However, while this would make spam detection harder for B@bel, it would make it easier for other systems that rely on email-header analysis, such as Botnet Judo [147].

2.9.2 Mitigating Feedback Manipulation As we discussed in Section 2.6, spammers can decide to either discard or trust any feedback they receive from the bots. To avoid this, attackers could guess whether the receiving mail server is performing feedback manipulation. For example, when all emails to a particular domain are rejected because no recipient exists, maybe all feedback from this server can be discarded. In this case, we would need to update our feedback mechanism to return invalid feedback only in a fraction of the cases.

2.10 Related Work

Email spam is a well-known problem that has attracted a substantial amount of research over the past years. Existing work on spam filtering can be broadly classified in two categories: post-acceptance methods and pre-acceptance methods. Post-acceptance methods receive the full message and then rely on content analysis to detect spam emails. There are many approaches that allow one to differentiate between spam and legitimate emails: popular methods include Naive Bayes, Support Vector Machines (SVMs), or similar methods from the field of machine learning [49, 124, 164, 166]. Other approaches for content-based filtering rely on identifying the URLs used in spam emails [180, 210]. A third method is DomainKeys Identified Mail (DKIM), a system that verifies that an email has been sent by a certain domain by using cryptographic signatures [106]. In practice, performing content analysis or computing cryptographic checksums on every incoming email can be expensive

41 2 Spam Mitigation and might lead to high load on busy servers [182]. Furthermore, an attacker might attempt to bypass the content analysis system by crafting spam messages in specific ways [113, 134]. In general, the drawback of post-acceptance methods is that an email has to be received before it can be analyzed. Pre-acceptance methods attempt to detect spam before actually receiving the full message. Some analysis techniques take the origin of an email into account and analyze distinctive features about the sender of an email (e.g., the IP address or autonomous system the email is sent from, or the geographical distance between the sender and the receiver) [77, 158, 176, 190]. In practice, these sender-based tech- niques have coverage problems: previous work showed how IP blacklists miss detect- ing a large fraction of the IP addresses that are actually sending spam, especially due to the highly dynamic nature of the machines that send spam [155, 169, 173]. Our method is a novel, third approach that focuses on how messages are sent. This avoids costly content analysis and does not require the design and implementation of a reputation metric or blacklist. In contrast, we attempt to recognize the SMTP dialect during the actual SMTP transaction. This complements both pre-acceptance and post-acceptance approaches. Another work that went in this direction was done by Beverly et al. [15] and Kakavelakis et al. [91]. The authors of these papers leveraged the fact that spambots have often bad connections to the Internet, and perform spam detection by looking at TCP-level features such as retransmissions and connection resets. Our system is more robust, because it does not rely on assumptions based on the network connectivity of a mail client. Moreover, to the best of our knowledge, we are the first to study the effects of manipulating server feedback to poison the information sent by a bot to the botmaster. The core idea behind our approach is to learn the SMTP dialect spoken by a particular client. This problem is closely related to the problem of automated pro- tocol reverse engineering, where an (unknown) protocol is analyzed to determine the individual records/elements and the protocol’s structure [19, 33]. Initial work in this area focused on clustering of network traces to group similar messages [38], while later methods extracted protocol information by analyzing the execution of a program while it performs network communication [25, 39, 111, 198, 202]. Sophis- ticated methods can also handle multiple messages and recover the protocol’s state machine. For example, Dispatcher is a tool capable of extracting the format of pro- tocol messages when having access to only one endpoint, namely the bot binary [24]. Cho et al. leverage the information extracted by Dispatcher to learn C&C proto- cols [29]. Brumley et al. studied how deviations in the implementation of a given protocol specification can be used to detect errors or generate fingerprints [22]. Our problem is related to previous work on protocol analysis, in the sense that we extract different SMTP protocol variations, and use these variations to build fingerprints. However, in this work, we treat the speaker of the protocol (the bot) as a blackbox, and we do not perform any code analysis or instrumentation to find

42 2.11 Summary protocol formats or deviations. This is important because malware is notoriously difficult to analyze and we might not always have a malware sample available. Instead, our technique allows us to build SMTP dialect state machines even when interacting with a previously unknown spambot. There is also a line of research on fingerprinting protocols [32, 142, 212]. Initial work in this area leveraged manual analysis. Nonetheless, there are methods, such as FiG, that automatically generate fingerprints for DNS servers [189]. The main difference between our work and FiG is that our dialects are stateful while FiG operates on individual messages. This entirely avoids the need to merge and explore protocol state machines. However, as discussed previously, individual messages are typically not sufficient to distinguish between SMTP engines.

2.11 Summary

In this chapter, we introduced a novel way to detect and mitigate spam emails that complements content- and sender-based analysis methods. We focused on how email messages are sent and created methods to influence the spam delivery mechanism during SMTP transactions. On the one hand, we showed how small deviations in the SMTP implementation of different email agents (so called SMTP dialects) allow us to detect spambots during the actual SMTP communication. On the other hand, we studied how can we poison the feedback mechanism used by botnets in such a way that will have a negative impact on their effectiveness. Empirical results confirmed that both aspects of our approach can be used to detect and mitigate spam emails. While spammers might adapt their spam-sending practices as a result of our findings, we argue that this reduces their performance and flexibility.

43

3 HTTP-Based Malware Mitigation

“The only truly secure system is one that is powered off, cast in a block of concrete and sealed in a lead-lined room with armed guards.”

Gene Spafford

As already stated previously, malicious software and especially botnets are among the most important security threats in the Internet. Thus, the accurate and timely detection of such threats is of great importance. Detecting machines infected with malware by identifying their malicious activities at the network level is an appealing approach, due to the ease of deployment. Nowadays, the most common communi- cation channels used by attackers to control the infected machines are based on the HTTP protocol. To evade detection, HTTP-based malware adapt their be- havior to the communication patterns of the benign HTTP clients, such as web browsers. This poses significant challenges to existing detection approaches like signature-based and behavioral-based detection systems. In Chapter2 we focused on the activities of a particular type of botnet ( spam- botnet) and more precisely in detection and mitigation of the spam emails it sends. In this chapter, we expand the approach of dialects on the HTTP protocol. In particular, we present a novel approach to detect HTTP-based malware, including botnets, at the network level by observing the small but perceivable differences in the implementations of the HTTP protocol. Our evaluation results demonstrate that we can precisely identify a large variety of real-world HTTP-based malware while outperform prior work on identifying HTTP-based botnets.

45 3 HTTP-Based Malware Mitigation

3.1 Introduction

Over the last decade, we have witnessed a tremendous increase in the number of users connected to the Internet [85]. Simultaneously, the number of potential victims of malware has significantly increased, as many users run outdated or vulnerable software [96] and thus are susceptible to attacks that may lead to a compromise of their devices. These infected devices could potentially turned into bots and instructed through a C&C channel. There is a large variety of network protocols that have been used to implement the C&C channel ranging from IRC and HTTP [1, 34] to peer-to-peer (P2P) [35, 71] for a decentralized C&C structure. Recently, numerous techniques have been proposed to detect botnets and com- promised devices at the network level. Many detection systems focus on the trans- ferred information during clients’ infections. These systems can automatically gen- erate [109, 135, 168] and match [143, 160] signatures on packet’s payload. Others, like BotHunter [73], can recognize the information flow of the initial event sequence during a compromise. Although these approaches may detect infections at the first place, existing malware on compromised machines will probably not be detected. Other detection approaches are based on the insight that bots behave similar in a spatial-temporal manner when conducting nefarious activities. Thus, they correlate the behavior of hosts in the monitored network to unveil compromised clients and C&C servers [63, 72]. These approaches are typically protocol agnostic, but they are not effective on detecting a small number of infected machines within a moni- tored network, and may lead to a large number of false positive alerts. Furthermore, bots often change their behavior to be more effective and avoid detection, e.g., by emulating the behavior of benign programs. Also, most bots try to stay hidden and accomplish their tasks in a stealthy way, leaving the user unaware of an infection. These facts pose significant challenges to existing detection systems. Nowadays, C&C channels and common types of malware, such as adware, trojans, and backdoors leverage HTTP protocol due to its large popularity [86, 88]. Conse- quently, the malicious traffic can blend with benign traffic and remain undetected. Existing signature-based detection systems cannot accurately distinguish between malicious and benign clients, as their resulting HTTP traffic may look alike. The main reason for this weakness is that the generic signature language used by these systems lacks higher-level information specific for HTTP, which could be used to identify the network traffic produced by HTTP-based malware. Additionally, ex- isting behavioral-based detection systems may not be able to identify the set of infected hosts, as their network behavior can be adapted to the usual behavior of benign hosts when accessing the web. To address the above issues, we propose BotHound: a framework for network- level detection of malware that leverage the HTTP protocol as the main channel to communicate with the C&C server or to perpetrate nefarious activities. In contrast

46 3.1 Introduction to previous works, BotHound focuses on the HTTP requests used by both malware and benign web clients. The basic insight is the fact that two implementations of the HTTP protocol will slightly differ: empirical analysis results indicate that HTTP-relevant behavior differs due to the ambiguousness of the protocol and the respective quality of the implementation. To achieve accurate malware detection at the HTTP level, we define two new models: header chains and HTTP templates, which can identify such implementation differences on the communication messages sent by malware and benign clients. Header chains analyze the sequence of HTTP headers sent in a request to detect a suspicious header order, while HTTP templates represent a generalized form of the HTTP headers name-value pairs generated by malware requests. These models expand the idea of dialects, which proposed in Chapter2. More precisely, BotHound shares the same principles with B@bel; both systems try to discover subtle differences in communication patterns and therefore distinguish benign and malicious traffic. However, BotHound instead of examine the email communication it focuses on detecting the communication between bot and botmaster as well as HTTP requests that lead to malicious activities. BotHound can accurately classify the HTTP communications captured in a moni- tored network as benign or malicious, by comparing them against models of benign and malicious traffic. In particular, our framework automatically generates header chains for both benign and malicious HTTP requests, and HTTP templates for ma- licious requests sent by malware. Note that a captured HTTP request may match both benign and malicious header chains. Then, the HTTP templates matching define the classification. While header chains can classify rapidly the benign traffic, HTTP templates provide increased accuracy for the detection of malicious traffic, resulting in both high detection speed and accuracy. In case that no header chains or no HTTP templates match with a captured request, BotHound will classify this request as suspicious. These suspicious requests are reported for further investi- gation. In general, BotHound is less resource intensive, and thus faster, compared to existing Intrusion Detection Systems (IDS) that need to apply heavy pattern matching to all HTTP requests. To demonstrate the practical feasibility of our approach, we present the design, prototype implementation, and experimental evaluation of BotHound using real- world malware samples and a broad range of benign HTTP clients. We also deploy BotHound in an operational network that counts tens of thousands different machines and generates several million HTTP requests on a daily basis and report our findings. Our results show that BotHound detects practically all the HTTP requests sent by malware (99.97%) with 0.04% of classification errors for the benign traffic. In real-world deployment, BotHound detected several malware instances from known malware families. Also, we demonstrate that our system is able to detect advanced persistent threats (APTs) such as Duqu or Miniduke used in targeted attacks.

47 3 HTTP-Based Malware Mitigation

In summary, we make the following main contributions:

• We propose a new approach to distinguish the network traffic generated by HTTP-based bots and malware from benign HTTP traffic. In contrast to previous detection techniques, our approach focuses on the different imple- mentations of the HTTP protocol.

• We introduce two models that describe malicious and benign HTTP requests: header chains and HTTP templates. These models focus on particular charac- teristics, such as HTTP headers’ sequence and structure, which reveal concrete discrepancies between benign and malicious requests.

• To assess the feasibility of our approach, we implemented and evaluated BotHound. Our experimental results demonstrate that BotHound is able to correctly identify the network traffic generated by HTTP-based malware in a real-world scenario with very low classification errors and high performance.

3.2 HTTP Protocol

The Hypertext Transfer Protocol (HTTP), as defined in RFC 2616 [52], is a state- less application-level protocol for distributed, collaborative, and hypermedia infor- mation systems. Clients such as web browsers use the HTTP protocol to connect to web servers for retrieving, presenting and updating information. The protocol is defined as a conversation between the client and the server, where text messages are transmitted in an alternating way. Messages consist of requests from client to server and responses from server to client. Both types of messages contain a start line, zero or more header fields (known as headers), an empty line that indicates the end of the headers, and optionally a message body. Each line ends with a line-terminator, denoted as CRLF. Each header consists of a case-insensitive name followed by a colon and the field value. We refer to the part before the colon as header name and to the part after the colon as header value. Due to the fact that our detection approach is based on the HTTP requests, we briefly describe their technical structure. HTTP clients use headers in the request message to identify themselves and con- trol how content should be returned by the server. Some web applications use custom headers to add comments or annotations to an HTTP message; the conven- tion is to prefix such header names with X- to indicate that it is a non-standard header. The first line (Request-Line) of a client’s request message includes: (i) the method to be applied to the resource, (ii) the identifier of the resource, and (iii) the protocol version in use. SPACE characters separate each element. The client can use any method; if a method is unknown, the server will treat it as invalid. The next lines of the message contain the headers, but all of them are optional and can be

48 3.3 HTTP-Based Malware

GET / HTTP 1.1 § Accept: text/html, application/xhtml+xml, */* ¤ Accept-Language: en-US Accept-Encoding: gzip, deflate User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) Host: example.com Connection: Keep-Alive

¦ ¥ Figure 3.1: A typical HTTP request.

omitted. RFC 2616 supports 8 request methods and 19 request headers. However, most HTTP clients have additional headers to provide further connection details and information about themselves to the server. An example of a typical HTTP re- quest is depicted in Figure 3.1. The client sends a request to the host example.com using the GET request method and awaits a response back from the server. The structure of a request message is diversified among the different instances of web browsers, crawlers, HTTP-based malware, and other kinds of applications that access the Internet via HTTP. Each instance implements the request messages differently in terms of the headers it uses, the sequence of these headers inside the message, and the value of each header. Hence, it is possible to fingerprint them by extensively analyzing the messages they exchange. In this work, we leverage these differences to identify the type of client involved in a specific HTTP conversation.

3.3 HTTP-Based Malware

Nearly every network-enabled device that is connected to the Internet allows HTTP traffic. In most networks, the HTTP connections are rarely filtered, which is a good prerequisite to establish a communication channel with infected devices. Attackers exploit this opportunity to create malware that communicate with the C&C server over HTTP, where malware can hide their network traffic within normal web traffic. Consequently, the malware can easily bypass firewalls with port-based filtering mechanisms. A common type of HTTP-based malware is adware. Although this is one of the least dangerous types of malware, it is at the same time one of the most lucrative for its distributors. The purpose of adware is to display advertisements or fake warnings on a victim’s computer. Such malware typically utilize the HTTP protocol for rendering advertisements, downloading new malware versions to victim’s computer, and uploading status information to the attacker. More precisely, adware force infected machines to visit targeted advertisement web sites, which lead to profit generation for the publisher of the advertisement. Occasionally, adware performs click-fraud attacks, which provide an even higher profit [60, 110, 125, 174].

49 3 HTTP-Based Malware Mitigation

Another type of HTTP-based malware is the downloader. This malware is used by attackers to download and install new software on a victim’s computer. When a downloader is installed on an infected machine, it can connect to a specified server and download other types of malware to the compromised machine. This is one of the most profitable businesses in the underground market. Attackers can gain a lot of revenue per year by operating a few thousand machines infected with downloaders [93, 132]. Presumably the most dangerous of all HTTP-based malware types are HTTP- based bots. Upon the infection of a device, such malware manipulate it to behave as the botmaster commands. Once a device is infected, the bot sends a request to the C&C server to inform the adversary of its presence. Typically, the botmasters publish their commands on certain web servers, and the bots periodically visit those servers to obtain new commands. The widespread use of HTTP-based bots started when malicious software toolkits became popular and provided a base for even inex- perienced cybercriminals to set up a botnet quickly [56]. This type of software can be bought on underground black market, and costs between few hundred to several thousand dollars, depending on the built-in features. With regards to the hosting infrastructure of an HTTP-based botnet, researchers revealed that the botmasters could use free hosting providers for their C&C server [188]. The drawback of this solution is that the botnet can be shutdown easily by abuse complaints. As an alternative, attackers use hosting infrastructures that are not subject to legislation on prohibiting the execution of such malicious services, and are not responding to abuse complaints. This may enable the botnet to stay longer alive and the attacker to gain further profit.

3.4 HTTP-Level Detection

In this work we want to detect HTTP-based malware at the network level. The de- tection system captures and analyzes HTTP traffic of a monitored network, aiming to detect HTTP connections originated by malware, and associate these connections with the respective malware-infected hosts. As HTTP-based malware we consider all the malware instances that use HTTP as C&C channel to communicate with their botmasters as well as all the malware instances that perform their malicious activities over HTTP. Our approach identifies groups of malware that interact with the web using com- mon HTTP protocol implementations. Thus, we first learn the specific aspects of this particular implementation, and then use these insights to detect the presence of compromised machines in a monitored network. To achieve this goal, we analyze the sequence of the HTTP headers within each request and generate unique models (header chains) for every malware family. Since malware instances often attempt

50 3.4 HTTP-Level Detection to disguise themselves as legitimate clients, we also construct distinctive templates that focus on specific patterns of the HTTP header values (HTTP templates). In principle, we use a comparable formalism to the one presented in Section 2.3. In the rest of this section, we describe the two strategies in detail.

3.4.1 Header Chains

To achieve reliable communication between a client and a server, both hosts need to “speak” the same protocol, (i.e., HTTP in our case). Note that the implementa- tion of this protocol is not subject to strict rules, but a certain amount of freedom is granted and should be tolerated by both parties. Apart from the Request-Line, which includes the HTTP method, the identifier of the resource, and the protocol version, all the other headers can be omitted or provided in any sequence. As a matter of fact, legitimate and widely used web clients—such as , Mozilla Firefox, Chrome—implement the HTTP protocol in slightly differ- ent ways. We expect that malicious requests will also slightly deviate from each other and from legitimate requests, as the malware authors have custom implemen- tations of the HTTP protocol. This enables us to spot the small differences in the header order or in the individual headers, which results in a robust way to detect suspicious HTTP requests at the network level. For this distinction, we model each client-specific implementation of the HTTP protocol. Thus, we need a set of HTTP requests generated by a web client when it performs a series of different tasks. Interestingly, the same client contains or omits HTTP headers depending on the assignment it accomplishes. For example, the header order can change after a user has performed a login at a website or new headers are added in case a web application sets a cookie. As a result, we need to model different header orders for a given web browser. To represent the headers’ order of an HTTP request, we define a header chain as a −→ vector H = (h , h , ..., h ) with header names as elements. For each HTTP request 1 2 n −→ of a known client C, we create and store a pair (H,C) of the request’s header chain −→ H and the client’s name C. To identify the unknown client that is responsible for an observed request, we form its chain U−→ and compare all our labeled header chains −→ H −→ H with the unlabeled chain U−→. If we find that a chain H is equal to the observed H chain U−→, we assume that the corresponding client C generated the new request. H It is possible that either no header chain matches or header chains from more than one clients match at the same time. Thus, we use header chains only as a first step of detection. If there is no match with a header chain, the request is classified as suspicious. If the request matches only with header chains of benign clients, it is classified as benign. Else, it is classified as possibly malicious and forwarded to the next phase.

51 3 HTTP-Based Malware Mitigation

3.4.2 HTTP Templates

In some cases, header chains may not be able to distinguish between two different clients, because they may produce identical vectors. This may happen if a malware sample attempts to use or spoof a legitimate web browser to send malicious requests, in exactly the same sequence as the legitimate browser. In this case, the header chains produced by malware and this browser will be identical. To overcome this obstacle, we examine the name-value pairs of the HTTP headers (i.e., the whole requests) to detect similarities with malicious requests. During our experiments we found that the value of the User-Agent request header field can be used to identify some of these attempts. This field contains information about the user agent generating the request, and often contains subtle differences that enable us to detect suspicious requests. This header and the respective client’s header chain can identify all the legitimate web clients. On the other hand, our experimental results show that malware that try to impersonate a legitimate web client usually change only the content of the User-Agent header, but they rarely change the headers’ sequences. Thus, BotHound is able to detect these malware instances. For example, we analyzed the binary of the Skynet malware (a Tor- powered trojan) and found that it contains 57 different hard-coded User-Agent header values, while the headers sequences always remain the same. However, there are malware samples that produce requests where the User-Agent field corresponds to legitimate clients, and thus the header chains are insufficient. To address these issues, we introduce HTTP templates, in which we perform a full protocol parsing and clustering of the headers found in malicious requests. More precisely, we examine all the requests matching header chains of malicious traffic, or both malicious and benign, by extracting their HTTP templates and comparing them with the HTTP templates of malicious requests. The template extraction mechanism works as follows. Initially, we distinguish between the headers’ names and their values, and extract the latter for further analysis. Next, we analyze the extracted values and categorize them into bigger groups: IP addresses, port numbers, version numbers, and Base-64 encoding. Addi- tionally, specific filetypes such as exe, bin, jpg, etc., are identified and generalized into clusters. For the remaining values we generate regular expression patterns. These regular expressions (we refer to them as tokens) can be (i) alphanumeric characters, (ii) hexadecimal digits, (iii) numerical digits, (iv) punctuation charac- ters, or (v) whitespace characters. Finally, the generated tokens are combined with their header names to produce a new template. When a possibly malicious request requires a deeper analysis, its template UT is extracted and compared with the existing malicious HTTP templates T to determine if there is a match. Initially, we consider only a partial match between the request’s template UT and a template T . Given that many malware may dynamically change

52 3.5 System Overview or generate the URL requested [5] a partial similarity is justified. Thus, we exclude the Host and Request-Line header values from the template comparison, as these values compose the URL. If we find a partial match with at least one existing template T , we investigate the previously dismissed values. We use URL signatures to measure the similarity between two URLs using heuristics such as the length of the URL and the num- ber of parameters it contains. In particular, we compare the observed URL RU with each URL RT contained in the partially matched templates and we measure: (i) the normalized Levenshtein distance between the first parts of the URLs that include path and page name, (ii) the Jaccard distance between the sets of parameter names, and (iii) the normalized Levenshtein distance between strings obtained by concatenating the parameter values. When the overall weighted average distance between the observed URL RU and at least one of the URLs RT from the partially matched HTTP templates of malicious requests is less than a specific threshold (see Section 3.6.3), the request is classified as malicious. HTTP templates are complementary to header chains, and they do not replace them. While header chains offer a quick solution to filter benign traffic and focus only on possibly malicious requests, the templates increase the detection accuracy and raise alarms when it comes to real threats. Therefore, we use header chains as a pre-filtering stage for speed: when there is no match with malware’s header chains there will be no match with HTTP templates as well, so there is no need to continue with the computationally expensive template processing. Thus, the combination of header chains and HTTP templates create a fast and accurate detection model.

3.5 System Overview

In this section, we describe the implementation of our approach. Our prototype implementation, called BotHound and operates in two phases: a training phase where it learns the deviations in the implementation of the HTTP protocol by different web clients and malware and uses these details to automatically generate models, and a detection phase where it examines whether the captured HTTP traffic matches with the models of benign or malicious traffic. The core of the BotHound architecture consists of three main components: (i) Virtual Machine Zoo,(ii) Learner, and (iii) Decision Maker. The Virtual Machine Zoo and Learner operate during the train- ing phase, while the Decision Maker runs during the detection phase (Figure 3.2). At first glance, we use a similar infrastructure to the one presented in Section 2.7. However, as the protocol we study in this chapter is completely different from the protocol we examined in Chapter2, we had to re-implement the major compo- nents of the infrastructure to target our current threat model. We describe each component in detail as follows.

53 3 HTTP-Based Malware Mitigation

Legi%mate Client Gateway Firewall HTTP9based Malware

Virtual/Machine/Zoo Internet

Network/Traces Learner Training'Phase

Detec/on'Phase

Network/Traffic Decision/Maker Report

Figure 3.2: Overview of BotHound architecture.

3.5.1 Virtual Machine Zoo

The Virtual Machine Zoo is a collection of Virtual Machines, each of which runs a different HTTP client. Clients are legitimate web browsers, crawlers, web libraries, and HTTP-based malware. We support four different operating systems (Windows, Mac OS X, Linux, and FreeBSD), and we use 21 distinct legitimate clients. The web clients are used with both vanilla and customized configurations: we have installed the top ten most popular client extensions for the web browsers to also record differences based on such additions to a stock configuration. Furthermore, we capture traffic from mobile devices running iOS, Android, Symbian, and Windows 8 to also cover legitimate traffic generated by such mobile devices. To generate realistic HTTP traffic, each client visits the top 1,000 most popular web sites based on alexa.com with a crawl depth of one. Additionally, to collect further benign data, we use a variety of software that leverage HTTP (web radio music players, HTTP- based video streaming, web stores like iTunes, Linux update processes, programs that update through HTTP, etc.). Finally, we use traffic generated from native operating system libraries, such as WinHTTP, WinInet, UrlMon, and libcurl. Our dataset also contains distinct malware samples. These malicious samples are collected in the wild through various malware collection and analysis platforms (such as Anubis [13], CWSandbox [200], and Cuckoo Sandbox [37]). All these binaries have been checked with virustotal.com to verify that they are indeed malicious. Previous studies have shown that malware, and especially botnets, switch controllers or download updates frequently (e.g., every two or three days [63]). In some cases,

54 3.5 System Overview this timeframe is even smaller. Hence, all samples are repeatedly executed in our controlled environment to capture a diverse set of traffic traces. To protect the outside world from malicious behaviors performed by malware binaries executed at the Virtual Machine Zoo, we use a firewall between the Virtual Machine Zoo and the Internet that restricts their malicious network traffic. We need to allow the network communication between malware and their C&C server (i.e., to receive updates and commands), but we want to limit their actual malicious activities. Thus, although the VMs have full network access, we throttle their bandwidth and block malicious connections through restrict firewall rules. Although current botnet C&C communications tend to be unencrypted [82], some bots use HTTPS as their communication protocol to evade existing detection mech- anisms. This problem can be solved by introducing an SSL Man-in-the-Middle (MiTM) proxy between the Virtual Machine Zoo and the actual network [26]. Nev- ertheless, such an approach is outside the scope of this work.

3.5.2 Learner

The network traffic generated at Virtual Machine Zoo is forwarded and analyzed at Learner. Additionally, any other network traces captured by other sources can be fed into and analyzed by this component as well. Learner initially filters the irrelevant traffic to improve its efficiency, and extracts only the HTTP requests omitting any other types of traffic. Next, it analyzes all the HTTP requests and creates a header chain for each one. As we mentioned earlier, each client can contain more than one header chains. Moreover, for each chain it is possible that the Learner will correlate a set of different clients. In addition, Learner is responsible for the creation of the HTTP templates for malware instances. To create an HTTP template, the header names are extracted from the HTTP request and the HTTP header values are processed to create specific clusters or regular expression pattern lists. The names and pattern lists are combined to form a template for an HTTP request of a malware sample. Learner stores each HTTP template together with the name of the responsible malware instance using a hashtable with linked lists to resolve collisions. As we keep track of the traffic generated by each malware instance, we are able to find the name of the malware that is responsible for each template. Due to the partial matching that is performed on templates, we apply a hash function only to the part of the template where an exact match is required. We use this hash value to insert and lookup templates in the hashtable very fast, and we store at each node the template together with malware names. As we have a part of the template, we may have more than one templates and malware names stored at a single node.

55 3 HTTP-Based Malware Mitigation

3.5.3 Decision Maker

The Decision Maker inspects the network traffic at real time and decides whether an HTTP connection is originated by malware or by legitimate web client. To this end, it extracts the HTTP requests, and for each request it creates the respective header chain. This chain is then compared with the entire header chains, generated at the training phase by the Learner. Instead of actually comparing the chain with every one of these header chains, it computes its hash value and performs a single lookup operation on the header chains hashtable, which reduces the searching time. If there is a match, the lookup will return a node with the set of clients that have a chain matching. In case of no match, the request is immediately classified as suspicious. Otherwise, if it matches only header chains of benign web clients, the request is classified as benign. Finally, if it matches only with header chains of malware, or with both benign and malicious clients, the request is labeled as possibly malicious and it is further inspected.

Regarding the HTTP requests that match with header chains from malware in- stances, we classify them as possibly malicious requests and we rely on the HTTP templates for the final decision. The classification works as follows. First, the Decision Maker computes the HTTP template of the request. Then, this tem- plate is compared with the HTTP templates produced during the training phase by malware-generated requests to obtain a partial match. As noted above, we compute the hash value only on the part of the template that an exact match is required (i.e., excluding the headers that should not completely match) to perform an ef- ficient comparison for the partial match. A successful partial match increases the accuracy of the detection approach. In case we find at least one template with a partial match, we apply the URL signatures and attempt a full template match. The Request-Line and the Host values of the new request are compared against the respective values of the templates that were partially matched.

In case the Decision Maker finds a match with at least one template, the in- spected request is classified as malicious. If there is no match, then it is classified as suspicious and further investigation with other tools is suggested. There is one case that a request will be classified as benign at the HTTP templates matching stage: if (i) the request matches with both benign and malicious clients, (ii) it does not match with any template, and (iii) the User-Agent header value accords with a legitimate client that is correlated with the request’s chain. The last requirement ensures that the client reported at the User-Agent is the same with a client that corresponds with the request’s chain, i.e., there is no mimicry attack.

56 3.6 Evaluation

128

64

32

16 Firefox v.10 Firefox v.11 Firefox v.12 8 Firefox v.13 Firefox v.14 Firefox v.15 4 Firefox v.16 Firefox v.17 Firefox v.18 2 Firefox v.19 Firefox v.20 Firefox v.21

Number of Header Chains Generated 1 1 4 16 64 256 1024 4096 16384 Number of URLs Visited Figure 3.3: Number of the generated header chains for different versions of Firefox as a function of number of the visited URLs.

3.6 Evaluation

In this section we evaluate the effectiveness of BotHound. First, we describe the ground truth we used during our experiments and then evaluate the accuracy of BotHound both with artificially created data and with real-world network traffic.

3.6.1 Establishing the Ground Truth We created two datasets of HTTP traffic for which the ground truth has been validated, resulting in a malicious and a benign dataset. Initially, we executed at the Virtual Machine Zoo 813 malware samples, which belong to 24 malware families (e.g., Sality, ZeuS, Pushdo, SpyEye, etc.). From this process we collected more than 40,000 HTTP requests, which form the malicious dataset. For the benign dataset, we used popular web clients and crawled the top 1,000 domains according to alexa.com with a crawl depth equal to one. This process generated more than 7,000,000 HTTP requests.

3.6.2 Model Generation in Various Web Clients We noticed small variations in the header chains generated for different versions of the same client. To find the optimal number of URLs we should visit with a new client version to get all its possible header chains, we visited the top 10,000 web sites listed by alexa.com with different versions of the Firefox web browser. Figure 3.3 shows the number of extracted header chains for 12 different versions of Firefox as a function of the number of visited URLs. We see that after a certain number

57 3 HTTP-Based Malware Mitigation

Table 3.1: Protection mechanisms. Web client # Header Chains # HTTP Templates Mozilla Firefox 527  Google Chrome 249  Internet Explorer 470 

Benign Opera 171  Safari 143  Jorik 21 30 Sality 4 14 Sofilblock 13 17 ZeroAccess 10 40 Malicious ZeuS 17 26 of URLs (less than 900) the number of generated header chains does not increase anymore. We observed similar results for the other web clients. Consequently, we conclude that after visiting the top 1,000 URLs, sufficiently many header chains of a web client will have been generated. Table 3.1 depicts examples of the number of header chains and HTTP templates generated by BotHound for different web clients. We observe that BotHound gener- ates 527 header chains in total for all Firefox versions, while for each single version it produced approximately 120 chains. This implies that although a significant num- ber of chains are identical among the different versions, different chains also exist. Additionally, we see that the number of header chains for malicious clients is signif- icantly smaller compared to legitimate clients, presumably because their protocol implementation is not as complex as of benign clients. Moreover, we observe that the number of generated templates is slightly higher than the number of the header chains for a given malware family.

3.6.3 Detection Accuracy

To evaluate the detection accuracy of BotHound, we used a ten–fold cross vali- dation for both benign and malicious datasets. Initially, we divided all the HTTP requests generated by each single benign web client into ten folds. Each disjoint fold contained approximately 700,000 HTTP requests generated by all the benign clients. Regarding the malicious dataset, we first divided the malware instances belonging to a single malware family evenly into each fold. This way, each fold contained malware instances from all the 24 malware families. Then, we assigned the requests of each instance into the corresponding fold. As a result, each fold

58 3.6 Evaluation

Table 3.2: Detection results for benign and malicious datasets as a function of different threshold values. Threshold False Positives False Negatives 8 0.04% 0.06% 9 0.04% 0.05% 10 0.04% 0.03% 11 0.05% 0.03% 12 0.07% 0.03% included around 4,000 requests generated by malware instances. Finally, for both datasets we used nine of these folds as training input for BotHound, and we used the remaining fold to evaluate the classification accuracy. We repeated the above procedure ten times, each time with a different fold as evaluation input. In our first experiment, we varied the threshold used by BotHound for a full template matching to find its optimal value that minimizes both false positives and negatives. As false positives we consider the HTTP requests in the benign dataset that are classified as malicious or suspicious. Similarly, the false negative rate is the percentage of HTTP requests in the malicious dataset that are not classified as malicious. Table 3.2 shows the false positives and false negatives rates as a function of the threshold. We see that the optimal value for the threshold is 10. While smaller values increase the false negatives, larger values lead to a higher number of false alarms. Moreover, we see that BotHound with this threshold is able to detect 99.97% of the HTTP requests generated by malware with a very good accuracy, i.e., with a false positive rate of just 0.04%. Consequently, the rest of the experiments were performed using this threshold value. To assess the classification accuracy of the three different algorithms used by BotHound, we applied them separately to the datasets and we measured their clas- sification errors. First, we evaluated the URL signatures, then the header chains as a single detection approach, and finally the complete BotHound approach, which combines header chains, HTTP templates, and URL signatures. Table 3.3 shows the classification results of each approach. We see that BotHound achieves signifi- cantly more accurate classification results compared to URL signatures approach, which exhibits a 6.53% false positive rate. Even BotHound only with header chains outperforms URL signatures with a false positive rate of just 0.49%. We also ob- serve that when HTTP templates are used in conjunction with header chains to improve detection, the percentage of false positives drops to 0.04%. In summary, BotHound demonstrates high accuracy in the detection of malware and improves the state-of-the-art URL signature approaches.

59 3 HTTP-Based Malware Mitigation

Table 3.3: Classification results of different approaches. URL signatures Header Chains BotHound False Positives 6.53% 0.49% 0.04% False Negatives 0.04% 0.03% 0.03%

3.6.4 Popular Malware Families

In this experiment we explore BotHound capability to detect malicious traffic with and without prior knowledge of a malware family. To this end, we trained BotHound with the whole benign dataset and with a portion of the malicious dataset that con- tains traffic only by six malware families. Then, we evaluated our approach by generating traffic from more malware families: from the six families that were in- cluded in our training datasets, and from five more families for which BotHound was not trained. The results revealed that all the requests coming from malware families that were included in the training phase were correctly classified as malicious. On the other hand, the requests coming from the remaining families were classified as suspicious, i.e., BotHound was able to successfully identify their respective requests as not originating from legitimate HTTP clients. We observed the same results for the other malware families as well. Hence, BotHound is capable to accurately identify the traffic generated by malware families for which one or more samples are contained in the training datasets. Regarding the unknown malware instances, depending on BotHound’s configuration, it can also trigger alerts when the traffic is recognized as suspicious.

3.6.5 Real-World Deployment

To explore how BotHound performs in a real-world scenario, we evaluated it using network traffic captured at the gateway of an operational network that counts tens of thousands different machines and generates millions of HTTP requests on daily basis. We trained BotHound with both the benign and malicious datasets we de- scribed in Section 3.6.1, and we used as test traffic three different datasets captured at this network during three different time periods. Keep in mind that all the IP addresses in these datasets are stripped, however, we can find the domain name of each web server from the Host header field of the respective HTTP request. It is worth mentioning that in some cases, malware may send an HTTP request to a different destination than the one displayed in the Host field, in order to evade detection. Since the actual IP addresses in our datasets are stripped, we validated the domains found in the Host field of HTTP requests classified as malicious against 30 known blacklists [105].

60 3.6 Evaluation

Table 3.4: Classification results of BotHound for three real-world traffic datasets. 1st dataset 2nd dataset 3rd dataset Benign 96.82% 99.31% 99.23% Malicious 0.32% 0.22% 0.25% Suspicious 2.86% 0.47% 0.52%

We evaluated BotHound with the three datasets as follows. Initially, BotHound inspected the 1st dataset. After retrieving the classification results, we investigated the suspicious traffic looking for groups of similar header chains. Once we found clusters of header chains, we added the missing web clients in Virtual Machine Zoo. Then, we evaluated the two remaining datasets against the data-enriched BotHound. Table 3.4 presents the classification results for the three separate datasets. Regard- ing the malicious traffic, we see very close results in all datasets: the malicious HTTP connections detected by BotHound range from 0.22% to 0.32% among the different datasets. We see a higher percentage of suspicious requests in the 1st dataset (2.86%), which is reduced in the next two datasets to 0.47% and 0.52%, respectively, due to the addition of the benign clients found in 1st dataset into the training set. Obtaining ground truth for real-world data is difficult; especially when it contains tens of millions HTTP requests. To overcome this obstacle, we analyzed the requests classified by BotHound as malicious and suspicious. To identify the domain name of the web server of each request, we examined the request’s Host field. BotHound classified 718 of the domains in our datasets as malicious. To analyze the extracted domains from malicious requests, we validated them against 30 blacklists. The blacklists classified 536 of these domains as malicious (74.7%). After one week we repeated the experiment and the blacklists classified 59 additional domains as malicious (82.9%). Then, we manually analyzed the 123 remaining domains. The 29 of them contained IP addresses in the Host field. When we tried to access them, all these IP addresses were unreachable. All the remaining 94 domains could be found in the alexa.com 100,000 most visited web sites. After reconstructing the full URL path, we noticed that in 17 of these domains the requested URLs did not exist. The traffic destined to the remaining 77 domains appeared legitimate with a first look. Interestingly, however, the header chains of the requests destined to these domains did not match with the header chains of the web client found in the respective User-Agent field; but they matched with header chains of existing malware families. A possible explanation of this finding is that many malware instances put a popular legitimate domain in the Host field and send benign HTTP requests as noise traffic to confuse detection systems [53, 54, 144].

61 3 HTTP-Based Malware Mitigation

Indeed, we found some malware instances in our dataset that used popular domain names in the Host field of the requests, while the requests were actually destined to different sites. We do not consider the classification of such noise requests generated by malware as false positives, because they can be used to identify malware and compromised machines, which send noise requests with the HTTP implementation that matches the header chains and HTTP templates of known malware. Thus, not only the noise is not useful for the malware, but also it turns out that it may help BotHound to detect traffic generated by malware and pinpoint compromised machines. Next, we examined the HTTP requests that were classified as suspicious. After a deeper insight on the data, we found that a large number of crawlers, which compose the 63.8% of the suspicious requests, were actively attempting to mimic the behavior of legitimate web browsers. Although they had changed their User-Agent headers using strings from real browsers, they had not changed their HTTP headers sequences. Thus, the header chains identified this discrepancy. The remaining 36.2% of the suspicious HTTP requests were originated either by malware that sent legitimate requests to blend malicious with benign traffic, or by versions of legitimate browsers we had not included in our training dataset.

3.6.6 Advanced Persistent Threats

In our next set of experiments, we tested BotHound’s detection capabilities when it inspects the traffic generated by advanced malware samples used in targeted attacks (e.g., Duqu and Miniduke). We did not expect BotHound to classify the attack traffic of this evaluation dataset as malicious, because we had not previously trained it with the respective malware samples. However, it classified the HTTP requests generated by these samples as suspicious, because they did not match with any of the header chains of a legitimate web client. Thus, BotHound provides a clear indication of malicious traffic even for such unknown threats, while the large majority of the benign traffic can be identified. Then, we trained BotHound with the HTTP requests of these malware samples to generate new header chains and HTTP templates. As expected, BotHound with the new models was able to successfully detect the traffic produced by these malware. To examine whether BotHound produces false positives with the new models, we evaluated it with the real-world traffic datasets we used in Section 3.6.5. We already knew that the requests from these malware were not contained in these datasets. Indeed, when BotHound was trained with the new malware samples, it did not report any malicious requests by them in the real-world datasets.

62 3.7 Discussion

3.7 Discussion

Although BotHound can effectively detect real-world malware, like any detection system, it has limitations. An adversary who gains knowledge on how our system operates, may find ways to evade detection. In contrast to the SMTP protocol we discussed in Chapter2, HTTP is used by an even larger variety of clients. This allows cybercriminals that utilize HTTP to be more flexible. In the following, we discuss potential evasion techniques and outline how we can handle them.

3.7.1 Cloaking A commonly used evasion attempt is performed when a malware injects noise into its traffic. This noise can be benign HTTP requests to cloak the real C&C communication channel, or malicious requests to benign domains to confuse a sig- nature generation system. In Section 3.6.5 we found that noise traffic generated by malware may actually help BotHound to detect the malware-infected machines. However, when BotHound uses such requests as a training dataset, it risks to also classifying benign requests and domains as malicious or suspicious, resulting in classification errors. Indeed, in our experimental evaluation we observed that some malware try to blend malicious with benign traffic. These mislabeled data were responsible for a fraction of the suspicious requests we mentioned in Section 3.6.5. In case a malware uses its own implementation of the HTTP protocol, and it does not perfectly correspond with a legitimate client, the malware can be detected by BotHound even if it produces noise. However, if the malware uses a perfect imple- mentation of a web browser’s HTTP usage and sends noise over it, then we need more advanced behavioral-based algorithms for more accurate detection. Unfortunately, the assurance that a dataset contains only benign or only mali- cious samples is an open problem and most of the time requires manual efforts to correctly label the training data. BotHound can address this issue by introducing a pre-filtering mechanism in malware-generated traffic during the training phase, by filtering out the traffic that is not actually related to malicious activities [86].

3.7.2 Randomness Attackers may also try to evade detection by having no constant patterns in their HTTP requests. For example, each new request could look completely, or partially, different from the previous requests. Such an attack could bypass our detection mechanisms, a limitation we share with other signature-based detection approaches. However, all the possible combinations follow a deterministic model. Thus, we believe that if we execute the malware samples in Virtual Machine Zoo for a long period, all the possible combinations will be revealed at the end.

63 3 HTTP-Based Malware Mitigation

3.8 Related Work

There has been a large number of research efforts aiming to design and implement detection and mitigation strategies for malware. In the following, we discuss the relationship between our approach and previous work in this area. Network-level intrusion detection systems utilize signatures to detect malicious traffic [143, 160]. A common practice is to generate signatures for detecting mal- ware that use network protocols for their activities [168]. To encounter polymorphic malware, disjoint signatures are generated to find substrings that are unique even in polymorphic payloads [109, 135]. Such detection systems may not be able to accurately distinguish the traffic of benign clients from the traffic of malware that mimic the behavior of these clients, as these systems do not specialize in the imple- mentation details of HTTP. In contrast, BotHound utilizes the header sequence and HTTP request structure to give more insights into the HTTP protocol implemen- tation differences, and thus the false positive ratio remains low. On the other hand, several botnet detection systems focus on network flow anal- ysis [58, 72] or require deep packet inspection [73] to detect compromised machines within a local network. Other detection approaches aim to identify common spatio- temporal behaviors of bots when performing malicious activities [63, 72]. Such approaches are less effective when malware generates only a small fraction of the overall traffic and hides its malicious activities and C&C communication in benign- looking traffic (e.g., HTTP requests). In contrast, BotHound aims to detect slight differences in the HTTP protocol when implemented by malware and benign clients. Wurzinger et al. [208] identifies bot responses upon botmaster’s commands in the network traffic while monitoring executed malware samples; automatic models are created to detect botnets. One limitation is that malware can be hidden within other HTTP traffic without raising suspicion. For instance, it can contact the C&C server and perform other malicious tasks over HTTP at random times and after long time intervals. In contrast, BotHound does not need to identify commands or responses from bots, as it focuses on the HTTP traffic generated by malware. Thus, malware can be detected even if other suspicious network activity is missing. Many detection approaches rely on blacklists to detect bots that connect to known C&C domains [66, 156]. However, botnets can evade this static blacklisting by using randomly generated domains for their C&C communication server. To address this problem, Antonakakis et al. [5] detect such domain name generating bots, by applying clustering and classification algorithms to group bots that use the same algorithm to generate the C&C domains. While this approach analyzes the DNS traffic, BotHound focuses on HTTP traffic for malware and botnet detection. Perdisci et al. [145] propose a system that focuses on the network-level behavior of HTTP-based malware to find similarities and generate signatures for malware clusters. BotHound improves the accuracy of URL signatures proposed in this work

64 3.9 Summary by working on the HTTP request headers, instead of the complete HTTP content, which reduces the amount of data that need to be processed and results in increased processing throughput compared to content processing. Template approaches have been applied in spam mitigation and such methods are closely related to our work. Botnet Judo [147] generates SMTP templates by monitoring spamming botnets in a controlled environment, and uses these templates to perform real time spam filtering at the network-level. Similarly, we apply a generalized template-based detection to the HTTP protocol, using templates to model malicious requests. Additionally, BotHound leverages header chains as a first step to reduce the amount of requests that need to be analyzed with templates, increasing significantly the detection speed.

3.9 Summary

In this chapter, we presented a network-level detection framework that focuses on HTTP-based malware and detects malicious HTTP traffic. The key idea be- hind our approach was to identify slight deviations among different HTTP protocol implementations. Our proposed framework automatically generates models, which can accurately classify benign and malicious HTTP connections. We deployed our prototype in an operational network verifying that it is able to detect a large va- riety of real-world malware, including advanced persistent threats used in targeted attacks, with very low false positive rate. We also found that our approach was able to early detect malicious domains before various popular blacklists publish them. Finally, we demonstrated the improved accuracy of our approach by comparing it with existing state-of-the-art approaches like URL signatures.

65

Part II

Web Security

67

Preamble

In PartI of this thesis, we have proposed mechanisms to detect and mitigate compromised machines from performing malicious activities. To do so, we utilized infected clients and monitor their malevolent activities in a sanitized environment. Then, we analyzed their generated traffic and compared it with traffic coming from benign clients. As a result, we discovered subtle differences in the implementation of the protocols they use, which we leveraged to create detection systems. In the second part of this dissertation, we no longer detect fraudulent activities from compromised machines, but instead we study the ways that attackers find new victims. Initially, and given the fact that highest-ranked web pages gain more visits from online users, we examine the formed alliances among web spammers that try to boost the ranking of their web pages, which can potentially host exploits. Next, we investigate the attacks against search engines and how they can be manipulated to reflect attacks to various websites. Then, we study and observe how malicious advertisements differ from benign ones, while we analyze the risks related to abuses. Finally, we propose a prevention framework that can be used by online users to protect themselves from web infections.

69

4 Revealing the Relationship Network Behind Link Spam

“The ultimate search engine would basically understand every- thing in the world, and it would always give you the right thing. And we’re a long, long ways from that.”

Larry Page

Accessing the large volume of information that is available on the web is more important than ever before. Search engines are the primary means to help users find the content they need. To suggest the most closely related and the most popular web pages for a user’s query, search engines assign a rank to each web page. Each page’s rank typically increases with the number and rank of the web pages having a link to this page. However, spammers have developed several techniques to exploit this algorithm and improve the rank of their web pages. These techniques are commonly based on underground forums for collaborative link exchange, building a relationship network between spammers to favor their web pages in search engine results. This causes a significant problem for both search engines and end users. In this chapter, we provide a systematic analysis of the spam link exchange per- formed through 15 Search Engine Optimization (SEO) forums. We design a system that captures the activity of web spammers in SEO forums, identify spam link ex- change, and visualize the web spam ecosystem. Our system collects spam links posted in public forum threads and links sent via private messages, by creating and using honey accounts in these forums. The web pages behind the collected links are examined, and their are extracted and matched against the other col- lected links. This analysis results in link exchange identification and generation of the respective relationship network graph of web spam. The outcomes of this study deepen our understanding of web spammers’ behavior and can be used to improve web spam detection.

71 4 Revealing the Relationship Network Behind Link Spam

4.1 Introduction

The World Wide Web, from online shops to social networks and private , plays a critical role in modern society. It offers an abundance of information accessi- ble to anyone with an Internet connection. To identify the most useful information among the vast amount of available web pages, users rely primarily on search en- gines. Search engines typically classify a huge number of web pages and present the ones that seem most relevant to user queries ranked by their estimated relevance and their popularity. The users typically visit the highest-ranked web pages and ignore the rest [167]. To attract more users, it is therefore important for each web page to rank high in the search engine results. While honest web pages achieve this ranking due to the quality of their content, dishonest ones try to mislead search engines to rank them higher than they deserve. We refer to the attempts of these dishonest pages that try to deceive search engines as link spam. Link spam is typically used for several reasons, ranging from money-related ac- tivities to malware propagation. Therefore, it is becoming more popular and more sophisticated as the rapid increase of the Internet users leads to higher revenue for spammers [110, 174]. Unfortunately, this has a negative impact to user’s experience: link spam is annoying as users often cannot find what they are searching for, while at the same time it constitutes a security problem due to possible malicious content on spam pages. Additionally, it causes headaches to the search engines themselves. Search engines must exert significant effort to filter link spam and satisfy users expectations. Therefore, over the years, many different anti-spam techniques have been developed [14, 76, 122, 205, 207]. However, adapting to such techniques, spam- mers always improve their strategies to evade detection. To determine the reputation and popularity of a web page, search engines com- monly rely on the number and ranking of the other web pages that link to it. The more websites linking to a page p, and the more popular these websites are, the higher is the ranking that the page p will receive from a search engine [21]. Al- though this is a reasonable way to define page ranking, it can also be exploited by spammers to boost the ranking of their pages by increasing the number of links to it. So called Search Engine Optimizations (SEO) forums are often used by spammers for this reason. Despite the fact that search engines approve SEO as a way used by web page owners to achieve a better recognition of their websites [67], blackhat SEO techniques considered as unfair means of boosting the web page ranking. For simplicity, in the rest of the chapter we refer to the blackhat SEO as SEO. SEO forums intend to bring together page owners who want to improve the rank- ing of their web pages. However, these forums are also attractive to spammers, as they can swap ideas on new spamming methods and can also exchange links. In this work, we study how the information found in SEO forums is related to link spam. We build a system that systematically collects and analyzes the links posted

72 4.1 Introduction in public threads or sent via private messages in popular SEO forums. We noticed that many spammers, to avoid detection, tend to exchange links or send information about their web pages only through private messages instead of posting to public forum threads. Hence, to collect data from such spammers, our system uses honey accounts that behave like typical spammers; post on public threads asking for links exchange with other websites. Next, we analyze the harvested links with respect to their spam affiliation, their frequency, and their occurrences among different users and different forums. Additionally, we crawl the web pages found in SEO forums posts or messages to examine their structure and extract their links, which are then matched against the other web pages found on monitored forums. Finally, to vi- sualize the developed relationships among link spammers, we use graph structures based on the extracted information from the observed exchanges of links. We examined 15 popular SEO forums and after a three-month period we collected 97,658 web pages that participated in link spam. Overall, we discovered two major categories of spammers with distinct features that use SEO forums to boost the page ranking of their websites. Each category behaves in a completely different manner, and thus we need different approaches to reveal their spam web pages. In addition, a deeper analysis in the collected data revealed few clear differences in the type of link exchange and in the relationship network between URLs exposed in public threads and those sent via private messages. These results improve our understanding on the web spam ecosystem and shed light on the activities performed in underground forums related to link spam.

In summary, we make the following main contributions:

• We collect a corpus of spam links from SEO forums. Our approach is the first one that uses honey accounts to harvest spam links from private messages.

• We analyze the links found in SEO forums and validate link exchange by crawling the respective web pages. Instead of solely relying on collected data from SEO forums, our analysis correlates this information with data extracted from the actual web pages.

• We introduce a new way to visualize the web spam using a graph representation of link exchange. This way, we can see relationships among web spammers and other patterns in their behavior using data from SEO forums and from their web pages.

• We present an in-depth analysis of data gathered from 15 popular SEO forums. The outcomes reveal the different approaches and strategies used by advanced and inexperienced web spammers.

73 4 Revealing the Relationship Network Behind Link Spam

5 10 5

5

2 4 7 2

Figure 4.1: A simplified example of PageRank calculation.

4.2 Web Page Ranking

One of the most commonly used web page ranking algorithms is PageRank [21]. This algorithm uses the link structure to estimate the popularity of a web page. Every web page has a rank that represents its estimated popularity. The page’s rank is determined by the number and ranking of its incoming links (called backlinks). This means that a web page is highly ranked if it has backlinks with a high rank, or a large number of backlinks with a low rank. The rank R(p) of a web page p is equal to the sum of the ranking R(b) of each backlinking web page b divided by the total number of its outlinks |O(b)|:

X R(b) R(p) = c (4.1) |O(b)| b∈B(p) where c < 1 is a normalization factor, B(p) the set of back-link web pages of page p, and O(b) the set of outlinks of a page b (i.e., all the links on a web page, which are not navigational links). Figure 4.1 demonstrates the rank propagation from one pair of pages to another utilizing the PageRank algorithm. The second most used web page ranking algorithm is the -Induced Topic Search (HITS) [100]. As its name implies, HITS uses the link structure to identify good web pages related to a specific topic. In contrast to PageRank, the web graph is not rated as a whole entity, but only a subgraph that contains web pages relevant to the searched keywords is taken into consideration. HITS distinguishes between linked web pages (authorities) and linking pages (hubs). A good authority is pointed to by many good hubs, and a good hub points to many good authorities. The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages. Thus, this algorithm is preferable to rank web pages on a specific topic. Consequently, the same web page can have different scores on different topics.

74 4.3 Web Spam

4.3 Web Spam

The intention of web spamming is to increase the ranking of spam web pages by misguiding the ranking algorithms used by search engines. According to Gyongyi et al. [75], the term web spamming refers to any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for a web page, considering the page’s true value. There are several reasons for creating web spam, such as increased ad revenue, attacks, profit from illegal activities and malware distribution. It is obvious that the highest-ranked web pages get clicked much more often by Internet users and this is something that drew miscreants’ attention. As a matter of fact, users tend to trust search engines as the primary mean of finding information in a fast and effective way, and they rarely question the returned results. Therefore, as spammers target end users through search engines, they often try to boost their own web pages to appear higher in the search engines returned results with the aid of term and link manipulation techniques [75, 78, 122]: Term Spam: To mislead search engines, spammers repeat specific terms on their spam web pages to trick the engines into deciding that the page is closely associated with these terms. Such terms often do not build a useful sentence: to increase the numbers of queries that are associated with a spam web page, attackers dump a large number of unrelated terms on the web page, which are often copied from dictionaries. This is an effective technique to lead rare queries to a spam web page because they are not included in many benign web pages, and a spam page with such terms is highly ranked. Additionally, misleading meta-keywords or anchor text can boost the page’s ranking as well. Link Spam: This method aims to modify the structure of the web graph by increas- ing the number of the backlinks targeting the spam pages. To this end, spammers often post backlinks to their web pages on guest books, wikis, messages boards, blogs, or in web directories. Such procedures are easy to handle and do not need extra effort or money. Moreover, as the cost of web servers is very cheap, spammers can leverage these servers to build a link structure. These servers can be used in different ways to push the ranking of the spam pages. Some of the servers could provide useful information by copying the content of other benign web pages, and linking also to the spam page. On top of that, spammers build link farms where they can interact with other spammers and exchange URLs. As means of communi- cation, spammers usually contact each other through SEO forums. This procedure is called link exchange and is the main focus of our study. There are three different kinds of link exchange: (i) the one-way link exchange, where only one website a links to another website b,(ii) the two-way link exchange, where two web pages a and b are both linking to each other, and (iii) the three-way link exchange, where the web pages do not link directly to each other, but they use

75 4 Revealing the Relationship Network Behind Link Spam a third web page to build a circle of links. For instance, the website a links to c, and the website c links to b. This type of link exchange is harder to be detected in practice. In this work, we study the two- and three-way link exchange. When a node a participates in more than one N-way exchanges, it is considered to participate in a link farm. For instance, in case a links to b and b also links to a, and similarly a links to c and c to a, a is a part of a link farm. Usually, web spammers are miscreants that own malicious web pages, such as phishing or drive-by-download pages, or even pages that participate in ad frauds, and thus they try to monetize them. As we already mentioned, more traffic to a website increases its value in the underground black market, which translates to more money for the site’s owner. Hence, among all the other tricks the cybercrimi- nals utilize in order to attract more users to their web pages, they also leverage link exchange as a tool to achieve their nefarious tasks. Note that not all web spammers own a malicious website. Some times, holders of blogs, personal web pages, or any other kind of benign web pages use similar techniques to increase their web page ranking for several other reasons. In this chapter, we define web spam to be any link exchange among websites, even if this action involves benign websites.

4.4 Data Collection

In this section, we describe in detail the goals of our work, and provide an overview of our measurement methodology.

4.4.1 Study Objectives Identifying link spam is a continuous process for search engines, and typically this process is hidden from the average end users. According to Wall [195], link spammers form alliances in order to exchange links among their web pages, resulting in global link farms. The most common channel used by individuals to communicate with each other is a SEO forum. Thus, it is almost impossible to discover these farms with traditional anti-spam techniques. In this work, we study the spammers and spam websites that use SEO forums and expose their alliances. To this end, we developed an infrastructure that allow us (i) to identify web pages that use SEO forums for improving their page ranking, and (ii) to visualize these pages and the formed relationships among them. We use this system to study the link spam ecosystem through the information that is available or can be collected at the popular SEO forums, and measure the extent of utilization of SEO techniques and their impact on the web. Keep in mind that with this work we do not want to provide yet another detection system, but we want to study how advanced and inexperienced spammers collaborate and create link farms.

76 4.4 Data Collection

SEO*Crawlers Web*Crawlers

Public* Private* Database Threads Messages Honey*Accounts Link*Exchange Data*Mining IdenFficaFon Spam*Websites Link*Exchange Analysis SEO*Forum

Figure 4.2: Architecture of our approach.

Our system is based on the idea that there is a mutual exchange of URLs among link spammers. When a spammer A wants to increase the number of URLs linking to her web page, she is willing to add the URL of another spammer B in her web page, if the latter adds the web page of the A in her own site as well. Having that knowledge in advance, we can accurately discover web pages that try to increase their ranking with fraudulent means. Therefore, we have created a set of crawlers that gather URLs from SEO forums’ public threads (under the sections of Link Exchanges), and a set of honey accounts that send messages to other users requesting link exchange with honey web pages, while hooking the responses through private messages. We use the notion of link spam throughout the paper. For this work we consider as link spam any web page that unethically tries to increase its page ranking by participating in global link farms. In addition, we consider as link exchange any mutual link transaction between two pages with similar or disparate web contents. The combinations of all these link exchanges form link spam relationship networks. In conclusion, the main goal of our work is to study the ecosystem behind these networks, and try to expose and categorize the behavior of different spammers.

4.4.2 Data Collection Architecture Figure 4.2 depicts the general architecture of our approach. The core elements of our infrastructure are: (i) the SEO forum crawlers, (ii) the honey accounts army, and (iii) the web page crawlers. While the SEO forum crawlers are responsible for harvesting the URLs from SEO forums, the honey accounts try to lure link spammers to expose information that is not publicly available and would only reveal to other peers through private messages. Finally, the web page crawlers examine each page found in the set of collected links and validate the actual link exchange. The data collected by these crawlers are forwarded to the components that are responsible for analyzing and correlating them in order to reveal the relationship network among spammers. Finally, the link exchange and the spammers’ relations visualized by creating the respective relationship network graphs.

77 4 Revealing the Relationship Network Behind Link Spam

SEO Crawlers

The SEO crawlers search underground forums for URLs that participate in link exchange. These forums have predefined places (sub-forums) where the users can exchange URLs. Therefore, the crawlers target these sub-forums to find and extract the URLs. For this procedure it is important to consider the HTML structure of the forums. Our experimental results reveal that all of our examined forums use one of the following platforms: (i) vBulletin, (ii) phpBB, or (iii) MyBB. This fact allow us to create crawlers that have identical behavior on more than one SEO forum. Consequently, the crawlers can extract all the necessary information from the forums (such as links, post authors, usernames) with small modifications in their configurations. Similarly, the URL extraction from the forums needs to be handled carefully. There exist posts that are not spam related and the included links are not posted for spam purposes. For instance, we observed a plethora of links to popular websites. For that reason, SEO crawlers use a whitelist to decide if a link should be extracted and stored in the database. Additionally, we filter all the links found in users’ signatures. Moreover, the users frequently quote posts of other users. This leads to double posted links in one thread and is appearing the same link to be posted by more than one user. To find the user who had originally posted the link, our crawlers percolate all the quoted elements from the posts and keep only these that differ from the previous. This way, we obtain a clean mapping between the link and the user who originally posted it.

Honey Accounts

We have witnessed that a significant fraction of the link exchange is performed through private messages. Hence, in order to gain access to that kind of information we need an approach that lures link spammers to expose themselves. From our prior knowledge, we know that a spammer is willing to reveal a certain type of information only to a fellow spammer. Thus, it is necessary for our approach to create fake accounts (i.e., Sybils) and make them to behave as if they are real spammers. For this purpose, these Sybil accounts can post requests for link exchange and harvest the responses sent by private messages. In addition, they have the capability to reply back to other users when the received private messages do not contain any exchanged URL. As a matter of fact, the responses are different each time they reply back, which make it more difficult to categorize these accounts as Sybils. We know that creating honey accounts to retrieve internal information constitutes a short-term solution and cannot be used as an anti-spam technique. Nevertheless, this approach provides us with valuable information that it could not have been retrieved with different means.

78 4.4 Data Collection

Web Crawlers The web crawlers map the link structure of the networks behind the harvested URLs. They follow all the outgoing links up to a defined depth and store the extracted URLs in a database for further analysis. To have more accurate results we used instrumented browsers as crawlers. When a crawler visits a new web page, it searches for every link on the page and checks if this link satisfies some predefined conditions. Initially, the link is reduced to its hostname. If the hostname of the link and the current web page match, it is ignored because it is a navigational link. Our experimental results revealed that most of the analyzed spam pages host their outlinks on their main page. In case of an outlink is found, the crawler will check if the linked page is already crawled. Finally, if the link is not crawled, it will be appended to the crawler’s queue and the pair of source and destination URL will be stored in the database.

Link Exchange Analysis With all the information gathered in a central database, we correlate the relation- ships among the crawled web pages. More precisely, we discover the connections among different entities and observe the real interactions in link farms. For in- stance, if a user claims that she will add the web page of another user and this is never happened, this relationship is classified as broken. This is a major difference from previous works that handle all the posted URLs in SEO forums as accom- plished [28]. Our approach can also recognize two- and three-way link exchange. Although a two-way link exchange is quite straightforward, a three-way requires more sophisticated techniques. Thus, we correlate all the links contained in public threads or in private messages; we consider these links as a cluster and try to find relationships among them. If there is a relationship that includes more than two URLs, we conclude that there is at least one three-way link exchange in this cluster.

Graph Generator This component is responsible for graphically represent the spammers’ relation- ships. A graphic representation can provide a clear view of the spam web pages, i.e., the pages that participated in a large number of link exchange. As we can recognize the major players on link exchange, we can easily extract viable conclusions about the procedure that these spammers follow, such as if they require link exchange for one or more pages, whether they prefer to advertise the link exchange with public posts or private messages, if they use a common template. Our system provides different levels of graph representations, such as relationships among users or web pages, two- or three-way link exchange, and link exchange through public posts or private messages.

79 4 Revealing the Relationship Network Behind Link Spam

100% > 8 'post_link_ratio.dat' u 1:2:3

7

6

5

4

3 Number of URLs

2

1 0% 1 2 3 4 5 6 7 > 8 Number of Posts

Figure 4.3: Percentage of users for the pairs number of post and URL in a thread.

4.5 SEO Forums Analysis

In our study, we analyzed 15 SEO forums that contain sub-forums for link ex- change and gathered data for a three-month period. These forums are among the top websites where users can search for SEO boosting techniques and they number hundreds of thousands active users. Each of these forums includes sections that describe how to boost pages to appear higher in search engines results as well as sections that offer link exchange among their users. In our research, we only focused on the sections related to link exchange. To this end, we analyzed in total 9,617 threads and extracted 25,338 unique URLs generated by 7,923 users. Our results indicate that there is a ratio of 3.6 replies per thread. This means that for each web page that tries to boost its page rank, there is an average of three other pages that are willing to contribute to this goal. It is worth noting that during our measure- ments we analyzed all the reply messages in public threads and found that 26.79% of them did not contain any actual URL but a reference to a private message.

4.5.1 Spammers Behavior Initially, we examined the behavior of users in SEO forums based on the number of posts they make and on the number of URLs they send on a single thread. Figure 4.3 illustrates the percentage of users with each pair of number of posts and URLs. We saw that the majority of users (53.94%) that participate in a thread make one post including one URL in each thread. The percentage is getting lower as the number of posts and URLs increasing. On average, each user generates 1.08 posts and 1.61 URLs in a thread. This reveals that most of the users that post publicly own maximum one website. We also noticed that the users who own many websites usually present all URLs in a single post. The rest of the posts mainly contain information such as the category and ranking of the web pages.

80 4.5 SEO Forums Analysis

1 1

0.8 0.8

0.6 0.6 CDF CDF 0.4 0.4

0.2 0.2

0 0 0 2 4 6 8 10 0 100 200 300 400 500 URL occurrences Number of posts per user Figure 4.4: CDF of URL occurrences in Figure 4.5: CDF of the number of posts our dataset. per user.

Forums have very strict rules for spamming. Users are allowed to freely post in the forums, however, they are forbidden to spam. As spam is considered, among the others, the replication of the same content in different threads. The accounts that caught spamming are permanently banned from the forums. According to this, we measured the frequency of each posted URL in all threads. Figure 4.4 shows the CDF of the occurrences of each URL in our dataset. We see that the 75.61% of all the posted URLs occurred only one time, while the 98.11% was found in up to five different posts. This led us to the conclusion that the SEO forums users try to avoid excessive spam inside these forums. We continue our analysis by measuring the total contribution of a user in a forum. We assess the activity of users by counting the total number of posts they make in the forums. The forums usually have different tiers for their users depending on their activity. The higher tier an account has, the more benefits it gains. Figure 4.5 depicts the CDF of the number of posts per user. We noticed that 78.13% of the users made less than 20 posts. The amount of users participated in this category is so large because most of the forums allow posts in the link exchange sub-forums only when a user has a defined number of posts. This number is often between 5 to 20 posts. We observed that the users who only want to exchange a link to their web pages are not very active on other parts of the forums. Table 4.1 shows the percentage of URLs and users that appear on up to three different forums. Most of the users are active only in one forum (97.99%). This per- centage, however, may not be very accurate as users may not use the same username in each forum. Taking into consideration that most of the unique URLs (95.53%) are also posted just in one forum, the percentage of users in the different forums discovered by our approach has to be close to the absolute number. Interestingly, we did not find any user or URL to be presented in more than three different forums.

81 4 Revealing the Relationship Network Behind Link Spam

Table 4.1: Percentage of URLs and users that appear on 1 up to 3 different forums. Number of forums URLs Users 1 95.53% 97.99% 2 4.04% 1.71% 3 0.43% 0.30%

As we previously mentioned, a fraction of users reveal their web pages only via private messages. In order to harvest these pages we created honey accounts. These accounts sent requests for link exchange to threads where the initial post does not include the exchanged URL, and reply to private messages. We wanted to retrieve a wide variety of URLs, and thus we followed two different strategies. In the first strategy, we created web pages and enriched them with content from various categories. These pages had low page ranking, and therefore we were able to attract other pages with similar ranking. In the second strategy, we advertised highly ranked pages that we did not own and spurious claimed that we offered them for link exchange with other highly ranked web pages. This method attracted owners of higher ranked websites. We overall retrieved 718 unique URLs from both strategies. We compared these URLs against the ones we collected from crawling the public threads and found an overlap in only three URLs. These results showed us that there are two disjoint sets of link spammers: spammers that exchange their web pages through publicly available posts, and spammers that exchange their pages only via messages. The users who send their URLs only via private messages were proved to be more suspicious of being detected for link spamming.

4.5.2 Spam Pages Categorization Another interesting aspect on comprehending the link spam network is to measure the page rank of the web pages that request link exchange. This will help us to understand if the highly ranked web pages behave in a similar way with the low- ranked pages. To this end, we used Google’s toolbar queries to get the page rank of the pages we found in SEO forums. Figure 4.6 depicts the outcomes of our analysis. The page rank 0 contains web pages, which have yet to be ranked. We discovered that the majority of the spam pages have page rank 2, while more than 90% of spam pages belong to a page rank lower than or equal to 4. Additionally, pages with rank greater than 6 are only the 0.88% of the total amount of spam pages. Consequently, we believe that owners of highly ranked pages behave completely different compared to owners of low-ranked pages. Obviously, highly ranked pages usually do not participate in link exchange networks to such an extent as the low ranked, either because they do not need it, or because they find different ways to

82 4.5 SEO Forums Analysis

40

35

30

25

20

15 Frequency (%)

10

5

0 0 1 2 3 4 5 6 7 8 9 Page Rank

Figure 4.6: PageRank of web pages that request for link exchange in SEO forums. increase their ranking, such as paying search engines to pitch their pages in better positions, or paying for more effective and targeted advertising methods. On the other hand, low-ranked pages, which are usually blogs or personal websites, seek cheaper solutions to increase their page ranking. Finally, there is the category of malicious web pages that cannot increase their ranking in any legal means, so they are forced to look for illegal ways, such as link exchange, to achieve that. Note that identifying malicious web pages, among those that participate in link farming, is outside the scope of this chapter. In the following experiment we categorize the web pages based on their contents and spot these categories that are more prone to link exchange than others. In essence, every web page has a theme that defines its content. Usually, the owners are trying to exchange links with other pages that belong to a similar category. To get an overview of the different provided themes, the threads are analyzed for the main subject of the related web pages. The results in Table 4.2 illustrate the top themes requested for link exchange. We observe that there are many different types of web pages found in the forums, and most of them have a very close popularity. Only the pages regarding travel and health protrude. In more detail, the most common theme we met is about web pages related to traveling, which occur in 16.33% of the threads. Some of these pages offer travels through India or Vietnam. The second most commonly requested theme is related to health. These web pages have content ranging from blood pressure and Viagra pills to tattoo removal. The 27.68% of all the web pages in the examined threads have several other themes that are rarely requested.

83 4 Revealing the Relationship Network Behind Link Spam

Table 4.2: Distribution of the requested Table 4.3: Breakdown of countries themes for link exchange. hosting web spam. Theme Percentage Country Percentage Travel 16.33% United States 71.21% Health 10.34% United Kingdom 7.62% Finances and Business 7.33% Germany 2.46% Adult Content 7.31% Netherlands 2.31% Shopping 6.67% Canada 1.77% Technology and IT 6.66% France 1.69% Online Games 6.34% Bahamas 1.68% Internet and Web Design 6.00% Japan 1.12% Entertainment 5.34% India 1.05% Other Themes 27.68% Other Countries 9.09%

Finally, we tracked the countries where the spam web pages hosted. In Table 4.3 we can see the breakdown of the spam websites into the respective countries that hosted them. We observed that the vast majority of web pages were hosted to English speaking countries. United States are ranked first hosting 71.21% of the total amount of spam pages, while the United Kingdom follows in the second place with only 7.62%. The remaining countries shared the rest 21.17% of the web spam hosting.

4.6 Link Exchange

Through analyzing the SEO forums’ threads, we discovered that spammers re- quest web pages with identical page ranking for link exchange. Likewise, the replies to link exchange threads propose an exchange with a similar or higher ranked page. Usually, most of the replies deal with two-way and only a small fraction of users require implicitly three-way link exchange. Moving on the messages gathered from the honey accounts, we discovered that users with a higher page rank in their web- sites prefer three-way link exchange, compared to users with a lower page rank that prefer two-way exchanges or many times they do not care about the type of link exchange at all. Previous studies [28] focused only on the available information provided by public forum threads. In contrast, we went one step further: we crawled the actual web pages to validate the exchange and found a number of link exchange in SEO forums that were not defined in the actual websites, mainly in the links found in public threads. We classified as not defined all the spam pages that we do not have a

84 4.6 Link Exchange

Table 4.4: Statistics of requested link exchange types. Type Public Posts Private Messages Two-way link exchange 65.47% 87.56% Three-way link exchange 0.81% 8.47% Link farm 1.45% 3.93% Not defined 32.27% 0.04% clear picture about the category of the link exchange they belong to, or we are not sure if the link was actually exchanged between the two pages. There are two possible reasons for not defined links: (i) the web pages did not actually exchange the link, or removed it at some point, or (ii) they exchanged the link through a private message, and thus it was not included in our dataset. Table 4.4 shows the classification of link exchange requested types in both public posts and private messages. We see that a significant amount of link exchange (32.27%) is not defined in public posts. This mostly happens because a portion of spammers (26.79%) discloses the requested exchange type and the requested URL only through private messages. On the other hand, when we analyze the private messages, the not defined link exchange types decrease to just 0.04%. We did not expect private messages with not defined exchange type, but a deeper analysis revealed that very few spammers misbehave and remove the outlink from their web pages, transforming the two-way into one-way link exchange. Thus, the large percentage of not defined link exchange type we observed in public posts is moved into the two- and the three-way link exchange as well as to the link farm in the case of private messages. More precisely, the two-way link exchange increased from 65.47% to 87.56%, the three-way link exchange from 0.81% to 8.47%, and the link farm requests from 1.45% to 3.93%. The most interesting increase in ratio is the one happened in the three-way link exchange. As we already mentioned, spammers with higher ranked web pages require a link exchange with other higher ranked pages, or a three-way link exchange. These spammers do not jeopardize to publicly reveal their websites and they only contact other spammers through private messages. As a result, we notice this increase in the three-way link exchange ratio when we move from public posts to private messages. Overall, SEO crawlers and honey accounts collected 26,053 unique URLs. These URLs were the initial seeds for our web crawlers. During the three-month period we crawled more than 10 million web pages. A deeper look in the results revealed 97,658 web pages that participate in link exchange. The vast majority of these pages where part of two-way link exchange, however, we spotted 274 situations where the web pages participated in three-way link exchange.

85 4 Revealing the Relationship Network Behind Link Spam

Figure 4.7: A two-way link exchange network. The numbers on nodes indicate the web page ranking.

Next, we further investigated these 97,658 web pages. Surprisingly, we discovered 842 IP addresses that were serving 5,916 different domains. In addition, we analyzed the whois records to possibly link together domains. We detected that the 5.24% of all the unique domains were registered by the same entities. Furthermore, the information in 11.69% of the domains was protected by private domain registrations.

4.7 Relationship Network Graph

We analyzed our results to find clusters of spam web pages. Each cluster denotes a link exchange relationship among the involved web pages. For each cluster we found, we produced a relationship network graph, where each node participates in a two- or three-way link exchange. During our study, we found 41 clusters with more than 50 nodes, 983 clusters that consist from 10 to 50 nodes, and 2,758 clusters that contain up to 10 nodes. Then, we generated relationship network graphs from our dataset based only on messages found in public threads in the SEO forums we monitored. The edges in these graphs represent two-way link exchange among spam web pages. All the nodes are pages that we retrieved from the SEO forums. The number within each node represents the page rank of the respective web page. A close observation on these graphs can reveal how inexperienced link spammers interact with each other. Some of the pages in these clusters have exchanged links with only a small number of

86 4.7 Relationship Network Graph

7 3 4 4 2 20 7 2 0 7 2 1 3 1 3 9 2 5 34 4 3 9 1 3 3 3 2 1 6 5 1 3 8 2 4 3 3 2 2 2 1 1 2 2 2 0 23 2 1 22 2 4 2 2 2 15 1 2 3 1 3 1 21 000 2 023 2 1 2 2 2 3 2 3 2 3 0 0 2 22 3 1 2 3 3 3 3 2 0 2 0 1 2 1 3 1 2 2 3 0 1 22 2 4 2 0 0 0 1 1 1 3 10 2 2 3 3 1 0 0 0 02 0 1 1 2 1 1 1 0 0 01 0 0 0 2 3 1 1 12 0 0 0 23 0 2 1 2 1 0 1 2 0 1 0 1 2 2 2 1 0 1 22 2 2 1 2 1 1 0 0 8 0 3 1 0 0 1 1 1 1 2 1 3 0

Figure 4.8: A link exchange network including web pages from both public threads and private messages. other spam pages, while others have exchanged up to 25 links. We mostly see web pages with small page ranks in clusters like the one presented in Figure 4.7. One explanation for this is that the inexperienced spammers ally with other spammers of similar level. Figure 4.8 shows a relationship network graph for a cluster of web pages that appeared in both public threads and private messages. Similar to Figure 4.7, each edge represents a link exchange and each node a web page with its respective page rank. The circles depict pages found in public threads, while the rhombuses pages collected from private messages. As we already mentioned, URLs sent by private messages usually contain web pages with higher page rank compared to those who are publicly posted. Additionally, we notice the creation of clusters where the web pages sent by private messages collaborate with each other, similarly to the publicly posted web pages. We also observe some web pages that we saw them only in private messages to collaborate with pages that appeared in public threads. To this end, we can combine these events and create a bigger relationship graph. In the previous experiments, we chose to be conservative with the generation of the relationship network graphs and only considered as nodes of the graph web pages that appeared in SEO forums. A more liberal approach that count all the outlinks from a spam web page as possible spam web pages could lead to bigger graphs (ecosystems). This liberal approach can be used by search engines, in combination with other anti-spam techniques, to increase their web spam detection accuracy.

87 4 Revealing the Relationship Network Behind Link Spam

Figure 4.9: A webspam ecosystem.

To extend our dataset with more link exchange, we started with the web page pairs that were participating in two-way links exchanges, as initial nodes in the extended graph. Then, we started to recursively crawl these web pages to retrieve URLs that fulfill the two-way link exchange requirements. It is worth noting that a small portion of the discovered nodes was already in our database. This proves that even without the prior knowledge of all the participating nodes in a link exchange, it is possible to retrieve nodes with a similar behavior. Figure 4.9 displays how such a link spam ecosystem looks like. With green nodes we represent the initial pair of web pages we collected from SEO forums and they participate in link exchange, with blue nodes we define the pages that were already presented in our database, and with red nodes all the web pages that web crawlers revealed and they are part of link exchange network.

4.8 Summary of Findings

In this work, we studied 15 SEO forums that contain sub-forums for link exchange and we tried to understand how their members behave. Our study reveals that spammers who use link exchange behave in similar manner, and hence, we are able to extract their heuristics and classify them into categories. The analysis results reveal two main categories of link spammers. The first, which counts the majority of the investigated members, consists of spammers that own low-ranked websites. These spammers usually post their websites publicly, and thus it is easy to identify

88 4.9 Discussion them. Additionally, they belong to the hit-and-go group, which means that they are not active in SEO forums and contribute only with a limited number of posts just to be able to participate in link exchange. On the other hand, we have the more experienced link spammers. These spammers, do not post publicly and they communicate with the other members by sending private messages. They own a sufficient amount of websites including highly ranked domains. These spammers are more difficult to be identified, and therefore advanced techniques should be used to lure them to expose their websites. Regarding the link exchange, we notice that the first category prefers a two-way link exchange. We assume that these spammers have limited knowledge of SEO optimization techniques and presume that by having more links targeting their websites can mislead the search engines. In contrast, the advanced spammers, are aware of how the page ranking system operates, and thus they prefer the three-way link exchange. They know that many ranking systems do not count the backlinks if there is a two-way link exchange involved. Hence, they create dummy websites that exchange with other users to achieve their goals.

4.9 Discussion

We believe that our analysis provides an accurate insight on the behavior of current techniques used by link spammers. This is because we collect and carefully analyze a large volume of data, while we also correlate different data sources to validate the link exchange in SEO forums. As we cannot have a direct access to the complete information stored in the databases of the SEO forums, we make a best effort approach to collect as much data as possible, either by public sources, or by trying to convince other users to send spam links to honey accounts. Therefore, we are not able to collect and analyze all link exchanges through these forums. However, we believe that our approach provides us with a representative and adequate sample of the spammers’ activity. Our approach utilizes honey accounts to harvest data that are not publicly avail- able. Although these accounts have a certain level of intelligence, they could be identified if SEO forums deploy more advanced detection techniques. Addition- ally, there exist cases in which these Sybil accounts do not know how to act. This happens when the algorithm behind them cannot successfully recognize, and thus categorize the text in public posts. This can also happen when it comes to private messages’ replies. In these cases, a manual input is required. Consequently, we do not recommend Sybil accounts as a long-term solution. Nevertheless, in our study they were a necessary “evil” in order to uncover disclosed information, which we could not access by any other means.

89 4 Revealing the Relationship Network Behind Link Spam

4.10 Related Work

Web spam as a phenomenon is nearly as old as the web itself and thus general aspects of web spam have been discussed in a large number of studies over the last years. Previous studies focused on a wide variety of issues including economic aspects of web spam [87, 165], cloaking and redirection techniques used by web spammers [75, 206], and content analysis of spam web pages [127, 138, 186]. How- ever, there are relatively few examples of empirical studies that identify the means by which spammers communicate with each other, most likely due to the private nature of this communication. Many anti-spam methods such as TrustRank [76], BadRank [205] and Spam- Rank [14] have been proposed to detect link spam or denote its influence on page ranking. Adali et al. [2] demonstrated that generating pages with links targeting a single page is the most effective means of link spam, while Zhang et al. [217] showed how to make PageRank [21] robust against attacks. Finally, Fetterly et al. [51] investigated the cases where web pages are mosaics of textual chunks copied from legitimate pages and presented methods for detecting them. Our work is comple- mentary to these studies, since we are focusing on the link structure of web spam. To assess the impact of forum-based spamming on search quality, Niu et al. [137] conducted a comprehensive study from three different perspectives: that of the search user, the spammer, and the forum-hosting site. They examined spam blogs and comments in legitimate and honey forums. Their results revealed the existence of link spamming in the top 20 returned results for web searches on Google and MSN and they even found forums on governmental sites that were not immune from spamming. Their evaluation also showed that a fraction of web pages use universal redirectors to cover up their spam URLs, including spam pages with malicious content. Compared to their work, our research is also able to reveal the relationship network among spammers, not only to decide whether a web page is spam or not. Motoyama et al. [132] analyzed the private messages exchanged in six under- ground forums. Interestingly, their analysis showed that these markets feature the typical characteristics of a regular market. More precisely, the researchers revealed that the pattern of communications in underground forums captures the dynamic trust relationships forged between mutually distrustful parties. Similar to this pa- per, our study uses the same methodology—investigation on underground forums— and we try to reveal the formed relationships among link spammers. The key difference between two works is that in our case we try to lure the spammers to reveal their spam web pages with the assistance of honey accounts. Using honey accounts is an aged old idea on conducting interactive studies. Such approaches appeared for example in studies that investigate spam appearances in instant messaging systems: HoneyIM [209] is a system that uses decoy accounts in user’s contact lists to detect content sent by instant messaging malware. Similar,

90 4.11 Summary

HoneyBuddy [6] is an active architecture that constantly adds “friends” to its decoy accounts and monitors a variety of instant messaging users for sign of contamination. Our approach is based on the same basic principles. We create active decoy accounts and try to tempt link spammers to reveal disclosed information. Our approach is closely related to the work presented by Cheng et al. [28]. In their study, they used SEO forums to discover link spammers who offer their pages for link exchange. They built a semi-supervised learning framework to detect spam links posted in these forums. To do so, they analyzed 100 public threads out of seven SEO forums, but they did not have access to the private messages exchanged in the context of these threads. Their results showed that the precision of their method is higher than TrustRank [76]. Compared to their work, our approach has two major differences. First, we observed that more than 25% of the spam links are sent via private messages, and thus we created honey accounts to collect these data. To the best of our knowledge, our work is the first that uses honey accounts in SEO forums. Second, we use the links found in user posts within the forums’ public threads as well as the links found in private messages, only as a sign of link exchange and not as an actual event. This is because promising a link exchange within a forum may not result in adding the respective links to the web pages. Our approach considers an actual link exchange among spammers only after crawling the respective web pages, extracting their links, and validating that the links were added as promised in the forum messages. On the other hand, previous works are based only on the data gathered from the forums. Hence, we believe that our research is more accurate and representative of the true nature of link spamming.

4.11 Summary

In this chapter we examined how web spammers use unfair means to increase the popularity of their websites, which could potentially host malware. In detail, we presented a large-scale study of the relationship networks that exist behind link spam. The key idea that motivated our data collection and analysis is that link spammers tend to generate relationships with each other, by performing link ex- change, in order to boost the ranking of their web pages. They usually utilize SEO forums to get in contact with other co-spammers. Thus, we systematically collected spam links from SEO forums, analyzed them, and validated the link exchange by crawling the respective web pages. Also, we enhanced a typical forum crawler with honey accounts, which collect data from private communications. In addition, we visualized the link spam network using a graph representation of the revealed link exchange and relationships found among spammers. The outcomes of our experi- ments indicate that there is a medium-sized but quite active community that seeks ways to unethically improve the ranking of its websites.

91

5 Abusing Crawlers for Indirect Web Attacks

“Some say Google is God. Others say Google is Satan. But if they think Google is too powerful, remember that with search engines unlike other companies, all it takes is a single click to go to another search engine.”

Sergey Brin

It could be argued that without search engines, the web would have never grown to the size that it has today. To achieve maximum coverage and provide relevant results, search engines employ large armies of autonomous crawlers that continu- ously scour the web, following links, indexing content, and collecting features that are then used to calculate the ranking of each page. In this chapter, we describe the ways in which attackers can abuse autonomous crawlers to exploit vulnerabilities on third-party websites while hiding the true origin of the attacks. Moreover, we show how certain vulnerabilities on websites that are currently deemed unimpor- tant, can be abused in a way that would allow an attacker to arbitrarily boost the rankings of malicious websites in the search results of popular search engines. We propose a series of preventive and defensive countermeasures that site owners and search engines can adopt in order to minimize, or altogether eliminate, the effects of crawler-abusing attacks.

5.1 Introduction

It is sometimes hard to imagine that search engines were not always part of the web. Before the prevalence of search engines, users were finding content either by following links, or hearing about websites, or even attempting to guess the domain

93 5 Abusing Crawlers for Indirect Web Attacks of a website relevant to them, for example, a user searching for “California wine” guessing that the californiawine.com is the most appropriate website to find what she is looking for. Consequently, an abundance of publicly available knowledge remained hidden from the vast majority of the early Internet populations who relied to the aforementioned techniques to discover useful information. This way of navigating the web changed dramatically with the advent of search engines. While the primary search engines had an index of few thousand web pages and web accessible documents [120], their modern versions count tens of billions of publicly accessible web pages [204]. Nowadays, these engines are trusted to provide relevant content to users, in response to their search queries. Moreover, due to unbiased ranking algorithms, the results that users get are, in principle at least, the ones that best match their interests. Undoubtedly, this was a total departure from previous website links, where the user could not know whether a host-website is linking to a destination-website because it has indeed the most relevant content, or because the destination-website is actually paying the host-website a monthly fee for having its link listed. As such, today, with the exception of some popular crowdsourced link-aggregating websites like reddit.com, the majority of content-discovery happens through search engines. The tight coupling of search engines and modern browsers in the form of dedicated input fields next to the browser’s URL bar, or piggybacking on the URL bar itself, is further evidence that users rely more and more on search engines. To provide relevant content, search engines employ large armies of automated website crawlers. These crawlers are constantly navigating the web, following links, indexing content, and gathering statistics for each discovered page. The gathered data is combined in order to produce a rank for each page, which is then used to provide ordered search results when users search for relevant terms. The higher a website is listed on search results, the more chances there are that a user would click on that link instead of a competing one [167]. This has given rise to a wide range of techniques, which websites employ in order to manipulate the findings of crawlers such that the websites appear on a higher ranking in a search engines’ results, than they would otherwise appear. These techniques are part of the so-called SEO toolbox, as we have already discussed in Chapter4, and can range from benign actions, such as refactoring a page’s HTML code in order to be easily consumed by crawlers, to blackhat ones [115], such as the purchasing of backlinks from other websites and the stuffing of each page with many keywords unrelated to the page itself. In this chapter, we investigate the extent to which attackers can abuse the logic of search engine crawlers in order to perform various attacks. First, we show that an attacker can convince crawlers to launch attacks against third-party websites by crafting appropriate links, which are then followed by crawlers. For instance, an attacker who knows that a remote website is using a content management system

94 5.1 Introduction vulnerable to a SQL injection, can construct a malicious URL that exploits that vulnerability and have the crawler of a search engine follow that link instead of following it himself. The attacker can then collect the results of that attack through a number of ways (e.g., inspecting the cached page of the vulnerable website on the search engine’s website). Apart from totally shielding the attacker from a post- mortem analysis of the attack by the operators of the vulnerable website, this attack also creates other problems. If, for instance, a web application firewall detects the attack and decides to block traffic coming from the IP address of the attacking host, it will essentially be blocking the crawler of a large search engine, an action with negative effects for the website’s visibility on the search results of that specific search engine. In addition, we show that websites vulnerable to reflected HTML and JavaScript injections could be used to boost the ranking of attacker-owned or third-party do- mains. This is feasible through the careful construction of links that, when followed by a crawler, will provide web pages with backlinks towards the attacker-owned or third party domains. These injected backlinks will positively affect the ranking of the adversary’s website(s), which can in turn be used for scams and drive-by download attacks. Finally, motivated by our findings, we propose a series of deterministic and learning-based countermeasures for the detection of malicious outbound links. For the former, we propose the notion of authorized links (i.e., links whose legitimacy can be verified by a search engine crawler), and show how these authorized links can be realized using existing web technologies. For our learning-based countermea- sures we use anomaly-detection techniques to establish a notion of normality for the outbound links of any given website, which can then be used by the site operator to automatically detect abnormal outbound links.

In summary, we make the following main contributions:

• We provide a systematic overview of attacks due to the abuse of search engine crawlers and study the consequences of different attacks on search engines as well as the affected third-party websites.

• We deploy multiple sites with vulnerabilities together with attacker-controlled websites and measure the susceptibility of the crawlers of various search en- gines and the degree to which they unwillingly “collaborate” with an attacker.

• We propose pragmatic deterministic, design-based countermeasures along with learning-based mitigations and evaluate their efficacy in stopping the afore- mentioned attacks.

95 5 Abusing Crawlers for Indirect Web Attacks

5.2 Web Vulnerabilities and Exploits

The complexity of modern websites, along with the mixing-and-matching of plat- forms and extensions, are some of the root causes of web application vulnerabilities. The existence of website vulnerabilities, such as cross-site scripting (XSS) [192], SQL injection [20], cross-site request forgery (CSRF) [12], command injections [179], HTTP parameter pollution (HPP) [8] and HTTP response splitting [99] are among the most pressing security problems on the Internet today. Attackers use a variety of these vulnerabilities to exploit websites. Even vul- nerabilities such as SQL injections and XSS, which are well known and have been studied for years, are still frequently exploited and make up a significant portion of the vulnerabilities discovered each year [30]. To achieve their goals, cybercriminals frequently use black-box web vulnerability scanners to automate the process. These tools crawl a website for common security vulnerabilities, and if they found one or more of them, they generate specially crafted input values to exploit them. An exploitable vulnerability can affect the website itself or its visitors. For in- stance, attackers that gain access to a website’s database can modify or delete selected entries, or collect sensitive data, such as user credentials [185]. Addition- ally, they can modify the content of the websites by attaching malicious scripts, redirecting the traffic to malicious websites, or modifying advertisements to gener- ate revenue for themselves. Consequently, an exploited website can cause serious problems to both its operators and its users.

5.3 Security Problems

In this section we describe how an attacker (Section 5.3.1) can take advantage of web crawlers against the search engine company itself (Section 5.3.2) or to launch indirect attacks that masquerade her true identity (Section 5.3.3). We argue that these are the two most representative instances of this type of attack scenarios.

5.3.1 Attack Model Referring to Figure 5.1, we assume the existence of vulnerabilities in a website, target.me, that is an attractive target for an attacker (e.g., contains valuable in- formation such as user credentials or possesses a high page rank). Additionally, we assume that an attacker wants to benefit from these vulnerabilities. However, she does not want to leave any traces of her actions. Moreover, she may also know that specific actions can trigger alerts in the website’s intrusion detection system (IDS), yet they might appear benign if they come from popular web crawlers (e.g., due to whitelisting). Hence, she decides to take advantage of these crawlers to perform a series of indirect attacks against the vulnerable website.

96 5.3 Security Problems

HTML4or4JS4payload4to4inject4 nonAlegiBmate4backlinks A"ack.1 (Blackhat.SEO)

Search. Links.with.Exploits Coopera+ng.site Crawling Crawling Vic+m.site Engine. (malice.me) (target.me) for.A"ack.1.and.2 Crawler A"acker A"ack.2 A"ackerAcontrolled (Indirect.A"ack) website SQLi,4LFI,4RFI Third.party.site. (boost.me) NonAlegi+mate.backlinks

Figure 5.1: Overview of the attack scheme.

The types of indirect attacks that an adversary uses along with their effects de- pend both on the targeted website as well as the extent to which the attacker can manipulate search engine crawlers. We classify them in two different categories: (i) attacks that promote a third-party website by abusing a vulnerable website and (ii) attacks that directly affect the targeted website. In the first category lay scenarios such as blackhat SEO attempts, whereas in the second belong classic HTTP-based attacks against the server-side software of target.me. In the follow- ing sections, we discuss in detail the aforementioned threats. In both cases, the adversary leverages a cooperating site controlled by her, which we call malice.me. This could be either a site that is unknowingly helping the intruder (e.g., a link aggregator or a that lets everyone post links) or a site created by the attacker herself. In both cases, the requirements are that the adversary can post links with arbitrary GET parameters.

5.3.2 Blackhat SEO Attacks

To determine the reputation and popularity of a web page, search engines com- monly rely on the number and ranking of the other web pages that link to it. As we have explained in Section 4.2, the more websites linking to a page p, and the more popular these websites are, the higher rank page p will receive from a search engine. We showed that cybercriminals have created techniques that exploit the PageRank algorithm and improve the ranking of their web pages. Apart from the link farms we discussed in Chapter4, which constitute a volunteered network formed by link spammers, adversaries can also employ more aggressive approaches such as code in- jection attacks against vulnerable websites to achieve their goals. More specifically, they can utilize HTML and JavaScript injections to add backlinks to boostme.com

97 5 Abusing Crawlers for Indirect Web Attacks

(see Figure 5.1) that supposedly originate from the targeted website. Note that in all the following examples, unlike the typical web application exploitation, the victim is the crawler and thus the injected code will appear in the crawler’s view of a web page.

HTML Injection

This attack allows an adversary to inject HTML code, which contains one or more links, inside the victim’s web page. When the search engine crawlers visit the targeted web page, they will extract the links and will classify them as backlinks to boostme.com. For instance, the attacker could leverage a vulnerability in target.me to inject the following payload, which will be followed by the crawler:

Hey bot, click me!

Consequently, if an attacker can exploit multiple vulnerable websites, and especially if these websites have a high page rank or a small number of outbound links, the victim sites will unwillingly increase the attacker’s website page rank. As in traditional injection attacks, this attack can be persistent or reflected. It is also worthwhile noting that HTML injection is traditionally viewed as a less important vulnerability over JavaScript injection, because of the limited power of HTML. In our scenario, however, the user is a crawler and as such injected HTML can have significant and unforeseen side effects.

JavaScript Injection

This injection allows an attacker to execute JavaScript code inside a targeted web- site, as if this website sent that code to the user rendering that website. JavaScript, when compared to HTML injection, offers attackers more power since they can dynamically modify the content of the website at will. For instance, the code recog- nizes the origin of traffic and if it comes from a web crawler, the website can display a completely different behavior than it usually exhibits. This way, web crawlers can observe links that are not visible to normal users. This attack has only recently become possible, since some popular crawlers have started adding support for parts of the JavaScript language [69]. As an example, the following payload injected (and rendered) on the targeted page:

for (i = 0; i < 20; i++) document.write(’Link’ + i + ’’); would create twenty backlinks to attacker-controlled domains.

98 5.3 Security Problems

5.3.3 Indirect Attacks There exist adversaries that aim to harm a vulnerable website, either with the manipulation of its data or by stealing sensitive information. The main difference with the attacks described in Section 5.3.2 is that not only search engine crawlers are beneficial for the adversary, but any kind of service that performs remote requests. Next, we describe some of the most common attacks that fit under this category:

SQL Injection

In this attack the malicious users can inject SQL commands into an entry field for execution. A successful injection can then read and modify sensitive data, execute administration operations on the database, recover the content of a given file present on the DBMS file system, and in some cases issue commands to the operating system. For instance, let us assume the following SQL statement:

SELECT * FROM users WHERE id=-1 or 1=1

This SQL command is completely valid. It will return all rows from the table users, since WHERE 1=1 is always true. An attacker can construct a link that contains the above statement, wait for a search engine crawler to visit it, and then exfiltrate the results of that attack by inspecting the cached page of the vulnerable website on the search engine’s website.

Local File Inclusion

Here, an attacker can include files that are already locally present on the server. This vulnerability allows directory traversal characters to be injected as input to the targeted website [178]. Consider the following request:

http://target.me?file=example.html

An adversary who wants to access the login and passwords of the targeted website can modify the input to look like:

http://target.me?file=../../../../etc/passwd

As before, attackers can construct such a request, place it in an anchor tag, and wait for a web crawler to visit it.

Remote File Inclusion

This attack is the process of including remote files through the exploiting of vul- nerable inclusion procedures implemented in the web application. This vulnerability

99 5 Abusing Crawlers for Indirect Web Attacks occurs by exploiting applications that dynamically reference external scripts. A suc- cessful attack enables the cybercriminal to include code from a remotely hosted file in a script executed on the application’s server [178]. Since the attacker’s code is executed on the web server it can be used for both stealing data out of the server, as well attempting a full takeover of the vulnerable server. Similarly to the local file inclusion the attacker can construct a link in such a way that includes an external malicious script.

5.4 Susceptibility Assessment

In this section we describe how we assess the aforementioned security problems in the real world. More precisely, we show the feasibility of the attack schemes by implementing them against five popular web application vulnerabilities. As a result, we are able to “maneuver” popular web crawlers to perform these attacks instead of us, hiding in this way our true identity from the targeted website. Briefly, in Section 5.4.1 we describe the methodology we use to perform the attacks, while in Section 5.4.2 we evaluate the inclination of web crawlers to launch these attacks. We believe that we have constructed a realistic scenario that can be found on the web and aim to raise the awareness of web crawlers’ programmers to implement better filtering mechanisms.

5.4.1 Methodology and Measurement Infrastructure A preliminary step for assessing web crawlers capacity to blindly follow URLs consists of attacking a vulnerable website (i.e., target.me). Since it would be unethical to target a real site, we decided to design and deploy our own version of a vulnerable website. This provided us with the ability to select the vulnerabilities we wanted to assess. Consequently, it was possible to create a website that was prone to all the attacks we described in Section 5.3, allowing comprehensive assessment. To increase our interactions with crawlers, we deployed six different copies of the vulnerable site, reachable through different domain names. Next, we developed the attacker-controlled website (i.e., malice.me) that tar- geted all these vulnerabilities by generating the appropriate attack-including links and mixing them with other benign links. We deployed a normal and a (base64) obfuscated version for each attack in order to observe the behavior of web crawlers in both situations. Finally, we advertised our attacking website in all the major search engines using their appropriate website-submission forms. In fact, we no- ticed that search engines were very eager to index the newly available content. Our analysis of the web server logs revealed that the biggest search engines commanded their crawlers to visit our websites the same day we advertised them. The following days, we observed traffic from other crawlers as well as individuals. As we did for

100 5.4 Susceptibility Assessment

Table 5.1: Overview of the feasibility of each attack for each type of abuse. Blackhat SEO Indirect Attacks HTMLi JSi SQLi LFI RFI Plain Obf. Plain Obf. Plain Obf. Plain Obf. Plain Obf. GoogleBot           BingBot           YahooBot           BaiduSpider           AhrefsBot           MJ12Bot           XoviBot           the vulnerable sites, we created six copies of the attacker-controlled websites to get multiple observation points. As a result, we ended up with six concrete instances of the abstract scheme depicted in Figure 5.1.

5.4.2 Findings

The results of our assessment are summarized in Table 5.1, which indicates the successful attacks that can be performed through different web crawlers. The table includes both normal and obfuscated attacks and how each crawler treated them. Note that we omitted attacks that originate from individuals (user-agents belonging to browsers) or do not belong to the web crawlers that we could trace back to a specific search engine.

Coverage

The first major outcome is that all web crawlers performed at least one attack. Some of them also launched the majority of the attacks (i.e., normal, obfuscated, or both of them). This means that all the examined crawlers are prone on being manipulated by attackers. For instance, we observed that a portion of search en- gines is more prudent when it comes to attacks that can affect themselves. More particularly, we saw that some crawlers did not launch HTML and JavaScript injec- tions that generate URLs and would have affected the search engine page ranking computation. However, they behaved more liberally when it came to attacks that would affect a third-party website.

101 5 Abusing Crawlers for Indirect Web Attacks

Role of Obfuscation Next we observed a level of randomness when it comes to obfuscation. We con- sidered that a web crawler will follow a deterministic model in cases of obfuscation, which is to launch: (i) the plain attack and omit the obfuscated version, (ii) the obfuscated attack and omit the plain version, (iii) both of them, or (iv) none of them. However, we noticed inconsistencies from the side of some web crawlers. For example, GoogleBot launched both normal and obfuscated version of an attack, whereas it launched only one of them in a different attack. We believe that in some cases the crawlers chose to follow only a portion of our links, which could be due to the fact that the links were all toward the same victim website with only slight differences in their arguments.

Speed and Frequency A final observation relates to crawling speed and frequency. We observed that some web crawlers visited the vulnerable website in a periodic manner and launched the attacks more frequently compared to others. An attacker can benefit from this fact to make some attacks more persistent. For instance, if the attacker knows the exact vulnerability she wants to target, then she can advertise the cooperating site to the most persistent web crawler to maximize the efficiency of the attack.

5.5 Defenses

In the earlier sections of this chapter, we showed how an attacker could abuse crawlers to conduct attacks against third-party servers as well as to boost the rank of her sites. In this section, we discuss how these attacks can be stopped. We have purposefully chosen to expand more on the attacks involving the addition of backlinks since they are harder to detect using existing technologies.

5.5.1 Stopping Indirect Attacks An attacker in order to successfully use a crawler to conduct server-side attacks, she must find a remote web application that is vulnerable to one of the previously discussed server-side attacks, such as SQL injection and Local File Inclusion. One could argue that it is the web application’s responsibility to protect itself and as such, even if a crawler has been used to conduct an attack, the search engine behind it cannot be held responsible. One way that a web application can protect itself (other than not having vul- nerabilities) is to use a web application firewall (WAF). These systems are located in front of the web server and search for attack patterns, such as SQL statements,

102 5.5 Defenses in the incoming traffic. Upon detecting an attack, the WAF will drop the request, and potentially blacklist the IP address of the offending host. While dropping the request is the right thing to do, blocking the offending IP address can lead to com- plications when an attacker is conducting her attacks through crawlers. In other words, if a WAF, upon seeing an attack, decides to block the IP address of the crawler of a popular search engine, it will unwillingly be blacklisting itself from that search engine. A naive way of handling this corner case is to consult the user-agent string of the incoming HTTP request and do not block the requests coming from bots. This, however, could be straightforwardly abused by attackers in order to fully bypass the firewall. As such, we reason that this strategy must be combined with reverse-DNS lookups. A WAF that received an offending request by a “supposed” search engine bot can use a reverse-DNS lookup to establish that the bot does in fact belong to the claimed search engine. If it does, then the WAF should avoid blacklisting its IP address and merely drop the request; understanding that the bot is merely an unwilling actor in this attack.

5.5.2 Stopping Blackhat SEO Attacks In this section, we focus on the problem of fake or otherwise non-legitimate back- links, where an attacker abuses vulnerabilities on websites to inject backlinks to attacker-controlled sites. As explained in earlier sections, depending on the na- ture of the vulnerability, the backlinks can be temporary (such as in the case of reflected HTML or JavaScript injection) or permanent (as in the case of persistent injections). In both cases, the attacker takes advantage of the crawler’s inability of distinguishing legitimate from malicious links, as those are discovered on an HTML web page. We discuss two possible solutions that place most of the burden either on the website owner or on the search engine itself. Note that these solutions are compatible with each other, and thus can be both used at the same time to improve the overall accuracy of a malicious-link detecting system. Moreover, they are all backwards compatible in the sense that sites and search engines that choose not to support them will continue working without any interruptions.

Deterministic Solutions: Authorized Links A deterministic solution to this problem is to provide mechanisms to website ad- ministrators that can be used to denote which links are legitimate. If any links are found on a page that are not explicitly approved by a page’s policy, the crawler can automatically treat them as malicious. To this end, we propose the use of authorized links, which are links augmented with authorization information that a crawler can inspect and verify. We posit that the use of message authentication codes (MACs),

103 5 Abusing Crawlers for Indirect Web Attacks nonces, and whitelisting are good mechanisms for realizing the notion of autho- rized links as all have been successfully used in the past to achieve integrity goals. Moreover, the nonces and whitelisting method can be realized using the Content Security Policy (CSP) mechanism and thus alleviating the need for implementing yet another server-driven security policy mechanism as well as training developers on how to properly use it. In the following we discuss these solutions in detail.

MAC-based Solutions. In a MAC-based approach, each link is augmented with a hash of that link concatenated with a shared secret between the website and the search engines. Since website administrators already interact with search engines in order to submit their websites’ URL for crawling, it is straightforward to add an extra step where an administrator, after verifying that she is indeed the owner of a specific domain, shares a secret key with that search engine. This key can be used in the future to verify the integrity of the links found on her website. The MAC-based mechanism works as follows: the owner of domain D agrees on key K with a specific search engine. For every link L toward remote domains, the owner of D, augments the anchor tag with the result of H(L||K) where the function H is a strong cryptographic hash function. For instance, given an outbound link to example.com and a key of “secret”, the HTML markup for an anchor tag would be the following:

Click here where the mac attribute is the result of the MD5 hash function on the string “http://example.comsecret”. Since web browsers ignore markup attributes that are not part of their specification, the mac attribute will be invisible to normal users of a website. However, a crawler belonging to a search engine that knows in advance the key of this specific website can directly recompute a link’s MAC and disregard the link, if the computed MAC does not match the MAC available on the website. Even though the process of creating MACs for every outgoing link is likely to be an arduous one, it does not need to be performed manually. Unless the owner of a website is writing HTML by hand, “What You See Is What You Get” (WYSIWYG) editors that are available for all modern content management systems can fully automate this process. For instance, when the user is writing a new blog post in WordPress and clicks on the button that will allow her to enter a new link, the editor can automatically fetch the key from the website’s database, compute the proper MAC, and append it to the generated markup. Even for developers who decide to write HTML by hand, a script that receives as input the secret key can straightforwardly parse the HTML file, compute the appropriate MACs and rewrite the HTML code to support the notion of authorized links.

104 5.5 Defenses

Nonce and Whitelisting. In a nonce-based solution, each link toward a remote domain is augmented with a non-predictable identifier that is different for every page load. As a matter of fact, the Content Security Policy (CSP) 1.1 draft already allows developers to include inline JavaScript in their web pages (forbidden in the original CSP specification) as long as each inline script specifies the appropriate nonce [193]. The correct nonce is communicated to the browser through the CSP header. For example, a browser that receives the following HTTP response:

HTTP/1.1 OK Content-Security-Policy: script-src self ‘nonce-1q2w3e4r’; [...]

// Legitimate script

// Malicious injected script will allow the first inline script to execute but will stop the second one which is not carrying the proper token. As such, these nonces essentially allow capability- based access control. As long as a script has the appropriate nonce (as specified in the CSP header), the browser will allow it to execute. The same functionality can be extended to protect links. That is, the CSP header can denote an ahref-src attribute which will specify a nonce for the legitimate links. Presuming that the nonce is sufficiently random and changes on every page load, the attacker who is injecting backlinks will have no way of a priori knowledge of the nonce that will be given by the victim server to the crawler. As with the MAC-based solution, both nonces can be implemented by WYSIWYG editors and content management systems, which will automatically emit the appropriate nonces without the user even being aware of their existence. The benefit of this solution over the MAC-based approach is that website owners do not need to exchange any long-term secret keys with search engines. At the same time, however, since these nonces have to change upon each page load, the server should be careful when cashing pages at the server-side. Since a caching of nonces can, in principle at least, allow an attacker to inspect a nonce and then, assuming that the nonce is reused, inject nonce-including backlinks on the vulnerable websites. It is worth to mention that several server-side frameworks allow the caching of fragments of a web page [46] (as opposed to caching an entire page), and thus servers do not need to completely forgo the performance benefits of cached web pages. An alternative solution that combines the benefits of no-secret exchange with search engines, as well as worry-free caching, are whitelists of allowed outbound links. As in CSP, a website can send to the client a list of authorized remote

105 5 Abusing Crawlers for Indirect Web Attacks domains through its CSP headers. As long as each outbound link belongs to a domain that is part of the header-specified domains, the crawler can treat it as an authorized list. Depending on the nature of each website, the whitelist can be a global one for the entire domain, or it can be specified per subdomain or per path. In all cases, the WYSIWYG editors should update this list as new URLs are added to each page.

Learning-based Mitigation The problem of automatically distinguish legitimate from non-legitimate data in web applications has received ample attention in the anomaly detection research area. At the price of some false alerts, anomaly detection methods fill the gap left by misuse-based solutions such as blacklists or classic signature-based WAFs. Given the state of the art, and the possibility of combining anomaly- and misuse- based solutions on a modern web application, we can reasonably assume that the HTTP requests toward large and popular websites are already screened to mitigate suspicious payloads.

Threat Model. A conservative threat model must assume that non-legitimate out- bound links have somehow landed on a page. This can happen in at least two cases. First and foremost, several websites deliberately allow anyone to post arbitrary links (e.g., comments on blog posts, directories, bookmarks). In these cases, the website operator simply does not want to invest resources to scrutinize every posted link, especially a posteriori. For example, Maggi et al. [117] have showed that once attackers succeed in bypassing the first line of defense and creating malicious short- ened aliases (e.g., on bit.ly, tinyurl.com and other leading services), the operators never check such aliases a second time. Secondly, an attacker can succeed when- ever automatic protection methods fail. There may be several reasons for this. For instance, the attacker can obfuscate the backlinks using redirections (e.g., via URL- shortening services) or adopt other sophisticated techniques. However, it is beyond the scope of this work to investigate them. The bottom line is that there exist ways for an attacker to bypass the first line of defense without the operator noticing it. Unfortunately, once the first line of defense is bypassed, the website operator loses the chances of accurately detecting a malicious link, because the contextual information (e.g., source IP, request headers) is not retained forever. In conclusion, there exist cases in which it would be beneficial to perform a posteriori analysis on existing outbound links to discriminate the legitimate from the non-legitimate backlinks, based exclusively on the available information. In a conservative approach, such information is simply the link itself. The threat posed by search engine bots, which we presented in the previous sections of this

106 5.5 Defenses work, is one relevant case. However, we believe that having such a protection measurement would be very useful for forensics purposes and other investigation tasks (e.g., periodic housekeeping of free blogging platforms).

Modeling Legitimate Links. Given an arbitrary web page, target.me/page.php, our goal is to detect outbound links pointing to a website outside the control of the web page owner. Thus, we focus on links that have paid-level domains (PLD) different from target.me. Of course, we do not consider across distinct domains from the same organization as outbound links, because these can be easily whitelisted. For instance, links from youtube.com to google.com are certainly legitimate. Clearly, we only consider links that can be used to encode an exploit or, in other words, those that contain a query string (e.g., boostme.com/path/p.php? par=var), including dynamically-generated links (e.g., via JavaScript). In general terms, we are interested in the URL contained in those GET or POST requests generated when the search engine bot follows a link referred by the source page on target.me. For simplicity, and without loosing generality, we analyze all these backlinks that the website operator does not know how to handle; these that have already gone through the state-of-the-art and whitelisting filters employed up to now.

Challenges. The challenges we face are: (i) there is little contextual information attached on a standalone link and (ii) there are multiple classes of links under the same domain (e.g., long links with many parameters, short links with few but long parameters, other links with just integers, floats or tokens as parameters).

Possible Solutions. The features proposed in the literature (e.g., [104]) for detecting anomalous HTTP requests are inapplicable to this context. The reason is three-fold. First, they are designed to detect anomalous requests directed toward a web application—including but not limited to the URL components—as opposed to unexpected outbound links. Second, and stemming from the first reason, state- of-the-art methods assume that there is quite a regular structure in the URLs to analyze, since they all encode a request to a single web application or to a small set of web applications. In this setting, anomalous requests can be detected quite easily by finding out-of-sequence parameters, long parameters, special characters in the payload, etc. However, when this assumption is removed, the problem becomes harder. In principle, it is possible to apply state-of-the-art web application anomaly detection techniques by creating one model per outbound domain, striving to learn the regularities of the requests directed toward each external site. However, scarcity and uneven distribution of data may limit the applicability of such an approach. Last, these approaches work well when used in conjunction with other models, for instance by creating correlated request-response models [104], timing features,

107 5 Abusing Crawlers for Indirect Web Attacks previous knowledge on the request handler, etc. This information, as discussed above, is not available on the referring pages hosted on target.me.

Characterizing Features. For the referring site that wants to perform checks on the outbound links, we propose a set of lightweight features and a simple but effective learning technique that makes no assumptions on the non-legitimate links. The only requirement is a set of legitimate links. We are not aiming at providing perfect recognition nor creating an alternative to detecting malicious payloads, as we are well aware that having only the link’s string representation gives us a very limited power. Perfect protection can be obtained with the deterministic solution described in Section 5.5.2, at the price of design changes. When this is not feasible, the solution described in this section gives reasonable protection at zero cost, which is already a benefit if compared to a baseline of an unprotected site. Moreover, the website operator may combine our solution with other site-specific filters that leverage domain knowledge to mitigate errors. In detail, for each link, we represent the string after the first slash with the following feature vector:

• l (integer) number of symbols, including any character class;

• d (integer) the depth of the path, that is the number of “/”;

• s (integer) number of special characters;

• u and U (integer) number of lower and uppercase alphabetical characters;

• p (float) average length of the parameters’ names, counting all the symbols before the “=”;

• v (float) average length of parameters’ values, counting all the symbols after the “=”;

• n (float) number of parameters.

In addition to these features, we run a pilot experiment including the frequency of each symbol in [a-zA-Z0-9] plus special characters. Unfortunately, we obtained unsatisfactory results and speed penalties due to the increased dimensionality (above 104 features).

Training and Detection. We use the aforementioned features to fit a model that can be used to decide whether a new outbound link is non legitimate. Our model can be trained at various aggregation levels, depending on the working environment. For instance, a site with regular links throughout all the pages can train one model,

108 5.5 Defenses while larger websites can train one model per site section. Given the problem setting, three broad modeling approaches can be applied. In the optimal case when the website operator has knowledge about the charac- teristics of the non-legitimate links she wants to detected, a fully supervised learning approach can be used. Using nomenclature from the machine-learning field, this is essentially a two-class classification problem. This yields the best short-term recall and precision, although the assumptions underneath this approach are not always realistic. Indeed, if one of the two classes of links is not well represented during training or if it changes dramatically during operation, the quality of detection may decrease over time. Another approach consists in mapping the problem as a one-class classification task, or semi-supervised learning. Essentially, we ignore non-legitimate links and train the classifier exclusively on legitimate links. Although this approach has great recall, close to one hundred percent, it suffers from many false positives. A third, and more realistic, approach consists of not mapping this problem to a classification task. Instead, we show that a simple outlier detection technique performs very well, without requiring any assumption on how the feature values are distributed. More importantly, we do not require any knowledge about the outliers. The technique we used is inspired by histogram-based outlier score (HBOS) [65]. We split the list of legitimate links available for training in two batches. On the first batch, for each feature we calculate the relative-frequency histogram using a fixed number of equally sized bins on the training data. The number of bins, as well as other parameters, can be easily tuned on a per-site basis as showed in Section 5.5.2. Interestingly, this method is suitable for online learning, as the frequencies can be updated without batch re-training as new samples come in. As a result, we obtain M = 7 histograms, where M is the size of the feature vector. On the second batch, we calculate the following score:

M X 1 HBOS(v) = log (5.1) freq (v ) i=0 i i where v is an M-sized vector holding the feature values for each link in the batch and freqi(vi) is the frequency of the i-th component of the vector v, which is actually the height of the corresponding bin in the i-th histogram. If a value has zero frequency, we assign it an arbitrarily low value to allow the calculation of the fraction and logarithm. Any low value close to zero yields a very high HBOS component, to account for the never-seen-before value. Tuning. We now calculate the mean and standard deviation of the HBOS, which essentially expresses the allowed values of outlier score of legitimate links. Since the legitimate links have a lower variability than non-legitimate links (HBOS values

109 5 Abusing Crawlers for Indirect Web Attacks are instead more distant from the mean), we can create a decision function for determining whether a new link is legitimate, given its feature vector v0:

Legitimate(v0) = HBOS(v0) ≤ α · µ + β · σ (5.2)

This is essentially based on Chebyshev’s inequality, where the α and β scaling pa- rameters can be easily tuned on a per-site basis as shown in Section 5.5.2. Note that this decision boundary works without assuming any specific underlying distribution of features. Overall, this lightweight technique allows flexible tuning, explanation of the rea- son for considering a link as an outlier, and excellent results. Indeed, the website owner can examine each alert and see which feature(s) contributed most to a high HBOS(·) score (e.g., link with unexpected number of parameters, too many special characters).

Feasibility Evaluation. We implemented a proof-of-concept of our approach in about 600 lines of Python code (including code required for automating the ex- periments), leveraging the SciPy [90] framework for statistical computation. Our prototype parses the path and query string of each outbound link and calculates the above features. Then, it performs the training and estimation of the µ and σ for a given site. These values are then used for deciding if new links are legitimate or not. Dataset. Using the PhantomJS [79] full-fledged, headless browser, we collected 795,274 outbound URLs referred from the top 5,000 Alexa websites. We excluded websites with no public pages, which required registration and login (e.g., Facebook, Twitter, LinkedIn), as they would not be targeted by public search engine bots, and sites with only a handful of outbound links. Our crawling script, based on CasperJS [146], started from the initial seed, en- queued inbound links to continue the crawling process on each site, and saved out- bound links. As a source of non-legitimate outbound links, we used the dataset from our measurement setup described in Section 5.4. On average, we collected 244.1 (± 319.97) outbound links per site, with peaks up to 2268 links. Given this skewed distribution, we focused our experiments on the top 400 sites having at least 200 outbound links each. Evaluation Results. The experiments described in the following were repeated ten times for each data point on a randomized and shuffled train-test split (i.e., ten–fold cross validation). Speed. Feature extraction is extremely fast, even in our proof-of-concept proto- type written with an interpreted language running on a laptop. On average, on our

110 5.5 Defenses

Generic Per−site tuning

● ● 0.9 ●

0.6 ● ● Ratio value 0.3

● 0.0

FPR TPR Pecision F1−score FPR TPR Pecision F1−score

Figure 5.2: Overall final detection quality in terms of F1 score, precision, TPR and FPR (before and after per-site tuning). entire dataset, 0.7513ms (± 0.3375ms) are required to extract the features from one link, using a handful of megabytes of main memory.

Parameter Selection. The default parameters in Equation 5.2, α = 1.0 and β = 2.0, the arbitrarily low frequency to be assigned to novel feature values in Equation (5.1), 0.0001, and the number of bins of the histogram, result in a very diverse detection performance, ranging from sites with very high TPR to sites with too many FPRs. This led us to consider that each site could optimize the detection by choosing the combination of parameters that maximizes the precision, or the F1 score (high precision·recall F1 score means both high precision and high recall, as F 1 = 2 · precision+recall ). We found out that each site has indeed a distinct combination of parameters that works better than others. Globally, with these per-site tuned models, we obtain a better detection as shown in Figure 5.2. In addition to optimization, each site owner can pre-group outbound URLs based on domain knowledge (i.e., URLs going to site X or Y), so that group-specific models can be fitted.

Overall Detection Quality. From Figure 5.2, the vast majority of websites can

111 5 Abusing Crawlers for Indirect Web Attacks detect non-legitimate outbound links with more than 90 percent of TPR and less than 6 percent of FPR. Depending on what costs more between handling a false positive or missing a true alert, the website owner can easily tune our model to minimize such cost. In our experiments, we tuned the models to maximize both precision and recall, and thus to obtain the highest possible F1 score. Fortunately, among the false positives we found (i) inbound URLs with a different domain but still affiliated to the same organization, (ii) well known banner circuits, or (iii) social buttons, which can be easily filtered out with a simple whitelisting.

5.6 Discussion

As someone could guess there exist no study that is completely flawless and with- out space for possible improvements. In this section, we discuss several limitations of our approach and propose directions which could lead in a follow up research.

5.6.1 Attack Models

In our attack model we considered two important security problems that were caused by the activity of crawlers: blackhat SEO and server-side attacks. However, current crawlers are powerful and the increasingly complex front end of modern web applications require more and more capabilities, close to emulating full-fledged clients. Simultaneously, there is a continuously cat-and-mouse game played by cy- bercriminals and security researchers. Therefore, we envision other attacks, where the machine cycles of web crawlers could be abused to run complex client-side software that performs malicious tasks such as crypto-currency mining, and brute- forcing. To do so, we first need to deeply understand the full capabilities of modern web crawlers. Next, in cases that this is feasible, we want to experiment with a list of complex attacks to evaluate to which of them the crawlers are susceptible.

5.6.2 Measurement

Our measurement focused on web crawlers, as they are the most important cate- gory of “link followers”. However, the security problems that we highlighted in this work can in principle affect any environment where an automated browser follows a link. However, we showed that, even if we limit our scope to web crawlers, this problem occurs in the real world. Expanding our analysis further may only reveal more instances of the very same concerning problem. For instance, an interesting scenario would be the exploration of widely used web services, such as link check- ers chat or social networking tools that process posted links against several attack scenarios. Additionally, another interesting scenario will be to crawl the web and

112 5.7 Related Work harvest suspicious links that try to manipulate web crawlers. This will shed light to which degree adversaries leverage these attacks in real world.

5.6.3 Deterministic Defenses

We proposed very simple yet effective design principles that could prevent abusing web crawlers entirely. The assumption is that the website owner or web applica- tion developer can decide which links are safe to be crawled and which ones are not. Our solution is perfectly compatible with the current web and works for both statically and dynamically generated links. For instance, automatic cross-site re- quest forgery (CSRF) protection has now landed in commercial and open-source web application frameworks (e.g., Flask, Django). All the developer has to do is use them, and forms are automatically protected against CSRF. Similarly, we believe that our proposed deterministic solutions can be implemented and included in web development libraries to ensure that all on-premise links are authenticated out of the box. However, our approach is less applicable to websites that host massive amounts of arbitrary user-generated content. In these cases, we had to resort to a non-deterministic mitigation approach, which is intrinsically imprecise.

5.6.4 Learning-Based Defenses

Our learning-based mitigation approach assumes that there is a limited number of “classes” of legitimate outbound links. On websites with many different “classes” of links, our technique can yield errors. Although we showed that we could ignore this problem, at least on the top Alexa sites, we believe that our technique may yield better results if applied in combination with pre-processing pipelines, which group outbound links according to their provenance (e.g., per user, per page or section of the website). This would ensure much more uniform histograms during training, increased detection capability and few errors. Nevertheless, an interesting research direction consists in exploring the applica- tion of supervised learning techniques in order to incorporate pre-existing domain knowledge in the models that we propose. This may improve the recall of the non-legitimate links, while keeping a high precision. However, given the variety of non-legitimate outbound links that an attacker can craft, applying these techniques requires an ample set of samples as well as a robust set of features to model them.

5.7 Related Work

We are not aware of any scientific systematization work that defines and analyzes in depth the security problems that we study and quantify in this chapter. Our work

113 5 Abusing Crawlers for Indirect Web Attacks is mainly related to search engine poisoning and web vulnerabilities, and marginally to web service abuse. Web vulnerabilities [139] are a longstanding problem in the modern digital world. The main concern is the wide attack surface that they offer. In the past decade, researchers focused on basically every aspect of web vulnerabilities, including design- based solutions [159] to minimize the chances of errors in the development phase, code-analysis approaches that try to find [192] and remove [48] application bugs, and runtime defenses that strive to detect and block their exploitation. Although technically simple to understand and implement, the abuse of web crawlers as an indirect “exploiter” of web vulnerabilities is a serious threat that creates a trade-off between “protection” vs. “popularity” in modern websites. The attacks that we present in this work can be categorized as abuses of public web services for malicious purposes. A recent and interesting work in this direction is [114], which presents a composition-based attack put together by the authors by leveraging benign web services (e.g., Google Docs, Facebook, URL shorteners). The approach presents a series of low-level HTTP primitives that can be created by adding certain URLs, for example, in a Google spreadsheet or Facebook status update, and waiting for the service to crawl such URLs. The net result is that an attacker can send arbitrary HTTP requests while remaining anonymous, well hidden behind the many levels of indirection created through the combination of such services. Interestingly, this method could be leveraged in our attacks to further increase their power. In addition, there exists a plethora of works that try to mitigate the blackhat SEO which are based either on the web page characteristics or on the link structure leading to the pages [138, 186, 205, 206]. Moreover, in order to prevent spammers from gaming the system, search engines do not officially disclose the exact features used to determine the rank and relevance of a web page to a search query. Never- theless, blackhat SEO is still an existing problem. Leontiadis et al. [107] studied the evolution of search engine poisoning over a period of four years. Borgolte et al. [18] describe advanced JavaScript injections that are particularly hard for crawlers to detect. Wang et al. [196] explore the effect of interventions against search poisoning campaigns targeting luxury goods. deSEO [89] on the other hand detects URLs from the search index that contain signatures derived from known search poisoning landing pages and exhibit patterns not previously seen by the search engine on the same domain. SURF [115] is another recent work on the detection of web search engine poison- ing. The authors study search redirection graphs, obtained with an instrumented browser, and extract robust features that indicate rogue redirections typical of poi- soning campaigns. Such features include, for instance, the total redirection hops, the number of cross-site redirections, and the presence of page-rendering errors. A system like SURF can be effectively adopted on a global scale by search engine oper-

114 5.8 Summary ators to find and hide rogue results. In fact, in our work we focus more on detecting the origin of the indirect attacks described in Section 5.3.3. Regarding the search engine poisoning attacks described in Section 5.3.2, we focus specifically on those that are made possible thanks to the presence of vulnerabilities. Moreover, SURF tackles the problem of detecting existing campaigns in general, whereas we ana- lyze the causes of the problem under the specific condition of a vulnerable website that offers the attacker a low-hanging fruit to create such campaigns. In principle, our results could be applied to characterize and detect vulnerability-enabled search engine poisoning campaigns right at the origin.

5.8 Summary

In this chapter, we explored a new category of attacks, which rely on the fact that search engine bots, or third-party web services in general, trust and follow links that are presented to them. The challenge here was that there is a trade-off between the primary goal of a bot (i.e., explore every corner of the web) and the risk of following a malicious link. The security problem that arises is amplified by the increased computational power demanded by modern websites, which require complex crawling capabilities. We discovered that the most popular crawlers (i.e., GoogleBot, BingBot, YahooBot) are blindly following links that could potentially end up exploiting a vulnerability against the target host. This empowers the at- tackers with the possibility of hiding their true location when launching an attack. Moreover, the existence of this attack venue creates a delicate and complex issue of responsibility: which party, between website owner and crawler operator, is liable in case a malicious outbound link disrupts a web service? To this end, we proposed countermeasures that can be adopted gradually and independently by each involved party, which can mitigate or altogether eliminate this problem.

115

6 Understanding Malicious Advertisements

“In the virtuous cycle of paid search, you need advertisers. The more advertisers you have, the more bids you have. The more bids you have, the more traffic you have. The more traffic you have, the more money you get per search.”

Gary Flake

Online advertising drives the economy of the World Wide Web. Modern websites of any size and popularity include advertisements to monetize visits from their users. To this end, they assign an area of their web page to an advertising company (so called ad exchange) that will use it to display promotional content. By doing this, the website owner implicitly trusts that the advertising company will offer legitimate content and it will not put the site’s visitors at risk of falling victims of malware campaigns and other scams. In this chapter, we perform the first large-scale study of the safety of the advertisements that are encountered by the users on the web. We analyze to what extent these users are exposed to malicious content through advertisements, and investigate what are the sources of this malicious content. The observations we made shed light on a little studied, yet important, aspect of adver- tisement networks, and could help both advertisement networks and website owners in securing their web pages and in keeping their visitors safe.

6.1 Introduction

The online advertising industry is constantly growing. A recent report showed that this industry generated a revenue of 42.8 billion dollars in 2013, which is 17% higher than what had been reported in the previous year [81]. In the World Wide Web, where most online services are free of charge, advertisements constitute the

117 6 Understanding Malicious Advertisements main revenue for website administrators, and it is very common to see promotional content alongside the actual information contained in such sites. Given the profitability of online advertising, many big players have entered the arena. Such players, known as ad exchanges, put in contact the advertisers, who want their content to be displayed, with the publishers, who want to show promo- tional content on their web pages, and make sure that the most suitable adver- tisement will be displayed to the visitors of that site at all times. Recent research showed that Google’s Doubleclick ad exchange service is the largest on the Internet, being present on 80% of the websites that provide advertisements [62]. A publisher looking to generate some revenue can easily subscribe with one of these companies, dedicate a part of her web pages to advertisements, and start serving promotional content to that page’s visitors. Publishers are paid either by impression, meaning that they get a sum of money every time a visitor watches an advertisement on their site, or by click, meaning that they get paid every time a user shows interest in the advertisement and clicks on it, visiting the advertiser’s website. To find a suitable advertisement to display to a certain visitor on a given web page, ad exchanges undergo a complex auction process [174]. This process is influenced by many factors, such as having advertisers bid to get their advertisement displayed, analysis of the web page content to ensure that the displayed advertisement is related to the page’s content, and fingerprinting of the visitor to infer what type of promotional content she might be interested in. Auctions do not interest a single ad network at a time, but can involve multiple ones. In fact, multiple ad exchanges are federated together, and an ad exchange can participate in an auction generated on a competitor, and end up serving an advertisement of its own on another exchange. This happens fairly commonly, especially when the original ad exchange does not have a “good” advertisement to display on a specific page. Because of their pervasiveness, online advertisements are not only used by legit- imate parties, but also by miscreants. A common scam linked to online advertise- ments is click fraud [199]. In this scheme, cybercriminals first set up web pages and become publishers. Then, they instruct a botnet, acting under the cybercriminal’s control, to visit the web page and click on the advertisements displayed on it [34]. By doing this, the cybercriminal will get paid by the ad exchange and make a revenue. Click fraud is a big concern for ad exchanges, and a wealth of research has been con- ducted to detect and block suspicious clicks on online advertisements [43, 44, 174]. Besides click frauds, online advertisements provide a convenient platform for in- fecting web users with malware. Attackers can set up malicious advertisements that attempt to automatically exploit the user’s browser and install malware with a drive-by download attack [151], or they can display an advertisement that lures the victim into installing malware through social engineering, making the advertisement look appealing to the user [172]. Leveraging advertisements to spread malware has many advantages for attackers.

118 6.1 Introduction

Since advertisements are displayed on very popular websites, miscreants have a chance of infecting a larger number of victims in a short amount of time. Without the use of advertisements, the only way that an attacker would reach a similar goal is by compromising the home page of a popular site, which is a challenging task due to its security mechanisms. In addition, publishers usually trust the advertisement network (ad network) that they entertain business with, and are unaware that such networks could actually end up serving malicious content. Previous research showed that malicious advertisements are fairly common in the wild [110, 121, 136, 177]. Similarly, recent news showed the feasibility of having ma- licious advertisements going undetected by major ad exchanges, and being served to users [57]. However, no comprehensive research has been conducted on under- standing the ecosystem surrounding malicious advertisements. The prevalence of malicious advertisements on the web, the number of ad networks that serve these malvertisements, and the quality of the defense systems deployed by ad exchanges are still open questions. In this chapter, we study the ecosystem of malicious advertisements. We crawled more than 600,000 real-world advertisements, and checked them against multiple detection systems to assess their maliciousness. We show that certain ad exchanges are serving more malicious content than others, probably because they have insuffi- cient detection systems. We also show that, because of the arbitration process, it is common for ad exchanges to serve a malicious advertisement provided by another ad exchange.

In summary, we make the following main contributions:

• We collect a corpus of more than 600,000 real-world advertisements from var- ious web pages and describe the misbehaving advertisements that we discov- ered.

• We analyze different ad exchanges and show that some of them are more prone to serving malicious advertisements than others.

• We demonstrate that due to the arbitration process, every website that serves advertisements and that does not have an exclusive agreement with the ad- vertiser is a potential publisher of malicious advertisements.

• We show that the vast majority of publishers tend to trust their advertisers not to serve malicious advertisements and thus they do not apply any additional filters to protect their users.

119 6 Understanding Malicious Advertisements

6.2 Malicious Advertising

Malicious advertising, known as malvertising, is the cybercriminals’ practice of injecting malicious or malware-laden advertisements into legitimate online advertis- ing networks and syndicated content. It can occur through deceptive advertisers or agencies running advertisements or compromises to the ad-supply chain including ad networks, ad exchanges, and ad servers. Different types of malicious advertise- ments exhibit different behaviors and in the following sections we briefly describe them.

6.2.1 Drive-by Downloads A drive-by download advertisement is an advertisement that hosts one or more ex- ploits that target specific vulnerabilities in web browsers. More precisely, attackers target vulnerabilities in web browsers or in browser plugins, such as Flash or Java, that enable users to experience rich media content within the browser environment. In some cases, the browser vendor pre-installs these plugins. The user may not even use the vulnerable plugin or be aware that it is installed. Users with vulnerable computers can be transparently infected with malware by visiting a website that serves a drive-by download, even without interacting with the malicious part of the web page.

6.2.2 Deceptive Downloads Deceptive downloads try to lure their victims to download and install a specific software component that is malicious. The main difference from drive-by downloads is that attackers do not try to find a vulnerability in the victim’s browser or browser plugins to download and install a piece of malware, but instead they try to trick the users into performing that procedure voluntarily. This happens by having the user believe that there is some desirable content on the visited web page. More specifically, the victims get informed that, in order to gain access to specific content of the page, they need to install a particular software component or to upgrade their supposedly outdated plugins. Of course, the updating/installing procedure installs malware on the user’s hosts instead of the advertised software.

6.2.3 Link Hijacking Link hijacking allows an advertisement to automatically redirect users to websites that they have not decided to visit. The advertisements are included in iframes, and the advertising scripts cannot access the Document Object Model (DOM) of the publisher’s web page due to the Same-Origin Policy (SOP) restrictions [10]. How- ever, a malicious script contained in an advertisement can redirect the entire page to

120 6.3 Methodology a preselected destination by setting Browser Object Model’s (BOM) top.location variable [136]. This way, the victim is redirected to an arbitrary location and not to the one she has initially selected.

6.3 Methodology

In this section, we present the methodology we used to generate and evaluate a large corpus of advertisements. Our process includes two steps. First, we extract the advertisements from a variety of websites. Second, we use an oracle to classify the advertisements as malicious or legitimate. We describe both steps in detail.

6.3.1 Data Collection In the first phase, we performed a large web crawl to create a corpus of adver- tisements. For this purpose, we used two different data feeds. First, we leveraged a data feed obtained from an anti-virus company (already used in the work presented by Stringhini et al. [177]). This feed contains web pages that in the past appeared to have a malicious behavior, and was generated by users who installed a browser security product to voluntarily share their data with the company for research pur- poses. For the second feed, we used Alexa’s one million top-ranked websites list. To have a certain degree of diversity in our data, we selected the top and the bot- tom 10,000 websites, the top and the bottom 1,000 websites from selected top-level domains, and also 20,000 randomly selected websites from Alexa’s ranked websites. Due to the fact that the content of the advertisements is dynamically generated, we periodically crawled each web page in an attempt to obtain different advertise- ments. More specifically, our crawler visited each website once per day, and refreshed a web page five times. Our crawler was based on Selenium, which is a software- testing framework for web applications that has the ability to programmatically control a real web browser (Mozilla Firefox in our experiments). This approach allowed us to retrieve the whole content of a rendered advertisement, which would not be possible if we used a simple command-line tool like wget. Additionally, we captured all the HTTP traffic during crawling for further investigation. In most of the cases, the advertisements were included in an iframe. An iframe is an HTML document embedded inside another HTML document. This allows the iframe to be rendered in a consistent way even if it is included by different websites. We leveraged this fact and we created HTML documents based on the contents of the iframes. These iframes included both statically- and dynamically-generated HTML elements. It is important to note that not all the iframes included in a web page contain advertisements. Thus, to distinguish the advertisement-related iframes, we utilized EasyList. EasyList includes domains and URL patterns for ad-related hosts, and is used by the browser plugin Adblock Plus [3] to block advertisements.

121 6 Understanding Malicious Advertisements

After a period of three months, we have created a corpus of 673,596 unique adver- tisements. We then analyzed this dataset searching for misbehaving advertisements.

6.3.2 Oracle To classify if an advertisement exhibits malicious behavior, we utilized an or- acle. The oracle constitutes an essential part of our study. It gets as an input the initial request for advertisements from a publisher’s website, and by monitor- ing several behavioral features, it can effectively determine the maliciousness of the advertisement. The lifeblood of our oracle constitutes by three main components: Wepawet [36], malware and phishing blacklists, and VirusTotal. We describe the contribution of each component in the classification process in the next paragraphs.

Wepawet As advertisements are included in pages with dynamic content that often changes over time, they are also dynamically generated. The dynamic nature of advertise- ments is achieved with the use of JavaScript or Flash. Miscreants can unleash their nefarious activities (drive-by download attacks, phishing attempts, etc.) to victims through advertisements. Therefore, we need to analyze the advertisements’ embed- ded code, which is often dynamically loaded, to detect if there exist any kind of malicious behavior. To do so, we utilized Wepawet [36], a honeyclient that uses an emulated browser to capture the execution of JavaScript code on web pages. Wepawet uses anomaly- based techniques to identify signs of maliciousness that are typical of a drive-by download attack. We submitted the iframes that contained advertisements to Wepawet, which exe- cuted all the JavaScript code and captured all the network traffic. Finally, with the use of specific heuristics, such as the download of malicious executables or machine learning models, Wepawet classified the advertisements as malicious or not.

Malware and Phishing Blacklists Blacklists are one of the most widespread techniques to protect users against malicious websites. In a domain-based blacklist, a domain is added to the list as soon as it is discovered to host malicious content. Additionally, the domain is classified based on its behavior, such as malware distribution, phishing attempts, stealing users’ credentials, and others. In this study, we used a tracking system that constitutes a collection of 49 anti-virus, spam and phishing blacklists [105]. We utilized these blacklists by checking against them all the domains we monitored to serve advertisements. Note that it is fairly common for the blacklists to produce false positives. In our study, we wanted to minimize the risk of false classification

122 6.4 Analyzing Malvertisements of an advertisement. To do so, we use an empirically calculated threshold. More precisely, to increase the accuracy of our results, we considered domains as malicious only if they were contained in more than five different blacklists at the same time. However, we observed that blacklists continuously add and remove data. Hence, a domain which is currently blacklisted could be whitelisted in the future. Thus, we are aware that this component could increase the false positives of our study.

VirusTotal

Among the malicious advertisements, there exist some that try to lure users to install software in their machines. They disguise the software as a media player or an up-to-date browser plugin required to display specific content. Most of the time, this software contains malware that tries to infect the user. Nevertheless, there will be some cases in which a benign plugin is required by the browser to display the content. For instance, a browser could not display Flash content due to Flash plugin absence. Hence, we need a way to decide whether the downloaded software is benign or malicious. Anti-virus products are the best solution for this classification. However, not all vendors can recognize the same malware. Additionally, having access to mul- tiple anti-virus products is a time and resource consuming process. Fortunately, VirusTotal can solve this problem. VirusTotal is an online service that ana- lyzes files using 51 different anti-virus products and scan engines to check for mal- ware [191]. One can submit samples to VirusTotal and get a report with the classi- fication of these samples by different anti-virus companies. We consider VirusTotal as a key component of our oracle. Whenever an advertisement tried to force a user to download software, we forward this software to VirusTotal and retrieve its clas- sification. This way, we can accurately decide if the downloaded software is benign of malicious.

6.4 Analyzing Malvertisements

In this section, we analyze the malicious advertisements that we discovered. In particular, we study various aspects of malvertising and try to understand what types of websites are more prone to malvertisements. Furthermore, we investigate whether a website is more secure by selecting a trusted ad network to serve the ad- vertisements. Finally, we examine if the publishers take the users’ security into their consideration, and thus take actions to protect their visitors from being infected.

123 6 Understanding Malicious Advertisements

Table 6.1: Classification of malvertisements. Type of maliciousness #Incidents Blacklists 5,694 Suspicious redirections 1,396 Heuristics 309 Malicious executables 68 Malicious Flash 31 Model detection 3 Unique advertisements 6,052

6.4.1 Type of Maliciousness To investigate to what extent cybercriminals utilize advertisements to promote their nefarious activities, we analyzed the collected advertisements. For this pur- pose, we used the following procedure: Initially, we retrieved all the analysis reports from Wepawet. Then, we examined the reports looking for the existence of specific heuristics like redirects to NX domains or benign websites like Google and Bing, which suggest the utilization of cloaking techniques. Additionally, we looked for be- haviors (models) that are similar to previously known malicious behaviors. Next, all the executable and Flash files were validated against VirusTotal. Finally, we used the previously mentioned blacklists to monitor if the content of the advertisement was served by a blacklisted domain. Table 6.1 shows the results of all the misbehav- ing advertisements that we detected. In general, we identified 6,052 advertisements which triggered our detection framework. Overall, we observed that about 1% of all the collected advertisements show a malicious behavior. Our results are close to previous research which shows that over 1% of the 90,000 Alexa’s top-ranked websites lead to malvertising [110].

6.4.2 Identifying Risky Advertisers In the next experiment, we wanted to investigate if there is any preference from the side of the malicious advertisers to specific ad networks. In other words, we wanted to measure if some ad networks are more prone to serving malicious advertisements than others. As we already mentioned, each ad network applies its own policy regarding the acceptance of an advertisement. For instance, some of the biggest ad networks do not allow the promotion of websites infected with malware while others, usually smaller in size, are more tolerant to this. Figure 6.1 illustrates the proportion of malvertisement in the total advertisements served by an ad network. The ad networks are sorted based on the ratio of malicious ads compared to the

124 6.4 Analyzing Malvertisements

10

1

0.1

0.01 0 10 20 30 40 50 60 70 Malicious Usage of Ad Networks (%) Ad Networks

Figure 6.1: Malvertising distribution from selected ad networks.

1

0.1

0.01

0.001

0.0001 Global Usage of Ad Networks (%) 0 10 20 30 40 50 60 70 Ad Networks

Figure 6.2: Distribution of advertisements from selected ad networks. legitimate ones served. As we observe, there exist ad networks that are preferred by cybercriminals, and therefore show more malicious ads. Note that in this figure we only display the ad networks that contain more than a certain number of malicious advertisements and omit all these that are able to successfully filter them. Although the existence of ad networks that serve malvertisements constitute a threat for the users of the web, the size of this threat can only be quantified if we measure the proportion that these ad networks have in the total served advertise- ments. Figure 6.2 shows that most of these ad networks send only a small number of malicious advertisements. This verifies our initial statement that the bigger ad networks tend to perform a more accurate filtering of the advertisements they serve compared to smaller networks.

125 6 Understanding Malicious Advertisements

Entertainment Others

17.3 15.8 Newsgroups Job Search 1.4 1.4 Financial Services 1.8 Hobbies 2.2 2.9 Sports News 16.6 3.4 Real Estate 3.5 Health 4.2 7.6 4.7 Trading

6.9 5.1 Business 5.2 Blogs

Computers Streaming Media Games Figure 6.3: Websites categorization that served malvertisements.

Next, we created three major clusters of websites. The first cluster contained the top 10,000 websites from Alexa’s one million top-ranked websites, the second cluster the bottom 10,000, and the third more than 23,000 websites that existed in our advertisement dataset and did not belong to the previous clusters. We wanted to measure from which websites we observe the majority of the malvertisements. We discovered that the first cluster served 88.3% of the whole malvertisements, while the second 6.3%, and the third 5.4%. One can consider that the more famous a website is, the better techniques are applied to protect its visitors. However, the recent event occurred in Yahoo! confirms our hypothesis [57]. In detail, when users visited Yahoo!’s website between 31 December 2013 and 4 January 2014, they were served with malvertisements. Given a typical infection rate of 9%, this incident likely resulted in around 27,000 infections every hour. In order to discover if the top websites receive more malvertisements because they display more advertisements on their web pages compared to the bottom websites, or whether they are simply preferred by cybercriminals, we measured the num- ber of the total advertisements (both benign and malicious) the previous clusters displayed. The results revealed that the first cluster served 76.6% of the total ad- vertisements, the second 11.6%, and the third 11.8%. These results are close to the previously mentioned malvertising results. Consequently, we believe that miscre- ants are not interested from which website their malicious code will be delivered, but they are actually concerned about the total amount of infections they will earn from malvertising.

126 6.4 Analyzing Malvertisements

Others

15.2 it. pl. 1.1 com.br 1.2 de. 1.3 1.3 in. 2.4

3.3 org.

3.7 co.uk.

64.9 5.6

com. net.

Figure 6.4: Malvertisement distribution based on top level domains.

To understand the type of websites that malvertisements are usually targeting, we clustered all the websites we spotted with malvertisement into major categories. Figure 6.3 shows this categorization. Websites that contain entertainment and news content constitute almost one third of the total websites targeted by malvertisement. Interestingly, we observed that the websites that contain adult material are targeted by less than 1% in the malvertising campaigns. This fact conflicts with previous studies, which showed that adult content is tied to increased maliciousness [203]. We assume that over years these websites have developed a strong economic model and they do not want malvertising to tarnish their reputation so they may apply more strict rules in the content of the advertisement they deliver, or perhaps it is not so profitable for cybercriminals to target these websites. Finally, we wanted to see the quota of top-level domains that serve malvertise- ments. Figure 6.4 shows that the .com domains constitute the majority of the websites serving malicious advertisements. Additionally, we noticed that the generic top-level domains (mainly .com and .net) compose more than 70% of the malvertis- ing traffic. Given the fact that most of the .com domains have an American-driven orientation, we believe that malvertising are primarily designed to target United States citizens. However, in order not to come to a false conclusion, we measured the distribution of the top-level domains in our dataset. Our analysis revealed that the generic top-level domains constitute only the 50% of our dataset. Consequently, it is obvious that there is an increase of 20% when it comes to malvertising.

127 6 Understanding Malicious Advertisements

100 Benign Malicious

10

1

0.1 Frequency (%)

0.01

0.001 0 5 10 15 20 25 30 Number of intermediate Ad networks

Figure 6.5: Ad networks involved in ad arbitration for malicious and benign ad- vertisements.

6.4.3 Ad Arbitration

Website administrators might assume that by using only advertisements from major networks, which are considered trustworthy, they can protect their visitors from potential malvertisements. Unfortunately, this is not the case. There is a practice called ad arbitration, which is widely used by ad networks to increase their revenue. During the ad arbitration process, the ad networks buy impressions from publishers as if they were advertisers, and then start a new auction for these ad slots as if they were publishers [174]. Hence, even if an administrator delegates a portion of her website to a specific ad network, she cannot be sure that the advertisements will be only provided by that particular ad exchange. Overall, we noticed a similar behavior in the ad arbitration chain of benign and malicious advertisements. As we see in Figure 6.5, in some cases, both benign and malicious advertisements were served directly from the initial ad network. However, we observed that as the arbitration chain gets longer it becomes more likely that the served advertisement will be malicious. Even though the ad slots that participate in more than 15 auctions constitute only 2% of the total malvertisements and less than 0.5% of the benign advertisements, we further investigated this phenomenon. Our analysis revealed another interesting aspect of ad arbitration. In the initial phases of the auction process, the participants are both popular ad networks and ad networks that we found out being involved in malvertising. However, once the

128 6.5 Discussion auction process gets longer the last auctions typically happen only among those ad networks that we found to serve malvertisements. An explanation for this could be that smaller and less reputable ad exchanges come into play only when the larger ones failed to obtain an ad slot for a particular arbitration. Interestingly, we observed ad networks to repeatedly participate in the auctions for the same ad slot. Specifically, we noticed that the same ad networks buy and sell the same slot multiple times. Another interesting fact is the distribution of the ad arbitration chains. Regarding the benign advertisements, the arbitration chain follows a decreasing trend, while, when it comes to malvertisements, it follows a slightly different model. In absolute numbers, the chain follows the same decreasing trend, however, we observe an increase in the frequency of chains in the middle of our graphs.

6.4.4 Secure Environment Publishers tend to trust the ad networks that they provide benign advertisements. Hence, they do not secure the environment where the advertisements are displayed. Nikiforakis et al. [136] described the problem of link hijacking, in which advertise- ments that are contained in iframes redirect the entire web page to an arbitrary destination. This is a serious attack, given the fact that most users open multiple tabs in their browsers for later reading. Hence, the users can be redirected to phish- ing websites without even noticing that. This problem can be solved in modern browsers with the utilization of the sandbox attribute of iframes in HTML5. Un- fortunately, none of the websites that we crawled utilized this attribute to protect its users.

6.5 Discussion

We have shown that malvertising poses a problem to the security of Internet users. In this section, we therefore discuss proactive and reactive countermeasures against malicious advertisements.

6.5.1 Ad Networks Filtering Ad networks are the primary mean used by miscreants to deliver their malicious advertisements. Many ad networks have detection mechanisms that successfully filter malvertisements. Yet, there exist others that have poor filtering processes, which are unable to completely eliminate this threat. We believe that collaboration among the ad networks can bring better results in defending against malvertisements compared to individual actions. For instance, the existence of a common blacklist where all the malicious advertisements will be submitted can prevent attackers from

129 6 Understanding Malicious Advertisements submitting their malvertisements to a different network if they get rejected from a former one. Another, more drastic, solution will be penalizing of the ad networks which are inefficient to detect the malicious code embedded in advertisements. For instance, forbidding from participating in ad arbitrations for a certain amount of time, or the application of similar penalties, when an ad network is found delivering malvertisements, can boost the ad networks to invest in better detection algorithms.

6.5.2 Last Line of Defense

In the case that a malicious advertisement can successfully bypass the filtering mechanisms deployed by ad networks, there should exist reactive countermeasures to protect the users from being infected. Li et al. [110] proposed a browser-based protection mechanism, which can utilize the knowledge of malicious ad paths and their topological features to raise an alarm when a user’s browser starts visiting a suspicious ad path, protecting the user from reaching an exploit server. Scare- crow [213], on the other hand, triggers false alarms in a user’s browser causing to malicious code, which wants to remain hidden from detection systems, not to be executed. Finally, the safest way for users to protect themselves against malver- tisements is to utilize solutions like Adblock Plus [3] to prevent advertisements from being delivered to their browsers. Although these solutions appear as the most secure way for the users to protect themselves, and a significant portion of the web population is already using them, a universal adoption of these approaches can cause a domino effect in the Internet’s economy.

6.6 Related Work

Detecting malvertisements falls under the category of detecting drive-by down- loads. Stringhini et al. [177] and Mekky et al. [121] used the properties of HTTP redirections to identify malicious behavior. Provos et al. [151] introduced Google Safebrowsing with the use of high-interaction honeypots. Ford et al. [55] focused on malicious flash-based advertisements by using dynamic and static analysis tech- niques. A more ad-specific approach was followed with MadTracer, a tool that inspects the advertisement delivery processes and detects malicious activities with machine learning. Instead of detecting malicious advertisements, AdJail [183] focuses on content restriction policies against third-party advertisements. ADSandbox [45] infers ma- liciousness by executing the suspected JavaScript in an isolated environment and ob- serving the performed actions. AdSentry [47] works in a similar way, by executing advertisements in a sandboxed JavaScript engine with control over the interactions with the user’s visited page.

130 6.7 Summary

Regarding the malvertising detection techniques, previous works focused on var- ious aspects of detecting click-fraud. Majumdar et al. proposed a content delivery system to verify broker honesty under standard security assumptions [118]. Met- wally et al. [123] and Zhang et al. [218] on the other hand proposed algorithms to efficiently detect duplicate clicks. Additionally, Daswani and Stoppelman [42] investigate the ways that malware can exploit ad networks. Immorlica et al. [83] studied fraudulent clicks and presented a click-fraud resistant method for learn- ing the click through rate of advertisements. Finally, Kintana et al. [98] created a system designed to penetrate click-fraud filters to discover detection vulnerabilities. Studying the operations of ad networks is recent in the literature. Guha et al. [74] explored different classes of advertising, like search, contextual, and social networks. Vallina-Rodriguez et al. [187] studied the mobile advertisement ecosystem and how mobile ads introduce energy and network overhead. A financial aspect of advertising was also studied in works of Gill et al. [62]. We differ from these studies as we focus on malvertisements and how these reach the end users.

6.7 Summary

The Internet offers unlimited ways for attackers to find new victims and infect their computers. One of these ways is the online advertising. In this chapter, we performed the first large-scale study of ad networks that serve malicious advertise- ments. We studied various aspects of the advertising ecosystem and observed how malicious advertisements differ from benign ones. In addition, we found that none of the websites that serve advertisements take advantage of new HTML5 features to protect its visitors. We concluded that despite any server-side efforts employed by the ad networks, malicious advertisements still reach the end users.

131

7 Fake Client Honeypots

“All warfare is based on deception. Hence, when we are able to attack, we must seem unable; when using our forces, we must appear inactive; when we are near, we must make the enemy believe we are far away; when far away, we must make him believe we are near.”

Sun Tzu

The great popularity of the Internet increases the concern for the safety of its users as many malicious web pages pop up in daily basis. Client honeypots are tools, which are able to detect malicious web pages that aim to infect their visitors. These tools are widely used by researchers and anti-virus companies in their attempt to protect Internet users from online threats. Unfortunately, cybercriminals are becoming aware of this type of detection and create evasion techniques that allow them to behave in a benign way when they feel to be threatened. This bi-faceted behavior enables them to operate for a longer period, which translates in more profit. Hence, these deceptive web pages pose a significant challenge to existing client honeypot approaches, making them incapable to detect them when exhibit the aforementioned behavior. We mitigate this problem by designing and developing a framework that benefits from this bi-faceted behavior. More precisely, we leverage the evasion techniques used by cybercriminals and implement a prototype, which triggers false alarms in the cases of deceptive web pages. The outcomes of our evaluation reveal that when our prototype was deployed, malicious websites with bi-faceted behavior was prevented from launching their attacks.

133 7 Fake Client Honeypots

7.1 Introduction

As we have already mentioned, over the last years, the Internet has become extremely popular. In this Internet-connected society, users spend most of their online time using browsers to access the Internet. This tendency makes the browsers the most indispensable software product of our days. Unfortunately, this enormous growth of Internet’s popularity has drawn the attention of miscreants that try to illegally monetize their malevolent activities. For this purpose, cybercriminals create fraudulent websites that lure Internet users and force them to install malicious software (malware) on their computers, most of the times without the users’ prior knowledge or consent. To do so, these websites usually exploit vulnerabilities in browsers or in browsers’ plugins, and manipulate them to download and install malware [70, 151, 152]. As a primary line of defense against this emerging threat, researchers and se- curity analysts utilize client honeypots to discover, study, and obliterate malicious websites. Client honeypots—in contrast to server honeypots, which are decoy sys- tems set up to attract and trap attackers that attempt to penetrate them—crawl the Internet, interact with servers, and classify websites based on their malicious behavior. More precisely, client honeypots, which are usually instrumented virtual machines, visit a web page and monitor all the changes in a virtual machine’s file system, configuration settings and running processes. If they notice an unexpected behavior, the web page is flagged as malicious [130, 131, 197]. The secure environ- ment where the client honeypots operate, allow the researchers to discover attacks and exploits, release security patches, and create browser’s alerts that are triggered when a user tries to access a malicious website. The findings are usually published on blacklists, which in turn are used for further study or development of more secure web products. For instance, modern browsers utilize mechanisms such as Google’s Safe Browsing [68] for protecting their users against known web threats and offer them a more secure web surfing experience. Client honeypots pose a significant challenge for the smooth operation of fraud- ulent websites. Unfortunately, cybercriminals who watched the incoming traffic of their websites to shrink developed techniques to avoid detection. More precisely, many malicious web pages have established a bi-faceted behavior. By adopting a series of inspections, deceptive websites can accurately determine if a client is a normal browser or a client honeypot. To achieve this, they use the following simple, yet effective technique: they probe a client to return back a response. This response is valuable for the attackers as they can gain information about the true origin of the client. Thus, to avoid detection, they do not mount their attack if they are dealing with a client honeypot but instead they appear to have a completely benign behavior. This strategy allows them to remain hidden and operate for a longer period of time.

134 7.1 Introduction

Several of the evasion techniques used by malicious web pages have been previ- ously studied in an effort of better understanding the limitations of the existing detection systems [94]. In general, cybercriminals introduce a number of attacks that leverage the weaknesses in the design of client honeypots. These attacks are divided into two main categories: (i) identification of the monitoring environment, and (ii) evasion of its detection. The malicious web pages in order to remain unde- tected must successfully implement attacks from both categories. While the attacks from the first category allow miscreants to detect the presence of a monitoring system such as a client honeypot, the attacks from the second category transform the malicious behavior of the web pages to appear as completely benign. Overall, websites that utilize the aforementioned techniques can trick existing detection sys- tems to classify them as benign and therefore allow cybercriminals to infect a larger number of potential victims. In this chapter, we utilize the very same bi-faceted behavior of malicious web- sites to enhance users’ security. We designed and implemented a framework, called Scarecrow, which benefits from the precautions taken by deceptive websites to hide their malicious activities from client honeypots. In particular, Scarecrow cloaks a regular web browser to appear as a monitoring system. More precisely, when a user visits a website, the framework employs all the necessary mechanisms to disguise its actions as they are generated by a client honeypot. Thereby, all interactions between a user’s browser and a website take place under this security umbrella. Unfortunately, this may occasionally interfere with the normal browsing experi- ence. Nevertheless, users can easily deactivate the framework on demand, or can configure it to meet their needs. In summary, Scarecrow allows users to surf the Internet protected from attacks that exploit their browsers and install malware in their computers. We implemented our prototype as an extension for Firefox. Browser extensions offer a series of advantages as they require minimum effort from users and are al- ready widespread, as most of the Internet users have at least one or more extensions installed in their browsers. Having that in mind, we implemented Scarecrow as a framework with a clear attack prevention focus, which is designed to be easily in- tegrated with Firefox, and thus can be used by inexperienced users. This contrasts with common client honeypot approaches, used by researchers and security experts, which have a strict attack detection mission. Finally, we evaluated our implemen- tation against malicious URLs provided by a large anti-virus company and show that our framework can protect users against malicious websites that display a bi- faceted behavior. We should note that our framework does not replace traditional anti-virus products and techniques, as there is a wide-variety of web pages that exhibit exactly the same behavior for both normal browsers and client honeypots. Nevertheless, Scarecrow constitutes the first line of defense against infections from intelligent malicious web pages that try to remain undercover.

135 7 Fake Client Honeypots

In summary, we make the following main contributions:

• We transform a traditional attack detection approach, such as client honey- pots, to a system with a clear attack prevention focus.

• We propose Scarecrow, a novel framework, which enhance users’ security when surfing the web. This framework allows the users to surf the web pro- tected from attacks that performed by deceptive web pages.

• We implement a prototype of our approach as a Firefox extension. Our pro- totype creates events that trigger the inspection mechanisms against client honeypots used by intelligent adversaries.

• We evaluate our implementation and our preliminary results demonstrate that Scarecrow can successfully protect users against malicious websites that dis- play a bi-faceted behavior.

7.2 Client Honeypots

Client honeypots, also known as honeyclients, constitute a security technology ca- pable of discovering malicious servers on Internet. More specifically, the client hon- eypots pose as normal web clients and interact with servers to investigate whether an attack has occurred. They divided in: (i) low-interaction, (ii) high-interaction, and (iii) hybrid client honeypots. In the rest of this section we briefly describe these categories.

7.2.1 Low-Interaction Client Honeypots Low-interaction client honeypots use simulated clients, similar to web crawlers, whose purpose is to interact with a server. Their task is to analyze the server’s responses and determine about its nature. By deploying static analysis techniques, such as signatures, can search for malicious patterns and assess whether an attack has occurred. Increased speed and low resource consumption are their major advantages. How- ever, since they are usually signature-based approaches, they are unable to detect previously unseen attacks such as zero-day threats. Additionally, their simplicity makes them easily distinguishable by advanced exploits.

7.2.2 High-Interaction Client Honeypots High-interaction client honeypots are fully functional systems comparable to real systems. They use a full-featured web browser to visit potentially malicious web

136 7.3 System Overview pages and classify their behavior. To this end, they monitor the environment in which the browser operates to inspect any modification of the system’s state after visiting a web server. The detection of any change in the monitored environment indicates the occurrence of an attack, and the corresponding web page is flagged by the system as malicious. This type of client honeypots is very effective at detecting novel attacks against web browsers. Nevertheless, the tradeoff for this accuracy is the high complexity of running a high-interaction honeypot and the time consuming monitoring process. Additionally, since the client honeypots are running inside virtual machines, the malicious web pages may try to detect the presence of the virtual environment and cease from launching the attack. Consequently, if the honeypot do not observe any detectable state change in the monitored environment, is likely to incorrectly classify the server.

7.2.3 Hybrid Client Honeypots Low-interaction and high-interaction client honeypots try to detect malicious web pages from a different perspective. Hence, the combination of both approaches leads to a detection system that integrates high speed and the ability to identify new threats. This detection system is called hybrid client honeypot. Hybrid approaches incorporate the classification methods used by low-interaction and high-interaction client honeypots into a hybrid system, which is capable of identifying malicious web pages in a cost effective way on a large scale. For that reason, the hybrid client honeypot approach outperforms a high-interaction client honeypot with identical resources and identical false positive rate.

7.3 System Overview

In this section, we discuss the architecture of our proposed system. We initially explain the threat model we use throughout this chapter and then provide informa- tion about the design of Scarecrow.

7.3.1 Threat Model We assume that a user surfs the Internet with a vulnerable web browser and visits a malicious web page. Even if the browser itself is secure, there exist a variety of extensions installed in the browser that might have subtle vulnerabilities, which can be exploited. In fact, hundreds of these extensions’ vulnerabilities have been studied in prior works [11, 27] and tools that automatically highlight these weaknesses have already been implemented [9, 95, 108, 149].

137 7 Fake Client Honeypots

Table 7.1: Popular mechanisms used by malicious web pages to evade detection. Categories of Malicious Mechanisms Heuristics Virtual Machine Detection Monitoring Environment Detection Client Honeypot Detection HTTP Headers Checks Mouse Events Exploitation Detection Evasion Whitelist Manipulation

Additionally, we assume that the malicious web page is aware of the browser’s, or extensions’, vulnerabilities and possesses all the necessary tools, which can exploit the browser. We believe that the purpose of this page is to generate profit for its operator, and thus the more time remains undetected the more profit it creates. Consequently, we consider that the page is capable of displaying only benign content when identifies the presence of a client honeypot.

7.3.2 Design Details Since adversaries’ goal is to hide their malicious activities from automated detec- tion systems, they created and maintain mechanisms to successfully inform them when client honeypots visit their web pages. Kapravelos et al. [94] created a list of these mechanisms, which is divided in two major categories: (i) identification of the monitoring environment, and (ii) evasion of its detection. Table 7.1 provides an overview of the techniques used in each category. Scarecrow leverages the heuris- tics of both categories to misinform miscreants for the true identity of a client. In the following, we provide details about the integration of each mechanism in our prototype implementation.

Virtual Machine Detection Since many successful drive-by-download attacks install malware that interferes with the victim’s operating system, many client honeypots utilize virtual machines to protect the actual host from being infected. As a matter of fact, after a complete scan to an under investigation web page has been performed, the virtual machine returns the operating system to a safe state. Unfortunately, the attackers in order to detect a monitor environment utilize mechanisms, which can reveal the presence of a virtual machine. In our system, we turn this knowledge into a protection heuristic. More precisely, some virtual machines have the tendency to reveal their presence by inserting ele- ments into the guest operating system, which eventually can be traced. For instance,

138 7.3 System Overview these elements can be service processes, unique files or directories, or even specific registry keys. Scarecrow is able to generate all the required files in the operating system as well as dummy executable files, which then execute. This way, it cre- ates processes that cloak the operating system to appear as it is running inside a virtualized environment.

Client Honeypot Detection An adversary in order to detect the presence of a monitoring system can actually check for signs of the client honeypot itself. These signs can be, for example, executable or DLL files. Sadly, there exist honeypots that do not try to hide their presence, and even worse the attackers are familiar of this practice. In theory, a malicious web page can use the JavaScript engine to load a suspicious file from the client’s local file system, which is only appeared in cases of client honeypots. This file can be an executable or library, since the engine does not perform any checks to validate the type of the files that have been requested. If the file exists, the attacker is getting aware of the situation. Hopefully, Same-Origin Policy [194] prohibits the access on local files through JavaScript. Nevertheless, there exist some older browsers’ versions, or some cus- tomized browsers used by client honeypots that allow this technique to be executed. Similar, Scarecrow has the ability to hook on specific JavaScript events that re- quest access to the filesystem. In detail, each time a web page tries to access a local file, a warning informs the user about the intention of the script. Addition- ally, Scarecrow creates several client honeypots’ executable files on every browser’s startup and deletes them on every browser’s shutdown. We observed that only the presence of these files sometimes is sufficient enough to raise an alert on malicious web servers.

HTTP Headers Checks Whenever a browser visits a web page, it sends one or multiple HTTP requests to the equivalent web server. Each request contains fields, called headers, which reveal information about the type of request, the connection, the browser, etc. Among these fields, there are two specific headers that prove to be quite useful for the attackers: the User-Agent and the Referer. While the User-Agent can reveal information about the browser itself, the Referer provides the source URI from which the call of the current web page originated. An adversary could utilize this knowledge to determine if the client is actually a honeypot. Modern honeypots can easily modify the User-Agent value to mimic a normal web browser, however, it is difficult to predict the correct origin of the hosted URI, based on which the malicious web page will trigger the attack.

139 7 Fake Client Honeypots

Scarecrow modifies both HTTP headers to appear as a misconfigured client hon- eypot. For this purpose, since there is only a small portion of malware that targets non-windows operating system, we use User-Agent header to classify the browser as a client honeypot, which runs inside a Linux-based machine. Additionally, we adjust the Referer header to purport as the web page is harvested from an anti-virus vendor. Consequently, as the user is probably visiting the malicious web page by clicking a link in a different website, this information will not be revealed to the attacker. One may argue that this will affect the behavior of the benign sites as well. However, the majority of the benign websites are mostly interested in gaining more traffic, which translates to more money for their owners from the published advertisements. Therefore, we believe that the operators of these benign websites do not really care for the operating system of their visitors or how these users originally landed to their websites.

Mouse Events Exploitation To ensure that they deal with a real client, many malicious web pages wait for a prior user input before launching an attack. Since frequently a client honeypot only visits a web page, without actively interact with it, events such as the movement of the mouse will never be fired. Therefore, adversaries can hold their attacks before one or more similar events convince them that an actual user generates the traffic to their web server and not an automated tool as a web crawler or a client honeypot. For this purpose they utilize JavaScript. In particular, some JavaScript objects have events associated with them. Usually, these events are user actions, or at least initiated by a user, for instance, click on an object or simply moving the mouse. Hence, an event listener can be used in JavaScript code to specify actions in response to the occurrence of a specific event. Scarecrow prevents the event listeners from catching the mouse movements. Con- sequently, the adversaries get the impression of dealing with client honeypots instead of real users. However, we believe that this might affect the user experience, espe- cially in cases where the mouse interaction is necessary. For instance, one example could be some online flash games that require the users’ mouse events to operate. Another example could be benign websites that allow users to have access to infor- mation, however, they do not allow a crawler to automatically collect all of their data and thus they may use the mouse event to distinguish between normal users and crawlers.

Whitelist Manipulation All browsers are able to interact with the operating system. This is an impor- tant action for procedures that store browsing data on the hard disk drive or load

140 7.4 Evaluation additional programs to display web content. Client honeypots use the so-called whitelists when analyzing attacks to separate the harmless interactions between the browser and the operating system, from the harmful. Adversaries circumvent these whitelists with cache poisoning attacks. Since creating, reading and writing to a file in the browser cache is a whitelist action, an attacker could leverage it to change the files in the cache, in order to force the redirection from benign URLs to mali- cious web pages. This attack is so severe that even if a victim closes or reboots the browser, it is sufficient just a visit to any web page that loads a modified cached script to re-infect the machine. To protect a client from these redirections we clear the cache on a regular basis. The browser’s cache is a data space where web pages are stored once they are loaded. If a web page gets revisited in the future, the content of this page can be loaded directly from the cache, which is faster than loading the content from a remote server. Each web page is accompanied with the duration on which it can be loaded from the cache, before it needs to be downloaded again from the server. An adversary could inject malicious code inside a web page and change its expiration date to the far future, so that is always being loaded directly from the cache. When we clear the cache regularly we cannot prevent the original injection of malicious code, however, we can prevent a re-infection from happening. To sum up, we consider the combination of the aforementioned mechanisms sufficient to delude a bi-faceted malicious web page so not to launch its attack. However, one can notice that mixed information is returned from the different components of Scarecrow. While some components claim, for instance, that the operating system is a Linux distribution, others state that all the processes run inside a Windows OS. It is worth to mention that this is not a wrong implementation of Scarecrow, but is designed in this way intentionally, so to appear to attackers as a misconfigured client honeypot.

7.4 Evaluation

In this section we present the evaluation results of our prototype. First, we briefly describe the experimental environment and then evaluate the protection effective- ness of Scarecrow using real malware samples from a large anti-virus company.

7.4.1 Experimental Environment We performed our experiments on a cluster that consisted of Windows XP ma- chines. We chose Windows XP as the hosts’ operating system due to its known vulnerabilities, which make it a perfect target for adversaries. Furthermore, it is long been known that Java and Flash are favored targets of attackers thanks to

141 7 Fake Client Honeypots

Report Malicious)URLs Network)Traces Analysis Cluster Figure 7.1: Overview of the experimental environment.

their huge installation bases and numerous security issues. Thus, in order to in- crease the chance of successful attacks against our infrastructure, in each host we installed some older versions of Java, Flash, Acrobat, and VLC that are susceptible to security breaches. Since we implemented the current version of our prototype as a Firefox extension, we decided to run all our experiments with this web browser in order to have a consistency in our results. To this end, we installed Scarecrow only on half of the machines of our cluster, while to the rest we installed a vanilla version of Firefox.

For our experiments, we used a dataset of 8,291 malicious URLs, which are pro- vided by a large anti-virus company. We visited each URL with two machines from the cluster, one with Scarecrow installed on it and one without, generating in par- allel events such as mouse movements and keystrokes. In addition, we captured and stored the network traffic of each visit for further analysis. After each web browser’s visit to the malicious URL, we returned the host to a clean state. To do so, we used Clonezilla [31] as a disk recovery solution. Albeit that using a disk recovery solu- tion is a time consuming task, compared to virtual machines snapshots, we rejected the snapshots because they could have interfered with the Virtual Machine Detec- tion heuristic of Scarecrow causing inaccurately results on our experiments. After visiting each URL, we analyzed the captured network traffic looking for existence of malicious traces. In case that the traffic of both machines contained malicious traces we concluded that either the malicious web page did not display a bi-faceted behavior, or our system was not able to deceive attackers’ mechanisms. Otherwise, if the captured traffic of the vanilla browser contained malicious traces, but the traffic from Scarecrow was clean, we accounted this case as a successful protection.

In summary, Figure 7.1 displays the overview of our experimental environment. Web browsers visit the malicious URLs, while the captured traffic is forwarded for extensive analysis that assesses the effectiveness of Scarecrow.

142 7.5 Discussion

7.4.2 Protection Effectiveness We evaluated the protection effectiveness of Scarecrow against real malicious web pages. From the 8,291 malicious URLs that constitute our dataset, Scarecrow successfully prevented the infection from 437. The outcomes of this experiment are two–fold. On the first hand, the results reveal that there is a portion of malicious web pages in the Internet that try to avoid detection by appearing a benign behavior when they are visited by a client honeypot. On the other hand, they show that is possible to use attackers’ precautions for users’ benefit by simply camouflaging a normal web browser to a client honeypot. We then analyzed, the remaining 7,854 malicious URLs that Scarecrow could not prevent them from infecting the host machines. We wanted to see if these infections caused because Scarecrow was not able to prevent them, or because the malicious web pages did not try to hide their existence from client honeypots. Thus, we carefully examined their source code searching for indications of possible bi-faceted behavior. Nevertheless, we could not find signs that reveal any attempt of concealing their malicious essence. This is an interesting result, which shows that the majority of miscreants are not care of hiding their fraudulent activities. We can only speculate that most of the operators of these web pages do not have advanced technical knowledge, and thus in order to launch their attacks use black-market tools, which may not support protection against client honeypots.

7.5 Discussion

Despite the fact that Scarecrow is able to effectively prevent attacks from de- ceptive web pages, like any other system has its own limitation. An adversary, who gains knowledge on our system existence, might be able to effectively infect the users. In this section, we discuss our system’s limitations and suggest possible improvements.

7.5.1 User Interaction Interference The most obvious limitation of Scarecrow is the problems that might cause in the users’ browsing experience in specific web pages. Although, usually a user does not notice any difference while surfing the Internet, there are some cases in which our system interfere with the user’s actions. One particular case is when the user tries to play an online flash game that requires input from the user’s mouse movements. Albeit the user can see the mouse cursor moving in the screen, Scarecrow prevents these mouse events from triggering the corresponding event listeners. Consequently, the result of this restriction is the mouse movement events never reach the game engine, which assumes that the user has never provided any valid input.

143 7 Fake Client Honeypots

A possible solution on this problem would be the deactivation of the mouse events restrictions. As a matter of fact, Scarecrow allow to its users to select the heuristics they want to enable. Additionally, they have the capability to enable/disable indi- vidual heuristics for distinct websites. This way, the restriction of mouse events can be disabled for a website that requires this specific user input, and be enabled for the rest of the websites. However, this may expose user’s computer to exploits and thus we recommend this action to take place only if another security framework, such as anti-virus software, is enabled.

7.5.2 File Content Verification One of Scarecrow’s heuristics is the creation of dummy executable and DLL files that used by virtual machines and client honeypots. Most of the attackers only check for the existence of these files. However, an intelligent deceptive web page can validate the provided files with the expected ones. If there is no match, it can assume that is under a deception attempt. In that case, having that knowledge, it can decide whether it proceeds with the attack or not. In this case, we can easily overcome this pitfall by simply providing the real files. Nevertheless, as we did not want to force users to save additional executable files and libraries to their computers, we offer them the opportunity to select between the dummy and the real files when using our system. If they choose the second option, Scarecrow will download all the required files from an online repository.

7.6 Related Work

Several malware detection systems focus on network flow analysis [58, 72] or require deep packet inspection [73] in order to detect compromised machines within a local network. Other detection approaches aim to identify common behaviors of infected machines when performing malicious activities [63, 72, 175, 215]. On the other hand, traditional honeypots as a detection approach have proven through years very successful with tasks as identifying malware [128], creating intrusion detection signatures [102] and understanding distributed denial-of-service (DDoS) attacks [129]. As successful successors, client honeypots visit websites and monitor changes in the underlying operating system that may be caused by malicious web pages [130, 131, 151, 184, 197]. An attacks against client honeypots is an aged old idea. Wang et al. [197] men- tioned possible evasions techniques against HoneyMonkey. Rajab et al. [153] in their study revealed that client honeypots, among other detection system, could not protect themselves against evasive techniques deployed by attackers. Kaprav- elos et al. [94] examined the security model that high-interaction client honeypots utilize, and evaluated their weaknesses against intelligent evasion techniques.

144 7.7 Summary

In addition, the concept of deceiving malware to hide its malicious behavior, and thus not to harm a machine is not new. Researchers suggested the use of fake honeypots to scare the attackers [163]. Additionally, Rowe [162] presented a study in which he assesses several tools for evaluating honeypot deceptions. Finally, Garg and Grosu [61] proposed a game theoretic framework for modeling deception in honeynets. Researchers utilize browser extensions when their deployed systems focus on inex- perienced users with a clear intention of increasing users’ security and privacy while surfing the Internet. PwdHash [161] is a browser extension designed to improve password authentication on the web with minimal change to the user experience and no change to existing server configurations. Papadopoulos et al. [140] pre- sented an obfuscation-based approach that enables users to follow privacy-sensitive channels on microblogging services, while, at the same time, making it difficult for the services to discover users’ actual interests. In addition, Kontaxis et al. [101] proposed a design for privacy-preserving social plugins that decouples the retrieval of user-specific content from the loading of a social plugin.

7.7 Summary

In this chapter, we proposed a prevention technique that can be used by online users to protect themselves from possible infections. Our system was based on the assumption that the effectiveness of the existing detection approaches, such as client honeypots, to correctly classify malicious web pages are considered modest, especially in cases where the attackers deployed techniques to display a benign behavior against these detection mechanisms. Therefore, we benefited from the attackers’ precautions and we transformed these detection approaches to a system with a clear attack prevention focus. In particular, we designed and implemented a prototype that triggers false alarms causing deceptive web pages to display a benign behavior. Our primary results indicate that the users who utilize our system can be protected from these attacks without any additional anti-virus software installed on their machines.

145

8 Conclusion

“A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.”

Alan Turing

The ever-growing popularity of the Internet offers new opportunities for cyber- criminals to conduct their nefarious activities. Their motives range from financial reasons to political activism. Unfortunately, no matter what their incentives are, the problems they cause to the smooth operation of the Internet are serious. Most of the times, miscreants act alone, or form small alliances, and target key infrastruc- tures of the Internet or steal private information from corporations and individuals. To do so, they initially seek for vulnerable machines which they successfully in- fect. Then, they create and command an army of these compromised machines, so-called botnet, which blindly follow cybercriminals’ instructions. Hence, attack- ers can successfully hide themselves behind the anonymity that a botnet offers. As such, cybercrime is much safer than real-life crime and in many cases, much more profitable. Consequently, cybercrime poses a serious problem that is not going to disappear in the near future. In this dissertation, we proposed new approaches to detect and block malicious activities on the Internet. More precisely, we focused on two complementary parts related to botnet detection and web security. While in the first part we showed how to detect compromised devices and mitigate their malicious activities, in the second part, we attempt to understand how cybercriminals operate and proposed methods that prevent them from infecting new computers. Overall, we believe that the two parts of this dissertation have tackled known security issues from a different angle and hope that our work will drive new research to this direction. In the following, we summarize our different approaches and outline directions for future research.

147 8 Conclusion

Initially, this thesis proposed a novel method to detect and mitigate spam emails by identifying the subtle deviations in the implementation of the SMTP protocol by various clients. We presented a lightweight method that could be integrated with existing anti-spam infrastructures. The results of our work confirmed that a server could reduce the number of emails that need to be processed by content analysis which is an expensive technique and not always feasible on busy servers. Additionally, we proposed a way to negatively affect the efficacy of spam botnets by poisoning their feedback mechanism and drive spammers to a lose-lose situation. We then extended the aforementioned detection approach to target HTTP-based malware and botnets. We introduced two models that focus on particular charac- teristics of the HTTP protocol such as headers’ sequence and structure, and imple- mented a system that can identify malicious network traffic with low classification errors and high performance. The outcomes of our experiments revealed that our prototype is able to early detect malicious domains before various popular black- lists publish them. Finally, we showed that we can detect real-world HTTP-based malware as well as advanced persistent threats used in targeted attacks. In a nutshell, the approaches presented in Chapter2 and Chapter3 proposed novel techniques in detection and mitigation of botnets and malware instances that have successfully compromised vulnerable computers and leverage the SMTP and HTTP protocols to perpetrate their malicious activities. Then, we went one step further and studied the techniques used by cybercriminals who try to trick victims into visiting malicious websites and infecting their machines. First, we studied the alliances formed by web spammers in their attempt to boost the ranking of their websites. We showed that web spammers utilize SEO forums to exchange links. Additionally, we noticed that a portion of these exchanges take place only through private messages. We therefore introduced the notion of honey accounts which collect data from private communications. Our analysis revealed the existence of two major categories of spammers with distinct features. Each category behaves in a completely different manner, and thus we need different approaches to unveil their spam web pages. Overall, the results of this study improves our under- standing on the web spam ecosystem and sheds light on the activities performed in underground forums related to link spam. Next, we explored how attackers can manipulate web crawlers to their benefit. First, we revealed that is feasible for an attacker to convince crawlers to launch attacks against third-party websites by simply crafting appropriate links. We then demonstrated that websites vulnerable to reflected HTML and JavaScript injections can help attackers to boost the ranking of their websites. Finally, we proposed a series of countermeasures for the detection of malicious outbound links. To the best of our knowledge, this was the first study that thoroughly examined the resistance of popular web crawlers against various attacks and we believe that the presented outcomes can aid crawlers to defend themselves in a more efficient way.

148 8 Conclusion

We also studied the ecosystem of malicious advertisements and detect cases where it was abused. We introduced a system that collects and evaluates advertisements and showed that certain ad exchanges are more prone to serving malicious content than others. Additionally we found that, because of the arbitration process, it is common for ad exchanges to serve a malicious advertisement provided by another ad exchange. Although the results of this study were premature, they revealed that malicious advertising is a real problem that can target millions of users without the prior need of luring them to visit a malicious website. Finally, we presented a novel approach which enhances users’ security while surf- ing the web by camouflaging a normal browser as a client honeypot. We proposed this solution after observing that a number of malicious websites exhibit a benign behavior when they visited by a client honeypot. The evaluation showed that in cases of malicious websites with split personality, users that utilize our system as a plugin to their browsers confront fewer threats without any further modification in their computers. Consequently, albeit that our framework does not replace tradi- tional anti-virus products, it adds an additional level of security. Overall, the two parts of this dissertation aim to make the Internet a bit more secure place to operate. We hope that our work will drive new research to tackle the problems we covered in this thesis from a different angle, as well as the invention of novel detection and mitigation techniques against closely related security issues.

Future Work

In this dissertation, we proposed several approaches that try to detect and de- fend against online threats. However, the problem is far from being solved. This is mostly happens due to the fact that computer security is a continuously arms race between cybercriminals and security researchers. As researchers invent a novel detection or mitigation technique against attacks, miscreants create countermea- sures to evade these techniques. With our mind in this cat-and-mouse game, we recommend directions for future research that could raise the bar a little higher. In the first part of this thesis, we proposed approaches that leverage slight devia- tions in the implementation of two protocols, SMTP and HTTP, to detect malicious activities. However, we believe that it would be interesting to enhance our research in more protocols and investigate under which circumstances our approaches will be effective in these protocols. An example would be the investigation of botnets that used for mining of Bitcoin and their discrimination from legitimate mining soft- ware. Furthermore, we showed that an attacker can evade our approach, however, especially in the case of spam botnets, she will suffer with a performance penalty. Therefore, another direction for future research would be to measure the economic loss for the cybercriminals who adapt their behavior to evade detection.

149 8 Conclusion

In the second part of the thesis, we studied the alliances among web spammers by capturing and analyzing their posts in SEO forums. However, we believe that these forums constitute only the tip of the iceberg. Therefore, we suggest further research in different forms of communication. For instance, in the era of social media where most of the traditional forms of online communication, such as emails and instant messaging, replaced with messages in profile status and hashtags in public or private channels, we expect that this trend has not gone unnoticed by web spammers. More precisely, we believe that spammers find more direct ways to communicate with each other and form alliances through social media. Although recent researches focused on how social media can be abused by cybercriminals in order to perform malicious activities, we are not familiar yet of any study that treat these services as a platform that can help web spammers to boost the ranking of their websites. Next, we presented an approach for automatic prevention of attacks against web crawlers. A future research direction in this topic is to extend the attack models beyond the ones described in this thesis. The first step is a careful systematization and fingerprinting of the capabilities of the crawlers to understand to which extent they can execute arbitrary client-side code. Then, possibly in cooperation with the search engine operators, the second step consists in testing a list of increasingly complex attacks to evaluate their feasibility. As a continuation of the previous point, exploring other web services that can be used as a trampoline for attacks is another interesting research direction. The challenge here is that the interaction with such web services could be hard to automate in a generic way. Interesting web services worth exploring include, for instance, link checkers, chat or social networking tools that process posted links, and security products such as automated scanners used to analyze a website to categorize its content (e.g., Web of Trust). Regarding the content of malicious advertisements, our main priority was on studying how web advertising can be turned to a powerful weapon for the cyber- criminals’ arsenal and if this is a real-world scenario. Consequently, we discuss the possible countermeasures of this threat only in brief. However, one of our suggested solutions was the utilization of advertisement blocking plugins, such as Adblock Plus. These plugins block all the advertisements, including the malicious, from rendering in a user’s browser. Unfortunately, a massive adaption of this solution could severely damage the economy of the Internet. Thus, we propose for future research a slight variation of our initial proposal. Instead of blocking all the adver- tisements, a plugin should restrict only the malicious ones. As such, we suggest a plugin that delivers advertisements to our oracle, which could then run an analysis on their execution, and build an advertisement reputation network. The results of this analysis could automatically delivered to the ad networks, which in their turn will block the malicious advertisements. Hence we will have a win-win scenario in which users securely surf the web and the ad networks continue to make profit.

150 8 Conclusion

Finally, as the Internet of Things (IoT) evolves, threats in the Web become more advanced by targeting devices other than personal computers. Hence, the existing detection and prevention techniques become less efficient. Applying widely used practices or variants in the IoT world requires substantial re-engineering to address device constraints. For instance, embedded devices are designed for low power consumption, and thus lack the computational capabilities of a personal computer. Moreover, they often operate headless; they must make their own judgments and decisions about whether to accept a command or execute a task without a human being—an operator who can input authentication credentials or decide whether an application should be trusted—operating them. While some embedded systems have clear and well-defined security goals, such as the pay-TV smart cards or the Hardware Security Modules (HSM) and therefore are rather secure, there exist others that are not designed with a clear threat model in mind. This gives little motivation to manufacturers to invest time and money in securing them. Therefore, this constitutes an area which requires further research. In the future, we plan to investigate how attackers select these devices and how easy is for the adversaries to compromise them. Our ultimate goal is to implement detection and prevention approaches similar to the ones that exist for infected computers.

151

List of Figures

2.1 A typical SMTP conversation...... 19 2.2 Simplified state machines for Outlook Express (left) and Bagle (right). 21 2.3 Regular expressions used in message templates...... 21 2.4 An example of decision state machine...... 28 2.5 Overview of B@bel architecture...... 34

3.1 A typical HTTP request...... 49 3.2 Overview of BotHound architecture...... 54 3.3 Number of the generated header chains for different versions of Fire- fox as a function of number of the visited URLs...... 57

4.1 A simplified example of PageRank calculation...... 74 4.2 Architecture of our approach...... 77 4.3 Percentage of users for the pairs number of post and URL in a thread. 80 4.4 CDF of URL occurrences in our dataset...... 81 4.5 CDF of the number of posts per user...... 81 4.6 PageRank of web pages that request for link exchange in SEO forums. 83 4.7 A two-way link exchange network. The numbers on nodes indicate the web page ranking...... 86 4.8 A link exchange network including web pages from both public threads and private messages...... 87 4.9 A webspam ecosystem...... 88

5.1 Overview of the attack scheme...... 97

153 List of Figures

5.2 Overall final detection quality in terms of F1 score, precision, TPR and FPR (before and after per-site tuning)...... 111

6.1 Malvertising distribution from selected ad networks...... 125 6.2 Distribution of advertisements from selected ad networks...... 125 6.3 Websites categorization that served malvertisements...... 126 6.4 Malvertisement distribution based on top level domains...... 127 6.5 Ad networks involved in ad arbitration for malicious and benign ad- vertisements...... 128

7.1 Overview of the experimental environment...... 142

154 List of Tables

2.1 MTAs, MUAs, and bots used to learn dialects...... 35

3.1 Protection mechanisms...... 58 3.2 Detection results for benign and malicious datasets as a function of different threshold values...... 59 3.3 Classification results of different approaches...... 60 3.4 Classification results of BotHound for three real-world traffic datasets. 61

4.1 Percentage of URLs and users that appear on 1 up to 3 different forums. 82 4.2 Distribution of the requested themes for link exchange...... 84 4.3 Breakdown of countries hosting web spam...... 84 4.4 Statistics of requested link exchange types...... 85

5.1 Overview of the feasibility of each attack for each type of abuse... 101

6.1 Classification of malvertisements...... 124

7.1 Popular mechanisms used by malicious web pages to evade detection. 138

155

Bibliography

[1] M. Abu Rajab, J. Zarfoss, F. Monrose, and A. Terzis. A Multifaceted Ap- proach to Understanding the Botnet Phenomenon. In ACM SIGCOMM Con- ference on Internet Measurement (IMC), 2006.

[2] S. Adali, T. Liu, and M. Magdon-Ismail. Optimal Link Bombs Are Uncoor- dinated. In Adversarial Information Retrieval on the Web (AIRWeb), 2005.

[3] Adblock Plus. Surf the Web Without Annoying Ads! https://adblockplus. org, 2014.

[4] D. S. Anderson, C. Fleizach, S. Savage, and G. M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In USENIX Security Symposium, 2007.

[5] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh, W. Lee, and D. Dagon. From Throw-Away Traffic to Bots: Detecting the Rise of DGA-based Malware. In USENIX Security Symposium, 2012.

[6] S. Antonatos, I. Polakis, T. Petsas, and E. P. Markatos. A Systematic Charac- terization of IM Threats Using Honeypots. In ISOC Network and Distributed System Security Symposium (NDSS), 2010.

[7] P. B¨acher, T. Holz, M. K¨otter,and G. Wicherski. Know Your Enemy: Track- ing Botnets. The Honeynet Project and Research Alliance, Technical Report, 2005.

157 Bibliography

[8] M. Balduzzi, C. T. Gimenez, D. Balzarotti, and E. Kirda. Automated Dis- covery of Parameter Pollution Vulnerabilities in Web Applications. In ISOC Network and Distributed System Security Symposium (NDSS), 2011.

[9] S. Bandhakavi, S. T. King, P. Madhusudan, and M. Winslett. VEX: Vet- ting Browser Extensions for Security Vulnerabilities. In USENIX Security Symposium, 2010.

[10] A. Barth. RFC 6454: The Web Origin Concept. http://tools.ietf.org/ html/rfc6454, 2011.

[11] A. Barth, A. P. Felt, P. Saxena, and A. Boodman. Protecting Browsers From Extension Vulnerabilities. In ISOC Network and Distributed System Security Symposium (NDSS), 2010.

[12] A. Barth, C. Jackson, and J. C. Mitchell. Robust Defenses for Cross-Site Request Forgery. In ACM Conference on Computer and Communications Security (CCS), 2008.

[13] U. Bayer, A. Moser, C. Kruegel, and E. Kirda. Dynamic Analysis of Malicious Code. Journal in Computer Virology, 2(1), 2006.

[14] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. SpamRank–Fully Automatic Link Spam Detection. In Adversarial Information Retrieval on the Web (AIRWeb), 2005.

[15] R. Beverly and K. Sollins. Exploiting Trasport-Level Characteristics of Spam. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2008.

[16] J. R. Binkley and S. Singh. An Algorithm for Anomaly-Based Botnet Detec- tion. In USENIX Workshop on Steps to Reducing Unwanted Traffic on the Internet (SRUTI), 2006.

[17] M. Bishop. Analysis of the ILOVEYOU Worm. http://nob.cs.ucdavis. edu/classes/ecs155-2005-04/handouts/iloveyou.pdf, 2000.

[18] K. Borgolte, C. Kruegel, and G. Vigna. Delta: Automatic Identification of Unknown Web-Based Infection Campaigns. In ACM Conference on Computer and Communications Security (CCS), 2013.

[19] N. Borisov, D. Brumley, H. J. Wang, J. Dunagan, P. Joshi, and C. Guo. Generic Application-Level Protocol Analyzer and Its Language. In ISOC Network and Distributed System Security Symposium (NDSS), 2007.

158 Bibliography

[20] S. W. Boyd and A. D. Keromytis. SQLrand: Preventing SQL Injection At- tacks. In Applied Cryptography and Network Security, 2004. [21] S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30(1):107–117, 1998. [22] D. Brumley, J. Caballero, Z. Liang, J. Newsom, and D. Song. Towards Auto- matic Discovery of Deviations in Binary Implementations With Applications to Error Detection and Fingerprint Generation. In USENIX Security Sympo- sium, 2007. [23] J. Caballero, C. Grier, C. Kreibich, and V. Paxson. Measuring Pay-Per- Install: The Commoditization of Malware Distribution. In USENIX Security Symposium, 2011. [24] J. Caballero, P. Poosankam, C. Kreibich, and D. X. Song. Dispatcher: Enabling Active Botnet Infiltration Using Automatic Protocol Reverse- Engineering. In ACM Conference on Computer and Communications Security (CCS), 2009. [25] J. Caballero, H. Yin, Z. Liang, and D. X. Song. Polyglot: Automatic Extrac- tion of Protocol Message Format Using Dynamic Binary Analysis. In ACM Conference on Computer and Communications Security (CCS), 2007. [26] F. Callegati, W. Cerroni, and M. Ramilli. Man-In-The-Middle Attack to the HTTPS Protocol. Security & Privacy, IEEE, 7(1):78–81, 2009. [27] N. Carlini, A. P. Felt, and D. Wagner. An Evaluation of the Google Chrome Extension Security Architecture. In USENIX Security Symposium, 2012. [28] Z. Cheng, B. Gao, C. Sun, Y. Jiang, and T.-Y. Liu. Let Web Spammers Expose Themselves. In ACM International Conference on Web Search and Data Mining (WSDM), 2011. [29] C. Y. Cho, D. Babi´c,E. C. R. Shin, and D. Song. Inference and Analy- sis of Formal Models of Botnet Command and Control Protocols. In ACM Conference on Computer and Communications Security (CCS), 2010. [30] S. Christey and R. A. Martin. Vulnerability Type Distributions in CVE. Mitre report, May 2007. [31] Clonezilla. The Free and Open Source Software for Disk Imaging and Cloning. http://clonezilla.org, Jul 2014. [32] D. E. Comer and J. C. Lin. Probing TCP Implementations. In USENIX Summer Technical Conference, 1994.

159 Bibliography

[33] P. M. Comparetti, G. Wondracek, C. Kruegel, and E. Kirda. Prospex: Pro- tocol Specification Extraction. In IEEE Symposium on Security and Privacy, 2009.

[34] E. Cooke, F. Jahanian, and D. McPherson. The Zombie Roundup: Under- standing, Detecting, and Disrupting Botnets. In USENIX Workshop on Steps to Reducing Unwanted Traffic on the Internet (SRUTI), 2005.

[35] B. Coskun, S. Dietrich, and N. Memon. Friends of an Enemy: Identifying Local Members of Peer-To-Peer Botnets Using Mutual Contacts. In Annual Computer Security Applications Conference (ACSAC), 2010.

[36] M. Cova, C. Kruegel, and G. Vigna. Detection and Analysis of Drive-By- Download Attacks and Malicious JavaScript Code. In International Confer- ence on World Wide Web (WWW), 2010.

[37] Cuckoo Sandbox. Automated Malware Analysis. http://www. cuckoosandbox.org/, Jul 2013.

[38] W. Cui, J. Kannan, and H. J. Wang. Discoverer: Automatic Protocol Reverse Engineering From Network Traces. In USENIX Security Symposium, 2007.

[39] W. Cui, M. Peinado, K. Chen, H. J. Wang, and L. Irun-briz. Tupni: Auto- matic Reverse Engineering of Input Formats. In ACM Conference on Com- puter and Communications Security (CCS), 2008.

[40] D. Dagon, G. Gu, C. P. Lee, and W. Lee. A Taxonomy of Botnet Structures. In Annual Computer Security Applications Conference (ACSAC), 2007.

[41] D. Dagon, C. C. Zou, and W. Lee. Modeling Botnet Propagation Using Time Zones. In ISOC Network and Distributed System Security Symposium (NDSS), 2006.

[42] N. Daswani and M. Stoppelman. The Anatomy of Clickbot. A. In USENIX Workshop on Hot Topics in Understanding Botnets (HotBots), 2007.

[43] V. Dave, S. Guha, and Y. Zhang. Measuring and Fingerprinting Click-Spam in Ad Networks. In ACM SIGCOMM Conference on Data Communication, 2012.

[44] V. Dave, S. Guha, and Y. Zhang. ViceROI: Catching Click-Spam in Search Ad Networks. In ACM Conference on Computer and Communications Security (CCS), 2013.

160 Bibliography

[45] A. Dewald, T. Holz, and F. C. Freiling. ADSandbox: Sandboxing JavaScript to Fight Malicious Websites. In ACM Symposium on Applied Computing (SAC), 2010. [46] Django. Django’s Cache Framework. https://docs.djangoproject.com/ en/dev/topics/cache/#template-fragment-caching, Mar 2015. [47] X. Dong, M. Tran, Z. Liang, and X. Jiang. AdSentry: Comprehensive and Flexible Confinement of JavaScript-based Advertisements. In Annual Com- puter Security Applications Conference (ACSAC), 2011. [48] A. Doup´e,W. Cui, M. H. Jakubowski, M. Peinado, C. Kruegel, and G. Vigna. deDacota: Toward Preventing Server-Side XSS via Automatic Code and Data Separation. In ACM Conference on Computer and Communications Security (CCS), 2013. [49] H. Drucker, D. Wu, and V. N. Vapnik. Support Vector Machines for Spam Categorization. In IEEE transactions on neural networks, 1999. [50] Europol’s European Cybercrime Centre (EC3). The Internet Organised Crime Threat Assessment (iOCTA). https://www.europol.europa.eu/content/ internet-organised-crime-threat-assesment-iocta, Sep 2014. [51] D. Fetterly, M. Manasse, and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005. [52] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. RFC 2616, Hypertext Transfer Protocol – HTTP/1.1, Jun 1999. [53] P. Fogla and W. Lee. Evading Network Anomaly Detection Systems: Formal Reasoning and Practical Techniques. In ACM Conference on Computer and Communications Security (CCS), 2006. [54] P. Fogla, M. Sharif, R. Perdisci, O. Kolesnikov, and W. Lee. Polymorphic Blending Attacks. In USENIX Security Symposium, 2006. [55] S. Ford, M. Cova, C. Kruegel, and G. Vigna. Analyzing and Detecting Mali- cious Flash Advertisements. In Annual Computer Security Applications Con- ference (ACSAC), 2009. [56] M. Fossi, D. Turner, E. Johnson, T. Mack, T. Adams, J. Blackbird, S. En- twisle, B. Graveland, D. McKinney, J. Mulcahy, and C. Wueest. Symantec Global Internet Security Threat Report: Trends for 2009. Symantec Enter- prise Security, 15:97, 2010.

161 Bibliography

[57] Fox-IT. Malicious Advertisements Served via Yahoo. http://blog.fox-it. com/2014/01/03/malicious-advertisements-served-via-yahoo/, Jan 2014.

[58] J. Fran¸cois, S. Wang, T. Engel, et al. BotTrack: Tracking Botnets Using NetFlow and PageRank. In IFIP Networking Conference, 2011.

[59] F. C. Freiling, T. Holz, and G. Wicherski. Botnet Tracking: Exploring a Root-Cause Methodology to Prevent Distributed Denial-Of-Service Attacks. In European Symposium on Research in Computer Security (ESORICS), 2005.

[60] M. Gandhi, M. Jakobsson, and J. Ratkiewicz. Badvertisements: Stealthy Click-Fraud With Unwitting Accessories. Journal of Digital Forensic Practice, 1(2):131–142, 2006.

[61] N. Garg and D. Grosu. Deception in Honeynets: A Game-Theoretic Analysis. In Annual IEEE SMC Information Assurance and Security Workshop (IAW), 2007.

[62] P. Gill, V. Erramilli, A. Chaintreau, B. Krishnamurthy, K. Papagiannaki, and P. Rodriguez. Follow the Money: Understanding Economics of Online Aggregation and Advertising. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2013.

[63] J. Goebel and T. Holz. Rishi: Identify Bot Contaminated Hosts by IRC Nickname Evaluation. In USENIX Workshop on Hot Topics in Understanding Botnets (HotBots), 2007.

[64] J. Goebel, T. Holz, and C. Willems. Measurement and Analysis of Au- tonomous Spreading Malware in a University Environment. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2007.

[65] M. Goldstein and A. Dengel. Histogram-Based Outlier Score (HBOS): A Fast Unsupervised Anomaly Detection Algorithm. KI-2012: German Conference on Artificial Intelligence, pages 59–63, 2012.

[66] Google. Google Safe Browsing API. https://developers.google.com/ safe-browsing/.

[67] Google. Search Engine Optimization - Starter Guide. http://www.google. com/webmasters/docs/search-engine-optimization-starter-guide. pdf, May 2013.

[68] Google. Safe Browsing API. https://developers.google.com/ safe-browsing, Jul 2014.

162 Bibliography

[69] Google Webmaster Central Blog. Understanding Web Pages Bet- ter. http://googlewebmastercentral.blogspot.com/2014/05/ understanding-web-pages-better.html, May 2014.

[70] C. Grier, L. Ballard, J. Caballero, N. Chachra, C. J. Dietrich, K. Levchenko, P. Mavrommatis, D. McCoy, A. Nappa, A. Pitsillidis, et al. Manufacturing Compromise: The Emergence of Exploit-As-A-Service. In ACM Conference on Computer and Communications Security (CCS), 2012.

[71] J. B. Grizzard, V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon. Peer- To-Peer Botnets: Overview and Case Study. In USENIX Workshop on Hot Topics in Understanding Botnets (HotBots), 2007.

[72] G. Gu, R. Perdisci, J. Zhang, and W. Lee. BotMiner: Clustering Analysis of Network Traffic for Protocol-And Structure-Independent Botnet Detection. In USENIX Security Symposium, 2008.

[73] G. Gu, P. Porras, V. Yegneswaran, M. Fong, and W. Lee. BotHunter: Detect- ing Malware Infection Through IDS-Driven Dialog Correlation. In USENIX Security Symposium, 2007.

[74] S. Guha, B. Cheng, and P. Francis. Challenges in Measuring Online Adver- tising Systems. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2010.

[75] Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. In Adversarial Information Retrieval on the Web (AIRWeb), 2005.

[76] Z. Gy¨ongyi,H. Garcia-Molina, and J. Pedersen. Combating Web Spam With TrustRank. In International Conference on Very Large Data Bases (VLDB), 2004.

[77] S. Hao, N. A. Syed, N. Feamster, A. G. Gray, and S. Krasser. Detecting Spam- mers With SNARE: Spatio-Temporal Network-Level Automatic Reputation Engine. In USENIX Security Symposium, 2009.

[78] M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in Web Search Engines. ACM SIGIR Forum, 36(2):11–22, 2002.

[79] A. Hidayat. PhantomJS. http://www.phantomjs.org, Mar 2015.

[80] T. Holz, M. Steiner, F. Dahl, E. Biersack, and F. Freiling. Measurements and Mitigation of Peer-To-Peer-Based Botnets: A Case Study on Storm Worm. In USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2008.

163 Bibliography

[81] IAB. Internet Advertising Revenue Report. http://www.iab.net/media/ file/IAB_Internet_Advertising_Revenue_Report_FY_2013.pdf, 2014.

[82] N. Ianelli and A. Hackworth. Botnets as a Vehicle for Online Crime. In International Conference on Forensic Computer Science (ICoFCS), 2006.

[83] N. Immorlica, K. Jain, M. Mahdian, and K. Talwar. Click Fraud Resistant Methods for Learning Click-Through Rates. Internet and Network Economics, pages 34–45, 2005.

[84] Internet Crime Complaint Center. 2013 Internet Crime Report. http://www. ic3.gov/media/annualreport/2013_ic3report.pdf, 2014.

[85] Internet World Stats. Internet Usage Statistics: World Internet Users and Population Stats. http://internetworldstats.com/stats.htm, Jun 2012.

[86] G. Jacob, R. Hund, C. Kruegel, and T. Holz. Jackstraws: Picking Command and Control Connections From Bot Traffic. In USENIX Security Symposium, 2011.

[87] B. J. Jansen. Adversarial Information Retrieval Aspects of Sponsored Search. In Adversarial Information Retrieval on the Web (AIRWeb), 2006.

[88] J. P. John, A. Moshchuk, S. D. Gribble, and A. Krishnamurthy. Studying Spamming Botnets Using Botlab. In USENIX Symposium on Networked Sys- tems Design and Implementation (NSDI), 2009.

[89] J. P. John, F. Yu, Y. Xie, A. Krishnamurthy, and M. Abadi. deSEO: Com- bating Search-Result Poisoning. In USENIX Security Symposium, 2011.

[90] E. Jones, T. Oliphant, and P. Peterson. SciPy: Open Source Scientific Tools for Python. http://www.scipy.org/, 2001.

[91] G. Kakavelakis, R. Beverly, and Y. J. Auto-Learning of SMTP TCP Transport-Layer Features for Spam and Abusive Message Detection. In USENIX Large Installation System Administration Conference (LISA), 2011.

[92] C. Kanich, K. Levchenko, B. Enright, G. M. Voelker, and S. Savage. The Heisenbot Uncertainty Problem: Challenges in Separating Bots From Chaff. In USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2008.

[93] C. Kanich, N. Weaver, D. McCoy, T. Halvorson, C. Kreibich, K. Levchenko, V. Paxson, G. M. Voelker, and S. Savage. Show Me the Money: Characterizing Spam-Advertised Revenue. In USENIX Security Symposium, 2011.

164 Bibliography

[94] A. Kapravelos, M. Cova, C. Kruegel, and G. Vigna. Escape From Monkey Island: Evading High-Interaction Honeyclients. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2011.

[95] R. Karim, M. Dhawan, V. Ganapathy, and C.-c. Shan. An Analysis of the Mozilla Jetpack Extension Framework. In European Conference on Object- Oriented Programming (ECOOP), 2012.

[96] Kaspersky Lab. Kaspersky Lab Report: 23% of Users Are Running Old or Outdated Web Browsers, Creating Huge Gaps in Online Security. http://www.kaspersky.com/about/news/virus/2012/kaspersky_lab_ report_23_of_users_are_running_old_or_outdated_web_browsers_ creating_huge_gaps_in_online_security, Nov 2012.

[97] Kaspersky Lab. Spam Report: April 2012. https://www.securelist.com/ en/analysis/204792230/Spam_Report_April_2012, 2012.

[98] C. Kintana, D. Turner, J.-Y. Pan, A. Metwally, N. Daswani, E. Chin, and A. Bortz. The Goals and Challenges of Click Fraud Penetration Testing Sys- tems. In International Symposium on Software Reliability Engineering, 2009.

[99] A. Klein. Divide and Conquer: HTTP Response Splitting, Web Cache Poi- soning Attacks and Related Topics. Sanctum whitepaper, 2004.

[100] J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Jour- nal of the ACM (JACM), 46(5):604–632, 1999.

[101] G. Kontaxis, M. Polychronakis, A. D. Keromytis, and E. P. Markatos. Privacy- Preserving Social Plugins. In USENIX Security Symposium, 2012.

[102] C. Kreibich and J. Crowcroft. Honeycomb: Creating Intrusion Detection Signatures Using Honeypots. ACM SIGCOMM Computer Communication Review, 34(1):51–56, 2004.

[103] C. Kreibich, C. Kanich, K. Levchenko, B. Enright, G. M. Voelker, V. Paxson, and S. Savage. On the Spam Campaign Trail. In USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2008.

[104] C. Kruegel, G. Vigna, and W. Robertson. A Multi-Model Approach to the Detection of Web-Based Attacks. Comput. Netw., 48(5):717–738, Aug. 2005.

[105] M. K¨uhrer and T. Holz. An Empirical Analysis of Malware Blacklists. Praxis der Informationsverarbeitung und Kommunikation, 35(1):11–16, 2012.

165 Bibliography

[106] B. Leiba. DomainKeys Identified Mail (DKIM): Using Digital Signatures for Domain Verification. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2007.

[107] N. Leontiadis, T. Moore, and N. Christin. A Nearly Four-Year Longitudinal Study of Search-Engine Poisoning. In ACM Conference on Computer and Communications Security (CCS), 2014.

[108] B. S. Lerner, L. Elberty, N. Poole, and S. Krishnamurthi. Verifying Web Browser Extensions’ Compliance With Private-Browsing Mode. In European Symposium on Research in Computer Security (ESORICS), 2013.

[109] Z. Li, M. Sanghi, Y. Chen, M.-Y. Kao, and B. Chavez. Hamsa: Fast Sig- nature Generation for Zero-Day PolymorphicWorms With Provable Attack Resilience. In IEEE Symposium on Security and Privacy, 2006.

[110] Z. Li, K. Zhang, Y. Xie, F. Yu, and X. Wang. Knowing Your Enemy: Under- standing and Detecting Malicious Web Advertising. In ACM Conference on Computer and Communications Security (CCS), 2012.

[111] Z. Lin, X. Jiang, D. Xu, and X. Zhang. Automatic Protocol Format Reverse Engineering Through Context-Aware Monitored Execution. In ISOC Network and Distributed System Security Symposium (NDSS), 2008.

[112] C. Livadas, R. Walsh, D. Lapsley, and W. T. Strayer. Using Machine Learning Technliques to Identify Botnet Traffic. In IEEE Conference on Local Computer Networks, 2006.

[113] D. Lowd and C. Meek. Good Word Attacks on Statistical Spam Filters. In Col- laboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2005.

[114] F. Lu, J. Zhang, and S. Savage. When Good Services Go Wild: Reassembling Web Services for Unintended Purposes. In USENIX Workshop on Hot Topics in Security (HotSec), 2012.

[115] L. Lu, R. Perdisci, and W. Lee. SURF: Detecting and Measuring Search Poisoning. In ACM Conference on Computer and Communications Security (CCS), 2011.

[116] M86 Labs. Security Labs Report. http://www.m86security.com/ documents/pdfs/security_labs/m86_security_labs_report_2h2011. pdf, 2011.

166 Bibliography

[117] F. Maggi, A. Frossi, S. Zanero, G. Stringhini, B. Stone-Gross, C. Kruegel, and G. Vigna. Two Years of Short URLs Internet Measurement: Security Threats and Countermeasures. In International Conference on World Wide Web (WWW), 2013.

[118] S. Majumdar, D. Kulkarni, and C. V. Ravishankar. Addressing Click Fraud in Content Delivery Systems. In IEEE Conference on Computer Communi- cations (INFOCOM), 2007.

[119] McAfee Labs. Scareware Poses Danger to Consumers. http://blogs. mcafee.com/mcafee-labs/scareware-danger, Mar 2010.

[120] O. A. McBryan. GENVL and WWWW: Tools for Taming the Web. In International Conference on World Wide Web (WWW), 1994.

[121] H. Mekky, R. Torres, Z.-L. Zhang, S. Saha, and A. Nucci. Detecting Mali- cious HTTP Redirections Using Trees of User Browsing Activity. In IEEE Conference on Computer Communications (INFOCOM), 2014.

[122] P. T. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In Adversarial Information Retrieval on the Web (AIRWeb), 2005.

[123] A. Metwally, D. Agrawal, and A. El Abbadi. Duplicate Detection in Click Streams. In International Conference on World Wide Web (WWW), 2005.

[124] T. Meyer and B. Whateley. SpamBayes: Effective Open-Source, Bayesian Based, Email Classification System. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2004.

[125] B. Miller, P. Pearce, C. Grier, C. Kreibich, and V. Paxson. What’s Clicking What? Techniques and Innovations of Today’s Clickbots. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2011.

[126] J. Mirkovic and P. Reiher. A Taxonomy of DDoS Attack and DDoS Defense Mechanisms. ACM SIGCOMM Computer Communication Review, 34(2):39– 53, 2004.

[127] G. Mishne, D. Carmel, and R. Lempel. Blocking Blog Spam With Language Model Disagreement. In Adversarial Information Retrieval on the Web (AIR- Web), 2005.

[128] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver. Inside the Slammer Worm. IEEE Security & Privacy, 1(4):33–39, 2003.

167 Bibliography

[129] D. Moore, C. Shannon, D. J. Brown, G. M. Voelker, and S. Savage. Inferring Internet Denial-Of-Service Activity. ACM Transactions on Computer Systems (TOCS), 24(2):115–139, 2006.

[130] A. Moshchuk, T. Bragin, D. Deville, S. D. Gribble, and H. M. Levy. SpyProxy: Execution-Based Detection of Malicious Web Content. In USENIX Security Symposium, 2007.

[131] A. Moshchuk, T. Bragin, S. D. Gribble, and H. M. Levy. A Crawler-Based Study of Spyware in the Web. In ISOC Network and Distributed System Security Symposium (NDSS), 2006.

[132] M. Motoyama, D. McCoy, K. Levchenko, S. Savage, and G. M. Voelker. An Analysis of Underground Forums. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2011.

[133] J. Nazario and T. Holz. As the Net Churns: Fast-Flux Botnet Observations. In International Conference on Malicious and Unwanted Software (MALWARE), 2008.

[134] B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph, B. I. P. Rubinstein, U. Saini, C. Sutton, J. D. Tygar, and K. Xia. Exploiting Machine Learning to Subvert Your Spam Filter. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2008.

[135] J. Newsome, B. Karp, and D. Song. Polygraph: Automatically Generating Signatures for Polymorphic Worms. In IEEE Symposium on Security and Privacy, 2005.

[136] N. Nikiforakis, F. Maggi, G. Stringhini, M. Z. Rafique, W. Joosen, C. Kruegel, F. Piessens, G. Vigna, and S. Zanero. Stranger Danger: Exploring the Ecosys- tem of Ad-Based URL Shortening Services. In International Conference on World Wide Web (WWW), 2014.

[137] Y. Niu, Y.-M. Wang, H. Chen, M. Ma, and F. Hsu. A Quantitative Study of Forum Spamming Using Context-Based Analysis. In ISOC Network and Distributed System Security Symposium (NDSS), 2007.

[138] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting Spam Web Pages Through Content Analysis. In International Conference on World Wide Web (WWW), 2006.

[139] Open Web Application Security Project. Top Ten. https://www.owasp.org/ index.php/Top_10_2013-Top_10, 2013.

168 Bibliography

[140] P. Papadopoulos, A. Papadogiannakis, M. Polychronakis, A. Zarras, T. Holz, and E. P. Markatos. K-Subscription: Privacy-Preserving Microblogging Browsing Through Obfuscation. In Annual Computer Security Applications Conference (ACSAC), 2013.

[141] A. Pathak, Y. C. Hu, and Z. M. Mao. Peeking Into Spammer Behavior From a Unique Vantage Point. In USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2008.

[142] V. Paxson. Automated Packet Trace Analysis of TCP Implementations. In ACM SIGCOMM Conference on Data Communication, 1997.

[143] V. Paxson. Bro: A System for Detecting Network Intruders in Real-Time. Computer networks, 31(23):2435–2463, 1999.

[144] R. Perdisci, D. Dagon, W. Lee, P. Fogla, and M. Sharif. Misleading Worm Signature Generators Using Deliberate Noise Injection. In IEEE Symposium on Security and Privacy, 2006.

[145] R. Perdisci, W. Lee, and N. Feamster. Behavioral Clustering of HTTP- Based Malware and Signature Generation Using Malicious Network Traces. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2010.

[146] N. Perriault. CasperJS, a Navigation Scripting and Testing Utility for Phan- tomJS and SlimerJS. http://casperjs.org, Mar 2015.

[147] A. Pitsillidis, K. Levchenko, C. Kreibich, C. Kanich, G. M. Voelker, V. Paxson, N. Weaver, and S. Savage. Botnet Judo: Fighting Spam With Itself. In ISOC Network and Distributed System Security Symposium (NDSS), 2010.

[148] D. Plohmann, E. Gerhards-Padilla, and F. Leder. Botnets: Detection, Mea- surement, Disinfection & Defence. The European Network and Information Security Agency (ENISA), 2011.

[149] J. G. Politz, S. A. Eliopoulos, A. Guha, and S. Krishnamurthi. ADsafety: Type-Based Verification of JavaScript Sandboxing. In USENIX Security Sym- posium, 2011.

[150] J. Postel. Simple Mail Transfer Protocol. Information Sciences, 1982.

[151] N. Provos, P. Mavrommatis, M. Abu Rajab, and F. Monrose. All Your iFRAMEs Point to Us. In USENIX Security Symposium, 2008.

169 Bibliography

[152] N. Provos, D. McNamee, P. Mavrommatis, K. Wang, N. Modadugu, et al. The in the Browser: Analysis of Web-Based Malware. In USENIX Workshop on Hot Topics in Understanding Botnets (HotBots), 2007.

[153] M. Rajab, L. Ballard, N. Jagpal, P. Mavrommatis, D. Nojiri, N. Provos, and L. Schmidt. Trends in Circumventing Web-Malware Detection. Google, Google Technical Report, 2011.

[154] M. A. Rajab, F. Monrose, A. Terzis, and N. Provos. Peeking Through the Cloud: DNS-based Estimation and Its Applications. In Applied Cryptography and Network Security, 2008.

[155] A. Ramachandran, D. Dagon, and N. Feamster. Can DNS-based Blacklists Keep Up With Bots? In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2006.

[156] A. Ramachandran, D. Dagon, and N. Feamster. Can DNS-Based Blacklists Keep Up With Bots? In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2006.

[157] A. Ramachandran and N. Feamster. Understanding the Network-Level Be- havior of Spammers. SIGCOMM Comput. Commun. Rev., 36(4), August 2006.

[158] A. Ramachandran, N. Feamster, and S. Vempala. Filtering Spam With Be- havioral Blacklisting. In ACM Conference on Computer and Communications Security (CCS), 2007.

[159] W. Robertson and G. Vigna. Static Enforcement of Web Application Integrity Through Strong Typing. In USENIX Security Symposium, August 2009.

[160] M. Roesch. Snort: Lightweight Intrusion Detection for Networks. In USENIX Large Installation System Administration Conference (LISA), 1999.

[161] B. Ross, C. Jackson, N. Miyake, D. Boneh, and J. C. Mitchell. Stronger Password Authentication Using Browser Extensions. In USENIX Security Symposium, 2005.

[162] N. C. Rowe. Measuring the Effectiveness of Honeypot Counter- Counterdeception. In Annual Hawaii International Conference on System Sciences (HICSS), 2006.

[163] N. C. Rowe, E. J. Custy, and B. T. Duong. Defending Cyberspace With Fake Honeypots. Journal of Computers, 2(2):25–36, 2007.

170 Bibliography

[164] M. Sahami, S. Dumais, D. Heckermann, and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. Learning for Text Categorization, 1998.

[165] R. R. Sarukkai. How Much Is a Keyword Worth? In International Conference on World Wide Web (WWW), 2005.

[166] D. Sculley and G. M. Wachman. Relaxed Online SVMs for Spam Filtering. In ACM SIGIR Conference on Research and Development in Information Retrieval, 2007.

[167] C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a Very Large Web Search Engine Query Log. ACM SIGIR Forum, 33(1):6–12, 1999.

[168] S. Singh, C. Estan, G. Varghese, and S. Savage. Automated Worm Finger- printing. In USENIX Symposium on Operating System Design and Implemen- tation (OSDI), 2004.

[169] S. Sinha, M. Bailey, and F. Jahanian. Shades of Grey: On the Effectiveness of Reputation-Based “Blacklists”. In International Conference on Malicious and Unwanted Software, 2008.

[170] Spamhaus. . http://www.spamhaus.org, Mar 2015.

[171] B. Stock, J. Gobel, M. Engelberth, F. Freiling, and T. Holz. Walowdac: Analysis of a Peer-To-Peer Botnet. In European Conference on Defense (EC2ND), 2009.

[172] B. Stone-Gross, R. Abman, R. Kemmerer, C. Kruegel, D. Steigerwald, and G. Vigna. The Underground Economy of Fake Antivirus Software. In Work- shop on Economics of Information Security (WEIS), 2011.

[173] B. Stone-Gross, T. Holz, G. Stringhini, and G. Vigna. The Underground Economy of Spam: A Botmaster’s Perspective of Coordinating Large-Scale Spam Campaigns. In USENIX Workshop on Large-Scale Exploits and Emer- gent Threats (LEET), 2011.

[174] B. Stone-Gross, R. Stevens, A. Zarras, R. Kemmerer, C. Kruegel, and G. Vi- gna. Understanding Fraudulent Activities in Online Ad Exchanges. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2011.

[175] G. Stringhini, M. Egele, A. Zarras, T. Holz, C. Kruegel, and G. Vigna. B@bel: Leveraging Email Delivery for Spam Mitigation. In USENIX Security Sym- posium, 2012.

171 Bibliography

[176] G. Stringhini, T. Holz, B. Stone-Gross, C. Kruegel, and G. Vigna. BotMag- nifier: Locating Spammers on the Internet. In USENIX Security Symposium, 2011.

[177] G. Stringhini, C. Kruegel, and G. Vigna. Shady Paths: Leveraging Surfing Crowds to Detect Malicious Web Pages. In ACM Conference on Computer and Communications Security (CCS), 2013.

[178] D. Stuttard and M. Pinto. The Web Application Hacker’s Handbook: Discov- ering and Exploiting Security Flaws. John Wiley & Sons, 2008.

[179] Z. Su and G. Wassermann. The Essence of Command Injection Attacks in Web Applications. In ACM Symposium on Principles of Programming Languages (POPL), 2006.

[180] SURBL. URI Reputation Data. http://www.surbl.org/, Mar 2015.

[181] Symantec Corp. State of Spam & Phishing Report. http://www.symantec. com/business/theme.jsp?themeid=state_of_spam, 2010.

[182] B. Taylor. Sender Reputation in a Large Webmail Service. In Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), 2006.

[183] M. Ter Louw, K. T. Ganesh, and V. Venkatakrishnan. AdJail: Practical Enforcement of Confidentiality and Integrity Policies on Web Advertisements. In USENIX Security Symposium, 2010.

[184] The Honeynet Project. Capture-Hpc: Client Honeypot / Honeyclient. https: //projects.honeynet.org/capture-hpc, Jul 2014.

[185] The Register. Hacker Crew Nicks ’1.2 Billion Passwords’ – but WHERE Did They All Come From? http://www.theregister.co.uk/2014/08/05/ russians_amass_1_2bn_stolen_passwords/, Aug 2014.

[186] T. Urvoy, T. Lavergne, and P. Filoche. Tracking Web Spam With Hidden Style Similarity. In Adversarial Information Retrieval on the Web (AIRWeb), 2006.

[187] N. Vallina-Rodriguez, J. Shah, A. Finamore, Y. Grunenberger, K. Papagian- naki, H. Haddadi, and J. Crowcroft. Breaking for Commercials: Charac- terizing Mobile Advertising. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2012.

[188] M. van den Berg. A Taste of HTTP Botnets. http://www.team-cymru.com/ ReadingRoom/Whitepapers/2008/http-botnets.pdf, Jul 2008.

172 Bibliography

[189] S. Venkataraman, J. Caballero, P. Poosankam, M. G. Kang, and D. X. Song. FiG: Automatic Fingerprint Generation. In ISOC Network and Distributed System Security Symposium (NDSS), 2007.

[190] S. Venkataraman, S. Sen, O. Spatscheck, P. Haffner, and D. Song. Exploit- ing Network Structure for Proactive Spam Mitigation. In USENIX Security Symposium, 2007.

[191] VirusTotal. Free Online Virus, Malware and URL Scanner. https://www. virustotal.com/, Mar 2015.

[192] P. Vogt, F. Nentwich, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna. Cross Site Scripting Prevention With Dynamic Data Tainting and Static Analysis. In ISOC Network and Distributed System Security Symposium (NDSS), 2007.

[193] W3C. Content Security Policy 1.1. http://www.w3.org/TR/2014/ WD-CSP11-20140211/, Feb 2014.

[194] W3C. Same-Origin Policy. http://www.w3.org/Security/wiki/Same_ Origin_Policy, Jul 2014.

[195] A. M. Wall. Search Engine Optimization Book. State College: Aaron Matthew Wall, 2005.

[196] D. Y. Wang, M. Der, M. Karami, L. Saul, D. McCoy, S. Savage, and G. M. Voelker. Search + Seizure: The Effectiveness of Interventions on Seo Cam- paigns. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2014.

[197] Y.-M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King. Automated Web Patrol With Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities. In ISOC Network and Distributed System Security Symposium (NDSS), 2006.

[198] Z. Wang, X. Jiang, W. Cui, X. Wang, and M. Grace. ReFormat: Automatic Reverse Engineering of Encrypted Messages. In European Symposium on Re- search in Computer Security (ESORICS), 2009.

[199] K. C. Wilbur and Y. Zhu. Click Fraud. Marketing Science, 28(2):293–308, 2009.

[200] C. Willems, T. Holz, and F. Freiling. Toward Automated Dynamic Malware Analysis Using CWSandbox. Security & Privacy, IEEE, 5(2):32–39, 2007.

173 Bibliography

[201] W. Wolf. An Algorithm for Nearly-Minimal Collapsing of Finite-State Ma- chine Networks. In IEEE International Conference on Computer-Aided Design (ICCAD), 1990.

[202] G. Wondracek, P. M. Comparetti, C. Kruegel, and E. Kirda. Automatic Network Protocol Analysis. In ISOC Network and Distributed System Security Symposium (NDSS), 2008.

[203] G. Wondracek, T. Holz, C. Platzer, E. Kirda, and C. Kruegel. Is the Inter- net for Porn? An Insight Into the Online Adult Industry. In Workshop on Economics of Information Security (WEIS), 2010.

[204] WorldWideWebSize. The Size of the World Wide Web (The Internet). http: //www.worldwidewebsize.com/, Apr 2015.

[205] B. Wu and B. D. Davison. Identifying Link Farm Spam Pages. In International Conference on World Wide Web (WWW), 2005.

[206] B. Wu and B. D. Davison. Detecting Semantic Cloaking on the Web. In International Conference on World Wide Web (WWW), 2006.

[207] B. Wu, V. Goel, and B. D. Davison. Propagating Trust and Distrust to Demote Web Spam. In Models of Trust for the Web (MTW), 2006.

[208] P. Wurzinger, L. Bilge, T. Holz, J. Goebel, C. Kruegel, and E. Kirda. Auto- matically Generating Models for Botnet Detection. In European Symposium on Research in Computer Security (ESORICS), 2009.

[209] M. Xie, Z. Wu, and H. Wang. HoneyIM: Fast Detection and Suppression of Instant Messaging Malware in Enterprise-Like Networks. In Annual Computer Security Applications Conference (ACSAC), 2007.

[210] Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I. Osipkov. Spamming Botnets: Signatures and Characteristics. In ACM SIGCOMM Conference on Data Communication, 2008.

[211] T.-F. Yen and M. K. Reiter. Traffic Aggregation for Malware Detection. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2008.

[212] M. Zalewski. P0f V3. http://lcamtuf.coredump.cx/p0f3/, 2012.

[213] A. Zarras. The Art of False Alarms in the Game of Deception: Leveraging Fake Honeypots for Enhanced Security. In IEEE International Carnahan Conference on Security Technology (ICCST), 2014.

174 Bibliography

[214] A. Zarras, A. Kapravelos, G. Stringhini, T. Holz, C. Kruegel, and G. Vigna. The Dark Alleys of Madison Avenue: Understanding Malicious Advertise- ments. In ACM SIGCOMM Conference on Internet Measurement (IMC), 2014.

[215] A. Zarras, A. Papadogiannakis, R. Gawlik, and T. Holz. Automated Gener- ation of Models for Fast and Precise Detection of HTTP-Based Malware. In Annual Conference on Privacy, Security and Trust (PST), 2014.

[216] A. Zarras, A. Papadogiannakis, S. Ioannidis, and T. Holz. Revealing the Relationship Network Behind Link Spam. In Annual Conference on Privacy, Security and Trust (PST), 2015.

[217] H. Zhang, A. Goel, R. Govindan, K. , and B. Van Roy. Making Eigenvector-Based Reputation Systems Robust to Collusion. In International Workshop on Algorithms and Models for the Web-Graph, 2004.

[218] L. Zhang and Y. Guan. Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks. In International Conference on Systems, 2008.

[219] L. Zhuang, J. Dunagan, D. R. Simon, H. J. Wang, and J. D. Tygar. Char- acterizing Botnets From Email Spam Records. In USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2008.

175

Apostolos Zarras Curriculum Vitae

Personal Information Date of Birth 26.10.1985 Place of Birth Athens, Greece Education 2011 – 2015 Ph.D. in Electrical Engineering and Information Technology, Department of Electrical Engineering and Information Technology, Ruhr – University Bochum, Germany. 2008 – 2011 M.Sc. Degree in Computer Science, Department of Computer Science, University of Crete, Greece. 2003 – 2008 B.Sc. Degree in Computer Science, Department of Computer Science, University of Crete, Greece. Professional Experience Aug. 2011 – Present Research Assistant, Chair for Systems Security, Ruhr – University Bochum, Germany. Supervisor: Prof. Thorsten Holz Jun. 2014 – Jul. 2014 Research Assistant, Security Group, EURECOM, France. Supervisor: Prof. Aurélien Francillon Jan. 2014 – Apr. 2014 Research Assistant, Computer Security Laboratory, Department of Computer Science, University of California, Santa Barbara, USA. Supervisors: Prof. Christopher Kruegel and Prof. Giovanni Vigna Dec. 2008 – Jul. 2011 Research Assistant, Distributed Computing Systems Laboratory, Institute of Computer Science, Foundation for Research and Technology Hellas, Greece. Supervisor: Prof. Evangelos Markatos Jun. 2010 – Sep. 2010 Research Assistant, Computer Security Laboratory, Department of Computer Science, University of California, Santa Barbara, USA. Supervisors: Prof. Christopher Kruegel and Prof. Giovanni Vigna Jun. 2007 – Nov. 2008 Research Assistant, Transformation Services Laboratory, Department of Computer Science, University of Crete, Greece. Supervisor: Prof. Christos Nikolaou Summer 2005 Software Engineer, Network Operations Center, University of Crete, Greece. Professional Activities Program Committee 6th European Workshop on System Security (EuroSec 2013)

177 Invited Talks July 2015 On Detection of Malicious Advertisements — Technische Universität München, Germany May 2015 Towards Detection and Prevention of Malicious Activities on the Internet — ETH Zurich, Switzerland June 2013 Social Media Censorship: Problem and Countermeasures — Ruhr – University Bochum, Germany March 2011 Worldwide Observatory of Malicious Behaviors and Attack Threats — ASMONIA Workshop, Heidelberg, Germany Conference Presentations July 2015 Revealing the Relationship Network Behind Link Spam — Annual Conference on Privacy, Security and Trust (PST), Izmir, Turkey November 2014 The Dark Alleys of Madison Avenue: Understanding Malicious Advertisements — Internet Measurement Conference (IMC), Vancouver, Canada October 2014 The Art of False Alarms in the Game of Deception: Leveraging Fake Honeypots for Enhanced Security — International Carnahan Conference on Security Technology (ICCST), Rome, Italy July 2014 Automated Generation of Models for Fast and Precise Detection of HTTP-Based Malware — Annual Conference on Privacy, Security and Trust (PST), Toronto, Canada List of Conference and Workshop Participation July 2015 13th Annual Conference on Privacy, Security and Trust (PST), Izmir, Turkey November 2014 14th ACM Internet Measurement Conference (IMC), Vancouver, Canada October 2014 48th IEEE International Carnahan Conference on Security Technology (ICCST), Rome, Italy July 2014 12th Annual Conference on Privacy, Security and Trust (PST), Toronto, Canada December 2013 29th ACM Annual Computer Security Applications Conference (ACSAC), New Orleans, Louisiana, USA August 2012 21st USENIX Security Symposium, Bellevue, Washington, USA July 2012 9th Conference on Detection of Intrusions and Malware & Vulnerability Assessment (DIMVA), Heraklion, Greece November 2011 11th ACM Internet Measurement Conference (IMC), Berlin, Germany April 2011 4th European Workshop on System Security (EuroSec 2011), Salzburg, Austria September 2009 12th International Symposium On Recent Advances In Intrusion Detection (RAID) Saint-Malo, Brittany, France List of Publications [1] Apostolis Zarras, Antonis Papadogiannakis, Sotiris Ioannidis, and Thorsten Holz. Revealing the Relationship Network Behind Link Spam. In Proceedings of the 13th Annual Conference on Privacy, Security and Trust (PST), 2015. [2] Apostolis Zarras, Alexandros Kapravelos, Gianluca Stringhini, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. The Dark Alleys of Madison Avenue: Understanding Malicious Advertisements. In Proceedings of the 14th ACM SIGCOMM Internet Measurement Conference (IMC), 2014. 178 [3] Apostolis Zarras. The Art of False Alarms in the Game of Deception: Leveraging Fake Honeypots for Enhanced Security. In Proceedings of the 48th IEEE International Carnahan Conference on Security Technology (ICCST), 2014. [4] Apostolis Zarras, Antonis Papadogiannakis, Robert Gawlik, and Thorsten Holz. Automated Generation of Models for Fast and Precise Detection of HTTP-Based Malware. In Proceedings of the 12th Annual Conference on Privacy, Security and Trust (PST), 2014. [5] Panagiotis Papadopoulos, Antonis Papadogiannakis, Michalis Polychronakis, Apostolis Zarras, Thorsten Holz, and Evangelos P. Markatos. k-subscription: Privacy-preserving Microblogging Browsing through Obfuscation. In Proceedings of the 29th ACM Annual Computer Security Applications Conference (ACSAC), 2013. [6] Gianluca Stringhini, Manuel Egele, Apostolis Zarras, Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. B@bel: Leveraging Email Delivery for Spam Mitigation. In Proceedings of the 21st USENIX Security Symposium, 2012.

[7] Brett Stone-Gross, Ryan Stevens, Apostolis Zarras, Richard Kemmerer, Christopher Kruegel, and Giovanni Vigna. Understanding Fraudulent Activities in Online Ad Exchanges. In Proceedings of the 11th ACM SIGCOMM Internet Measurement Conference (IMC), 2011.

179