UC San Diego UC San Diego Electronic Theses and Dissertations

Home , Spike (application)

Title Understanding URL Abuse for Profit

Permalink https://escholarship.org/uc/item/7nw0f6bf

Author Chachra, Neha

Publication Date 2015

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

Understanding URL Abuse for Proﬁt

A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy

Computer Science

Neha Chachra

Committee in charge:

Stefan Savage, Co-Chair Geoffrey M. Voelker, Co-Chair James H. Fowler Kirill Levchenko Lawrence K. Saul

Co-Chair

University of California, San Diego 2015

iii DEDICATION

To mom for instilling a love of science in me.

iv TABLE OF CONTENTS

Signature Page ...... iii

Dedication ...... iv

Table of Contents ...... v

List of Figures ...... vii

List of Tables ...... viii

Acknowledgements ...... ix

Vita...... xii

Abstract of the Dissertation ...... xiv

Chapter 1 Introduction ...... 1

Chapter 2 Using Crawling to Study Large-Scale Fraud on the Web ...... 5 2.1 Introduction ...... 5 2.2 Related Work ...... 7 2.3 Architecture ...... 8 2.3.1 Selena ...... 10 2.3.2 Oliver-I ...... 10 2.3.3 Oliver-II ...... 12 2.3.4 Stallone ...... 13 2.3.5 Charlotte ...... 13 2.4 Browser Instrumentation using Custom Extensions ...... 16 2.5 Responding to Deterrence...... 19 2.6 Summary ...... 21 2.7 Acknowledgements ...... 22

Chapter 3 Characterizing Affiliate Marketing Abuse ...... 23 3.1 Introduction ...... 23 3.2 Background ...... 25 3.3 Methodology ...... 28 3.3.1 Identifying Affiliate URLs and Cookies ...... 29 3.3.2 User Study...... 29 3.3.3 Crawling ...... 30 3.3.4 Browser Extension Analysis ...... 33 3.4 Results ...... 33 3.4.1 Networks Affected by Cookie-Stuffing ...... 34

v 3.4.2 Prevalence of Cookie-Stufﬁng Techniques ...... 36 3.4.3 Fraudulent Browser Extensions ...... 40 3.4.4 Prevalence of Afﬁliate Marketing ...... 44 3.5 Summary ...... 45 3.6 Acknowledgements ...... 47

Chapter 4 Characterizing Domain Abuse and the Revenue Impact of Blacklisting 48 4.1 Introduction ...... 49 4.2 Background ...... 51 4.3 Data Sets ...... 54 4.3.1 Authenticity and Ethics ...... 55 4.3.2 GlavMed and SpamIt ...... 56 4.3.3 URIBL ...... 57 4.3.4 Spam Feeds ...... 57 4.4 Domain Abuse ...... 58 4.4.1 Overall Observations ...... 61 4.4.2 Advertising Vectors ...... 63 4.4.3 Infrastructure Domains ...... 66 4.4.4 Purchased Trafﬁc ...... 72 4.5 Blacklisting ...... 74 4.5.1 Blacklisting Speed ...... 75 4.5.2 Coverage ...... 76 4.5.3 Blacklisted Resource ...... 77 4.5.4 Blacklisting Penalty ...... 79 4.6 Discussion ...... 79 4.6.1 A Simple Revenue Model ...... 80 4.6.2 Changing Blacklisting Penalty ...... 82 4.6.3 Increasing Coverage ...... 83 4.7 Related Work ...... 84 4.8 Summary ...... 85 4.9 Acknowledgements ...... 86

Chapter 5 Conclusion ...... 87 5.1 Dissertation Summary ...... 87 5.2 Future Directions and Final Thoughts ...... 88

Bibliography ...... 91

vi LIST OF FIGURES

Figure 2.1. System design for Oliver-I, the second version of the Web crawler built in 2010...... 11

Figure 2.2. System design for the proof-of-concept crawler, Charlotte, built in 2015...... 14

Figure 3.1. Different actors and revenue flow in the affiliate marketing ecosystem. The left half of the figure depicts a potential customer receiving an affiliate cookie, while the right half shows the use of the affiliate cookie to determine payout upon a successful transaction. 25

Figure 3.2. Stuffed cookie distribution for top 10 categories of impacted merchants...... 35

Figure 4.1. Revenue from clicks on different kinds of referrers...... 65

Figure 4.2. Spammers seamlessly switch from one free hosting site to another in the face of takedowns...... 69

Figure 4.3. Revenue of domains before and after blacklisting. Note that the x-axis is non-linear...... 75

Figure 4.4. The highest cost of domain a spammer can afford (y-axis) against the time delay (x-axis) in blacklisting...... 82

vii LIST OF TABLES

Table 2.1. Different versions of the Web crawler we built for studying fraudulent ecosystems...... 9

Table 2.2. The table shows some of the supported features for interacting with Web pages and the corresponding challenges we faced...... 16

Table 3.1. Examples of afﬁliate URLs and cookies for different afﬁliate programs...... 29

Table 3.2. Affiliate programs affected by cookie-stuffing and the distribution of cookie-stuffing techniques corresponding to the 12K stuffed cookies we detected...... 33

Table 3.3. Afﬁliate Programs that AffTracker users received cookies for. . . . . 45

Table 4.1. Classiﬁcation of referrers used by SpamIt afﬁliates...... 58

Table 4.2. Classiﬁcation of referrers used by GlavMed afﬁliates...... 59

Table 4.3. Example referrers for advertising vectors (email and Web search), infrastructure domains (free hosting, bulk, and compromised), and purchased trafﬁc...... 60

Table 4.4. Statistical differences between blacklisted and non-blacklisted domains...... 78

viii ACKNOWLEDGEMENTS I would like to thank my wonderful advisers, Stefan Savage and Geoff Voelker. They have been remarkable advisers and mentors throughout my doctoral program. Through their guidance, encouragement, and constructive feedback, they taught me how to do research and how to best present my work in speech and in writing. I am also grateful to them for encouraging me in my initiatives for the community development activities within CSE. I am also thankful to Kirill Levchenko who has not only been a supportive committee member, but also a wonderful friend and mentor. His astuteness and work ethic have been inspirational from the beginning of my doctoral program. I would also like to thank James H. Fowler and Lawrence K. Saul for being on my doctoral committee and being available whenever I needed any help. My co-authors have been remarkable to work with and I am very grateful for all their help. Different co-authors inspired me to learn different skills. I am especially grateful to Chris Grier, Alexandros Kapravelos, Christian Kreibich, Damon McCoy, and Vern Paxson. Julie Connor, Brian Kantor, Cindy Moore, and Jennifer Folkestad have also helped immeasurably with various administrative tasks for which I am sincerely grateful. I am also indebted to all my friends within CSE and outside who anchored me, guided me, inspired me, and cheered on for me at my lowest points. In particular, I am grateful to Alexander Bakst, Lakshmi Balachandran, Karyn Benson, Dimple Chugani, Sabrina Feve, Olga Gromadzka, Tristan Halvorson, Aswath Krishnan, Brenna Lan- ton, Sohini Manna, Keaton Mowery, Valentin Robert, Zachary Tatlock, and Rosalydia Tamayo. The CSE community in UC San Diego has been like a family over the last six years and I am thankful to all the faculty and students here who make it such a special

ix department. Speciﬁcally, I would like to thank Marc Andrysco, Dimitar Bounov, Joe DeBlasio, Jessica Gross, Tristan Halvorson, Danny Huang, Ranjit Jhalla, Lonnie Liu, Nadyne Nawar, David Kohlbrenner, Sorin Lerner, John McCullough, Sarah Meiklejohn, Marti Motoyama, Arjun Roy, Alex Snoeren, Cynthia Taylor, Panagiotis Vekris, Sravanthi Kota Venkata, Ganesh Venkatesh, Michael Vrable, David Wargo, and David Y. Wang among many others who make this department so special. I would also like to express my gratitude to Steve Checkoway for being a good friend and for creating the dissertation template, which proved to be indispensable for writing my dissertation. I am also very grateful to Danny Huang for taking over the responsibilities of running the “Learn from Peers” talk series that I started. Finally, I am grateful to my mother, Praveen Chachra and my brother, Ricky Chachra for being unfailingly supportive and loving. Chapter 2, in part, is a reprint of the material as it appears in Proceedings of the 4th USENIX Workshop on Cyber Security Experiment and Test (CSET). Chris Kanich, Neha Chachra, Damon McCoy, Chris Grier, David Y. Wang, Marti Motoyama, Kirill Levchenko, Stefan Savage, Geoffrey M. Voelker, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, is a reprint of the material as it appears Proceedings of the 2015 ACM Conference on Internet Measurement Conference (IMC). Neha Chachra, Stefan Savage, and Geoffrey M. Voelker, 2015. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part, is also a reprint of material as it appears in Proceedings of the 23rd Usenix Security Symposium. Alexandros Kapravelos, Chris Grier, Neha Chachra, Christopher Kruegel, Giovanni Vigna, and Vern Paxson, 2014. The dissertation author was the primary investigator and author of this paper.

x Chapter 4, in full, is a reprint of the material as it appears in Proceedings of the Workshop of Economics of Information Security (WEIS). Neha Chachra, Damon McCoy, Stefan Savage, Geoffrey M. Voelker, 2014. The dissertation author was the primary investigator and author of this paper.

xi VITA

2009 Bachelor of Science in Computer Science, University of California, San Diego 2009-2015 Research Assistant, University of California, San Diego 2012 Master of Science in Computer Science, University of California, San Diego 2015 Doctor of Philosophy in Computer Science, University of California, San Diego

PUBLICATIONS

Afﬁliate Crookies: Characterizing Afﬁliate Marketing Abuse. Neha Chachra, Stefan Savage, and Geoffrey M. Voelker. In Proceedings of the 2015 ACM Conference on Internet Measurement Conference (IMC), October 2015.

Empirically Characterizing Domain Abuse and the Revenue Impact of Blacklisting. Neha Chachra, Damon McCoy, Stefan Savage, Geoffrey M. Voelker. In Proceedings of the Workshop of Economics of Information Security (WEIS), June 2014.

Hulk: Eliciting Malicious Behavior in Browser Extensions. Alexandros Kapravelos, Chris Grier, Neha Chachra, Christopher Kruegel, Giovanni Vigna, and Vern Paxson. In Proceedings of the 23rd Usenix Security Symposium, August 2014.

Manufacturing Compromise: The Emergence of Exploit-as-a-Service. Chris Grier, Lucas Ballard, Juan Caballero, Neha Chachra, Christian J. Dietrich, Kirill Levchenko, Panayiotis Mavrommatis, Damon McCoy, Antonio Nappa, Andreas Pitsillidis, Niels Provos, M. Zubair Raﬁque, Moheeb Abu Rajab, Christian Rossow, Kurt Thomas, Vern Paxson, Stefan Savage, Geoffrey M. Voelker. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS), October 2012.

No Plan Survives Contact: Experience with Cybercrime Measurement. Chris Kanich, Neha Chachra, Damon McCoy, Chris Grier, David Y. Wang, Marti Motoyama, Kirill Levchenko, Stefan Savage, Geoffrey M. Voelker. In Proceedings of the 4th USENIX Workshop on Cyber Security Experiment and Test (CSET), August 2011.

xii Click Trajectories: End-to-End Analysis of the Spam Value Chain. Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, Mark´ Felegyh´ azi,´ Chris Grier, Tristan Halvorson, Chris Kanich, Christian Kreibich, He Liu, Damon McCoy, Nicholas Weaver, Vern Paxson, Stefan Savage, Geoffrey M. Voelker. In Proceedings of IEEE Symposium on Security and Privacy, May 2011.

xiii ABSTRACT OF THE DISSERTATION

Understanding URL Abuse for Proﬁt

Neha Chachra

Doctor of Philosophy in Computer Science

University of California, San Diego, 2015

Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair

Large-scale online scam campaigns pose a signiﬁcant security threat to casual Internet users. Attackers simultaneously abuse millions of URLs to swindle visitors by selling counterfeit goods, by phishing to steal user credentials for various online services, or even by infecting user machines with malware. In this dissertation, I address the problem of studying these large-scale fraudulent ecosystems that heavily rely on URL abuse for proﬁt. I demonstrate the feasibility of analyzing ground truth data at scale to derive valuable insights about the underlying business model, allowing me to assess the impact of different interventions on attacker revenue.

xiv First, I address the challenge of collecting high-fidelity ground truth data under adversarial conditions. I describe the design of an efficient Web crawler that mimics real user activity to elicit fraudulent behavior from Web sites. I then use the crawler to detect affiliate marketing fraud on hundreds of Web sites. Fraudulent affiliates target merchants who outsource their affiliate programs to large affiliate networks to a much greater extent than merchants who run their own affiliate programs. Profit-oriented attackers seek to minimize costs to maximize profit. Therefore, the use of more sophisticated and expensive techniques against in-house affiliate programs suggests stricter policing by these programs. Subsequently, I analyze the ground truth sales data for two major counterfeit pharmaceutical programs with total sales of $41M over three years. Attackers advertising via email spam and black-hat search-engine optimization show different patterns of domain abuse to maximize profit under differing defensive pressures. To analyze the efficacy of intervention, I use concurrent blacklisting data and study the revenue impact of blacklisting on spammer revenue. Blacklisting, which is the most popular intervention universally used against abusive URLs, is effective in limiting revenue from specific URLs that are blacklisted. However, it does not undermine overall profitability due to very low cost of replacing domains, high consumer demand for counterfeit pharmaceuticals, logistical difficulties in rapid detection and universal deployment of blacklists, and the sophistication and ingenuity of attackers in the face of takedowns.

xv Chapter 1

Introduction

Almost every major online service today is plagued with spam content designed to lure visitors to attacker-controlled Web pages that are deployed to profit from unscrupulous activities. Some examples of these activities include the sale of counterfeit pharmaceuticals and luxury goods, phishing to steal user credentials for banking sites and other online services, infecting user machines with adware and other malware, etc. Usually, scammers advertise their Web sites through URLs contained within spam email messages [49, 96, 100], spam posts on social networks [32, 85], blogs [58] and almost any site supporting public content creation, or through black-hat search-engine optimization [48, 90]. Some of these fraudulent campaigns use botnets to rapidly create vast quantities of spam content all over the Web [70, 89]. Understanding the revenue model of attackers is critical for identifying effective intervention techniques to undermine profitability. The scale of abuse coupled with the presence of an intelligent adversary makes it infeasible to reason about attacker resources in isolation, thus creating a need for observational analysis through direct engagement with attacker infrastructure. In this thesis, I directly address the challenges inherent in measuring profit-driven abuse of URLs for different fraudulent activities. In particular, I demonstrate the feasibility of collecting and analyzing large-scale ground truth data to reason about the cost dynamics within fraudulent ecosystems, thereby placing us on a

1 2 strong footing to devise better interventions. I describe the design of a custom Web crawler for collecting large-scale data about scam content hosted on attacker controlled URLs. I implemented the crawler to closely mimic real user activity to elicit malicious behavior from scam sites by instrumenting a modern browser with sophisticated custom add-ons. Subsequently, I used the crawler to collect various features (e.g., screenshots, HTML content, etc.) from millions of abusive Web sites. Attackers usually respond to crawling with deterrence or aggression, thereby giving rise to challenges in building and maintaining a defensive crawling infrastructure. I describe some of these challenges stemming from adversarial response, and also address concerns arising from the scale of measurement studies. Our key insights include instrumenting modern browsers to achieve verisimilitude, deploying evasive techniques such as using multiple diverse IP addresses for crawling, and continuously updating crawling in response to the inevitable evolution of attacker capabilities. I then use ground truth data to analyze abuse in two separate profit-oriented fraudulent ecosystems. In the first study, I use large-scale data collected from crawling and a user study to explain the dynamics of affiliate marketing fraud (Chapter 3). Affiliate fraud garnered widespread media attention in 2014 with the imprisonment of Shawn Hogan [86], an

EBay affiliate indicted for wire fraud of $15.5M [31]. There have been multiple similar legal disputes over affiliate marketing since then [25]. These reports motivated me to measure affiliate fraud on the Web. In particular, I crawl Web sites to detect “cookie- stuffing”, whereby a fraudulent affiliate causes the user’s browser to receive a special cookie (i.e., an affiliate cookie) so that, when the user makes a purchase on a retailer site, the fraudulent affiliate earns a commission. I find that attackers target merchants outsourcing the operation of their affiliate programs to large affiliate networks much more than merchants who run their own affiliate programs. I measure the different 3 techniques deployed by affiliates to fraudulently earn commissions and observe attackers using more sophisticated and expensive techniques against in-house affiliate programs. Since fraudsters are motivated to minimize cost and maximize profit, the use of more expensive techniques against in-house affiliate programs suggests stricter policing by these programs. Using a concurrent dataset containing the source code of all publicly available browser extensions for Google Chrome, I also identify fraudulent browser extensions that perpetrate affiliate fraud by tampering with the Web sites visited by the extension users. Unlike Web sites perpetrating cookie-stuffing fraud, a browser extension provides much greater control to a fraudulent affiliate who can ensure that an extension user has an affiliate cookie just before making a purchase, thereby ensuring commission for the affiliate. In the second case study, I determine the cost dynamics of domain abuse in the counterfeit pharmaceutical market using a publicly available ground truth dataset for two of the largest counterfeit pharmaceutical campaigns with sales worth $41M spanning over three years. While the efficacy of interventions such as domain blacklisting and takedowns is frequently debated [27, 38], I discern the actual revenue impact of defensive interventions on spammer profitability. By looking at the Referer fields for actual transactions on counterfeit Web sites, I characterize the abuse of domains for advertising counterfeit drugs in email spam messages and in search-engine results. This dataset shows that attackers abuse different advertising vectors and online services simultaneously to maximize revenue (Chapter 4). Furthermore, the spamming strategies are sensitive to the pressures of interventions and takedowns. Typically, spammers respond with ingenuity and agility to replace the URLs that are taken down and use strategies that mitigate the risk of detection. The ground truth data also suggests that domain blacklisting, a universally popular intervention used against abusive URLs, is only locally effective in limiting revenue from the specific domains that are blacklisted. At a macro level, the 4 inexpensive and replaceable nature of URLs coupled with a strong consumer demand for counterfeit drugs ensure that blacklisting is merely a small cost of business and does not undermine the overall profitability of online counterfeit pharmacies. While I only describe two case studies for understanding the abuse of URLs and the efficacy of interventions, the general approach of collecting large-scale data through crawling and analyzing it to infer attacker costs and revenue is applicable for studying any other adversarial campaign reliant on domain abuse at scale. This dissertation is structured as follows. Chapter 2 describes the design of an efficient, high-fidelity crawler we use for measuring fraudulent ecosystems. Chapter 3 provides an analysis of abuse in the affiliate marketing ecosystem. Chapter 4 describes domain abuse in the counterfeit pharmaceutical market, and the impact of domain blacklisting on attacker revenue. Finally, Chapter 5 summarizes this dissertation and provides some ideas for future directions. Chapter 2

Using Crawling to Study Large-Scale Fraud on the Web

In this chapter, we describe the design of a Web crawler to gather large-scale data to study fraudulent ecosystems on the Internet. We synthesize our experiences and the lessons we learned from building and maintaining crawling infrastructure used for visiting millions of Web sites. We describe the operational challenges to achieving verisimilitude — obtaining measurements that capture “real” behaviors — many of which arise from the adversarial nature of the measurement process. Since 2010, we have also iterated on the crawler infrastructure to accommodate the storage needs of the ever-growing scale of collected data, to support new functionality needed by other members of our group for various projects (e.g., the ability to set user-agents to study cloaking [90]), and also to take advantage of the latest technology available to us. We discuss the challenges we overcame and provide valuable insight into building crawlers that engage an adversary.

2.1 Introduction

Scammers often attract customers to their Web sites through large-scale advertising via spam email messages or spam comments on blogs, social networks, etc., or

5 6 through black-hat advertising and search-engine optimization. Irrespective of the advertising vector used, the end result is the same — scammers abuse URLs to profit from unscrupulous activities. To discover new avenues for intervention, security researchers often perform observational, data-driven studies to reason about attacker resources and the underlying business models. Sometimes researchers can opportunistically benefit from publicly available data (Chapter 4), but generally we have to gather data about illicit ecosystems ourselves. A natural approach for collecting high-quality data about scammer infrastructure is to directly engage with it like a victim would. For example, using a tool to automatically visit spam-advertised URLs in a browser engages the network infrastructure of an attacker in a manner very similar to real users. Furthermore, when studying large-scale campaigns that abuse millions of domains, it is necessary to use an automated tool that can engage a sufficiently large segment of the ecosystem to identify major actors, common payment mechanisms, and other resources. Thus, we developed an efficient Web crawler to study scams heavily reliant on domain abuse. At its core, a Web crawler can use any tool for making Web requests. However, not all tools are created equal. Simple tools such as wget are inadequate for comprehensively studying fraudulent online markets because they do not emulate a real user well, and thus, do not always comprehensively elicit malicious behavior from Web sites. For example, in Chapter 3, we discovered many fraudulent Web sites that use simple JavaScript to create hidden DOM elements for affiliate fraud. Such behavior is completely opaque to simple tools that do not execute scripts. The challenges of mimicking a real user with high-fidelity and of eliciting the relevant malicious behavior from Web sites motivated us to use a modern browser in our crawler. Over time, we instrumented two popular browsers, Firefox and Google Chrome, with add-ons to emulate real user activity while crawling. 7

Besides emulating a real user, we also faced unexpected challenges stemming from the adversarial nature of our studies. In any adversarial system, every actor plays the dual roles of “attacker” and “defender”. We learned that scammers have become adept at evading defensive crawlers like ours, and respond with deterrence or aggression. Thus, we were forced to build evasive capabilities in our crawlers as well. Broadly, we evade detection by intelligently rate-limiting the number of visits per domain and by diversifying the range of IP addresses we use. Thus, over the years we learned how to build an efﬁcient Web crawler that can collect data at scale in an adversarial system. In this chapter, we describe all the building blocks needed to build a Web crawler like ours, and the commonly desired features that we implemented in our crawler. Furthermore, we explain some of the important rules of thumb for future architects of such crawlers. This chapter is organized in the following manner. Section 2.3 describes the evolution of our crawler’s architecture to accommodate the needs of different studies, such as crawling millions of spam-advertised domains [49] and crawling to detect afﬁliate fraud (Chapter 3). Section 2.4 describes the different capabilities we built into the custom browser extensions to extract various features from the visited Web pages. Next, in Section 2.5, we discuss the evolution of our crawling in response to detection by scammers. Finally, Section 2.6 summarizes our key takeaways from building and maintaining crawlers to study adversarial ecosystems.

2.2 Related Work

Crawling is a common approach for data-collection on the Web. Perhaps the most popular publicly available crawling dataset is the Common Crawl [2] corpus. The 149TB of the Common Crawl data from August 2015 contains Web content from 1.84 billion Web pages along with the HTTP headers for all requests. However, the Common Crawl 8 data is unsuitable for studying fraudulent ecosystems because the Common Crawl crawler is easily detectable (it uses a special User-Agent field set to CCBot). Thus, an attacker can easily detect and evade the Common Crawl crawler by looking at the User-Agent header in the HTTP request. Furthermore, even if an attacker fails to actively evade the Common Crawl crawler, the data corpus is still incomplete for studying scam campaigns because Common Crawl actively attempts to avoid spam and other “low-quality” URLs while crawling [97]. In fact, we manually verified that the Common Crawl corpus did not contain the fraudulent URLs we discovered while studying affiliate fraud (Chapter 3). Crawling attacker-controlled URLs such as those advertised in spam is the most natural way to engage the infrastructure that directly underlies the attacker business model, and has long been a standard technique among security groups in both academia and industry. Academics have used crawling for a variety of measurement problems including the network characteristics of spam relays [70], Web hosting [45], phishing sites [59], blacklist effectiveness [77], spamming botnets [39, 96], and fast-flux networks [39], just to name a few. While most of these studies focus on the inferences derived from the collected data, we describe the challenges encountered in building and maintaining crawler infrastructure.

2.3 Architecture

As late as 2007, researchers could successfully rely on command-line tools

(e.g., curl) to request Web content from attacker-controlled servers [8]. However, as Web environments have grown increasingly complex, attackers have adopted more sophisticated techniques to lead customers to their Web sites while simultaneously avoiding automated crawlers. Along with all popular Web sites, scam sites have also adopted scripting languages (e.g., JavaScript, ActionScript) to dynamically generate content on the ﬂy, sometimes conditionally based on user actions. For example, while 9

Table 2.1. Different versions of the Web crawler we built for studying fraudulent ecosystems.

Crawler Year Salient Features Selena 2009 Based on Selenium Web testing framework Oliver-I 2010 Custom built; uses Firefox and is tightly coupled with Postgres database Oliver-II 2012 Decoupled from database; uses Hadoop file storage system Stallone 2012 Stand-alone version of Oliver-II Charlotte 2015 Uses Google Chrome; offers flexibility in data storage studying affiliate fraud on the Web (Section 3.3.3), we observed that the fraudulent affiliate in control of bestwordpressthemes.com caused the user’s browser to fetch a fraudulent affiliate cookie only when another specific cookie was not already present on the user’s browser. Under these circumstances, a simple command-line tool cannot fully trigger the malicious functionality of a Web site. Broadly, crawling today requires using a popular browser to ensure fidelity because, as extensible platforms, carefully constructed browser add-ons can make browsers act more like real users. In our case, we instrumented two popular browsers, Google Chrome [33] and Firefox [62] using custom extensions. While using a modern browser has several advantages, crawling URLs using a full browser greatly increases the CPU and memory resources required; connection timeouts from blacklisting and unreachable domains further tie up resources, preventing quick recycling (although some of this can be addressed through per-machine parallelism and configuring kernels to provision more network resources). Thus, for most large-scale projects it is a requirement to dedicate a cluster of machines to comprehensively study an ecosystem. Consequently, we designed the Web crawler to run efficiently on a cluster. As with any large software system, we iterated over the crawler design several times to implement new features and to address the limitations of the existing design. Table 2.1 shows the different versions of the crawler that we built over the years. In this 10 section, we describe the system design and the key changes we made in each iteration of the crawler.

2.3.1 Selena

In 2009, we used Selenium, a Web testing framework, to create our first crawler called Selena, that ran on a cluster of 15 modes. Specifically, we used Selenium- Grid [72] that allowed running multiple browser instances concurrently on a cluster and automatically requesting features (e.g, HTML content) from the visited pages. However, since Selenium was developed as a framework for testing Web applications under the programmer’s control, we faced challenges in using it for crawling Web sites under the attacker’s control. Since Selenium was designed for testing Web applications on many different browsers, the resulting crawler was too complex and inefficient for our needs, thereby making maintenance difficult. For example, implementing our own code for taking a screenshot of a Web page proved very complex. Thus, we abandoned Selenium in favor of building our own crawling infrastructure from scratch.

2.3.2 Oliver-I

We built the next version of the crawler, called Oliver-I, in 2010. Figure 2.1 shows the overall design of Oliver-I. At its core, it uses the Firefox browser instrumented with a custom extension for visiting Web sites. For storing the workload of URLs to be crawled and the results from Web visits, we use Postgres, a relational database. Oliver-I runs on a cluster of 30 nodes. To fully utilize a node, we run multiple instances of Firefox on each node. A node-controller process manages the interaction between the machines and the database. It interacts with the browsers using thread- controllers. Each thread-controller manages a single instance of Firefox. The crawler operates as follows. First, the node-controller retrieves URLs from 11

Thread Controller

NODE Thread CONTROLLER Controller DATABASE

URL Q Thread RESULTS Controller

Thread Controller

NODE Thread CONTROLLER Controller

Figure 2.1. System design for Oliver-I, the second versionThread of the Web crawler built in Controller 2010. the crawl queue in the database and stores them in a local queue shared between all of the thread-controllers. A thread-controller for a browser pops URLs from the local queue for visiting. The thread-controllers are responsible for requesting various features about the Web visit (e.g., page screenshots) using API calls exposed by the custom extension running in the browser. The thread-controller combines all the features from a visit and puts them in a local result queue, again shared between all of the thread-controllers. The node-controller empties the result queue by performing batch insertions into the database. This design is replicated on each of the 30 machines across the cluster. To ensure maximum hardware utilization, we run 40 instances of Firefox concurrently on each machine. Besides emulating a real user, performing a study relying on Web crawling requires the cluster to be robust to failure conditions in order to crawl in steady-state for a period of time (several months in our case). To tolerate intermittent network or server issues, Oliver-I makes multiple attempts to visit a URL before deciding it is not valid. To prevent stalled connections from idling a browser instance indeﬁnitely, it times out long 12 page loads after multiple minutes. It also detects browser failures (e.g., a hung process) with heartbeat requests from the thread-controller every 15 seconds; if a browser does not respond, the thread-controller restarts it. To ameliorate the effects of any malware infections, memory leaks, or other resource leaks, we reboot the crawler and its browsers on every machine in the crawling cluster every 24 hours. We used this crawler to visit millions of spam-advertised domains [49]. During busy days of crawling spam feed corpuses, we crawled over 600K URLs/day with brief bursts corresponding to peak rates of 2M URLs/day.

2.3.3 Oliver-II

We visited millions of spam-advertised URLs using Oliver-I steadily over a few months [49]. Recognizing the value of crawling to gather large-scale data, we allowed other members of our research group to use the crawler as well. Even though we built several of the features described in Section 2.4, our original system design was inherently limited for catering to the needs of different projects and of the scale of measurement studies. Thus, we iterated over the original system design to create Oliver-II in 2012. The new design offers flexibility for the needs of different projects and is much more scalable. However, we only focused on improving the overall system design and left the core browser instrumentation intact from Oliver-I. Thus, Oliver-II has the same capabilities for interacting with Web pages as Oliver-I, and it handles hung browsers and page errors in the same manner as well. Originally, we only needed a specific set of features from every Web visit to a spam-advertised URL, which we stored in Postgres database. As any other relational database, Postgres provides guarantees of consistency, atomicity, and durability. The collected data was, therefore, consistent and easy to query and analyze. However, as the scale of the data grew, we reached the maximum write-through rate to the database and 13 adding more nodes to the cluster had no effect on the rate of crawling. Thus, we had to revisit our system design and we decoupled the database from the crawler. We adopted the Hadoop file storage system, which allows storing data in files that are replicated across a cluster for redundancy. Another change we made from Oliver-I was to allow users to only request features necessary for their studies. For example, while we collected the DOM content, screenshots, HTTP headers, embedded page components, etc., for studying spam-advertised domains, we only needed the Set-Cookie headers and the DOM content of a select few elements to study affiliate fraud (Chapter 3) . Allowing users to request only specific features from their crawl jobs allows us to use the file system storage more efficiently. Furthermore, decoupling from the relational database in favor of a loosely structured file storage system enables us to implement multiple new features (e.g., ability to execute arbitrary JavaScript) without being constrained to a specific database schema.

2.3.4 Stallone

We recognized that some users wanted to use a crawler but did not need the resources of an entire cluster. Therefore, we created a stand-alone version of Oliver-II called Stallone, and made it publicly available.1 By default, Stallone stores the crawling results on the local ﬁle system.

2.3.5 Charlotte

As mentioned before, when we redesigned the crawler to create Oliver-II, we did not modify the core browser instrumentation. Thus, Oliver-II uses the same browser extension as Oliver-I. Eventually, like with any software engineering system, the legacy code from 2010 became outdated and limited the performance of the crawler. When we

1https://github.com/nchachra/Stallone 14

NODE CONTROLLER REDIS

Crawl Queue

Results Queue WEBDIS SERVER NODE CONTROLLER

Figure 2.2. System design for the proof-of-concept crawler, Charlotte, built in 2015. needed crawling to study affiliate marketing abuse, we decided to experiment with newer technology to implement a prototype by instrumenting Google Chrome. In 2010, Firefox was the only browser that offered sufficient functionality to browser extensions, but by 2015, Google Chrome had become a major player in the browser extension ecosystem. Thus, we created a new prototype for the crawler, called Charlotte, using Google Chrome as the browser and a much simpler system architecture. Charlotte is a smaller, proof-of-concept Web crawler with only four nodes. The system design for this crawler is shown in Figure 2.2. As with Firefox, we again instrumented Google Chrome using a custom extension. We used Charlotte for visiting approximately 500K domains to detect affiliate cookies from Web visits (Chapter 3). The new design consists of two queues on a fast key-value store called Redis. Redis provides atomic push and pop actions for queues, and thus supports concurrent requests. The extension interacts with the Redis database over HTTP using asynchronous requests through an off-the-shelf server, Webdis, which serializes HTTP requests into Redis requests. 15

Once the node-controller starts the browser instances on a node, the browser extension starts crawling. It grabs a URL from the Redis crawl queue (along with a list of features to extract from the URL visit), visits the URL, and saves the visit features in a JavaScript object. This object is serialized and saved in the result queue on Redis. For the study in Chapter 3, a separate process emptied the Redis result queue into a Postgres database for ease of analysis. However, our storage schema is not coupled with crawling, and one could equivalently transfer the data from the result queue onto a file system or a different relational database. The key differences between the architecture in Figure 2.1 and Figure 2.2 are the absence of thread-controllers and the simplification of the node-controller. While the node-controller in Figure 2.1 manages 30 different thread-controllers and interacts with the Postgres database, the node controller in Figure 2.2 simply observes the machine utilization to control the number of running browser instances for maximum hardware utilization. Beyond starting and stopping browser instances and ensuring that all the required processes are running, the node controller performs no other task. Furthermore, since the extension itself can intelligently visit URLs and save all the features directly, we do not need a separate thread-controller to individually request different Web visit features by making API calls to the browser extension. Charlotte handles failure conditions in the same manner as Oliver-I. Specifically, it restarts hung browsers and times out after three minutes for every page load. Overall, we found it much simpler to rewrite a crawler after maintaining existing crawlers for multiple years. While one can attribute the ease and expedience of creating a new crawler to our experience, using the improved API available in browsers and the newly available off-the-shelf tools such as Redis certainly proved beneficial in simplifying the crawler architecture, thereby making it easier to maintain. 16

Table 2.2. The table shows some of the supported features for interacting with Web pages and the corresponding challenges we faced.

Feature Challenge Issuing clicks Pop-ups requiring clicks to proceed Setting custom headers Web site cloaking Reconstructing redirect chains Limited extension developer API in 2010

2.4 Browser Instrumentation using Custom Extensions

When studying a fraudulent online marketplace, a researcher might desire a variety of features about the attacker-controlled Web pages. While some of the features (e.g., HTML content of a page) might be useful for studying almost any fraudulent ecosystem, other features might be relevant only in speciﬁc use-cases (e.g., HTTP cookies). Over the course of a few years, we built and deployed a variety of features required for different projects. In some cases, we also had to build features primarily as a response to the adversarial nature of the studies. We discuss some of the capabilities we built into the different versions of the crawler below, and the corresponding challenges we faced (Table 2.2). All of the functionality described below is implemented within the custom extensions used for instrumenting browsers.

Issuing clicks While crawling spam-advertised URLs [49], we noticed that spammed sites increasingly used more sophisticated redirection techniques designed to trick users, but also to make crawling more difﬁcult. In particular, sites use JavaScript to present popups to users that require a mouse click event to proceed to the ﬁnal landing page, and use image overlays on the page to the same effect. Thus, we added the ability to issue clicks in our browser extension such that it detects and clicks on popups and images to trigger these sophisticated “redirects”. While it is simple to issue clicks using a browser extension in both, Google 17

Chrome and Firefox browsers, the challenge is to stay current with the different techniques scammers deploy, so that one can add appropriate capabilities to the crawler.

Setting custom request headers To crawl URLs in search results to explore Web site cloaking and black-hat search-engine optimization (SEO) activity in [90], we observed the need for yet more sophisticated mimicry to emulate real users. In particular, crawling a cloaked page returns different results depending on the HTTP Referer and User-Agent fields. Sites decide whether a request comes from the result of a search based upon the contents of the Referer, cloaking the contents otherwise. Sites further return different content depending on the operating system specified in the User-Agent string (e.g., a scam site will sell fake anti-virus software to Windows-based visitors and offer an iPod scam to Mac-based visitors). A crawling system for such URLs therefore requires the further ability to parameterize specific HTTP fields for each URL crawled. Thus, we added the ability to modify outgoing HTTP request headers in our custom extensions.

Redirections While crawling spam-advertised domains, we observed significant use of URL redirection to bring customers to storefront sites. Generally, visiting a spammed URL results in one or more redirects before finally landing the user on a storefront Web site where the user can buy goods such as counterfeit pharmaceuticals. Often, the spammed URLs are hosted on abused free hosting domains (e.g., imgur.com) or cheap bulk-purchased domains (Section 4.4.3), because free hosting and bulk-purchased domains impose very low cost on spammers when the URLs are blacklisted or taken down. Even when the redirects are simple 301 or 302 HTTP redirects, in 2010, we found it challenging to reconstruct the redirect chains of URLs from the visited URL to the final landing page. Most of the challenge arose from the lack of appropriate API calls 18 exposed to the Firefox extension developers, and the large number of embedded page components (e.g., advertisements) that often result in hundreds of network requests for every visited page. Fortunately, when we updated our crawler in 2015, we found that the browser extension API had become much richer over the years and, as a result, we found it straightforward to reconstruct the redirect chains for URLs requesting fraudulent affiliate cookies (Chapter 3).

Querying specific DOM elements For some projects, we need to analyze specific DOM elements in a Web page. For example, while studying affiliate fraud, we wanted to analyze whether a DOM element associated with a specific Web request was visible to the end-user. Extracting DOM elements from processing raw HTML and HTTP headers, as collected using simple tools like wget, is extremely challenging. However, modern browsers are complex pieces of software built to gracefully handle the vast majority of programming errors when parsing the page content into a structured DOM tree. Further- more, browsers expose multiple functions to extension developers for querying specific page elements and their style properties. Thus, a modern crawler can effectively delegate all the actions needed to gather data to a browser instrumented with a custom extension, thereby significantly reducing the workload of post-processing scripts. In fact, in Chapter 3, the browser extension installed on Google Chrome performs all of the actions we needed for studying affiliate fraud on the Web. The custom extension causes the browser to automatically visit a page; it then analyzes the incoming HTTP Set-Cookie headers, and parses the styling properties of the DOM elements corresponding to the cookie requests.

Executing arbitrary scripts on demand Over time, as described in Section 2.3, we exposed the crawler to all of the members in our research group. One commonly requested 19 feature was the ability to extract context-specific information from pages (e.g., all frames containing advertisements). Instead of modifying the custom extension for every feature request, we allowed injection of arbitrary custom JavaScript (through eval) on the visited Web pages as requested by the crawler user. Besides the features listed above, we also implemented a few features in the crawler that we do not discuss here because we did not face any significant challenges in their implementation. Some examples include the ability to screenshot a Web page, to store the server IP address, to store the contents of all embedded components such as included CSS or JavaScript files, to save HTTP request and response headers, etc. All of these features are again implemented in the custom browser extensions for both, Google Chrome and Firefox browsers.

2.5 Responding to Deterrence

In an adversarial environment, both the attacker and the defender attempt to evade detection by the opponent. While studying spammed URLs [49], we observed that spam sites blacklisted IP addresses we used for crawling.2 To counter blacklisting, in every version of the crawler starting with Oliver-I, we have used a combination of prevention and detection. To avoid being blacklisted, we tunnel HTTP requests through proxies running in multiple disparate IP address ranges, using various cloud hosting and IP address reselling services, as well as address blocks loaned to us from individuals and via experimental allocations from the Regional Internet Registries. We then randomize HTTP requests across the address ranges to minimize the footprint of any single IP address for any given site. Blacklisting manifests either as DNS

2In our related activities monitoring underground forums, and through collaborations with similarly focused researchers, we have found a range of blacklist firewall configurations designed to specifically block traffic from various security groups, including our own. This blacklisting includes both individual IP addresses as well as entire address ranges, /24 and larger, associated with particular security organizations. 20 errors (the name server is also commonly an element of scam infrastructure), 5xx HTTP error codes, or connection timeouts. We detect blacklisting by monitoring the rates of such errors and reacting when short-term rates well exceed long-term rates. In response, we retry requests using a different IP address range. Once again, the lessons we learned in 2010 proved useful again in 2015 when we created Charlotte. While studying affiliate fraud (Chapter 3), we learned of a high-profile case of a fraudulent affiliate, Shawn Hogan, indicted for wire fraud of $15.5M against EBay’s affiliate program [31]. Shawn Hogan only perpetrated affiliate fraud once per IP [16], again strongly suggesting the need for a diverse IP range for studying fraud. Besides blacklisting crawler IPs, we also observed more aggressive actions from spammers such as implicit DDoS on crawlers via spam poisoning. In particular, the Rustock bot started emitting large amounts of spam e-mail containing URLs with random

.com domains (literally millions of both real and unregistered domains, none of which was truly being advertised [14]). The purpose of this campaign appears to be both poisoning blacklisting services with large numbers of false positives and overwhelming crawlers such as ours with timeouts and diverse useless page loads. When this behavior started in September of 2010, we were able to manually identify some lexical patterns used across most of these URLs and tried to ﬁlter them out using regular expressions. This approach was ultimately unsuccessful as the operators of Rustock changed their poisoning code to become ever more random. To address this issue we added state to our crawler and, instead of blindly crawling all URLs, use a method that tracks the appearance of individual registered domains over time. Thus, Oliver-I schedules crawls based on how frequently a registered domain has been seen. This approach prioritizes new domains, minimizing the overhead and blacklisting risk of re-crawling the same domain many times, but not crawl millions of domains that have only been seen once. 21

2.6 Summary

Data collected from Web crawling can provide valuable insight into the underground activities on the Internet. Attackers increasingly make use of techniques to treat real users and automated crawlers differently. Thus, crawlers today need to be sophisticated to sufficiently mimic a real user to gather useful data. We built such a crawler by instrumenting a modern browser using a custom extension and used it across a series of projects. The crawler can collect a variety of features including HTML content of a page, scripts and stylesheets embedded within a page, all network headers, screenshots of Web pages, etc. It can also perform specific actions such as “clicking” or executing arbitrary JavaScript on demand. We were also forced to use a wide range of proxy IP addresses to protect our crawler nodes from being blacklisted by the spammers. Cloud hosting and IP address resellers proved to be expedient and inexpensive solutions. Finally, the scale of spammer activities and the resource usage of browsers necessitated the use of multiple instances of browsers running across an entire cluster to successfully crawl a large number of URLs in a timely manner. Even though instrumenting a modern browser provides flexibility in interacting with the crawled Web page, such as by clicking, we have had to stay current with the evolution of various techniques used by spammers to continuously update our crawler over the years. We describe the different versions of crawlers we built between 2009 and 2015 to make the crawler more versatile and easier to maintain. In the next chapter, we analyze abuse in the affiliate marketing ecosystem. We deploy the crawler to discover Web sites stealing commissions on user purchases from popular online retailers such as Amazon. 22

2.7 Acknowledgements

Chapter 2, in part, is a reprint of the material as it appears in Proceedings of the 4th USENIX Workshop on Cyber Security Experiment and Test (CSET). Chris Kanich, Neha Chachra, Damon McCoy, Chris Grier, David Y. Wang, Marti Motoyama, Kirill Levchenko, Stefan Savage, Geoffrey M. Voelker, 2012. The dissertation author was the primary investigator and author of this paper. Chapter 3

Characterizing Afﬁliate Marketing Abuse

In this chapter, we provide a case study of detecting fraud in an ecosystem using ground truth data that we collect using the Web crawler infrastructure described in Chapter 2. Speciﬁcally, we use the page content and the HTTP headers exchanged with attacker-controlled Web sites to measure and characterize the techniques deployed by unscrupulous afﬁliates to fraudulently earn commissions on purchases from major e-commerce merchants such as Amazon, Macys, etc.

3.1 Introduction

Affiliate marketing is a popular form of pay-per-action or pay-per-sale advertising whereby independent marketers are paid a commission on “converting traffic” (e.g., clicks that culminate in a sale). Heralded as the “the holy grail” of online advertising a decade ago [83], affiliate marketing has become prevalent across the Web, complementing more traditional forms of display advertising. Affiliate marketing is often described as a “low-risk” proposition for merchants, as they pay out only upon the successful completion of sales. Consequently, affiliate marketing attracts significant investment from almost every major online retailer, some

23 24 of whom also invest in multiple third-party affiliate advertising programs. Similarly, it is an attractive proposition for independent marketers as they can create online content (e.g., book reviews) that can be monetized simultaneously as a means to attract likely converting traffic and to host contextual advertising. For every click that converts into a sale, affiliate marketing is frequently much more profitable than display ads because the earning commission is typically between 4 and 10% of the sales revenue [6, 84]. Like almost all economic activity on the Web, affiliate marketing also attracts the attention of fraudsters looking to make easy cash. Affiliate fraud garnered widespread media attention in 2014 with the imprisonment of Shawn Hogan [86], an EBay affiliate indicted for wire fraud of $15.5M through the use of a technique called cookie-stuffing [31] whereby the Web cookies used to determine the likely source of user traffic are overwritten without the user’s knowledge. There have been multiple similar legal disputes over affiliate marketing since then [25]. Besides media attention, affiliate marketing has also been a subject of academic research to understand the incentives in the ecosystem and the extent of affiliate fraud [26, 78]. We study the affiliate fraud ecosystem using ground truth data we collect by crawling hundreds of thousands of domains with a modern browser instrumented using a custom extension (Chapter 2). Our extension, AffTracker, can identify affiliate cookies for six of the top affiliate programs. From crawling likely sources of cookie-stuffing, we find that large affiliate networks such as CJ Affiliate (formerly Commission Junction) and Rakuten LinkShare (recently renamed to Rakuten Affiliate Network) are targeted by cookie-stuffing orders of magnitude more than affiliate programs run by merchants themselves, such as the Amazon Associates Program. Lower attempted fraud coupled with the much higher use of evasive cookie-stuffing techniques against in-house affiliate programs suggests that such programs enjoy stricter policing, thereby making them more difficult targets of fraud. 25

uSigns up ~Pays aﬃliate vReceives aﬃliate links

AFFILIATE AFFILIATE AFFILIATE AFFILIATE NETWORK NETWORK Browser sends cookie with request xBrowser fetches affiliate URL {for tracking pixel Affiliate link Returns cookie idenfying affiliate y }Pays affiliate z Visits merchant site network wClicks on affiliate link containing tracking pixel

|Purchases goods

USER MERCHANT USER MERCHANT (a) (b)

Figure 3.1. Different actors and revenue flow in the affiliate marketing ecosystem. The left half of the figure depicts a potential customer receiving an affiliate cookie, while the right half shows the use of the affiliate cookie to determine payout upon a successful transaction.

Our analysis also shows that retailers in the Apparel, Department Stores, and Travel and Hotels sectors of e-commerce are disproportionately targeted by affiliate fraud on the Web, usually through domains typosquatted on the merchant’s trademarks. Furthermore, we also identify several browser extensions that are complicit in affiliate fraud. We find that all of these extensions — some with thousands of users each — earn commissions by silently modifying the merchant URLs visited by the extension users while browsing the Web. Finally, we evaluate data from a two-month in situ user study with 70+ users and find that affiliate marketing is dominated by a small number of affiliates while cookie-stuffing fraud is rarely encountered. Overall, our targeted crawl and user study both suggest that the problem, while real, appears to be less prevalent than suggested by previous reports.

3.2 Background

Online merchants benefit from affiliate marketing through customized and targeted advertising for their products. For example, when an affiliate reviews a bicycle on 26 a blog dedicated to biking, the bicycle merchant can receive sales from the readers of the blog without the merchant having to produce an advertising creative or directly advertise to the blog subscribers. Instead, the merchant pays a commission to the affiliate for each such sale made. To recruit affiliates for advertising goods and services, an online merchant can either run its own affiliate program or join one run by a larger affiliate network. While some merchants like Amazon and HostGator run their own affiliate programs, most online retailers (particularly those whose expertise is on the brick-and-mortar side of the business) market through large affiliate networks such as CJ Affiliate and Rakuten LinkShare. Figure 3.1 provides an overview of the affiliate marketing ecosystem where a merchant is part of a large affiliate network that acts as a link between affiliates, who are typically content publishers, and merchants. Affiliates who sign up for an affiliate network can choose to market for one or more merchants who are members of the network. Affiliate networks generally assign unique identifiers to all affiliates (affiliate IDs) and merchants (merchant IDs). Upon signup (), affiliates receive special links from the affiliate network () that encode these identifiers for the affiliate and the merchant for whom the affiliate is advertising. An affiliate includes these affiliate links in published content (e.g., a product review site) such that, when a potential buyer visits an affiliate’s Web site and clicks on the link (), it redirects the visitor to the merchant site via the affiliate program (). The affiliate link

GET request to the affiliate program returns an HTTP cookie (i.e., an affiliate cookie) that encodes the affiliate ID and the merchant ID (). Since cookies can be persistently stored on a browser, these cookies can uniquely identify the referring affiliate for up to a month after the initial visit. If the user visits the merchant site during this period and starts a transaction (), the user’s browser requests the affiliate program’s tracking pixel embedded in the merchant’s checkout flow (). The browser matches the domain 27 of the affiliate cookie on the user’s browser with the domain of the affiliate program and sends the affiliate cookie along with the request. When the user completes the transaction and pays the merchant (), the merchant pays the affiliate network for sourcing the sale (). Finally, the affiliate network pays the referring affiliate between 4 and 10% of the transaction amount as commission ().1 In-house programs work similarly, with the network replaced by infrastructure maintained by the merchant itself. An affiliate cookie remains in a user’s browser until it expires, or the user deletes it manually, or it is overwritten by an affiliate cookie belonging to a different affiliate participating in the same affiliate program. Thus, if a user clicks on multiple links for the same merchant from several affiliates participating in the same affiliate program, the cookies are overwritten and only the last affiliate whose cookie is present on the user’s browser at the time of sale earns a commission. These behaviors — that the presence of a cookie determines payout and that the most recent cookie “wins” — are at the core of the cookie-stuffing technique that allows fraudulent affiliates to obtain illicit commissions. In Figure 3.1, instead of using the affiliate URL as a clickable link, a fraudulent affiliate may cause the browser to directly fetch her affiliate URL on a page controlled by her without any explicit clicks from the user, thereby tricking the affiliate program into returning a cookie that then identifies the fraudulent affiliate as the referrer for the user’s transactions. As a result, not only does an affiliate program pay a non-advertising affiliate, but the fraudulent cookie overwrites any existing affiliate cookie that may have already been present, thereby potentially stealing the commission from a legitimate affiliate. Furthermore, cookie-stuffing fraud is typically completely opaque to an end user and goes against the advertising guidelines issued by the Federal Trade Commission for marketers, which require declaration of any financial

1This is a general description. The details of the commission, the allowed duration for conversion and the implementation details (including the affiliate URL and cookie structures) can vary considerably among affiliate programs. 28 relationship with advertisers [29]. As a result, most affiliate programs explicitly forbid cookie-stuffing. For instance, the HostGator affiliate program states that “sales made through cookie stuffing methods will be considered invalid” [36]. In prior work, Moore et al. found several typosquatted domains that belong to fraudulent affiliate marketers [60]. Snyder et al. studied the extent to which users encountered affiliate fraud from the HTTP request logs for a public university [78]. They found only 15K affiliate cookie requests across 2.3 billion HTTP requests for multiple programs including the Amazon Associates Program. Snyder er al. do not consider any of the remaining major affiliate programs we include in this study. In another study, Edelman et al. explored the incentives of different players in the affiliate marketing ecosystem, and also used crawling to identify affiliate programs defrauded through adware, typosquatting, and search engine optimization (SEO) [26]. Our work furthers this area by characterizing cookie-stuffing techniques and the range of targeted networks and retailers. In addition, we perform a user study to characterize the prevalence of affiliate marketing and cookie-stuffing fraud.

3.3 Methodology

In this section we describe how we measure cookie-stuffing fraud against six large affiliate programs: CJ Affiliate, Rakuten LinkShare, ShareASale, ClickBank, Amazon Associates Program, and HostGator Affiliate Program. While Amazon and HostGator run their own affiliate programs, the remaining four are consistently top-rated affiliate networks [65], which include well known merchants such as Nordstrom, Lego Brand, GoDaddy, etc. First, we study the structures of affiliate URLs and cookies used by these programs so that we can identify the affiliate network, the targeted merchant, and the affiliate’s ID. We then use a custom-built browser extension to identify affiliate cookies received while browsing, and use it for the large-scale crawling and the user study. 29

Table 3.1. Examples of afﬁliate URLs and cookies for different afﬁliate programs.

Affiliate Program URL Cookie Amazon Associates Program http://www.amazon.com/dp/tag=&... UserPref=.* CJ Affiliate http://www.anrdoezrs.net/click--... CLOAK=.* ClickBank http://..hop.clickbank.net/ q=.* HostGator http://secure.hostgator.com/∼affiliat/... GatorAffiliate=.*. Rakuten LinkShare http://click.linksynergy.com/fs-bin/click?... lsclick mid=”.*|-.*” ShareASale http://www.shareasale.com/r.cfm?... MERCHANT=

3.3.1 Identifying Afﬁliate URLs and Cookies

Broadly, we identified affiliate URLs and cookies either by signing up for these programs ourselves, or by finding this information online. Table 3.1 shows how we parse out affiliate and merchant IDs from some example affiliate URLs and cookies. For CJ Affiliate, we only show how we identify the publisher ID because we are unable to identify the corresponding affiliate ID. Every CJ affiliate can have multiple publisher IDs, one for each site used for publishing affiliate marketing creatives. However, every publisher ID is uniquely associated with a single affiliate. As a result, we use the terms publisher ID and affiliate ID interchangeably when discussing CJ Affiliate in the following sections. Finally, the merchant is easy to identify because an affiliate URL eventually redirects to the merchant domain.

3.3.2 User Study

In our user study, we examine how often users click on affiliate links while browsing the Web, and identify affiliate cookies using a custom-built browser extension for Google Chrome called AffTracker.2 AffTracker gathers information about every single affiliate cookie it observes in the Set-Cookie HTTP response headers while a user is browsing. Upon detection of an affiliate cookie, AffTracker parses out the affiliate and merchant identifiers and the rendering information, including size and visibility, for the DOM element that initiated

2https://chrome.google.com/webstore/detail/afftracker/aifikahpmikoknlnhnjobdakoppn 30 the affiliate URL request. AffTracker also records the redirect chain for the requests that result in affiliate cookies. Besides notifying the user about the cookie, AffTracker also submits this information to our server which stores it in a Postgres database. By advertising to friends and colleagues, we obtained browsing data from 74 installations between March 1, 2015 — May 2, 2015. Using a locally generated unique ID, we can attribute affiliate cookies to specific users without collecting any personally identifiable information (PII). While we can identify final attributes of DOM elements that cause a browser to fetch the affiliate links, we cannot automatically determine how such DOM elements are generated. Upon manual inspection we came across several affiliates who use JavaScript or ActionScript to dynamically generate hidden images and iframes that then request affiliate URLs. However, we are unable to quantify this phenomenon. Also, our user study does not have a completely random sample of users, and is likely biased towards savvy computer users. We discuss the results of our user study in Section 3.4.4.

3.3.3 Crawling

To characterize cookie-stuffing at scale, we visited over 475K domains to search for stuffed cookies. As described in Section 3.2, a user should only receive an affiliate cookie upon clicking on an affiliate URL. While crawling we do not click on any links and therefore every affiliate cookie we receive is deemed fraudulent. Since it is infeasible to crawl every Web page, we narrowed our visits to four different sets of URLs where we expected to come across affiliate fraud. For every crawl we used a slightly modified version of our publicly available extension, AffTracker, which automatically grabs a new URL from a queue on Redis, a persistent key-value store. The overall design of the crawler we used is described in Section 2.3.5. Upon completion of a visit, the extension submits results to our server 31 and purges the crawler browser of all history, cookies, and local storage. We purge the browser because we found affiliates who save state in browsers to rate-limit their cookie- stuffing to evade detection by affiliate programs. For example, an affiliate, jon007, who controls bestwordpressthemes.com, sets a custom cookie called bwt which is valid for a month. As long as this cookie remains valid in a browser, bestwordpressthemes.com does not request HostGator affiliate cookies for the user. Also, inspired by Shawn Hogan who, according to EBay, rate-limited his cookie-stuffing by only requesting an affiliate cookie once per IP [16], we use 300 proxies to mitigate IP based detection by fraudulent affiliates. In order to gather affiliate cookies, we crawled the four sets of domains listed below. Except for the Alexa top domains set, the remaining three sets are purposely biased towards domains where we expect to find higher concentration of cookie-stuffing.

Alexa Top Domains We crawled the Alexa top 100K domains [5] as of April 16, 2015 to ﬁnd popular domains stufﬁng cookies.

Reverse Cookie Lookups After identifying the cookie names used by different afﬁl- iate programs, we performed reverse lookups on the publicly available cookie-search interface [23] on digitalpoint.com, a webmaster community that indexes all of the cookies its crawler encounters. Overall, we gathered 9.5K domains that the Digital Point crawler observed performing cookie-stufﬁng over the last two years.

Reverse Affiliate ID Lookup Using the cookie-stuffing affiliate IDs discovered from our Digital Point domain crawl, we queried an aggregator site, sameid.net, that indexes domains by Amazon and ClickBank affiliate IDs seen on a domain. By iteratively crawling domains queried from the newly discovered cookie-stuffing affiliate IDs, we visited a total of 74.5K domains. 32

Typosquatted Domains While crawling the domains in the above three sets, we observed that much of the cookie-stufﬁng fraud was on domains typosquatted on merchant domain names. Thus, we crawled all typosquatted .com domains for over 7K domains belonging to major e-retailers. We acquired the set of domains belonging to e-retailers from a public API offered by Rakuten Popshops.3 The downloaded data includes merchant lists for Commission Junction, ShareASale, and Rakuten LinkShare afﬁliate networks.

By calculating the Levenshtein distance [50] for merchant domains against all .com domains in a zone file from April 19, 2015, we found over 300K typosquatted domains with an edit distance of one. Similar to Edelman et al. [26], we interpret the use of typosquatting to redirect users to merchant sites without any explicit clicks as cookie-stuffing. Typically, users visiting typosquatted domains intend to visit the merchant site rather than the typosquatted domain itself. Thus, typosquatting a domain does not bring new customers to merchant sites via direct affiliate marketing of products. Therefore, we recognize cookie-stuffing via typosquatted domains as fraudulent. We gather the same features about affiliate cookies through crawling as from the user study because we use the same browser extension, AffTracker, in both cases. Furthermore, we only visit top-level pages of domains and therefore miss any cookie- stuffing in domain sub-pages. Also, Google Chrome disables popups by default, a feature we left unchanged because it emulates a user’s browser more faithfully. However, this behavior likely caused our crawler to miss any affiliate fraud where a fraudster opens a popup to load an affiliate URL. Finally, while we can detect fraudulent affiliates stuffing cookies, we are unable to discern whether an affiliate network has already identified the fraudster. We have seen examples of ClickBank and LinkShare affiliate sites where affiliate links show an error

3https://www.popshops.com 33

Table 3.2. Affiliate programs affected by cookie-stuffing and the distribution of cookie- stuffing techniques corresponding to the 12K stuffed cookies we detected.

Affiliate Program Cookies Domains Merchants Affiliates Techniques Avg. Redirects Images Iframes Redirecting Amazon Associates Program 170 (1.41%) 122 1 70 28.8% 34.1% 37.0% 1.64 CJ Affiliate 7344 (61.0%) 7253 725 146 0.29% 2.46% 97.2% 0.94 ClickBank 1146 (9.52%) 1001 606 403 34.4% 13.5% 52.0% 0.68 HostGator 71 (0.59%) 63 1 29 43.7% 19.7% 35.2% 0.87 Rakuten LinkShare 2895 (24.1%) 2861 188 57 0.28% 0.41% 99.3% 1.01 ShareASale 407 (3.38%) 404 66 34 0.25% 0.0% 99.8% 0.74 message about affiliates having been banned, but some networks do not break banned affiliate links to prevent bad end-user experience for their URLs.4

3.3.4 Browser Extension Analysis

When the corpus of all browser extensions was made available to us for detecting malicious behaviors [44], we also opportunistically performed an initial analysis to implicate browser extensions that are involved in affiliate fraud. Specifically, we analyzed the browser extensions downloaded from the Chrome Web Store [1], which are publicly available, third-party add-ons (e.g., AffTracker) designed to enhance the functionality of the Google Chrome browser. We performed a manual, static analysis over the source code of all extensions to identify any extensions modifying URLs for different affiliate programs. In Section 3.4.3, we characterize the extensions we found stuffing cookies.

3.4 Results

In this section we first analyze the affiliate networks and merchants most impacted by cookie-stuffing fraud. Next, we survey various cookie-stuffing techniques used by affiliates on their Web sites, and then describe various examples of fraudulent browser extensions surreptitiously stuffing cookies. Finally, we present the results from our user study on the prevalence of affiliate marketing and cookie-stuffing fraud.

4Based on reading black-hat forums, we know that most afﬁliate programs disable payouts for afﬁliates upon detection of fraudulent activity. 34

3.4.1 Networks Affected by Cookie-Stufﬁng

Using data collected from our crawls, we identify the affiliate networks most targeted by cookie-stuffing. Overall, we received 12,033 affiliate cookies from 11.7K domains. Table 3.2 summarizes our results. We found that CJ Affiliate and Rakuten LinkShare are the most targeted programs, comprising 85% of all fraudulent cookies we observed. We identified affiliate IDs for all but 1.6% of these cookies. Every fraudulent CJ affiliate stuffed almost 50 cookies, while every LinkShare affiliate stuffed 41 cookies. However, fraudulent affiliates in Amazon and HostGator affiliate programs only stuffed 2.5 cookies per affiliate on average, the fewest of all affiliate programs in our study, suggesting that fraudulent affiliates target networks much more than they target in-house affiliate programs. Generally, affiliate networks represent a greater cookie-stuffing opportunity to fraudulent affiliates because a single affiliate can simultaneously defraud multiple merchant members of the network. Our data from the Popshops API (Section 3.3.3) contains almost 2.4K merchants in CJ Affiliate, and 1.3K merchants in Rakuten LinkShare, and as Table 3.2 shows, each fraudulent affiliate targeted more than three merchants in LinkShare on average. We received 10 and 15 cookies on average for every targeted merchant in CJ Affiliate and Rakuten LinkShare, respectively. Using the Popshops data as ground truth, we classified the defrauded merchants in all of the major networks in our study except ClickBank and 420 CJ Affiliate cookies. Figure 3.2 shows the category of merchant along the x-axis and the total number of fraudulent affiliate cookies we observed along the y-axis for CJ Affiliate, ShareASale, and Rakuten LinkShare affiliate programs. We found that on the whole, Apparel and Accessories e-retailers are targeted the most across all three affiliate networks while the second most impacted group of merchants are Department Stores, again abused in all 35

CJ Affiliate ShareASale Rakuten LinkShare

1000

750

Cookies 500

250

Software

Travel & Hotels Home & Garden Department Stores Health & Wellness Shoes & Accessories Apparel & Accessories Electronics & AccessoriesComputers & Accessories Music & Musical Instruments

Figure 3.2. Stuffed cookie distribution for top 10 categories of impacted merchants. three networks but to a greater extent in LinkShare. Travel and Hotel sites were the third-most defrauded group. These three sectors have a large number of merchants and we found almost 11 stuffed cookies on average for every targeted merchant. On the other hand, the Tools and Hardware category only contains four impacted merchants but we observed almost 45 cookies for each of them on average, the highest of any category. Home Depot, a CJ Affiliate member, was the most impacted merchant in this category with 163 stuffed cookies. We also found 107 merchants who were defrauded across two or more networks. Chemistry.com, a member of CJ Affiliate and LinkShare, was the most targeted merchant participating in more than one affiliate program. 36

3.4.2 Prevalence of Cookie-Stufﬁng Techniques

We use the data collected from crawling to characterize the use of various cookie- stuffing techniques against each of the affiliate programs under consideration. As described in Section 3.3, whenever the browser requests an affiliate URL, AffTracker finds the DOM element that caused the fetch. Visiting a fraudulent affiliate’s Web page can cause the user’s browser to fetch an affiliate URL without any explicit click from the user by setting it as the src attribute of images, iframes, or script tags. All of these DOM elements can request third-party content on a Web page. AffTracker is able to detect all affiliate cookies by observing the Set-Cookie HTTP headers, and distinguish between these techniques automatically. Table 3.2 shows the percentage of cookies corresponding to each of these techniques for every affiliate program. At a high level, we observe that fraudulent affiliates use a much larger variety of techniques uniformly for affiliate programs run by the merchants themselves compared with larger affiliate networks, which are targeted primary via redirects to merchant sites without user clicks. Table 3.2 also shows the average number of intermediate domains requested after the initial page visit but before the affiliate URL, i.e., a value of zero means that an affiliate URL was directly requested from the crawled page. Intermediate referrers can be used to obfuscate the original source of a fraudulent affiliate URL request because an affiliate program only sees the final referrer when determining the legitimacy of such a request. Affiliates defrauding the Amazon Associates Program use more intermediate domains on average. Since the cost of getting intermediate domains is higher, it is likely more expensive for fraudulent affiliates to defraud the Amazon Associates Program. The higher cost of defrauding Amazon reinforces the notion that Amazon does better policing.

Redirecting Fraudulent affiliates redirect users to affiliate URLs without any clicks by either using 301 or 302 HTTP response status codes, or using ActionScript or JavaScript 37 to redirect the browser to the affiliate URL. In each of these cases, the original page we crawled only resulted in one stuffed cookie per visit. Such redirects delivered over 91% of all stuffed cookies, most of which resulted from typosquatted domains. In fact, we received 84% of all affiliate cookies from 10.1K typosquatted domains.

Of the 10.1K cookies from typosquatted domains, 93% (9.4K cookies) are from domains typosquatting on merchant domain names while 1.8% resulted from typosquatting on subdomains. For example, liinensource.com redirects to Rakuten LinkShare merchant linensource.blair.com. We manually inspected 30 of the remaining 520 typosquatted domains that resulted in affiliate cookies and found that these domains can be broadly classified into three types. One-third of these domains are contextually related to the final landing page. For example, 0rganize.com redirects to CJ merchant shopgetorganized.com, while bhealthypets.com and healthypts.com redirect to CJ merchant entirelypets.com. Another third appear to be expired CJ offers, and thus did not redirect to any merchant site. The remaining one-third cookies result from typosquatted domains selling traffic to traffic distributors like pureleads.com, 7search.com, and blendernetworks.com that eventually redirect through an affiliate URL. We revisit traffic distributors in our discussion on referrer obfuscation in this section.

Iframes Iframes are often used to render third-party content on a Web page, which is otherwise forbidden by the Same-Origin policy[ 63] implemented by browsers. Most major affiliate programs disallow the use of iframes, as it is a commonly used mechanism to facilitate cookie-stuffing. For example, HostGator explicitly prohibits iframes: “iframes may not be used unless given express permission by HostGator, sales made through hidden iframes or Cookie-stuffing methods will be considered invalid” [36]. Similarly, the Amazon Associates Program prohibits framing any Amazon link on a page [7]. 38

We received 420 cookies from content rendered in iframes on third-party sites. Generally, a server can prevent a Web site from framing its content on a page by using the X-Frame-Options HTTP header with its value set to SAMEORIGIN to only allow rendering content on a page with the same origin as the frame, or DENY to completely disallow framing content on any Web site [64]. We determined that Google Chrome and Firefox browsers honor the X-Frame-Options header and do not render the iframe content, but both browsers save the cookies nonetheless. Thus, iframe based stufﬁng is effective despite the use of X-Frame-Options header. We found that 17% of the cookies received from iframes set X-Frame-Options to either SAMEORIGIN or DENY, including every Amazon Associates Program cookie. Table 3.2 shows that iframes still accounted for over a third of the stuffed Amazon cookies. Unlike Amazon, only 2% of CJ cookies and 50% of LinkShare cookies were accompanied by a restrictive X-Frame-Options header. We also used the style and size information gathered in our crawl to determine how, if at all, a user would have seen the corresponding iframe on the crawled page. We gathered this information for 46% of the iframes. Of the 191 iframes, 64% explicitly set the height or width to either zero or one pixel; 49 (25%) iframes have visibility:hidden or display:none set, thereby making the iframe invisible to an end user. Additionally, seven iframes use CSS classes to hide the iframe DOM element.

Of these, three have the same affiliate ID kunkinkun and the CSS class rkt specifies left:-9000px, which positions the iframe outside the viewport, and therefore invisible to the user. The same affiliate also defrauds the Amazon Associates Program using the same technique and ID shoppertoday-20. We also found two examples where iframes were made invisible by setting the visibility CSS property on their parent DOM elements. The 49 remaining iframes were not hidden, and most of them correspond to ClickBank. 39

Images Images can also be used to fetch third-party content on a Web page. 504 cookies in our data set were requested as images. Of these we have recorded rendering information for 91% cookies. Unlike our iframe data, we found that every single DOM element either had width or height set to zero or one pixel, or style set to display:none, effectively hiding the image from the end user.

We also found six cookies were requested by hidden img elements embedded within iframe elements. For example, bestblackhatforum.eu, a domain with Alexa rank 47,520, stuffs cookies for three different LinkShare merchants (UDemy.com, micr osoftstore.com, origin.com), one CJ merchant (GoDaddy.com), and Amazon. All of these affiliate URLs are requested as hidden images of height and width zero pixels inside iframes with src set to lievequinp.com, which is then observed by affiliate programs as the referrer. As a result, the affiliate programs do not observe the actual cookie-stuffing domain, bestblackhatforum.eu, in the request for affiliate URLs, thereby making detection of cookie-stuffing difficult. As shown in Table 3.3, fraudulent affiliates use iframes to defraud Amazon and HostGator more often than large affiliate networks, suggesting greater difficulty in evading detection by these in-house programs.

Scripts Even though script tags can be used to fetch third-party content from affiliate URLs by setting the src attribute, we only found two such stuffed cookies. However, upon manual inspection of several cookie-stuffing domains we found that scripts are often used for dynamic generation of hidden images and iframes that then request the affiliate URLs.

Referrer Obfuscation Next, we analyze the extent to which fraudulent affiliates hide the actual cookie-stuffing domain behind innocuous domains. Referrer obfuscation is used to make cookie-stuffing via any technique, such as images, opaque to the affiliate 40 programs.

Of the 12K cookies we gathered in our crawl, 84% were fetched via at least one intermediate domain. In fact, 77% of all cookies were fetched via a single redirect, 4.5% via two redirects, and another 2% via three or more redirects. Only the last redirect is seen by the affiliate program in the HTTP Referer header. We analyzed the intermediate domains, and found that a significant portion of the redirects go through a very small variety of domains. The most common intermediate domains we observed are cheap-universe.us, flexlinks.com, dpdnav.com, pgpa rtner.com, 7search.com and pricegrabber.com. Of these, flexlinks.com belongs to an affiliate program called FlexOffers, while the other domains are likely traffic distributors buying traffic and then monetizing via affiliate fraud. Over 25% of the cookies in our data contain a redirect through at least one of these traffic distributors. In fact, 36% of all CJ cookies contain at least one of these domains.

3.4.3 Fraudulent Browser Extensions

The cookie-stuffing domains we discovered by crawling (Section 3.3.3) earn revenue only when users who visit these domains make purchases with merchants in a short amount of time after receiving fraudulent cookies from these domains. Fraudulent affiliates can use advertising or purchase traffic to bring visitors to these domains. How- ever, an extension author can modify the Web sites visited by a user to force her to visit a fraudulent affiliate URL just before visiting a merchant site. Thus, an extension can ensure that a user has a valid affiliate cookie at the time of purchase, thereby guaranteeing commission for the fraudulent affiliate. Browser extensions are powerful add-ons that have significant control over modifying the browsing experience for users such as by changing the appearance of any Web site, modifying network headers sent and received by the browser, opening or closing tabs, 41 remembering passwords locally on the browser, etc. Generally, users have to explicitly give permission to an extension for accessing resources (e.g., HTTP headers) during installation. Subsequently, the extension can continue accessing these resources until it is explicitly removed by the user. Given the power, not surprisingly, browser extensions are sometimes used un- scrupulously by attackers to defraud users. While searching for malicious behavior in extensions [44], we found extensions for cookie-stuffing as well. We did not performa quantitative analysis on these extensions but focused on qualitatively characterizing their operation. We classified the cookie-stuffing extensions into two groups depending on whether the users were informed of the extension’s monetization of their online shopping or not. We describe our findings with examples below.

Extensions Monetizing User’s Online Shopping Stealthily The first group includes extensions that provide some utility to users — such as refreshing pages automatically every few seconds, or changing the appearance of popular sites like Facebook — but do not inform the extension users of monetizing their online purchases. Generally, these extensions monitor the Web sites visited by a user to detect when the user is visiting a merchant site and then change the visited URL to the fraudulent affiliate URL without the user’s knowledge. Since affiliate URLs redirect to the merchant site anyway, the modification of the visited merchant URL is opaque to the user who receives an affiliate cookie just before visiting the merchant site. Thus, the user always has a valid affiliate cookie for the extension author when visiting a merchant site. As a result, when the user makes any purchase on the merchant site, the affiliate cookie causes the merchant to incorrectly credit the advertising commission to the extension author who in fact put no effort in advertising to the user. 42

For example, we found a fraudulent extension with 52K users called “*Split Screen*”. The stated intent of “*Split Screen*”is to enable its users to render two browser tabs side-by-side within a single browser tab. However, it also stealthily monitors the Web sites visited by a user by reading the URLs being typed into the browser address bar and by observing the outgoing HTTP request headers. When the extension observes a merchant URL of interest, it silently replaces the outgoing request to request the extension author’s affiliate URL for the same merchant. We found that the extension replaced the affiliate URLs for multiple merchants including amazon.com, amazon.co.uk, hotelscombined.com, hostgator.com, godaddy.com, and booking.com. For some merchants, it also modifies the Referer header for the outgoing requests to falsely imply that the user arrived on the merchant site by clicking on a link within the fraudulent affiliate’s Web page. The extension is able to make these changes because it requests permissions to modify properties of the browser tabs and the incoming and outgoing HTTP headers when users install the extension for the first time. In fact, we found four other extensions created by the same developer that similarly provided some small utility to the user while defrauding affiliate programs in the background. Overall, this developer’s extensions had nearly 70K users.

Another extension, “Facebook Theme: Basic Minimalist Black Theme” (2.5K users), allows users to change the appearance of facebook.com. Besides its stated intent, however, it also replaces the URLs for the Amazon Associates Program with its own affiliate URL. During installation, this extension requests the user’s permission to perform eval to execute any arbitrary JavaScript strings. The extension initializes by executing a highly-obfuscated hexadecimal and base64-encoded script that stores multiple Amazon affiliate IDs locally on the user’s browser, which it subsequently uses to modify outgoing requests for merchant URLs. The obfuscated code makes it difficult to automatically detect affiliate fraud through static analysis of the extension source code. 43

We also observed that instead of directly replacing the merchant URL with an affiliate URL, some extensions replace the merchant URL with a URL on a different domain which subsequently redirects to the affiliate URL. Fraudulent affiliates likely perform such redirection to have greater control over the frequency of referring customers to an affiliate program to evade detection or to perform bookkeeping of their referrals. For example, an extension called “Page Refresh” (200 installations) redirects users going to merchant URLs to go through shortened URLs for 40 different merchants including Amazon.

Extensions Explicitly Monetizing User’s Online Shopping The second group of cookie-stuffing extensions we found include extensions with descriptions that make the monetization of user’s online shopping obvious, and thus, the intent or legitimacy of such extensions is difficult to ascertain. The stated purpose of these extensions is usually charitable donations. For example, the extension “Give as you Live” [4] has over 11K users, and it is a part of a larger campaign [3] to raise funds for charities from user purchases online. The extension modifies the search result pages to include results from stores that it can monetize through the corresponding affiliate programs. It also automatically modifies affiliate URLs for the Amazon Associates Program during user visits. While the extension brings legitimate and likely well-intentioned customers to Amazon, it hurts legitimate affiliates by overwriting their affiliate URLs. In fact, we found several extensions enabling users to donate to charity simply by shopping online. The extension advertises itself as “Help support our charity by shopping at amazon.co.uk”, but actually defrauds affiliate programs and legitimate affiliates5 by explicitly overwriting any affiliate URLs that users attempt to visit. Besides directly overwriting URLs visited by users, we also found extensions

5The source code of the extension helpfully contains a comment listing the need to obfuscate the cookie-stuffing code as future work. 44 injecting iframes on arbitrary Web pages visited by its users with the src set to affiliate URLs. The result is the same as when users visit Web pages containing iframes with affiliate URLs as described in Section 3.4.2.

3.4.4 Prevalence of Afﬁliate Marketing

As described in Section 3.3, we gathered affiliate cookie data from 74 users over a two month period to study how often users click on affiliate links and how often they receive stuffed cookies. Only 12 users received any affiliate cookie in our study. Overall, users encountered a total of 61 cookies for 23 distinct merchants. Over a third of these cookies resulted from affiliate links on dealnews.com and slickdeals.net. Thus, while almost 84% of the users did not receive any cookie at all, 12 users received an average of five cookies per user. Table 3.3 shows the high-level results. The Amazon Associates Program was the most popular affiliate program in our study, accounting for almost 51% of the cookies. As shown in Table 3.3, CJ Affiliate was the second most popular affiliate program, followed by Rakuten LinkShare. Our users did not receive any affiliate cookies from the ClickBank or HostGator affiliate programs. This distribution is different from the networks targeted by cookie-stuffers, where CJ Affiliate is targeted significantly more than Amazon. Notably, none of these affiliate cookies were rendered within hidden DOM elements. We manually inspected all of them and verified that none of the source affiliate Web sites are stuffing cookies. To rule out the possibility that users were protected by ad-blocking extensions that often disallow third-party cookies, we gathered the lists of extensions on their browsers and found that only four users use any such extension. From our user study we found that users rarely encounter cookie-stuffing fraud, and affiliate marketing is dominated by a small number of affiliates. While the set of participants of our user study is likely biased towards technologically savvy users, our results are 45

Table 3.3. Afﬁliate Programs that AffTracker users received cookies for.

Affiliate Network Cookies Users Merchants Affiliates Amazon Associates Program 31 9 1 16 CJ Affiliate 18 5 2 7 ClickBank 0 0 0 0 HostGator 0 0 0 0 Rakuten LinkShare 9 3 6 5 ShareASale 3 2 3 2 consistent with Snyder et al. [78] who found that cookie-stuffing was a very small percentage of the HTTP traffic of a large public university.

3.5 Summary

We characterized the abuse of affiliate marketing for monetary gains through the use of techniques broadly classified as cookie-stuffing. Overall, even through targeted crawling of domains with higher likelihood of encountering affiliate fraud, we observed only a limited amount cookie-stuffing in our study. Even in our user study designed to detect affiliate marketing encountered by users in daily browsing, we found no cookie- stuffing fraud. In the cookie-stuffing attempts we did discover, we characterized the techniques used for affiliate fraud and the merchants most targeted by fraudsters. Most merchants interested in affiliate marketing have a choice to either run their own affiliate programs, or join a large affiliate network with thousands of other merchants and affiliates. Since affiliate networks have a larger number of merchants, they provide greater opportunity to fraudulent affiliates to simultaneously target multiple merchants. In fact, of the affiliate fraud we did identify, we observed that large affiliate networks are targeted disproportionately more compared to the merchant-run affiliate programs who are targeted to a smaller extent, but through more sophisticated and costly cookie-stuffing techniques such as the use of intermediate domains to obfuscate the cookie-stuffing 46 domains. These results suggest that the in-house affiliate programs are better placed to police their affiliate programs due to greater visibility into the affiliate activities and the revenue flow, and possibly shorter turnaround time to take action against a fraudulent affiliate upon detection. However, running an in-house affiliate program requires expertise and cost investment not necessary for outsourcing the logistics of running an affiliate program to an existing network. In an initial analysis, we also discovered and qualitatively analyzed browser extensions perpetrating affiliate fraud. Unlike every cookie-stuffing domain we discovered by crawling, browser extensions offer much more control to the fraudster. Instead of needing to entice users to visit a fraudulent domain, a fraudulent affiliate can much more effectively force an extension user to visit an affiliate URL right before making a purchase, thereby guaranteeing a commission. From the fraudster’s perspective, creating a browser extension and acquiring a sufficiently large user base to be profitable is slow and expensive. It requires creativity along with technical skills to create an extension that would organically acquire a large number of users. However, once a fraudster has such an extension, it is easy to commit affiliate fraud while evading detection by the affiliate programs. In response, affiliate programs would need to undertake the difficult task of actively monitoring the vast browser extension ecosystem to discover the fraudulent extensions. 47

3.6 Acknowledgements

Chapter 3, in part, is a reprint of the material as it appears in the following two publications. Proceedings of the 2015 ACM Conference on Internet Measurement Conference (IMC). Neha Chachra, Stefan Savage, and Geoffrey M. Voelker, 2015, and Proceedings of the 23rd Usenix Security Symposium. Alexandros Kapravelos, Chris Grier, Neha Chachra, Christopher Kruegel, Giovanni Vigna, and Vern Paxson, 2014. The dissertation author was the primary investigator and author of these papers. Chapter 4

Characterizing Domain Abuse and the Revenue Impact of Blacklisting

Online counterfeit drug stores earn revenue from customer purchases on their storefront Web sites. Typically, potential customers find the links for these storefront sites in spam email messages in their inboxes, in Web search results, or through spam posts on forums, blogs, etc. The most common form of intervention used against any scam that involves domain advertising is blacklisting. By making domains harder to discover, blacklisting affects the advertising campaigns of spammers, thereby reducing the number of customers to their storefront sites. In this chapter, we once again use large scale ground truth data to understand the underlying conflict between attackers and defenders in the counterfeit pharmaceuticals market, and study the efficacy of interventions on attacker revenue. Specifically, we use the sales data available to us to characterize the domains and advertising vectors abused by spammers. The data also provides insights into the differing strategies used by spammers to remain profitable by minimizing advertising costs even in the face of blacklisting and takedowns. We find that spammers respond to blacklisting with agility and ingenuity. While

48 49 blacklisting does affect spammer profitability from the specific domains being blacklisted, the low cost and easily replaceable nature of domains coupled with limitations in domain discovery by defenders enables spammers to easily profit despite blacklisting. The ground truth data suggests that blacklisting is only a minor economic hindrance to spammers.

4.1 Introduction

Virtually every mode of mass communication in use online today — email, search, blogs, social networks, instant messaging, and VoIP — engenders some form of spam that is used to shill for products or services. In nearly all cases, this activity is monetized by driving users to click on Web links for spammer-affiliated e-commerce sites which then process conversions using standard online payment mechanisms (e.g., Visa and Mastercard). In response, a wide range of defenses have been proposed and implemented to identify such unwanted communication and filter it out of the user’s view (typically either preventing the advertisement containing the link from appearing or preventing the link from being visited). Among the oldest and most widely used of these defenses is domain blacklisting — the active identification and distribution of domain names advertised in an unwanted manner. This approach is used today in a broad range of defenses, including email classification, anti-phishing toolbars and search classification, and in turn has driven spammers to a variety of countermeasures (e.g., churning through large numbers of domains, using one or more layers of domain redirection, abusing existing sites to host content or redirect traffic, etc.) However, while the technical components of this domain name arms race are widely understood, the underlying economic issues that drive them are not. In this chapter, we investigate the economics of domain name abuse and place several aspects on an empirical footing — both characterizing the economic value enabled 50 by spam-advertised domain names and the concurrent economic impact of domain name blacklisting. In particular, we use almost two and a half years of sales data from two counterfeit pharmaceutical affiliate programs (comprising over 40K sites and 1.3M sales records) to characterize the domains abused for driving traffic to pharmacy sites. Using recorded Web referrer logs we infer the nature of the Web services abused (e.g., free hosting sites, search results, Webmail providers, etc.), and the dynamics of such abuse over time and the revenue afforded by different vectors. Specifically, we characterize the abused domains that account for $25M in revenue for SpamIt and $41M for GlavMed during 2007–2010. We find that spammers respond to blacklisting and takedown interventions with ingenuity, and maximize profit by abusing a wide range of traffic vectors at low cost. Against this backdrop, we then use nine months of contemporaneous data from a widely-used domain blacklisting service to quantify the impact of blacklisting on these same pharmacy store sites. Since our data set provides the precise time of every revenue event for each advertised domain name, we are able to directly investigate the extent to which domain blacklisting was successful as a strategic defense mechanism; did it undermine the fundamental business model or simply change marginal costs? Moreover, this data allows us to investigate hypothetical questions such as the extent to which improvements in blacklisting “speed” would impact profitability. Interestingly, we found that existing blacklists rapidly identified a large percentage of spammed domains (88% within 2 days) and that additional improvements in blacklisting “speed” would, by itself, have little impact on profitability. Indeed, our findings suggest that domain discovery is a more important issue in the efficacy of domain blacklisting. To wit, over 60% of revenue for domains advertised through spam came from 12% of the sites in our data set that evaded blacklisting (either through luck or, as we observe in some cases, through careful advertising to avoid the sensors of defenders). 51

Thus, even if blacklisting is otherwise robust, a small fraction of non-blacklisted domains may be sufficient to sustain overall profitability. Another complex aspect of this problem is the interaction between consumer demand and how blacklisting is used. Domain blacklists are not universally used and in many cases they are only used in an advisory fashion (e.g., labeling email as “spam” that contains offending domains). However, we find strong evidence that motivated consumers are not dissuaded by such advisories. From the referrer logs in our data set, we found that 20 to 40 percent of sales from email spam arise from users who actively open their spam folder and click on links to pharmacy sites. Indeed, this user behavior is one of the reasons that blacklisted domains in our data set earn 87% of their revenue after being blacklisted. Using a simple revenue model to represent our data we establish that even if blacklists can identify all counterfeit pharmacy domains, blacklisting can make spamming unprofitable only when used to completely block access to offending domains. While our data is limited to a particular time period (2007–2010) and a particular set of actors (GlavMed and SpamIT), we believe that the underlying conflict is largely unchanged today and that our key findings — that incomplete domain discovery and the advisory use of blacklists limits the strategic value of the approach — are likely to still hold today across a broad range of scenarios.

4.2 Background

Abusive advertising, such as email spam, dates back to the origins of the Internet. The attraction of an advertising medium with virtually non-existent marginal cost is irresistible. While a combination of legal limits (e.g., the US CAN SPAM act [17]) and the creation of structured advertising vectors (e.g., sponsored search on popular search engines) have placed controls on legitimate advertisers, those who are already breaking the law by selling counterfeit or fraudulent products continue to abuse communication 52

channels to shill their wares. Today, virtually every form of Internet communication has an attendant form of spam: email [49, 70], search [40, 48, 91, 92], blogs and forums [66, 75], social networks [34], instant messaging [68] and so on. The prevalence of such widespread abuse of different services suggests profitability but the relative differences in the extent of abuse have not been studied previously. Over the years, different Web services have developed techniques to counter spam. While the abuse of these different services is well known [11, 12, 18], we place the relative differences in abuse of different services on an economic footing. By far the oldest and most heavily abused service on the Internet is the email vector which became so prevalent that even as of 2013, spam was still the dominant form of email in transit [21]. To manage this problem, security researchers learned to classify mail as wanted or unwanted based on both its content and from where the message was sent. Thus was born Internet blacklisting. The first blacklists focused on identifying and distributing the IP addresses of hosts known to be sending spam messages so that mail servers could know to properly drop or classify their messages [82]. A wide range of literature has focused on evaluating and improving upon such IP-based blacklisting approaches (e.g., [41, 47, 76, 77, 95, 96, 98]) but at their core they are “bot detectors” and thus their value is primarily in limiting the amount of mail that can be sent with impunity from a given host. However, if the advertiser has a large number of senders available (as with large botnets) or is able to “launder” their mail traffic through a major Web mail server [100], SMTP relay [35] or SMTP server, then this sort of blacklisting will be ineffective (i.e., one cannot blacklist

the IP addresses for hotmail.com). An alternative approach was designed by the anti-phishing community: URL blacklists. These systems distributed full URLs of sites known to be hosting counterfeit 53 pages (typically representing banks or other financial institutions) and would be used either by mail servers (to classify emails containing such URLs) or by Web browsers (to block or warn users about to visit such URLs). A range of empirical studies have focused on evaluating the reaction time of such services, with results suggesting that the reaction time is short (typically a couple hours or less) [53, 74, 99]. More recently, a number of predictive approaches have been proposed, using some combination of the lexicographic features of URLs [54] or the characteristics of domain registration [30]. In practice, many high-volume URL blacklists have focused primarily on the registered domain in a URL. Feeds from such domain blacklists, such as the Spamhaus DBL [79] and the SURBL [80], have become standard inputs to virtually all enterprise spam filtering systems today. In characterizing any blacklist, two questions need to be considered: how is the blacklist created, and how will it be used? Today, since most blacklisting activity is driven by addressing the email spam vector, the blacklists are created via spam traps — open MX resolvers, honey accounts, botnet output or sometimes human-labeled spam messages [69]. By definition these lists can only detect abusive domains that are collected by these sensors. This truism is well known to spammers and “list washing” services abound (for example, http:// emaillistcleaning.com/) to remove honey and test accounts for output spam lists. Still other spammers traffic only in lists of likely customers (e.g., who have purchased goods in the past). Finally, spammers who move on to other advertising vectors (e.g., search engine optimization) may experience no impact from blacklisting since there is no organized ecosystem to collect or distribute blacklists for that medium.1 The second question is how the blacklist data is used. Email spam filtering software will typically use domain blacklist data as a strong feature in their classification algorithms. Thus, an email message advertising a given domain (i.e., including a URL

1The Google Safe Browsing list contains URLs known for phishing or distributing malware. 54 with that domain in the message body) will be likely classified as unwanted and automatically filed in a “Spam” or “Junk” folder. In other situations, such as with anti-phishing toolbars or Web filtering software (e.g., such as offered by Websense or Cisco Ironport), users may be prevented from resolving DNS queries for domains on the blacklist even if they are allowed to click on the URLs. Indeed, it is this last use that has generated the most controversy as governments have sought to legislate its use. For example, in China, comprehensive DNS filtering is used to prevent resolution of domains which the government deems as threatening [52]. However, this desire to use the DNS in this manner appears in democratic regimes as well. For example, the Australian Communications and Media Authority (ACMA) maintains a blacklist of Web sites and several administrations have proposed that ISP filtering of this list be mandatory (the two major Australian ISPs filter based on the blacklist on a voluntary basis) [38]. In the United States, the controversial Stop Online Piracy Act (SOPA) and Personal Information Protection Act (PIPA) would have required all ISPs to filter DNS requests to domains identified by brand holders as infringing on their copyright or trademark [27]. This last case generated tremendous opposition. However, most of the resulting arguments focused either on claims that it would have a chilling effect on innovation and potential infringement on free speech [22]. We are unaware of any academic evaluation about whether the statutes would have in fact prevented counterfeiters from still pursuing their business at a profit.

4.3 Data Sets

At the core of our analysis are two data sets, originally described by the journalist Brian Krebs in his “PharmaWars” series [46] and documented more fully by McCoy et al. [57], that capture the full “back end” database for the GlavMed and SpamIt pharmaceutical affiliate programs between 2007 and 2010. As described in the PharmaLeaks paper, 55 these affiliate programs provided drugstore storefronts (including domain names and Web sites), drug fulfillment, payment processing, and customer service to independent affiliate advertisers who were paid on a commission basis [9, 71]. Thus, an individual affiliate would be given one or more domain names to advertise and they would be paid a fraction of the revenue for every sale they brought through advertising using any vector (e.g., email spam or search engine optimization).

4.3.1 Authenticity and Ethics

As discussed in McCoy et al. [57], studying these leaked data sets raises concerns regarding authenticity and ethics. Here we briefly summarize the evidence that makes us confident about the authenticity of the data, and refer readers to [57] for a more detailed discussion of these concerns.2 While there is no mechanism to ascertain the authenticity of this data beyond all doubt, we never found any inconsistencies in over 140 linked tables with over 2M sales records. We further compared the databases to the separately leaked corpora of metadata containing detailed chat logs from the program operators for both GlavMed and SpamIt and similarly found no inconsistencies. Moreover, we found these data sets accurately contain all of our past purchases [43, 49] in the database as further evidence of the authenticity of the data. We address the ethical concerns surrounding the data using the same principle [57] of causing no additional harm in analyzing a leaked data set already in the public domain. We also reiterate that we again strictly adhere to our institution’s human subjects review process and ethical guidelines. For this study we only use anonymized data and do not mention any identifiable information about any person or institution who appear in the data other than naming the affiliate programs GlavMed and SpamIt themselves.

2Excerpts from both data sets and additional discussion can also be found on Krebs’ blog [46]. 56

4.3.2 GlavMed and SpamIt

The data dumps of these two afﬁliate programs are in the form of complete, self- contained PostgreSQL databases but no other code external to the database. GlavMed and SpamIt were sister programs and therefore shared the same schema. We use four

of the 140 tables in the database. Three of these, shop sales, shop transactions, and shop affiliates, were originally also used by McCoy et al. The shop sales table contains details of every order such as timestamp, sale amount, etc. The shop transactions table includes payment attempts and details of orders, and another table, shop affiliates, contains information about afﬁliates such as when they joined the program and their user handles. Unlike McCoy et al., who focused on the nature of sales and the role of afﬁliates in these programs, our focus is on domain abuse. As a result,

we also used the shop sites table which contains domain information such as their create date and the affiliate responsible for advertising the domains. Besides basic order data, the shop sales table also contains an HTTP referrer field which was previously not used by McCoy et al. For 45% of all sales in both programs combined, this field contains the URL that referred the customers to the shop storefronts. We use this field to determine how a pharmacy shop was advertised to customers: whether customers visited the Web site directly from a Webmail message

(e.g., referrer domain is hotmail.com), a search result (e.g., referrer is google.com with search terms in the URL), etc. We further restrict our analysis to valid sales, i.e., sales for which all fraud checks passed, all test purchases are removed, and a valid credit card authorization is attempted (we do not perform further sanitization of sales beyond that performed by McCoy et al.). Finally, when discussing blacklisting of SpamIt domains, we purposely omit “public shops”, domains which are shared among different afﬁliates (using a cookie or 57

URL token to claim commission) and “reorder shops” (not advertised publicly, but provided to past customers for reorders) because we cannot attribute revenue to a particular afﬁliate or a mode of advertising. These sites account for just 0.1% of all sites in SpamIt.

4.3.3 URIBL

To assess the impact of blacklisting, we use the URIBL blacklist [87]. The data we extract from URIBL contains a timestamped list of spam-advertised blacklisted domains starting July 9, 2009. While URIBL is primarily reactive, it does include some predictive features and thus, some domains appeared on it before they were seen in a spam trap (we conﬁrmed our observation with URIBL). Therefore, to distinguish between the predictive listing of domains and domains that are simply reused at a much later point of time, we exclude all domains that appeared on the blacklist more than a month before their recorded create date (equivalent to 0.3% of all shop domains). We understand the inherent risk in characterizing the entire blacklisting defense mechanism using a single blacklist. However, despite our efforts, we were unable to acquire any other contemporaneous domain blacklist for this study that provided ﬁne-grained blacklisting timestamps necessary for our analysis.

4.3.4 Spam Feeds

We also used two feeds of spam-advertised domains that we obtained from Pitsillidis et al. [69] between July 9, 2009 and March 18, 2010. We use these feeds to indicate, for instance, when spammers advertised the domains to customers in email spam. The ﬁrst feed consists of domains captured by MX spamtraps which are honeypot email addresses advertised to be visible only to Web scrapers searching for email addresses online. The second is a human-identiﬁed (HI) feed of domains contained in messages marked as spam by the users of a major Webmail provider on the provider’s user interface. 58

Table 4.1. Classiﬁcation of referrers used by SpamIt afﬁliates.

Shop Sites Sales Revenue Revenue/Sale Afﬁliates

Advertisement vectors 11957 147582 $18.05M (73.2%) $122.36 330 Email spam 11898 145041 $17.83M (98.7%) $122.94 326 Web search 173 2541 $0.23M (1.25%) $89.21 71 Infrastructure domains 1402 54351 $6.58M (26.7%) $121.17 174 Free hosting 1282 45941 $5.67M (86.1%) $123.37 154 Bulk purchased domains 120 7781 $0.84M (12.8%) $108.36 50 Compromised sites 64 629 $0.07M (1.13%) $119.33 27 Purchased trafﬁc 11 199 $0.02M (0.08%) $104.15 4 Uncategorized 863 4610 $0.54M (2.20%) $117.91 165

By construction, the HI feed contains domains that were actually seen by a human whereas the MX feed contains domains that were indiscriminately advertised to all email addresses. The HI feed has two gaps from September 19, 2009 to October 7, 2009 and October 26, 2010 to November 12, 2010. Since our spam feeds end on March 18, 2010, we only consider shop domains created through March 10, 2010 to allow for a week for domains to appear in the feeds. We believe this period is sufﬁcient because over 90% of domains appear on each feed and blacklist within a week of their create date. Given the above constraints, our analysis of blacklisting only uses the overlapping subset of all these data sets (databases, blacklist, and spam feeds) between July 2009 and March 2010.

4.4 Domain Abuse

Our first goal is to understand how affiliates abused various domains and Web services to drive traffic to the pharmaceutical storefronts. We partition the domains in our data set into three categories. The first category contains domains that belonged to SpamIt and GlavMed and hosted storefronts where customers could purchase various drugs (primarily for erectile dysfunction). We call these “shop sites” or “shop domains”. 59

Table 4.2. Classiﬁcation of referrers used by GlavMed afﬁliates.

Shop Sites Sales Revenue Revenue/Sale Afﬁliates

Advertisement vectors 1433 134977 $13.8M (38.3%) $102.25 787 Email spam 615 10855 $1.35M (9.81%) $124.79 578 Web search 1182 124122 $12.4M (90.2%) $100.27 537 Infrastructure domains 1017 134832 $13.71M (38.0%) $101.68 898 Free hosting 684 38094 $3.91M (28.5%) $102.65 654 Bulk purchased domains 374 63639 $6.27M (45.7%) $98.55 356 Compromised sites 456 33099 $3.53M (25.7%) $106.59 393 Purchased trafﬁc 458 86657 $8.55M (23.7%) $98.68 366 Uncategorized 1047 45337 $4.72M (13.1%) $104.21 890

There are 51.6K such domains in SpamIt and 2.3K in GlavMed created between November 7, 2007 and April 30, 2010. The second category consists of domains representing an advertising vector: external Web services through which customers discovered the shop domains. These include Webmail providers (e.g., Gmail, Hotmail, etc.) and Web search engines such as Google Search, Yahoo Search, etc. The remaining domains are infrastructure domains that were used by affiliates to facilitate advertising via email and Web search, and to prevent exposing the shop domains directly to blacklists. These include free hosting domains (e.g., blogspot, geocities, etc.) which are legitimate sites where anyone can host free content, compromised private sites that did not belong to affiliates, and domains purchased by affiliates in bulk for the sole purpose of redirecting traffic to the shop domains. A significant portion of GlavMed revenue (23.7% as shown in Table 4.2) also came from customers arriving at shop sites via traffic purchased from traffic sellers. As discussed later in this section, these services share characteristics with both advertising vectors and infrastructure domains, yet have a role distinct from the other categories and therefore we have included it as a separate category. Finally, there are some domains we were unable to classify in part due to limitations on being able to find reliable contemporaneous historical data about the domains. We labeled these domains as 60

Table 4.3. Example referrers for advertising vectors (email and Web search), infrastructure domains (free hosting, bulk, and compromised), and purchased trafﬁc.

Category Referrer

Email spam http://mail.live.com/mail/readmessagelight.aspx?action=markasnotjunk&folderid=... Web search http://search.yahoo.com/search?p=canadian+viagra&ei=utf-8&fr=b1ie7 Free hosting http://groups.google.com/group/... http://www.umbc.edu/ddm/wiki/user:cheap cialis http://answers.yahoo.com/my/profile?show=... Bulk purchased http://accutanewithoutprescription.org Compromised http://library.newschool.edu/askal/request/.inc/c/clomid-without-prescription.html Trafﬁc http://traffic-analytics.net/tds/in.cgi?3&seoref=http://search.comcast.net/?... q=lavitra&http referer=http://www.plantright.org/?id=49&default keyword= http://klikcentral.com/traffic/in.cgi?11&parameter=buy%20viagra& seoref=http://www.google.com/search?q=buy+viagra&...&http referer=www.vfcc.edu

Uncategorized.3 While the shop sites are conveniently listed as such in the database dumps we received, we identiﬁed the advertising vectors and infrastructure domains using the HTTP referrers recorded for 30% of 690K SpamIt sales (accounting for $25M) and 61% of 660K

GlavMed sales (totaling $41M). These referrers reflect the kind of Web site that led a customer to the shop site. For example, a customer arriving at a shop site after clicking on a URL for the shop site URL in an email message in Gmail will have a recorded referrer from mail.google.com. To classify referrers, we used features such as domain names, historical page content from The Wayback Machine [93], historical WHOIS information from DomainTools [24], and keywords in the referrer URLs. For some vectors, such as free hosting domains, we were able to find aggregated lists of domains online which we manually verified before using for classification of referrer URLs. Unfortunately, we do not have the entire redirection chain of URLs from a user’s click to the shop domain, but only the penultimate referrer that led the customer to the shop site in the next hop. However, contemporaneous data from Levchenko et al. [49] collected by crawling spam domains as described in Chapter 2 shows that 90% of the 8M spam-advertised domains

3Manually sampling these found them to be primarily bulk domains with some compromised domains. 61 they crawled using Firefox resulted in either zero or one redirects, suggesting that the redirection chains are likely short. In the remainder of this section we present our analysis of the subset of sales that have corresponding referrers.4 We start with some overall observations about the data. We then describe how we classified sales with referrers into the various categories, the sites and services that affiliates frequently targeted, and the spamming behavior of the top affiliates using each strategy. For reference, Table 4.3 shows example referrers in each category.

4.4.1 Overall Observations

Tables 4.1 and 4.2 breakdown the number of affiliates, sales, and revenue generated by the affiliates in each category. By intent, email spam was the dominant form of advertising used in SpamIt (Table 4.1). There was a moderate use of infrastructure domains (26.7% revenue dominated by free hosting) to mask the shop domains in the URLs advertised presumably also via email. In contrast, affiliates in GlavMed attracted customers mostly via Web search (Table 4.2) results. However, the use of various infrastructure mechanisms to facilitate traffic via Web search was more prevalent (38% revenue) and more evenly distributed than in SpamIt. The differences in the use of infrastructure domains for SpamIt and GlavMed can be attributed to differing pressures in the dominant advertising channels (email and Web search, respectively) used by both programs. SpamIt affiliates needed to bulk advertise their domains repeatedly via email to maintain traffic volumes, while GlavMed affiliates could have placed their content on a compromised site once and, in return, received ongoing traffic from that site until it was identified and taken down by administrators.

4We can only speculate as to the remaining sales, but we suspect a large fraction arise from email clients that do not naturally transmit a referrer and, in some cases, from intermediate domains that explicitly strip referrers. 62

Similarly, while SpamIt affiliates had a cost structure that needed to accommodate adversarial blacklisting and filtering of email messages, GlavMed affiliates monetizing search traffic needed to maintain the rank of their shop sites for popular search terms. We discuss these differences further below in the context of individual categories. Yet another interesting result of this classification is the revenue generated per sale. Notably, the average revenue per sale was relatively uniform at just over US$100/sale for all categories in both Tables 4.1 and 4.2. No matter how an affiliate attracted customers, customers tended to spend the same amount of money regardless of the kind of URL they clicked on; there is little customer differentiation by strategy in terms of revenue. So the dominant goal for affiliates remained attracting as many customers as possible. Generally speaking, though, the top affiliates often used multiple infrastructure domains for redirection at a time, often emphasizing one kind over another over time in response to a dynamic environment. Whereas Tables 4.1 and 4.2 provide a summary overview, Figure 4.1 shows the temporal dynamics of the revenue from clicks on different kinds of domains over time (binned by weeks) for SpamIt and GlavMed. The dynamics in Figure 4.1 highlight the freedom of innovation of the affiliate program model, which provides the flexibility for different affiliates to explore different strategies for generating sales and the agility of affiliates to react to defensive pressures. Even though the vast majority of revenue in SpamIt came from shop domains directly advertised via email, there was some use of infrastructure domains as redirection mechanisms. In July 2008, one affiliate began using free hosting sites. It was an effective strategy for a while, but gradually the free hosting providers were able to undermine the abusive practice. As free hosting dwindled, in January 2010 a small group of affiliates began using bulk domains for redirection as well, a profitable strategy for three months. In contrast, the use of Web search for direct advertisement of shop domains 63 was much smaller in GlavMed and there was significant use of infrastructure domains, presumably for search engine optimization (SEO). The revenue from different SEO efforts is more distributed. After a steady rise in sales throughout 2008, GlavMed experienced a jump in revenue primarily via purchased traffic and the use of bulk domains to direct traffic to shop sites. A rise in sales from direct advertising of shop domains on Web search contributed to another spike in January 2010.

4.4.2 Advertising Vectors

Email spam and Web search are the primary direct advertising vectors in SpamIt and GlavMed, respectively.

Email spam

Sales from email spam are those in the data set where users clicked on links to the shop sites advertised directly in email messages. Since SpamIt caters to email spammers, it is not surprising that email-based sales account for nearly all of its revenue. We classiﬁed referrers in this category by matching domains of known popular

Web mail providers (e.g., mail.google.com), regional Web mail providers (e.g., poczta .o2.pl), and keywords corresponding to known email clients (e.g., zimbra, squirrelmail, etc.). We also included sales from other online sites with internal message services, most notably Facebook. To validate likely but uncertain referrers, we manually inspected them by visiting the sites using the Wayback Machine [93]. The vast majority of sales came via spam to Web mail providers, with spam to Yahoo, Hotmail, and AOL accounting for 84% of the total revenue. Historically, the goal of filtering email spam has been to prevent it from reaching the user’s inbox. To account for the possibility of false positives, though, services file messages classified as spam separately (e.g., into a spam or junk folder with a timeout) 64 rather than deleting them immediately. Surprisingly, the sales records indicate that this filtering approach does not necessarily undermine revenue: despite such active filtering, users intentionally locate messages classified as spam and visit the storefront sites advertised in the messages. In effect, for some users the spam filtering, and contributing defenses such as blacklists, makes it easier for people to locate advertised storefronts. Specifically, we were able to infer the folder from which an email spam message was clicked in 68% of referrers from Hotmail and 40% of referrers from Yahoo Mail. We used the well known folder names in Yahoo Mail and Hotmail to determine the folder that the user found the email in. For example, Table 4.3 shows a Hotmail referrer with the parameter folderid.A folderid of 5 corresponds to the spam folder while 1 corresponds to the inbox for Hotmail. Similarly, the parameter fid contains the name of the folder for Yahoo Mail referrers. We found that for Hotmail, over 20% of email-based sales came from customers who clicked on links in messages not in the inbox. Similarly, 39% of sales from Yahoo Mail referrers arose from non-inbox folders: 31% are from the bulk folder and the remaining 8% from various custom folders such as online orders, cheap medication, viagra reorder, etc. Such folder names clearly suggest that some people save these messages for future use and we also identified multiple referrers where users explicitly marked pharmaceutical spam as “not junk”. Table 4.3 shows such an example referrer for Hotmail. This evidence shows strong demand in the counterfeit pharmaceuticals market.

Web search

With Web search sales, customers arrived at pharmacy sites by directly clicking on shop domain URLs in results to Web search queries. Again reﬂecting the duality of the two afﬁliate programs, search-based sales are far more popular in GlavMed. Revenue from search results predominates in GlavMed at 31% of the total revenue, while it forms 65

500K SpamIt

400K

300K

200K Revenue ($) Revenue

100K

0 2008−01 2008−07 2009−01 2009−07 2010−01

Other E−mail Free hosting Bulk

GlavMed 500K

400K

300K

200K Revenue ($) Revenue

100K

0 2008−01 2008−07 2009−01 2009−07 2010−01

Paid traffic Search Other Compromised Free hosting Bulk

Figure 4.1. Revenue from clicks on different kinds of referrers. only a tiny fraction in SpamIt at 1.2%. As seen above with users explicitly searching their mail folders, Web search sales again demonstrate customer demand in explicitly seeking out online pharmacy sites. We identified sales from all major search engines (Google, Yahoo, Bing, Ask and AOL), 66 portal search sites such as search.rr.com, search.msn.com, search.orange.co.uk, as well as other sites that allow searching for arbitrary keywords. Referrals from the top two search engines at the time, Google and Yahoo, dominate GlavMed revenue at 78%. Moreover, nearly all referrers include the keywords (e.g., canadian viagra for the URL shown in Table 4.3) for which the customer searched. For Google and Yahoo, the most popular keywords are cialis and viagra, respectively. These terms reflect the overwhelming demand for male enhancement products in these programs [57]. Over the period of study, GlavMed affiliates received steady sales from search results for shop sites with 4,137 sales on average per month, and all shop sites received at least one sale from search engines. The affiliate webplanet received the most search-based sales in GlavMed, evolving his strategy over time. Webplanet initially received search based sales from Yahoo and MSN Search, and did not start monetizing Google Search until April 2008. From that point, he attracted customers from both Yahoo and Google equally. Lastly, we also observed referrers from searches on mobiles in GlavMed throughout the period of the data set. Although initially accounting for a negligible fraction of sales, monthly sales increased continuously over time — suggesting affiliates started to explore a nascent yet growing advertising vector.

4.4.3 Infrastructure Domains

In addition to using email spam and Web search to attract customers to their shop sites, afﬁliates also made use of other Web services and domains to boost trafﬁc from both of these vectors. 67

Free hosting

Free hosting domains are sites where any user can post content for free. Spammers frequently abuse these sites by creating blogs, profiles, forums, and wiki pages, and posting comments, uploading images or other files, etc. The spammed content has links to entice potential customers to pharmacy shop sites. We classified many of the free hosting domains in referrers using lists of such domains generally available online. Examples include docs.google.com, spaces.liv e.com, imageshack.us, etc. For domains that did not appear in our free hosting site list, in most cases we were able to notice free hosting sites when multiple referrers only differed in the profile identity string. We verified that these domains were in fact free hosting domains using the Wayback Machine. We also identified forum abuse using keywords such as viewtopic, discuss, showthread, etc. We manually inspected referrers to distinguish between open forums and wikis used to freely post content, and forums and wikis hosted on compromised sites (Section 4.4.3). Table 4.3 shows three canonical examples of free hosting referrer URLs from our data set, including a wiki hosted on umbc.edu that was used to create a page to advertise erectile dysfunction drugs. We also included URL shortening services in this category, including tran slate.google.com exploited as a redirection service. Even though services such as bitly.com were very popular in 2009 [10], we only see a small number of sales via shorteners for structural reasons. Most popular shortening services respond with a 301 Moved Permanently HTTP status, causing the browser to resend the request to the final site using the original referrer. As a result, the referrer seen by the shop site is the site where the user clicked on a shortener link, not the shortener itself. Free hosting was the most popular form of infrastructure domains used in SpamIt (Table 4.1). While free hosting abuse was less popular than bulk domain abuse in 68

GlavMed, it still comprised 5–7% for both affiliate programs. However, the nature of free hosting abuse differs for SpamIt and GlavMed because of differing objectives with these sites. SpamIt affiliates used free hosting to host content on trusted domains to overcome blacklisting-based content filtering (i.e., blacklists do not list google.com as a bad domain because of some abuse on docs.google.com). Thus, various services on google.com, live.com, yahoo.com, and imageshack.us are the most abused free hosting services among SpamIt affiliates. Google Groups was abused most effectively, typically using bogus group profiles, and accounted for 29% of the SpamIt revenue via free hosting sites. In contrast, the motivation for free hosting abuse among GlavMed affiliates is to attract traffic by boosting the search engine rank of their domains. Also, abusing a large range of redirection sites causes multiple results linking to the same shop site to show up when a potential customer queries for pharmacies. Thus, among GlavMed sales we notice abuse of a larger number of free hosting sites (3,956 unique domain names vs. 830 among SpamIt) and sales are spread more evenly among domains: backpage.com was the most abused domain but accounted for only 6% of all free hosting abuse in GlavMed. For spammers, a disadvantage of using free hosting sites is that once the abused domain removes the offending content, the spam links break and no longer point to the affiliate’s shop site. While we do not know the time it takes sites to detect and takedown spam pages, in at least one case correlating spammer behavior with news reports of abuse suggests that takedowns by free hosting sites require months to be effective. Furthermore, spammers seamlessly switched targeted sites in the face of such takedowns. Figure 4.2 illustrates the agility of a top SpamIt affiliate, master, who accounted for 93% of all sales via free hosting referrers. From July–August 2008, a large fraction of master’s revenue came via ImageShack. Subsequently, in August–September 2008 reports of abuse emerged of email spammers sending messages with links to Flash 69

100K Google Groups & Docs ImageShack MSN Spaces & LiveFileStore Yahoo Groups & blogs 75K

50K Revenue ($) Revenue

25K

2008−07 2009−01 2009−07 2010−01

Figure 4.2. Spammers seamlessly switch from one free hosting site to another in the face of takedowns.

files hosted on ImageShack [18]. Revenue via ImageShack in SpamIt immediately declined (suggesting takedowns of advertised files by ImageShack), while master’s free hosting revenue from spaces.live.com increased from August–September 2008 using automatically created profile pages on MSN Spaces.

This pattern repeats. Revenue from live.com almost entirely disappears in November 2008, coinciding with a Spamhaus report that ranked Microsoft as the ﬁfth most spam friendly ISP [11]. Master’s sales switched to heavy and almost exclusive abuse of Google free hosting services between December 2008 and February 2009. At this point revenue from Google Groups declined signiﬁcantly, once again coinciding with Spamhaus ranking Google as the fourth most spam friendly ISP [12]. Master’s free hosting sales then switch to Google Docs followed by a brief switch to the abuse of Yahoo in June–July 2009. Such trends demonstrate that even when free hosting sites took action against 70 spammers, the general prevalence and availability of such sites enabled a skilled spammer to quickly and seamlessly switch to newer services in the face of takedowns.

Compromised sites

We also found a large number of referrers to domains in GlavMed that appear to be compromised sites. Such sites are valuable for their search rank in poisoning search results [40, 91] for attracting trafﬁc to storefronts: over 66% of these domains are under

.edu or .gov TLDs, which purportedly have higher search engine rank. Using DomainTools [24] and the Wayback Machine [93], we found legitimate sites hosting spam content in sub-directories. Often referrers from compromised sites contain content in either hidden directories (e.g., .inc as shown in Table 4.3), or sub- directories intended for other purposes (e.g., css, images). Some hackers added content to compromised sites in a signature style, facilitating matching. For example, eight affiliates received traffic from sites where the redirecting page had content placed in a directory named md. A challenge in identifying compromised sites is distinguishing whether wiki and forum hosting software was compromised, or if spammers just created their own free pages on publicly accessible wikis and forums. We classified discussion and message board abuse as free hosting abuse (Section 4.4.3). For wikis, however, we used the Wayback Machine to see if they were open for public editing at that time. If not, we considered them compromised. As discussed in Section 4.4.4, hackers often compromise sites and install malware to direct customers to traffic buyers on demand. As a result, the relation between compromised sites and affiliates is often not one-to-one. Nearly 36% of compromised sites redirected traffic to multiple affiliates, while 65% of affiliates receiving this kind of traffic received traffic from more than one domain. 71

The most effective afﬁliates used compromised sites differently. GlavMed afﬁliate glavmed2 received the most revenue (12%) from 44 compromised sites, with one site

(arkansasbaptist.edu) accounting for 81% of his revenue. While glavmed2 primarily monetized just one site, affiliate grbk received the second highest revenue more evenly distributed among 268 different sites. SpamIt affiliates received a negligible amount of traffic from compromised sites. The primary advantage that compromised sites offer to email spammers is the reputation of the advertised site in the spam filter calculation — an advantage that free hosting sites offer as well, but at a much lower cost.

Bulk purchased domains

Affiliates purchase bulk domains as intermediaries for redirecting users to shop sites. Many bulk domains contain pharmacy-related keywords such as accutanewithou tprescription.org, tramadol-shop24.com, etc. The pharmacy content is typically at the root of these sites (Table 4.3 shows an example), distinguishing them from compromised sites where the content is on pages deeper in the name hierarchy. Furthermore, each domain redirected sales to just one affiliate, suggesting that these were owned by the affiliates themselves. The revenue from the use of bulk domains by SpamIt affiliates to redirect to shop domains is much smaller than the use of free hosting sites (Table 4.1) even though bulk domains are inexpensive and can be purchased in large numbers conveniently. This small use is perhaps because bulk domains advertised in email spam are blacklisted very quickly and therefore do not offer much advantage over spamming shop domains directly. For GlavMed affiliates, though, bulk domains offer advantages similar to compromised and free hosting sites by potentially increasing the number of results that appear on search pages and attracting more traffic. We counted 63,639 sales from 1,957 distinct 72 bulk referrer domains, and nearly 23% of all the affiliates in GlavMed used bulk domains as a mechanism to attract buyers. The use of bulk domains by the two highest earners once again highlights the flexibility of the affiliate program model. Venerable affiliate webplanet received the most revenue via bulk domains (43%), with over half of the revenue coming from three domains (bestedmed.com, newedpills.com, and thebettersexmall.com) and the remainder from 127 other domains. In contrast, affiliate andrew13plus had the next highest number of sales but used a different strategy for monetizing bulk purchased domains. In particular, andrew13plus apparently purchased domains after they had expired, but also once they had accumulated useful search rank (e.g., carrollfootball.org, sharonlnorris.com, etc.). As a result, most of the sales from the domains came during the first month of use. The remaining revenue came within three months, after which the ownership of both of these sites, as well as their search potential, changed.

4.4.4 Purchased Trafﬁc

Third-party advertising in general is a popular method for attracting traffic to Web sites, and affiliates of pharmaceutical sites are no exception. Purchased traffic comprises the second largest source of revenue (24%) for GlavMed affiliates. We grouped traffic providers into three classes ranging from legitimate to suspicious services. First are premier advertising services such as Google Ads. Early in 2008 affiliates experimented with ads with such services, but soon abandoned them presumably because of the high cost of popular pharmaceutical keywords.

The second class consists of traffic distribution systems (TDSs) such as traffi c-analytics.net. These services act as intermediaries buying and selling traffic [81]. Frequently TDS kits are also installed on compromised sites to gather traffic, which 73 is then monetized in a variety of ways including forwarding traffic to pharmacy sites, fake anti-virus, malware distribution, etc. [19]. A distinguishing characteristic of these referrers is that the referring domains point to several affiliates and the referrer URLs frequently have affiliate IDs. Table 4.3 shows a referrer for a user who searched for levitra on search.comcast.net and clicked on plantright.org, a site owned by Sustainable Conservation between 2007–2010. Upon clicking, the customer was redirected to traf fic-analytics.net which then sent the customer to a GlavMed affiliate shop site. Purchased traffic from TDSs are attractive for affiliates because their illegitimate nature enables them to be cheaper than premier ad services; for instance, the price for keywords such as viagra and cialis varies between 30–90 cents per click based on bids from the TDS RivaClick, while bidding for the same term on a mainstream advertising network costs several times more. The third class consists of content providers who gather traffic to sell to TDS vendors. These two actors can also be the same. For example, klikcentral.com was a fake search engine for certain categories such as pharmaceuticals, cruise deals, degrees online, etc., and where the results are primarily ads. However, we also have evidence of it gathering traffic from compromised sites via Google Search. Another referrer in Table 4.3, for instance, shows a user who queried for buy viagra on Google Search and landed on www.vfcc.edu, and redirected through klikcentral.com to the pharmacy store site. Some sites also use the guise of legitimate search engines for pharmaceuticals, but actually just gather and resell traffic (e.g., viagra-prices-comparison.com). All of these sites redirect customers to more than one affiliate, including one such site, topmeds10.com which redirected customers to as many as 73 affiliates. A large number of traffic buyers and sellers are merely intermediate domains (e.g., klikcentral.com) that specialize in search engine optimization, often using compromised sites. As such, 74 they also act as infrastructure domains. The top GlavMed affiliate using purchased traffic was once again webplanet, who received 39% of the revenue from purchased traffic. Through September 2008 webplanet received few sales from purchased traffic (83/month), but then invested heavily over several months in purchased traffic and averaged 3,537 sales per month.

4.5 Blacklisting

Since afﬁliates rely upon domains to host shop sites and intermediate sites, a common defense is to blacklist such domains. In particular, email services use domain blacklists to identify shop domains advertised via email spam, and classify incoming messages containing these domains into designated spam folders. In this section, we describe the revenue impact of blacklisting on gross revenue for spam-advertised pharmaceutical campaigns. We use the URIBL blacklist described in Section 4.3 as a representative list of blacklisted domains and the leaked sales data sets as ground truth for the sellers impacted by blacklisting. As mentioned in Section 4.3, we restrict our analysis to the nine month period from July 9, 2009 to March 18, 2010 for which all our data sets (leaked sales, blacklist, and spam feeds) overlap.

As an email-based blacklist, URIBL identiﬁed 88% of 40K SpamIt shop domains as offending, most of them within two days of the creation of domains. Unsurprisingly, it identiﬁed only 4% of 1K GlavMed domains which are advertised predominantly via Web search. Therefore we restrict our blacklisting analysis to SpamIt domains only. This subset of data has 40K SpamIt shop domains that received 137K sales grossing $15.6M total revenue. In our analysis, we use four parameters — common to all blacklisting based defenses — to assess the impact of blacklisting. 75

150K

100K

50K Total revenue (USD) revenue Total

0 −6m −1m −1w −1d −8h −2h BL 2h 8h 1d 1w 1m 6m

Figure 4.3. Revenue of domains before and after blacklisting. Note that the x-axis is non-linear.

4.5.1 Blacklisting Speed

The first aspect we consider is the time it takes for a spam domain to appear on a blacklist. Figure 4.3 shows the revenue distribution of the 35K blacklisted domains before and after blacklisting. We normalize domains by the time they appear on the blacklist: time zero is their blacklisted time, revenue earned before being blacklisted is negative in time, and revenue afterward is positive in time. Relative to their blacklist time, the curve shows the amount of revenue earned from customers across all domains per hour. The figure shows a number of interesting results. First, it shows that domains receive most of their revenue after they are blacklisted. The revenue before blacklisting was $740K, or just 13% of the total revenue of 5.9M (Table 4.4) from the blacklisted domains. One explanation is that blacklisting did not have a universal effect: customers may have received spam from email services that did not use the blacklist for spam filtering or deployed it sometime after the blacklist was updated and hence blacklisting had no effect on them. However, as found in Section 4.4.2, even when email services use blacklisting to classify spam, customers still explicitly searched their spam folders and 76 clicked on URLs with shop domains to make purchases. In effect, while blacklisting does not immediately stop the domains from earning revenue, it does set a lifetime to their earning potential. Revenue rises sharply just before blacklisting, peaks immediately after blacklisting, and then drops substantially over the next two days. In contrast, domains that are not blacklisted earn revenue over much longer time spans (weeks and months). With blacklisting, domains no longer have significant earning potential after a few days and affiliates have to purchase new domains to continue to earn revenue by advertising with spam. Therefore, keeping everything else constant, placing these domains on blacklists even “faster” would not have prevented domains from being monetized.

4.5.2 Coverage

During the period of study, URIBL identified an impressive 88% of all SpamIt domains. However, while missing only 12% of all domains attests to the diligence of the blacklist maintainers, this minority of domains still accounted for 62% ($9.7M as shown in Table 4.4) of the total revenue from all blacklisted and non-blacklisted domains combined. Evading blacklists was clearly advantageous for affiliates. While there is evidence that affiliates made some effort to evade blacklisting by making much greater use of free hosting and bulk domains as layers of indirection (Table 4.4), this difference is rather small. Even the non-blacklisted domains received most of their sales from clicks directly in email messages. Thus, it appears that these domains evaded blacklisting not just because of how they were advertised but also because of who they were advertised to. Blacklists are typically created using MX honeypot accounts that do not belong to real people. Also, McCoy et al. [57] showed that some affiliates were far more successful than others in their ability to spam effectively and earn more revenue. We found evidence 77 suggesting that affiliates who evade blacklisting do not send their email messages to spam traps. As shown in Table 4.4, 88% of the blacklisted domains appeared on the MX feed that consists of URLs seen in spam traps, while only 0.5% of the non-blacklisted domains ever appeared on these feeds. On the other hand, our Human Identified (HI) feed identified 96% of the blacklisted domains as being spam, while only 25% of the non-blacklisted domains appeared on it.5 The observation that non-blacklisted domains were significantly more likely to be seen by a human rather than a spam trap suggests that there was a difference in the nature of the email addresses used to advertise blacklisted and non-blacklisted domains. We attribute this difference to the sophistication of spammers advertising them. To improve blacklist coverage going forward, one possibility is to extend the provenance of domains beyond just spam-advertised domains. For example, crawling spam advertised links can determine whether a URL for an otherwise legitimate domain might lead to a pharmacy (or other counterfeit storefront) domain. Another possibility is for services that maintain human-identified feeds (e.g., Webmail providers) to share domains from spam-advertised URLs with blacklist maintainers. Finally, if we assume that we could have blacklisted domains that managed to avoid being listed — and that doing so would have caused them to monetize similar to the blacklisted domains — then these domains would have earned just $168 on average as opposed to the $2038 that they actually earned. Thus, discovering every additional domain would have reduced spammer revenue by $1870 (92% of its original revenue).

4.5.3 Blacklisted Resource

As an intervention against afﬁliate spammers, blacklisting could use any of the uniquely identiﬁable resources used by spammers such as the IP addresses of hosts

5The stated number of domains appearing in HI is a lower bound because four weeks of data is missing in the feed. 78

Table 4.4. Statistical differences between blacklisted and non-blacklisted domains.

Blacklisted Non-blacklisted Shop-sites 34959 4751 Sales 56K 80K Revenue $5.9M $9.7M Sales/Site 1.6 16.9 Affiliates 119 144 Sites seen in feeds 34771 (99%) 1193 (25%) MX 30647 27 HI 33701 1185 Sales with referrers 6076 28576 Email Spam 4798 (78.9%) 18206 (63.7%) Purchased Traffic - 168 (0.59%) Free hosting 284 (6.31%) 4507 (15.8%) Compromised sites 8 (0.13%) 124 (0.44%) Web search 7 (0.11%) 22 (0.07%) Bulk Purchased Domains 709 (11.7%) 4375 (15.3%) Uncategoried 170 (2.79%) 1172 (4.10%) sending spam messages, domain names hosting storefronts, or even the bank accounts used to process transactions [49]. We next analyze the efficacy of choosing domains as the resource that is blacklisted.

In 2009–2010, the bulk price of a domain varied between 15 for a .cn do- ¢ main [51] to $7 for a .com domain [88]. Also, purchasing domains in bulk can also be automated, making the effort to replace a blacklisted domain negligible. Moreover, infrastructure domains such as free hosting domains can be useful for evading blacklisting, yet are abundantly available at low prices as well.

During the nine months we consider for blacklisting, afﬁliates used 40K domains and 88% of them were blacklisted. Assuming that blacklisting forced afﬁliates to replace the listed domains, the total cost of replacement for domains was $245K when conservatively assuming $7 per domain. This cost is only 1.6% of the total revenue from this period. Again, given the low costs of domains and the relatively much higher 79 revenue earned per domain, blacklisting did not impose a serious cost of replacement. In Section 4.6 we estimate the cost per domain that would have made blacklisting prohibitively expensive for spammers.

4.5.4 Blacklisting Penalty

The last aspect we consider is the penalty of having domains blacklisted. During 2009–2010, blacklisting was used to identify and filter spam messages into designated spam or junk folders, where they would remain for up to a month for most Webmail providers. In the absence of demand, such spam filtering would have hidden unwanted ads away from user inboxes. As discussed above, despite blacklisting, affiliates continued to receive sales at least in part as a result of market demand: 20–40% of customers accessed messages in their spam folder, searched for drugs by name, etc. Thus, for the spam-advertised unlicensed prescription drug market, the penalty imposed by a classification-based defense was overshadowed by the demand for these drugs. As discussed in Section 4.4.2, classifying messages as spam effectively made it easier for people to find these storefronts and thereby enabling a domain, at least to some extent, to earn 87% of its total revenue after blacklisting (Figure 4.3). In the next section, we also consider the effects of increasing the blacklisting penalty.

4.6 Discussion

While the previous section describes the lackluster role played by blacklisting in our data set, it begs the larger question of whether these results might change if blacklisting were deployed more aggressively. To reason about this issue, we parameterize a simple model of blacklisting impact and then explore the general implications for blacklisting email advertised domains. 80

4.6.1 A Simple Revenue Model

In general, prediction and extrapolation can be highly error-prone, and this is only enhanced by the presence of an intelligent adversary. Thus, we use a very simple model to capture the impact of blacklisting on profitability. As shown in Section 4.5, blacklisting does have a significant opportunity cost for the spammer, but our assumption is that the central goal of domain blacklisting is to hurt spammer revenue sufficiently to make such unsolicited email based domain advertising unprofitable altogether. Thus, we parameterize a model to determine the conditions under which blacklisting can achieve this goal. To formulate our model, we first assume that there is a sufficiently good detection capability that all abusively-advertised domains will eventually be blacklisted (an assumption we will return to in Section 4.6.3). We then model the marginal revenue R from a domain as the composition of the revenue before blacklisting and the revenue after blacklisting. To describe the pre-blacklisting revenue in a parsimonious fashion, we use a single parameter α to represent the mean revenue per unit time and we assume that this revenue remains constant from the time a domain is first advertised until it is eventually blacklisted. Once the domain is blacklisted we assume that, as in our data set, sales decline swiftly and thus all additional revenue can be captured by a single parameter

β.6 Thus, we describe the marginal revenue per domain as:

R = α ∗t + β

Finally, assuming that blacklisting causes spammers to replace a domain, we

6Recall from Section 4.5.1 that 87% of income for blacklisting domains occurs after blacklisting, due to some combination of demand, delays in using blacklist data, or non-universal deployment of blacklist-based ﬁltering. 81 estimate the marginal cost for every domain is c. We estimate this cost as only the cost of purchasing and registering a domain name. We assume that all other costs associated with replacing a blacklisted domain name such as creating a new Web site for the pharmacy storefront or even sending out spam emails to advertise a domain are negligible because these processes can be automated and do not require appreciable resources for every additional domain at scale. Even the cost of attempting to evade blacklisting (through the use of “list washing services”) is amortized over all domains because the same email address lists can be used for all domains. Thus, such efforts do not impose any marginal cost for a new domain acquired due to blacklisting. Past work in the area also suggests that domain registration is the dominant per-domain cost [57]. Combining these, the marginal proﬁt per domain for a spammer is:

P = α ∗t + β − c

This simple linear relation does not capture the variation in revenue earned per domain (either before or after blacklisting). Since we are dealing with aggregates, we believe this approximation is sufﬁcient to examine gross effects. Using the blacklisted domains in our data set we can thus empirically calculate the values for these parameters, with the average pre-blacklisting revenue per domain per day α as $1.14 and the average post-blacklisting revenue β as $104.19. These values reﬂect some combination of contemporaneous demand for pharmaceutical products and the intensity of the advertising effort and are by no means universal. Domain registration cost varies considerably with time and TLD, ranging from a low of roughly $0.15 for

.cn domains circa 2008—2009, to $7 for retail .com domains during roughly the same time period as mentioned in Section 4.5.3 (in practice, bulk domains for selected TLDs are readily available today for $2–3 each from a broad range of resellers). At any price, 82

200

150

100

50 Cost per domain ($) Filtering based blacklist Takedown on blacklisting 0 0 1w 1m 2m 3m

Figure 4.4. The highest cost of domain a spammer can afford (y-axis) against the time delay (x-axis) in blacklisting. however, it is clear that the marginal revenue per domain from our data set is at least an order of magnitude greater than the marginal cost imposed on the spammer due to blacklisting of a domain.

4.6.2 Changing Blacklisting Penalty

Given our model, there are two factors that we now consider: how quickly a newly advertised domain is blacklisted and the regime in which blacklisting is used to undermine the advertising vector, either ﬁltering (e.g., as in anti-spam) or blocking (e.g., as in DNS blocking or registrar takedown). In the nomenclature of our model, this corresponds to varying t and β (setting it to zero in the case of blocking). We capture both of these effects in Figure 4.4 which plots the minimal cost per domain that a spammer can afford (i.e., the break-even point) for both regimes. The dotted line corresponds to a

filtering regime, like spam filtering, in which revenue is acquired (β) even after a domain appears in a blacklist. As per the empirical parameter values described earlier, even if a domain were blacklisted instantly, the post-blacklist revenue is such that per-domain costs would need to be greater than $100 to make advertising unprofitable. 83

The solid line, however, reflects a regime in which a domain ceases to generate revenue once it has been blacklisted (e.g., because the domain name is shut down by the registrar or, a la the proposed SOPA legislation, because DNS resolvers refuse to lookup the associated A records). Thus β → 0, and the break-even point is represented simply by c = α ∗t. In this case, there is a meaningful interplay between the time to blacklist and the practical cost of domains. Even the nominal cost of $2.28 per domain (in line with current prices for cheap bulk registration) is sufficient to undermine the profitability of blacklisted domains. These results suggest that even large reductions in blacklisting latency would not have made costs prohibitive for spammers, whereas increasing the penalty of being listed on a blacklist could have had more severe consequences for the domains that were identified.

4.6.3 Increasing Coverage

The discussion above neglects the small number of non-blacklisted domains in our data set altogether. We do not consider them here since, absent a blacklisting date, we cannot reason about how their revenue changes before and after blacklisting. However, these domains constitute almost two-thirds (Table 4.4) of total revenue (even for the email spam vector) and hence were a major source of revenue for spammers. Putting these findings together suggests that while blacklisting did have an impact on spammer revenue, characteristics such as high consumer demand, the sophistication of spammers, and the reactive use of classification based blacklisting made it far less effective. Our blacklisting analysis only focused on SpamIt during 2009–2010 when counterfeit pharmaceuticals dominated email spam [20]. However, since the shutdown of SpamIt (possibly due to its inability to process MasterCard payments [13]), the global spam volume steadily declined from 87% in 2009 to 70% in 2013 [21]. Qualitatively, 84 spam is now dominated by phishing messages instead of ads for pharmaceuticals and other counterfeit goods. Counterfeit pharmaceuticals are now almost solely advertised through non-email vectors discussed in Section 4.4. But despite these changes in the global spam trends, filtering-based blacklisting continues to be the primary form of intervention for email spam. Spam still dominates global email volume, suggesting that even the current monetization methods from email spam are profitable. Thus, blacklisting remains an important intervention mechanism. Consequently, our findings that further improvements in sensors to identify domains and better data sharing on spam domains is necessary to defend against email spam remain applicable today. Our quantitative analysis of the revenue impact of domain blacklisting on email spam is limited to the counterfeit pharmaceuticals market because features such as consumer demand and conversion rates vary for different markets. However, our results remain applicable for revenue from counterfeit pharmaceuticals sold through non-email vectors. Indeed, improved domain-based classification tools for these vectors (such as social networks, ads, search) still have the potential to significantly affect the business of online counterfeit pharmaceuticals.

4.7 Related Work

Various aspects of the online pharmaceutical industry have been discussed in the research literature including its use of abusive advertising and defense mechanisms against the same [42, 67, 92, 96]. For example, Leontiadis et al. [48] studied the use of compromised sites to drive traffic to online pharmacy storefronts. Our work differs both in providing a wider analysis of domain abuse (email, search, free hosting, traffic sellers, etc.) and, more critically, in examining the relative revenue provided by such traffic. We provide analysis of ground truth data and explain the most successful abuse strategies for spammers in the face of takedowns. Similarly, Moore et al. [61] also look 85

at the potential impact of blacklisting pharmaceutical domains (in particular for search engine result traffic). By comparison, our work is narrower (restricted to only those sites of a particular set of pharmaceutical affiliate programs) but is comprehensive within that set and we are able to analyze the true economic activity in dollars, rather than using proxies such as site popularity. More broadly, the use of blacklisting for filtering email spam has been a popular topic for many years. However, most of the work in this space is aimed at evaluating and improving the mechanical aspects of blacklisting-based defenses such as speed and coverage [30, 54, 74, 77, 99]. By contrast, our work has focused on the larger question of the extent to which blacklisting efforts impact the profitability of the underlying business enterprise. Closer to our work in motivation are recent efforts focused on “payment intervention”, an intervention which seeks to undermine the profitability of abusive businesses by blocking their ability to obtain consumer payments [49, 56]. Our work explores similar questions, but focuses on a different point of intervention (domain names). Finally, our work builds on and supplements that of McCoy et al. [57] which uses overlapping, but different aspects of the same data set. McCoy et al. focused on analyzing the nature of global demand for counterfeit pharmaceuticals, the role of third- party affiliates in the industry, and the cost structure of such businesses, while our work is concerned with the role played by domain names within this business model and uses external data sets to measure the impact of blacklisting defenses.

4.8 Summary

There are as many ways to spam as there are to communicate, yet virtually all of them are Web-centric and require user clicks to convert. This commonality makes domain blacklisting a highly attractive mechanism for managing unwanted ads. Indeed, 86 all evidence suggests that blacklisting is a quick and largely comprehensive process (at least for email spam, which has an active blacklisting ecosystem). However, the success of domain blacklisting has done little to stem the tide of email spam (let alone other abusive advertising practices). That a defense can simultaneously achieve its goal, yet not appreciably bother the adversary, is counter-intuitive yet this fairly describes the current state of affairs today. Our study of thousands of online pharmaceutical sites demonstrates that a combination of appreciable demand for counterfeit pharmaceuticals (indicated by the large fraction of revenue arising from the email spam folders and Web search queries leading users to pharmacy sites), the ability of sophisticated spammers to evade blacklisting and heavily monetize a small number of domains, and the existence of multiple vectors for traffic ensures that online pharmaceuticals business remains profitable. As such, in this context of high demand and significant returns from successful evasion, domains alone do not impose a significant resource cost to the attacker who utilizes the whole spectrum of advertising strategies and is agile in the face of takedowns. Finally, our results suggest that changing the blacklisting penalty for email spam from simply filtering messages in spam folders to completely blocking access to domains is necessary to grossly undermine the profitability of the blacklisted domains. Even then blacklist evasion and the other traffic vectors remain possible and lucrative alternatives for the attacker.

4.9 Acknowledgements

Chapter 4, in full, is a reprint of the material as it appears in Proceedings of the Workshop of Economics of Information Security (WEIS). Neha Chachra, Damon McCoy, Stefan Savage, Geoffrey M. Voelker, 2014. The dissertation author was the primary investigator and author of this paper. Chapter 5

Conclusion

In this chapter, we ﬁrst summarize this dissertation, and then conclude with some ﬁnal thoughts on the possible future directions for this area of work.

5.1 Dissertation Summary

In this dissertation, we have demonstrated the usefulness of using large-scale data to understand the cost dynamics in a fraudulent ecosystem, thereby enabling us to examine the revenue impact of different interventions. We built an efficient Web crawler for gathering data from attacker-controlled URLs at scale, and provided general insights into building Web crawlers for adversarial ecosystems. We performed two case studies that further our understanding of URL abuse for profit — we studied the counterfeit pharmaceutical market and the affiliate marketing ecosystem. In both of these ecosystems, we analyzed the underlying conflict between attackers and defenders, and explained the observed measurements using a simple assumption — scammers minimize resource cost to maximize profitability. For example, to advertise pharmacy storefronts in email spam, spammers abuse free hosting domains (e.g., spaces.live.com) to redirect users to the actual storefront URL to minimize the cost imposed by blacklisting of the URL visible to the defenders in the spam messages. Similarly, when perpetrating affiliate fraud, scammers abuse stricter affiliate programs through more sophisticated and expensive

87 88 techniques to trade off the risk of detection against profitability and the likelihood of monetization in these programs. In fact, using ground truth data we can also determine the efficacy of different forms of intervention under the status quo as well as hypothetical scenarios. For example, we analyzed revenue data from counterfeit pharmacies and established the limited efficacy of domain blacklisting. We also projected the revenue impact of blacklisting under hypothetical scenarios of increases in blacklisting speed and coverage. Our projections show that URL blacklisting would only make the counterfeit pharmaceutical market unprofitable under an unachievable regime involving significantly increased detection of fraudulent domains and the immediate universal blacklisting of the discovered domains.

5.2 Future Directions and Final Thoughts

The nature of adversarial security research is such that both the attacker and the defender continuously evolve their capabilities in a perpetual arms race. Thus, we expect that defensive crawling will have to evolve further as it is deployed for engaging adversaries in completely new ecosystems in the future. When we thoroughly crawled and analyzed email spam in 2010, spam was almost 90% of all email messages in transit and a third of the spam messages advertised health-related goods [73]. By 2015, spam has declined to just over 50% of all email messages [55]. The quality of email spam has also changed over the years. Email spam is now dominated by phishing and dating scams. While phishing is very well-documented in the academic literature [53, 74, 99], there have been only a few limited, mostly qualitative studies of dating scams [37, 94]. Specifically, we are unaware of any study of dating scams advertised via email spam. Thus, studying dating email spam would be a natural future direction for this area of work. Also, even though we demonstrated the limitations of blacklisting for email spam, 89 it continues to be the most popular intervention deployed universally. One avenue for blacklist improvement is the potential data sharing between different online service providers. Specifically, we observed attackers rapidly migrating from one free hosting service to another in the face of takedowns. We also observed that the human-identified spam feed we received from a major email provider had much greater visibility into the most profitable domains that evaded defenders’ sensors. With the increased initiatives [28] among companies to share this data, it is likely that the problem of blacklisting coverage may be alleviated. Another future direction is to study the impact of these initiatives on the efficacy of interventions such as blacklisting. In our study of affiliate fraud on the Web, we detected cookie-stuffing to a much smaller extent than we initially expected. However, now that we understand affiliate marketing much more closely, it is highly plausible that there is very little affiliate marketing fraud on the Web overall. One key difference between an ecosystem such as email spam and affiliate marketing is that unlike email service providers, affiliate programs have much greater visibility into affiliate behavior, such as the rate of conversions and the rate of requesting cookies, etc. Furthermore, while email service providers can block a URL (or, more commonly, filter it into a separate folder), an affiliate program can simply cancel payout and ban an affiliate. Often, affiliate programs require a new affiliate to sign up with their Social Security Number and a bank account number for payouts. The limited availability of these resources to the attacker makes blacklisting an affiliate very effective. Also, affiliate programs are usually not accountable to affiliates and reserve the right to ban affiliate accounts upon observing suspicious activity. Thus, affiliate marketing programs have an inbuilt avenue for very effective intervention which immediately limits fraudulent revenue. However, we did observe affiliate fraud via browser extensions. These extensions provide much greater control to the attacker who can control the rate of conversions 90 and the referrer URLs seen by the affiliate programs. Furthermore, policing browser extensions is very challenging because the extension ecosystem is very dynamic and vast. Even if affiliate programs monitor the source code of all publicly available extensions, eliciting malicious behavior from browser extensions automatically at scale poses significant challenges [44]. Thus, another direction entails measuring and quantifying affiliate fraud via browser extensions at scale. After our study described in Chapter 3, Google Chrome added the following change to their policies: “Don’t misrepresent the functionality of your app or include non-obvious functionality that doesn’t serve the primary purpose of the app without clear notification to the user.” [15]. But, the efficacy of this new change in policy is unclear and would also be an interesting dimension to study. Bibliography

[1] Chrome Web Store. https://chrome.google.com/webstore/category/extensions. Accessed: 2015-11-25.

[2] Common Crawl. https://commoncrawl.org/the-data/get-started/. Accessed: 2015-11-25.

[3] Infographic: Internet Shopping. http://www.theguardian.com/technology/blog/ 2011/jul/04/internet-shopping-infographic-give-as-you-live-charity, 2011.

[4] Give as you Live. https://chrome.google.com/webstore/detail/give-as-you-live/ fceblikkhnkbdimejiaapjnijnfegnii, 2013.

[5] Alexa. Does Alexa have a list of its top-ranked websites? https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-hav e-a-list-of-its-top-ranked-websites-. Accessed: 2015-11-25.

[6] Amazon. Associates Program Advertising Fee Schedule. https://af ﬁliate-progr am.amazon.com/gp/associates/help/operating/advertisingf ees.

[7] Amazon. Associates Program Participation Requirements. https://af ﬁliate-progr am.amazon.com/gp/associates/help/operating/participation/.

[8] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Proceedings of the USENIX Security Symposium, 2007.

[9] Behind Online Pharma. From Mumbai to Riga to New York: Our Investigative Class Follows the Trail of Illegal Pharma. http://behindonlinepharma.com, 2009.

[10] Ben Parr. Bit.ly is Eating Other URL Shorteners for Breakfast [Stats]. http://mashable.com/2009/10/12/bitly-domination/, 2009.

[11] Brian Krebs. Spamhaus: Microsoft Now 5th Most Spam Friendly ISP. http://voices.washingtonpost.com/securityﬁx/2008/11/spamhaus microsoft now 5th mos.html, 2008.

91 92

[12] Brian Krebs. Spamhaus: Google Now 4th Most Spam-Friendly Provider. http://voices.washingtonpost.com/securityﬁx/2009/01/google now 4t h most spam-frien.html, 2009.

[13] Brian Krebs. Spam Afﬁliate Program Spamit.com to Close. http://krebsonsecurit y.com/2010/09/spam-afﬁalite-program-spamit-com-to-close/, 2010.

[14] Ken Chiang and Levi Lloyd. A Case Study of the Rustock Rootkit and Spam Bot. In Proceedings of the 1st USENIX Workshop on Hot Topics in Understanding Botnets (HotBots), 2007.

[15] Google Chrome. Developer Program Policies. https://developer.chrome.com/ webstore/program policies. Accessed: 2015-11-25.

[16] Civil Cover Sheet. http://www.benedelman.org/af ﬁliate-litigation/ebay-digital point-hogan-kessler-thunderwood-dunning-complaint.pdf #page=8, 2008.

[17] Controlling the Assault of Non-Solicited Pornography and Marketing Act of 2003. http://www.gpo.gov/fdsys/pkg/PLAW-108publ187/pdf/PLAW-108pub l187.pdf , 2003.

[18] Dancho Danchev. Malware and Spam Attacks Exploiting Picasa and Image- Shack. http://www.zdnet.com/blog/security/malware-and-spam-attacks-exploit ing-picasa-and-imageshack/1852, 2008.

[19] Daniel Cid. Large Scale Compromises Leading to Trafﬁc Distribution Sys- tem. http://blog.sucuri.net/2013/02/large-scale-compromises-leading-to-tds.h tml, 2013.

[20] Darya Gudkova. Spam Evolution: January – March 2009. https://www.securelis t.com/en/analysis/204792061/Spam evolution January March 2009, 2009.

[21] Darya Gudkova. Kaspersky Security Bulletin. Spam Evolution 2013. http://www.securelist.com/en/analysis/204792322/Kaspersky Security Bulleti n Spam evolution 2013, 2014.

[22] Derek Broes. Why Should You Fear SOPA and PIPA? http://www.f orbes.com/ sites/derekbroes/2012/01/20/why-should-you-f ear-sopa-and-pipa/, 2012.

[23] Digital Point. Cookie Search. https://tools.digitalpoint.com/cookie-search. Accessed: 2015-11-25.

[24] DomainTools. http://www.domaintools.com/.

[25] Ben Edelman. Afﬁliate Fraud Litigation Index. http://www.benedelman.org/ af ﬁliate-litigation/, 2015. 93

[26] Ben Edelman and Wesley Brandi. Risk, Information, and Incentives in Online Afﬁliate Marketing. In Journal of Marketing Research, 2014.

[27] Electronic Frontier Foundation. SOPA/PIPA: Internet Blacklist Legislation. https://www.eff .org/issues/coica-internet-censorship-and-copyright-bill. Accessed: 2015-11-25.

[28] Facebook. ThreatExchange. https://developers.f acebook.com/products/ threat-e xchange/. Accessed: 2015-11-25.

[29] Federal Trade Commission. Guides Concerning the Use of Endorsements and Testimonials in Advertising. https://www.f tc.gov/sites/default/ﬁles/attachments/ press-releases/ftc-publishes-ﬁnal-guides-governing-endorsements-testimonial s/091005revisedendorsementguides.pdf , 2009.

[30] Mark´ Felegyh´ azi,´ Christian Kreibich, and Vern Paxson. On the Potential of Proactive Domain Blacklisting. In Proceedings of 3rd USENIX LEET, 2010.

[31] United States District Court for the Northern District of California. UNITED STATES OF AMERICA v. SHAWN D. HOGAN. Indictment. http://www.bened elman.org/af ﬁliate-litigation/hogan-indictment.pdf , 2010.

[32] Hongyu Gao, Yan Chen, Kathy Lee, Diana Palsetia, and Alok N. Choudhary. Towards Online Spam Filtering in Social Networks. In Proceedings of the Network and Distributed System Security Symposium, 2012.

[33] Google. Developer’s Guide. https://developer.chrome.com/extensions/devguide. Accessed: 2015-11-25.

[34] Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. @spam: The Under- ground on 140 Characters or Less. In Proceedings of the 17th ACM Conference on Computer and Communications Security, 2010.

[35] Pedro H. Calais Guerra, Dorgival Guedes, Wagner Meira Jr., Cristine Hoepers, Marcelo H. P. C. Chaves, and Klaus Steding-Jessen. Spamming Chains: A New Way of Understanding Spammer Behavior. In Proceedings of the 6th CEAS, 2009.

[36] HostGator. Afﬁliate Terms of Service. http://www.hostgator.com/tos/af ﬁliate-tos. Accessed: 2015-04-15.

[37] JingMin Huang, Gianluca Stringhini, and Peng Yong. Quit Playing Games With My Heart: Understanding Online Dating Scams. 2015.

[38] Jillian York. Australia Heads Down the Slippery Slope, Authorizes ISPs to Filter. https://www.eff .org/deeplinks/2011/06/australia-heads-down-slippery-sl ope-authorizes, 2011. 94

[39] John P. John, Alexander Moshchuk, Steven D. Gribble, and Arvind Krishnamurthy. Studying Spamming Botnets using Botlab. In Proceedings of the 6th ACM/USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2009.

[40] John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy, and Mart´ın Abadi. deSEO: Combating Search-result Poisoning. In Proceedings of the 20th USENIX conference on Security, 2011.

[41] Jaeyeon Jung and Emil Sit. An Empirical Study of Spam Trafﬁc and the Use of DNS Black Lists. In Internet Measurement Conference, 2004.

[42] Chris Kanich, Christian Kreibich, Kirill Levchenko, Brandon Enright, Geoffrey M. Voelker, Vern Paxson, and Stefan Savage. Spamalytics: An Empirical Analysis of Spam Marketing Conversion. Communications of the ACM, 2009.

[43] Chris Kanich, Nicholas Weaver, Damon McCoy, Tristan Halvorson, Christian Kreibich, Kirill Levchenko, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. Show Me the Money: Characterizing Spam-advertised Revenue. In Proceedings of the USENIX Security Symposium, 2011.

[44] Alexandros Kapravelos, Chris Grier, Neha Chachra, Christopher Kruegel, Gio- vanni Vigna, and Vern Paxson. Hulk: Eliciting Malicious Behavior in Browser Extensions. In Proceedings of the 23rd USENIX Security Symposium, 2014.

[45] Maria Konte, Nick Feamster, and Jaeyeon Jung. Dynamics of Online Scam Hosting Infrastructure. In Proceedings of the 10th Passive and Active Measurement Conference (PAM), 2009.

[46] Brian Krebs. SpamIt, Glavmed Pharmacy Networks Exposed. Krebs on Security Blog, http://www.krebsonsecurity.com/category/pharma-wars/, 2011.

[47] Christian Kreibich, Chris Kanich, Kirill Levchenko, Brandon Enright, Geoffrey M. Voelker, Vern Paxson, and Stefan Savage. Spamcraft: An Inside Look at Spam Campaign Orchestration. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), 2009.

[48] Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. Measuring and Analyz- ing Search-Redirection Attacks in the Illicit Online Prescription Drug Trade. In Proceedings of the 20th USENIX Security, 2011.

[49] Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, Mark´ Felegyh´ azi,´ Chris Grier, Tristan Halvorson, Chris Kanich, Christian Kreibich, He Liu, Damon McCoy, Nicholas Weaver, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. Click Trajectories: End-to-End Analysis of the Spam Value Chain. In Proceedings of the IEEE Symposium and Security and Privacy, 2011. 95

[50] Vladimir I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, 1966. [51] He Liu, Kirill Levchenko, Mark´ Felegyh´ azi,´ Christian Kreibich, Gregor Maier, Geoffrey M. Voelker, and Stefan Savage. On the Effects of Registrar-level Inter- vention. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), 2011. [52] Graham Lowe, Patrick Winters, and Michael L. Marcus. The Great DNS Wall of China. http://cs.nyu.edu/Epcw216/work/nds/ﬁnal.pdf , 2007. [53] Christian Ludl, Sean Mcallister, Engin Kirda, and Christopher Kruegel. On the Effectiveness of Techniques to Detect Phishing Sites. In Proceedings of the 4th DIMVA, 2007. [54] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Identifying Suspicious URLs: An Application of Large-Scale Online Learning. In Proceedings of the 26th ICML, 2009. [55] Maria Namestnikova. Securelist. Spam evolution: January 2010. https://secureli st.com/analysis/monthly-spam-reports/36286/spam-evolution-january-2010/, 2010. [56] Damon McCoy, Hitesh Dharmdasani, Christian Kreibich, Geoffrey M. Voelker, and Stefan Savage. Priceless: The Role of Payments in Abuse-advertised Goods. In Proceedings of the ACM Conference on Computer and Communications Society, 2012. [57] Damon McCoy, Andreas Pitsillidis, Grant Jordan, Nicholas Weaver, Christian Kreibich, Brian Krebs, Geoffrey M. Voelker, Stefan Savage, and Kirill Levchenko. PharmaLeaks: Understanding the Business of Online Pharmaceutical Afﬁliate Programs. In Proceedings of the USENIX Security Symposium, 2012. [58] Gilad Mishne, David Carmel, and Ronny Lempel. Blocking Blog Spam with Lan- guage Model Disagreement. In Proceedings of Adversarial Information Retrieval on the Web (AIRWeb), 2005. [59] Tyler Moore, Richard Clayton, and Henry Stern. Temporal Correlations between Spam and Phishing Websites. In Proceedings of the 2nd Workshop on Large-Scale Exploits and Emergent Threats (LEET), 2009. [60] Tyler Moore and Ben Edelman. Measuring the Perpetrators and Funders of Typosquatting. In Financial Cryptography and Data Security, 2010. [61] Tyler Moore, Nektarios Leontiadis, and Nicolas Christin. Fashion Crimes: Trending-term Exploitation on the Web. In Proceedings of the 18th ACM conference on Computer and Communications Security, 2011. 96

[62] Mozilla Developer Network. Add-ons. https://developer.mozilla.org/en-US/ Add-ons. Accessed: 2015-11-25.

[63] Mozilla Developer Network. Same-origin Policy. https://developer.mozilla.org/ en-US/docs/Web/Security/Same-origin policy. Accessed: 2015-11-25.

[64] Mozilla Developer Network. The X-Frame-Options response header. https://deve loper.mozilla.org/en-US/docs/Web/HTTP/X-Frame-Options. Accessed: 2015- 11-25.

[65] mThink. The Top 20 Afﬁliate Networks 2014. http://mthink.com/the-top-20-af ﬁl iate-networks-2014, 2014.

[66] Yuan Niu, Yi-Min Wang, Hao Chen, Ming Ma, and Francis Hsu. A Quantitative Study of Forum Spamming Using Context-based Analysis. In Proceedings of the 14th Network and Distributed System Security Symposium (NDSS), 2007.

[67] Grazia Orizio, Peter Schulz, Serena Domenighini, Luigi Caimi, Cristina Rosati, Sara Rubinelli, and Umberto Gelatti. Cyberdrugs: a Cross-sectional Study of Online Pharmacies Characteristics. The European Journal of Public Health, 2009.

[68] L.D. Paulson. Spam hits Instant Messaging. Computer, 37(4), 2004.

[69] Andreas Pitsillidis, Chris Kanich, Geoffrey M. Voelker, Kirill Levchenko, and Stefan Savage. Taster’s Choice: A Comparative Analysis of Spam Feeds. In Proceedings of the ACM Internet Measurement Conference (IMC), 2012.

[70] Anirudh Ramachandran and Nick Feamster. Understanding the Network-Level Behavior of Spammers. In Proceedings of ACM SIGCOMM, 2006.

[71] Dmitry Samosseiko. The Partnerka — What is it, and why should you care? In Proc. of Virus Bulletin Conference, 2009.

[72] SeleniumHQ Browser Automation. Selenium-Grid. http://www.seleniumhq.org/ docs/07 selenium grid.jsp#how-selenium-grid-works-with-a-hub-and-nodes. Accessed: 2015-12-03.

[73] Tatyana Shcherbakova, Maria Vergelis, and Nadezhda Demidova. Securelist. Spam and Phishing in Q2 2015. https://securelist.com/analysis/quarterly-spam-reports/ 71759/spam-and-phishing-in-q2-of -2015/, 2015.

[74] Steve Sheng, Lorrie F. Cranor, Jason Hong, Brad Wardman, Gary Warner, and Chengshan Zhang. An Empirical Analysis of Phishing Blacklists. In Proceedings of the 6th CEAS, 2009.

[75] Youngsang Shin, Minaxi Gupta, and Steven Myers. The Nuts and Bolts of a Forum Spam Automator. In Proceedings of the 4th USENIX LEET, 2011. 97

[76] Sushant Sinha, Michael Bailey, and Farnam Jahanian. Shades of Grey: On the Effectiveness of Reputation-based Blacklists. In Proceedings of 3rd MALWARE, 2008.

[77] Sushant Sinha, Michael Bailey, and Farnam Jahanian. Improving Spam Blacklist- ing through Dynamic Thresholding and Speculative Aggregation. In Proceedings of the 17th Annual Network and Distributed System Security Symposium (NDSS), 2010.

[78] Peter Snyder and Chris Kanich. No Please, After You: Detecting Fraud in Afﬁliate Marketing Networks. In Proceedings of the Workshop on the Economics of Information Security (WEIS), 2015.

[79] Spamhaus. http://www.spamhaus.org/dbl/.

[80] SURBL. http://www.surbl.org/.

[81] Symantec. Web-Based Malware Distribution Channels: A Look at Trafﬁc Redis- tribution Systems. http://www.symantec.com/connect/blogs/web-based-malwar e-distribution-channels-look-trafﬁc-redistribution-systems, 2011.

[82] The Anti-Abuse Project. DNS Blacklists. http://www.anti-abuse.org/dns-blackl ists/. Accessed: 2015-11-25.

[83] The Economist. Pay Per Sale. http://www.economist.com/node/4462811, 2005.

[84] The New York Times. Surviving the Dark Side of Afﬁliate Market- ing. http://www.nytimes.com/2013/12/05/business/smallbusiness/surviving-the- dark-side-of -afﬁliate-marketing.html, 2013.

[85] Kurt Thomas, Chris Grier, Dawn Song, and Vern Paxson. Suspended Accounts in Retrospect: an Analysis of Twitter Spam. In Proceedings of the 2011 ACM SIGCOMM conference on Internet Measurement Conference, 2011.

[86] United States District Court for the Northern District of California. UNITED STATES OF AMERICA v. SHAWN D. HOGAN. JUDGMENT IN A CRIMINAL CASE. http://www.businessinsider.com/document/5362ddeceab8ea856647f 392/ 10624220-0--22751.pdf , 2014.

[87] URIBL. http://www.uribl.com/.

[88] Verisign Announces Increase in .com/.net Domain Name Fees. https://investor.v erisign.com/releasedetail.cfm?releaseid=431292, 2009.

[89] Virus Bulletin. 41% of spam sent via Rustock botnet. http://www.virusbtn.com/ news/2010/08 26.xml, 2010. 98

[90] David Wang, Stefan Savage, and Geoffrey M. Voelker. Cloak and Dagger: Dynam- ics of Web Search Cloaking. In Proceedings of the ACM Conference on Computer and Communications Security, 2011.

[91] David Wang, Stefan Savage, and Geoffrey M. Voelker. Juice: A Longitudinal Study of an SEO Campaign. In Proceedings of the Network and Distributed System Security Symposium (NDSS), 2013.

[92] Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. Spam Double-Funnel: Con- necting Web Spammers with Advertisers. In Proceedings of the 16th World Wide Web conference (WWW), 2007.

[93] The Wayback Machine. http://web.archive.org.

[94] Monica T. Whitty and Tom Buchanan. The Online Romance Scam: A Serious Cybercrime. CyberPsychology, Behavior, and Social Networking, 2012.

[95] Yinglian Xie, Fang Yu, Kannan Achan, Eliot Gillum, Moises Goldszmidt, and Ted Wobber. How Dynamic are IP Addresses. In Proceedings of the 2007 ACM SIGCOMM, 2007.

[96] Yinglian Xie, Fang Yu, Kannan Achan, Rina Panigrahy, Geoff Hulten, and Ivan Osipkov. Spamming Botnets: Signatures and Characteristics. In Proceedings of ACM SIGCOMM, 2008.

[97] Jennifer Zaino. Blekko Data Donation is a Big Beneﬁt to Common Crawl. http://www.dataversity.net/blekko-data-donation-is-a-big-beneﬁt-to-co mmon-crawl/, 2012.

[98] Jian Zhang, Phillip Porras, and Johannes Ullrich. Highly Predictive Blacklisting. In Proceedings of the 17th USENIX Security, 2008.

[99] Yue Zhang, Serge Egelman, Lorrie Cranor, and Jason Hong. Phinding Phish: Evaluating Anti-phishing Tools. In Proceedings of the 14th NDSS, 2007.

[100] Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum. BotGraph: Large Scale Spamming Botnet Detection. In Proceedings of the 6th USENIX NSDI, 2009.