UC San Diego UC San Diego Electronic Theses and Dissertations

Title Investigating Large-Scale Internet Abuse Through Web Page Classification

Permalink https://escholarship.org/uc/item/8jp0z4m4

Author Der, Matthew Francis

Publication Date 2015

Peer reviewed|Thesis/dissertation

eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO

Investigating Large-Scale Internet Abuse Through Web Page Classification

A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy

in

Computer Science

by

Matthew F. Der

Committee in charge:

Professor Lawrence K. Saul, Co-Chair Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair Professor Gert Lanckriet Professor Kirill Levchenko

2015 Copyright Matthew F. Der, 2015 All rights reserved. The Dissertation of Matthew F. Der is approved and is acceptable in quality and form for publication on microfilm and electronically:

Co-Chair

Co-Chair

Co-Chair

University of California, San Diego 2015

iii DEDICATION

To my family: Kristen, Charlie, David, Bryan, Sarah, Katie, and Zach.

iv EPIGRAPH

Everything should be made as simple as possible, but not simpler.

— Albert Einstein

Sic transit gloria . . . glory fades.

— Max Fischer

v TABLE OF CONTENTS

Signature Page ...... iii

Dedication ...... iv

Epigraph ...... v

Table of Contents ...... vi

List of Figures ...... ix

List of Tables ...... xi

Acknowledgements ...... xiii

Vita ...... xviii

Abstract of the Dissertation ...... xix

Chapter 1 Introduction ...... 1 1.1 Contributions ...... 4 1.2 Organization ...... 6

Chapter 2 Background ...... 8 2.1 The Spam Ecosystem ...... 8 2.1.1 Spam Value Chain ...... 9 2.1.2 Click Trajectories Finding ...... 10 2.1.3 Affiliate Programs ...... 11 2.2 SEO and Search Poisoning ...... 15 2.3 Domain Names ...... 17 2.3.1 The DNS Business Model ...... 18 2.3.2 Growth of Top-Level Domains ...... 19 2.3.3 Abuse and Economics ...... 23 2.4 Bag-of-Words Representation...... 25 2.5 Related Work ...... 28 2.5.1 Non-Machine Learning Methods ...... 28 2.5.2 Near-Duplicate Web Pages ...... 29 2.5.3 Web Spam and ...... 30 2.5.4 Other Applicatons ...... 32

Chapter 3 Affiliate Program Identification ...... 36 3.1 Introduction ...... 37 3.2 Data Set ...... 41

vi 3.2.1 Data Collection ...... 41 3.2.2 Data Filtering ...... 42 3.2.3 Data Labeling ...... 43 3.3 An Automated Approach ...... 45 3.3.1 Feature Extraction ...... 46 3.3.2 Dimensionality Reduction & Visualization ...... 49 3.3.3 Nearest Neighbor Classification ...... 51 3.4 Experiments ...... 53 3.4.1 Proof of Concept ...... 55 3.4.2 Labeling More Storefronts ...... 56 3.4.3 Classification in the Wild ...... 58 3.4.4 Learning with Few Labels ...... 60 3.4.5 Clustering ...... 63 3.5 Conclusion ...... 66 3.6 Acknowledgements ...... 67

Chapter 4 Counterfeit Luxury SEO ...... 68 4.1 Introduction ...... 69 4.2 Background ...... 71 4.2.1 Search Result Poisoning ...... 71 4.2.2 Counterfeit Luxury Market ...... 76 4.2.3 Interventions ...... 79 4.3 Data Collection ...... 81 4.3.1 Generating Search Queries ...... 82 4.3.2 Crawling & Cloaking ...... 83 4.3.3 Detecting Storefronts ...... 84 4.3.4 Complete Data Set ...... 84 4.4 Approach ...... 85 4.4.1 Supervised Learning ...... 85 4.4.2 Unsupervised Learning ...... 90 4.4.3 Bootstrapping the System ...... 91 4.5 Results ...... 93 4.5.1 Classification Results ...... 93 4.5.2 Ecosystem-Level Results ...... 98 4.5.3 Further Analysis ...... 103 4.6 Conclusion ...... 106 4.7 Acknowledgements ...... 107

Chapter 5 The Uses (and Abuses) of New Top-Level Domains ...... 108 5.1 Introduction ...... 110 5.2 Data Set ...... 113 5.3 Clustering and Classification ...... 116 5.3.1 Clustering ...... 116

vii 5.3.2 Classification ...... 118 5.3.3 Further Analysis ...... 123 5.4 Document Relevance ...... 125 5.4.1 Generating a Corpus for Each TLD ...... 127 5.4.2 Estimating Topics ...... 129 5.4.3 Relevance Scoring ...... 130 5.5 Conclusion ...... 134 5.6 Acknowledgements ...... 136

Chapter 6 Conclusion ...... 137 6.1 Impact ...... 138 6.2 Future Work ...... 140 6.3 Final Thoughts ...... 142

Bibliography ...... 144

viii LIST OF FIGURES

Figure 2.1. The steady and swift rollout of new gTLDs. Dates of delegated strings were collected from [18]...... 22

Figure 3.1. Data filtering process. Stage 1 is the entire set of crawled Web pages; stage 2, pages tagged as pharmaceutical, replica, and luxury; stage 3, storefronts of affiliate programs matched by regular expressions...... 41

Figure 3.2. Projection of storefront feature vectors from the largest affiliate program (EvaPharmacy) onto the data’s two leading principal components...... 51

Figure 3.3. Histogram of three NN distances to EvaPharmacy storefronts: distances to storefronts in the same affiliate program, to those in other programs, and to unlabeled storefronts...... 54

Figure 3.4. Boxplot showing balanced accuracy for all 45 classes as a function of training size...... 61

Figure 3.5. Number of affiliate programs of different sizes with few versus many clustering errors; see text for details. In general the larger programs have low error rates (top), while the smaller programs have very high error rates (bottom)...... 65 Figure 4.1. An illustration of search result poisoning by an SEO . . 74

Figure 4.2. Example of a poisoned search result...... 76

Figure 4.3. Examples of counterfeit luxury storefronts forging four brands (in row order): Louis Vuitton, Ugg, Moncler, and Beats By Dre. 77

Figure 4.4. Counterfeits of Gucci products offered at false discounts. Curi- ously, every product’s retail price is the same...... 78

Figure 4.5. Flowchart depicting one round of classification...... 92

Figure 4.6. Stacked area plots ascribing PSRs to specific SEO campaigns in four different verticals. (This is Figure 2 from [84].) ...... 102

Figure 5.1. Breakdown of classes for domains in new TLDs...... 123

ix Figure 5.2. Number of parked domains by service. The number on top of each bar indicates how many distinct TLDs used that parking service...... 126

Figure 5.3. Percentage of relevant Web pages in ten TLDs...... 133

x LIST OF TABLES

Table 2.1. The number of storefronts in the original data set for all forty- four affiliate programs. Aggregate programs are indicated as ZedCash* and Mailien†...... 13

Table 3.1. Screenshots of online storefronts selling counterfeit pharmaceu- ticals, replicas, and software...... 40

Table 3.2. Summary of the data from crawls of consecutive three-month periods...... 45

Table 3.3. Feature counts, density of the data, dimensionality after princi- pal components analysis, and percentage of unique examples. . 49

Table 3.4. Examples of storefronts matched by regular expressions (left column), and storefronts not matched but discovered by NN classification (right column)...... 57

Table 3.5. Data sizes and performance for select affiliate programs. For each program, the two rows show results from the first then second 3 months of the study...... 59

Table 3.6. Examples of correctly classified storefronts when there is only one training example per affiliate program. The affiliate programs shown here are 33drugs and RX-Promotion...... 63

Table 4.1. Rounds of classification, in which automatic predictions are manually verified. At each round, we specify the total number of storefront Web pages, the number that we have labeled, and the number of associated SEO campaigns...... 94

Table 4.2. The ten most distinctive features of the msvalidate campaign, along with their corresponding weights. The first column indi- cates whether the feature was extracted from the storefront (s) or doorway (d)...... 95

Table 4.3. The five most likely candidates for the msvalidate campaign, all of which had probability nearly 1 and were verified as correct. 95

Table 4.4. Breakdown of the prediction that the louisvuittonicon.com store- front is affiliated with the msvalidate campaign. The first column indicates the feature source: storefront (s) or doorway (d)...... 96

xi Table 4.5. Sizes of storefront clusters and doorway clusters. For exam- ple, there is 1 cluster containing 8 storefronts, and 2 clusters containing 8 doorways each...... 97

Table 4.6. Sixteen luxury verticals and the associated # of PSRs, doorways, stores, and campaigns that target them. The key campaign, the first one we identified that guided our study, targeted all verticals except those with an ‘*’...... 99

Table 4.7. Classified campaigns (with 30+ doorways) and the # of asso- ciated doorways, stores, brands targeted, and peak poisoning duration in days. (This is a subset of Table 2 from [84].) . . . . . 100

Table 5.1. The ten largest new TLDs, when they appeared in the root zone, and prices of registering a domain in them...... 114

Table 5.2. Examples of parked, unused, suspended, and junk Web pages. 120

Table 5.3. Classification flux between two Web crawls 20 weeks apart. . . . 124

Table 5.4. Number of registered domains, and percentage of contentful domains, in ten specialized TLDs...... 127

Table 5.5. Most likely collocating words for ten TLD words...... 130

Table 5.6. Web pages that are (ir)relevant to their TLD. Shown for each page are the second-level domain, relevance score as given by eq. (5.1), and a screenshot...... 132

xii ACKNOWLEDGEMENTS First and foremost, I would like to thank my advisors: Professors Lawrence Saul, Geoff Voelker, and Stefan Savage. I feel so humbled and grateful to have had the privilege of working with them. I have the utmost respect for them as researchers, advisors, and teachers, but I have equal admiration for the people they are. The combination of their personalities created an affable and witty group dynamic, which made working together a real pleasure. In addition, I greatly appreciate their involvement and service to the department; their efforts help engender a gregarious community that makes CSE a special place. Also, their passion for volunteering trickles down to their students. I thank Lawrence further for his time and invaluable feedback on dry runs of my presentations, his guidance and course materials for the Intro to AI course I taught, and all of our conversations about sports. Big thanks to Gert Lanckriet and Kirill Levchenko for taking time to serve on my thesis committee. Another CSE professor I must credit is Sorin Lerner, with whom I had the pleasure of working on Visit Day and Gradcom over the years. I applaud his ability to corral abundant opinions, remain focused on the task at hand, and get things done. I thank my department colleagues and co-authors Do-kyum Kim, Tristan Halvorson, and David Wang; I thoroughly enjoyed working with them. I always had more fun collaborating than working in isolation, and one of the greatest pleasures of graduate school was learning from my peers. I owe a huge debt of gratitude to CSE staff member Jessica Gross, who had an answer or pointer for every one of my administrative questions. I admire and thank her for her willingness to help. Also, thanks to Julie Conner for shepherding

xiii me from beginning to end of the program. Beyond the walls of EBU3B, I would like to thank Ian Roxborough, my intern host at Google San Francisco in the summers of 2011 and 2012. It was a joy to work with him and the Internal Privacy team. He is a badass software engineer and virtuous champion of privacy who reassured me that Googlers genuinely do care about it. Also, he strongly reinforced the importance of software testing. Furthermore, Ian was an outstanding mentor and good friend. I cherished our daily (and supremely delicious) lunches in the sunshine by the Bay Bridge, and our competitive and fun games of pool. I thank him profoundly for granting me this special opportunity and experience. My graduate study would not have been possible without the teaching, advising, and inspiration I received from my mentors at the University of Rich- mond, Professors Jim Davis and Barry Lawson. Jim offered me my first research opportunity and through him, I learned that research is intellectually stimulating, challenging but fun, daunting but rewarding. Additionally, he is a paragon of work-life balance, a true role model for me who shares similar values and priorities. I find it impossible how Jim thrives as a teacher and researcher, but is a devoted family man who leaves work at work so he can invest in his family, and somehow still finds time for hobbies of his own. Barry welcomed me as I shifted my gaze from mathematics to computer science. I completed my honors thesis with him on a topic of mutual interest, which was the seed from which my graduate research grew. He was The Cool Professor who I looked up to, and who strikes a mighty fine work-life balance himself. Today, I am glad to call him a lifelong friend. I thank both Jim and Barry for making my undergraduate experience what it was, and for the encouragement to go onward and upward into graduate school.

xiv Most importantly, I would like to thank my family for their love and support. Family is the most special thing to me in life, and my happiest days are the ones spent with them. I have missed them dearly during my sojourn on the West Coast. I am deeply grateful for the way my parents raised us four kids; one of their greatest gifts to us was allowing us the independence to make our own way. I tip my hat to my brother Bryan, who blazed the academic trail for me and who beat me to “First PhD in the Family” by two years. Also, I thank my brother David for presenting me with an exciting next chapter in my career. I am absolutely thrilled to work and learn together with him. In the absence of immediate family, Dan, Cindy, and Molly Ennis became my family in San Diego; I thank them to the moon and back. I am very blessed to have met them, and to have them in my life now and always. Another friend I wish to thank is fellow computer scientist Daniel Ricketts; we both started at UCSD in Fall 2010. He was my workout partner, both in the gym and on the pitch (i.e., soccer field). I took him under my wing, and it was a joy to watch him blossom into the confident and polished soccer player he is today. The dream to play for the U.S. Men’s National Team (USMNT) is very much alive, and they need us now more than ever. But in all seriousness, Dan was my go-to guy whenever I wanted an honest and trusted opinion; his voice of reason countered my emotion. I value his input tremendously, and want to especially thank him for encouraging me to teach a summer course. Some of the most fun I have had in graduate school is playing intramural sports, soccer in particular. I wish to single out my longest tenured teammate and good friend, Gary Johnston. He was my most trusted teammate, a great athlete and fierce competitor who could unintentionally make me laugh even in the heat of battle. I thank all of my teammates, especially the ones on the championship

xv Cherrypickers team. I loved competing with them, I was honored to serve as their captain, and I will miss leading them to victory! Yet another friend I would like to thank is Brown Farinholt, a fellow Rich- monder! Having someone from my hometown enter the department was very refreshing, and it is impossible to not have fun whilst spending time with Brown. I look forward to seeing him down the road. I want to thank many more friends for the camaraderie, experiences, and laughs we shared: fellow CSE Fall 2010 doctoral inductees Sheeraz Ahmad, Akshay Balsubramani, Karyn Benson, Alan Leung, Wilson Lian, Daniel Moeller, and Mingxun Wang; department colleagues and soccer teammates Yashar Asgarieh, Stefan Schneider, Chris Tosh, Michael Walter, and, of course, Professor Yannis Papakonstantinou, who finally earned his first championship T-shirt with us; officemates Mayank Dhiman, David Kohlbrenner, and Ding Yuan; other CSE friends Rohan Anil, Alex Bakst, Dimo Bounov, Neha Chachra, Anukool Junnarkar, Alex Kapravelos (via UCSB), Sam Kwak, Keaton Mowery, Zach Murez, Nima Nikzad, Vivek Ramavajjala, Dustin Richmond, Valentin Robert, and misc king Arjun Roy; and, last but not least, nanoengineer Sohini Manna. Finally, I would like to sincerely thank my esteemed roommates of the past three years — Garrett Graham, Davis Graham, Austin Kieffer, Zack Jones, Matt Hoffman, Aaron Polhamus, and Jack Silva — for making our apartment a respectful, comfortable, safe, and convivial place to call home. Chapter 3, in part, is a reprint of the material as it appears in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2014. Der, Matthew F.; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was the primary investigator and author of this paper. Chapter 4, in part, is a reprint of the material as it appears in Proceedings

xvi of the International Measurement Conference (IMC) 2014. Wang, David; Der, Matthew F.; Karami, Mohammad; Saul, Lawrence K.; McCoy, Damon; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was one of the primary investigators and authors of this paper. Chapter 5, in part, is currently being prepared for submission for publication of the material. Der, Matthew F.; Halvorson, Tristan; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was the primary investigator and author of this material.

xvii VITA

2010 Bachelor of Science in Computer Science and Mathematics University of Richmond 2013 Master of Science in Computer Science University of California, San Diego 2015 Doctor of Philosophy in Computer Science University of California, San Diego

PUBLICATIONS

Tristan Halvorson, Matthew F. Der, Ian Foster, Stefan Savage, Lawrence K. Saul, Geoffrey M. Voelker. “From .academy to .zone: An Analysis of the New TLD Land Rush.” To appear in Proceedings of the 15th ACM SIGCOMM Conference on Internet Measurement, Tokyo, Japan, 2015.

David Wang, Matthew F. Der, Mohammad Karami, Lawrence K. Saul, Damon McCoy, Stefan Savage, Geoffrey M. Voelker. “Search + Seizure: The Effectiveness of Interventions on SEO Campaigns.” In Proceedings of the 14th ACM SIGCOMM Conference on Internet Measurement, Vancouver, BC, Canada, 2014.

Matthew F. Der, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker. “Knock It Off: Profiling the Online Storefronts of Counterfeit Merchandise.” In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York City, New York, USA, 2014.

Do-kyum Kim, Matthew F. Der, Lawrence K. Saul. “A Gaussian Latent Variable Model for Large Margin Classification of Labeled and Unlabeled Data.” In Proceed- ings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 2014.

Matthew F. Der, Lawrence K. Saul. “Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning.” In Advances in Neural Information Processing Systems 25 , Lake Tahoe, Nevada, USA, 2012.

Corneliu Bodea, Calina Copos, Matthew F. Der, David O’Neal, James A. Davis. “Shared Autocorrelation Property of Sequences.” In IEEE Transactions on Infor- mation Theory, Volume 57, Issue 6, 3805-3809, June 2011.

xviii ABSTRACT OF THE DISSERTATION

Investigating Large-Scale Internet Abuse Through Web Page Classification

by

Matthew F. Der

Doctor of Philosophy in Computer Science

University of California, San Diego, 2015

Professor Lawrence K. Saul, Co-Chair Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair

The Internet is rife with abuse: examples include spam, , malicious , DNS abuse, search poisoning, click , and so on. To detect, investigate, and defend against such abuse, security efforts frequently crawl large sets of Web sites that need to be classified into categories, e.g., the attacker behind the abuse or the type of abuse. Domain expertise is often required at first, but classifying thousands to even millions of Web pages manually is infeasible. In this dissertation, I develop machine

xix learning tools to help security practitioners classify Web pages at scale. These automated, data-driven methods are made possible by the efforts of miscreants to operate at scale. Crafting every scam from scratch is too expensive, so miscreants use some degree of automation and replication to recreate their attacks. As a result, underlying similarities in both Web site content and structure can link related pages together. In the end, this automated classification of “big data” collected from the Web has significant impact, as it enables large-scale measurement and informs potential defensive interventions. This dissertation focuses on three applications. First, I present a system for monitoring Web sites that serve as online storefronts for spam-advertised goods. The system is highly accurate, even when training data is very limited. Second, I describe a system for identifying the black hat SEO campaigns that promote online stores selling counterfeit luxury goods. This system was used to nearly double the number of known campaigns to track, and increase the number of associated stores by 69%. Third, I discuss a system for categorizing the hosted in new top-level domains. In total, this system was used to classify 4.1 million domains in 480 new TLDs. Overall, today’s scale of well-organized demands the use of scalable defensive analysis. This setting is where the data-driven techniques of machine learning prove especially useful. Furthermore, large-scale classification has become a frequent need in security, and our methods are more generally applicable to problems beyond just the ones documented in this dissertation.

xx Chapter 1

Introduction

The Internet ignited a technological revolution that dramatically transformed society, both culturally and economically. Today, the Internet bustles with a variety of activity and content: , news, multimedia, banking, shopping, social media, and so on. Along with enabling near-instant communication and access to information, the World Wide Web has also become a massive marketplace for consumers. There is great opportunity for big business since customers have immediate access to virtual stores and need not travel to physical ones; plus, every Internet user is a potential customer. Further still, even the parts of the Internet not devoted to e-commerce largely serve as a portal to it, given the ubiquity of advertisements which are often customized for the individual user. Inevitably, this marketplace attracts both legitimate and illegitmate busi- nesses alike. In either case, the goal is to get users to visit their Web sites, and ultimately make a purchase. To acquire customers, illegal businesses commonly resort to malicious or deceptive techniques to amass user traffic: for example, spam containing links to online storefronts, and black hat optimization (SEO) through which high-ranking search results are rewired to direct users to a fraudulent store. Interestingly, the Internet namespace is its own unique marketplace, where

1 2

domain names are scarce and exclusive — and in turn, potentially quite valuable — resources. Not every domain owner publishes useful content, either: while much of the Web is teeming with daily traffic, a significant portion of it remains underdeveloped and largely unused. Instead of hosting meaningful content, some domain owners just sit on their online property and try to earn money passively by serving automatically spun ads, or by reselling their in the future. While such practice, called domain parking, is not necessarily abusive, the real value it adds to the social good is certainly questionable. A common theme resonates among e-commerce (legal or not), advertising, domain parking, and many other online services: a substantial percentage of Internet activity is motivated by making money. Indeed, in recent years security researchers have found considerable success in adopting a socio-economic perspective toward computer security: follow the money to learn about the business models and relationships among actors behind these online enterprises (e.g., [44, 38, 56, 55, 80, 57, 33, 5, 61]). This approach for investigating and combatting cybercrime is engineered through empiricism — i.e., through large-scale measurement and analysis of abuse on the Internet. Specifically, the usual methodology involves crawling sizable collections of Web sites, which then must be organized and evaluated to draw quantitative, evidence-based conclusions. In security applications, for example, a set of Web pages needs to be classified into categories, such as the type of abuse, or the attacker behind the abuse. For a given application, to begin understanding the problem space unavoidably requires domain expertise. However, manual review is a laborious job for the human expert, and a fully manual approach indubitably does not scale. This encumbrance impels the need for automated methods to categorize Web pages accurately and efficiently to aid security practitioners. The premise of 3 this dissertation is that machine learning tools can fulfill precisely this demand. Automated machine learning techniques are effective solutions in part due to the regularity of replication on the Internet, where many duplicate or near-duplicate Web pages exist. For e-crime in particular, miscreants wish to operate at scale, as consistently crafting unique scams from scratch is far too expensive. Thus, at least some degree of replication and automation is behind their attacks, and therefore, underlying similarities in Web sites can link related pages together. Essentially, miscreants trade off stealth for scale. The impact of machine learning methods in this setting is that it removes the domain expert as a bottleneck in empirical assessment. Large-scale measurement becomes feasible, and from a security angle, the results can reveal choke points in the attackers’ infrastructure, and consequently inform defensive interventions. The task of organizing crawled Web sites is one of data labeling. Much work in the machine learning field focuses on supervised learning, where classifiers are trained on a data set with ground truth labels. In contrast, Web pages collected from the wild lack ground truth labels by nature. The task, then, is to establish ground truth through assigning Web pages a label indicating their categorization. Of course, we want to make this process most efficient, transferring as much of the workload as possible from the domain expert to the machine. Labeling data necessarily begins with manual inspection to understand the problem and recognize early patterns, but we can eventually bootstrap labeling from an initially manual procedure to a more automated one. At some point, the expert will have labeled a sufficient sample of data for training a machine learning model, which then can scale up the labeling process by making many automatic predictions. Unfortunately, we cannot blindly accept the predicted labels as ground truth. We can conduct proof-of-concept experiments to gain confidence in the model’s 4 performance, yet still the human must be brought back into the loop to validate predictions. Due to the sheer size of the data, it is impractical to validate all predictions, and a representative random sample must suffice. Advantageously, though, after this validation step, the expanded set of labeled data can be used to train even more accurate models, which then make more confident predictions on the remaining unlabeled data. In this way, data labeling becomes an iterative approach occurring in rounds. To summarize the overall methodology, the first round starts with a domain expert labeling some data by hand, scales with machine learning models that predict labels automatically, and concludes with the expert validating (a sample of) the predictions. Then, we iterate the process with repeated rounds of retraining the model, making more predictions, and validating select predictions, until the labeling is finished. To establish ground truth, human-in-the-loop is unavoidable, but performing this task at scale is only possible with automation in a human-machine system. My thesis is that machine learning offers an invaluable toolkit for data-driven security researchers, who face a repeated problem of partitioning large sets of crawled Web pages into categories.

1.1 Contributions

This dissertation addresses three particular instances of this general problem of organizing large collections of Web pages crawled on the Internet. The first instance arises from a holistic analysis of the spam ecosystem. Illegitimate businesses called affiliate programs run online stores selling counterfeit goods, and they team with spammers to advertise their stores via . I seek to identify which storefront Web sites belong to which affiliate programs — a technical challenge behind a defensive intervention with considerable economic 5

impact. The second instance also involves online counterfeit stores, but instead of acquiring user traffic via email spam, here the criminals use black hat SEO — specifically, search result poisoning and cloaking — to lure users to their stores. The goal is analogous as well: I develop techniques to match these storefront Web pages to the SEO campaigns that promote them. The third instance emerges from the recent and dramatic expansion of the Internet namespace, wherein the Internet Corporation for Assigned Names and Numbers (ICANN) introduced a program for delegating new top-level domains (TLDs). From a crawl of millions of registered domains in hundreds of new TLDs, I measure how domain registrants are using their new domain names. The successful results for all three studies prove the critical role of machine learning in empirical approaches to Internet security. Specifically, the contributions of this dissertation are as follows:

• I present a system that solves a classification problem of identifying the affiliate programs that manage online storefronts selling counterfeit merchandise. This task underlies a crucial bottleneck in the spam business model (to be explained in Chapters 2 and 3), illuminating a promising point for defensive intervention. Classification is highly accurate, and the methodology is significantly more automated than the approach first used by other researchers. Furthermore, the classifier discovers additional storefronts belonging to known affiliate programs—false negatives which were previously missed. In mimicking an operational deployment when training data is limited, the system remains highly accurate. Lastly, the system can unveil a new affiliate program for which we have no labeled examples. 6

• I describe a system for linking counterfeit luxury stores to the SEO campaigns which promote them in poisoned search results. The system is designed for maximal usability by a security researcher who is not a machine learning expert; it is fast to train, fast to make predictions, and its output is highly interpretable. Additionally, the system can discover new SEO campaigns by grouping together highly similar unlabeled stores.

• I introduce an iterative system for classifying the Web content of millions of registered domains in hundreds of new top-level domains. The system performs rounds of clustering, labeling, classification, and validation. Also, I classify the same set of domains but twenty weeks later to see how domain owners change their Web presence over time. Secondly, for domains hosting legitimate Web content in a subset of ten specialized TLDs, I develop a statistical language model for judging how relevant the content is to the TLD’s target community.

I stress that this machine learning methodology is broadly applicable to data-driven cybercrime and more general than just the aforementioned applications. In that spirit, this dissertation also serves as a model for other researchers who wish to solve similar problems.

1.2 Organization

The remainder of this dissertation is organized as follows. Chapter 2 provides context for the three problems I explore, background on bag-of-words feature extraction which I use for each problem, and a brief review of related work. 7

Chapter 3 presents a system for classifying spam-advertised storefronts according to the affiliate program that manages them. Chapter 4 describes a system for identifying the different SEO campaigns which promote counterfeit luxury stores in poisoned search results. Chapter 5 explains a methodology for categorizing registered domains in new TLDs. In addition, for domains that host meaningful Web content in a sample of ten specialized TLDs, I introduce a technique for estimating how relevant the content is to the TLD’s intended purpose. Finally, Chapter 6 summarizes the contributions of this dissertation and offers potential directions for future work. Chapter 2

Background

This chapter imparts all necessary context for understanding the contribu- tions of this dissertation. Sections 2.1 to 2.3 provide a high-level overview of the three security applications we study. Section 2.4 explains the basics of bag-of-words feature extraction, which plays a fundamental role in all of our approaches. Lastly, Section 2.5 discusses select related work.

2.1 The Spam Ecosystem

Spam, or unsolicited bulk email, is a decades-old and widespread problem affecting anyone with an email account. Even though anti-spam tools have developed into a multi-billion dollar industry, spam continues to flood inboxes today. Viewed through an economic lens, this malpractice, which is essentially an advertising business, persists because spammers are driving a profitable venture; they will continue as long as they are making money. This reality suggests a different approach to combatting spam that is poten- tially more fruitful than simply trying to prevent spam delivery. The hypothesis is that by investigating the entire spam ecosystem, we can learn about the underlying business model and, in turn, discover parts of it that are most susceptible to defensive action.

8 9

2.1.1 Spam Value Chain

Today, spam is a much more complex enterprise than it once was; now it is a highly diversified operation in which responsibilites are distributed among many entities. Some components have been studied in isolation — primarily spam delivery since that is what the user directly experiences — but until recently, the security community lacked a comprehensive view and understanding of the whole ensemble. This void was filled by the initiative of Levchenko et al. [44], who conducted a holistic, end-to-end analysis of the full spam value chain — the infrastructure and coordinating organizations which monetize spam. More specifically, the endeavor sought to identify the diversity of participat- ing actors, understand their individual roles as well as the relationships among them, and gauge the width of bottlenecks and cost (both replacement and opportunity) if a link in the chain were taken out. To accomplish this, the researchers carried out a large-scale empirical study in which they traced a complete “click trajectory,” from the initial act of following a URL advertised in a spam email to the final selective purchasing of goods from counterfeit stores. Their work showed that the spam ecosystem can be decomposed into three main functions: advertising, click support, and realization. We summarize each function below, but refer the reader to their paper for a more thorough treatment. Advertising. Spammers mainly operate as advertisers, trying to lure potential customers into clicking their . They serve as affiliates of fraudulent stores and earn a 30-50% commission. We note that email is just one of many advertising vectors, but it is the most popular. Click support. After a user clicks on a spam-advertised URL, several pieces are required to take the user to an online store. Rarely does the URL directly 10 link to the store; instead, redirection sites are used to evade domain blacklisting and takedowns. Of course, the store must be reachable somewhere, so spammers or their associates acquire domain names to publish the storefront sites. Now, the spammers or third parties secure the standard infrastructure necessary for supporting any Web site on the Internet. In particular, they need name servers and Web servers; the name servers provide address records that indicate which Web servers host their sites. The stores themselves are managed by illegitimate businesses called affiliate programs, who fulfill numerous back-end obligations. They provide advertising materials, storefront templates, shopping cart management, analytics help, and Web interfaces for their affiliates to track conversions and register for payouts. In addition, affiliate programs contract payment and fulfillment services to outside parties. Realization. Finally, the tail end of the spam value chain consists of resources needed to complete a sale. Specifically, payment services enable online purchases, involving a merchant (or “acquiring”) bank and usually a payment processor as well. Then, fulfillment services ship physical products to customers; virtual products like software, music, and videos are available via direct download on the Internet.

2.1.2 Click Trajectories Finding

The collective effort of Levchenko et al. crawled nearly 15M spam-advertised URLs, 6M of which successfully resolved and rendered Web content. After filtering out any that was not a storefront, they were left with a set of roughly 221K unique storefront Web pages. Their investigation revealed that the vast majority of these storefronts are managed by forty-five prevalent affiliate programs; 11 even further, they found that these affiliate programs use the same few merchant banks. Thus, solving a classification problem — mapping storefront Web pages to the affiliate programs behind them — was crucial to unveiling this choke point in the spam value chain. One downside, though, was that solving this classification problem took a great deal of manual effort by the research team. This motivated us to perform classification in a more automated way, which is the basis of the material we present in Chapter 3.

2.1.3 Affiliate Programs

Affiliate programs have controlled the spam business for years now. The Federal Trade Commission shut down a few of them in 2008 [14], but the take- down was a rare case of combatting affiliate programs directly. Indeed, other affiliate programs (called “partnerka” in Russia) continued to flourish, as evidenced by Samosseiko’s report [75] which unveiled their business model and economics. Levchenko et al. [44] advanced our understanding of the central position these programs have in the spam value chain, and also inspired further work that explored their economics in more depth. Kanich et al. [38] introduced two measurement techniques for estimating order volume (and hence revenue) and purchasing behavior (what products to which customers). They found that ten leading affiliate programs (seven pharmaceutical, three counterfeit software) collected over 119K monthly orders, with the most prolific programs each generating over $1M in revenue per month. Also, the distribution of pharmaceutical purchases is mostly male enhancement drugs (62%) but has a long tail (289 distinct products). Thirdly, spam is largely funded by Western purchases, the U.S. in particular. 12

McCoy et al. [56] had a particularly unique vantage point into the economics of affiliate programs: they obtained four years’ worth of raw transactional data from three prominent pharmaceutical programs (GlavMed, SpamIt, and RX-Promotion). Through mining this data, which covered hundreds of thousands of orders totaling over $185M in revenue, they learned about product demand, customer behavior, affiliates (advertisers), payment service providers, and costs. Their results include the following: (i) erectile dysfunction purchases domi- nate revenue; (ii) an appreciable fraction of purchases are from repeat customers; (iii) a few big affiliates out of hundreds to thousands dominate the market; (iv) only a handful of payment processors handle most of the transaction volume; and (v) while business is seemingly thriving, substantial costs dwindle profit to under just 20% of sales. McCoy et al. [55] also honed in on the payment portion of the spam ecosystem, following up on the previous works suggesting that financial services are a susceptible bottleneck in the value chain. They characterized the relationship between affiliate programs and their acquiring banks, and documented the effects and reactions to interventions by brand holders and payment card networks. They showed how concentrated efforts against just a small set of high-risk banks can terminate merchant accounts within weeks and severely disrupt the business of many affiliate programs. In our work, we track a total of forty-four1 affiliate programs across three product categories: pharmaceuticals, replica luxury goods, and counterfeit software.

1Levchenko et al.’s paper discusses forty-five affiliate programs, but, due to an artifact of their regular expression matcher, any storefront that matched the Stimul-cash program automatically matched the RX Partners program as well. Hence, Stimul-cash storefronts were a proper subset of RX Partners’. However, this outcome actually reflects the true circumstance, as the Stimul-cash program was acquired by the owners of RX Partners, roughly in 2008 [55]. Thus, we ignored Stimul-cash as a distinct program in our study. 13

Table 2.1 shows the number of storefronts we had for each program at the outset of our study. There are two aggregrate programs, Mailien and ZedCash, whose constituents are indicated in the table. Mailien administers two pharma brands, while ZedCash manages seven pharma and nine replica brands. ZedCash is unique in that it has storefront brands for more than one product category. We explain how we classify storefronts by their sponsoring affiliate programs in Chapter 3.

Table 2.1. The number of storefronts in the original data set for all forty-four affiliate programs. Aggregate programs are indicated as ZedCash* and Mailien†.

Affiliate Program Storefronts WatchShop 472 Affordable Accessories* 341 One Replica* 269 Ultimate Replica* 79 Prestige Replicas* 32 Diamond Replicas* 14 Replica Luxury Replica* 13 Distinction Replica* 11 Exquisite Replicas* 9 Swiss Replica & Co.* 2 EuroSoft 2,215 Royal Software 808 Authorized Software Resellers 314 Soft Sales 39 Software OEM Soft Store 24 Continued on next page 14

Table 2.1. The number of storefronts in the original data set for all forty-four affiliate programs. Aggregate programs are indicated as ZedCash* and Mailien† — continued.

Affiliate Program Storefronts EvaPharmacy 58,215 Pharmacy Express† 44,017 RX-Promotions 37,245 Online Pharmacy 16,546 GlavMed 6,616 World Pharmacy 4,340 Greenline 3,857 RX Partners 1,486 RX Rev Share 548 Canadian Pharmacy 261 33Drugs2 181 ED Express† 77 RXCash 72 MediTrust 42 MaxGentleman* 22 PH Online 20 Dr. Maxman* 19 Pharmaceutical Club-first 14 Stallion 11 MAXX Extend 9 Viagrow* 8 HerbalGrowth 8 US HealthCare* 6 ManXtenz* 4 VigREX* 4 Swiss Apotheke 3 Stud Extreme* 3 Ultimate Pharmacy 3 Virility 2 Total 178,281

2Also known as DrugRevenue. 15

2.2 SEO and Search Poisoning

The popularity of the Internet demands a business to have a strong Web presence. Certainly, their Web site should be easy for users to find. One key ingredient, then, is a good domain name — memorable, short, and therefore simple to type into the address bar. Visiting a Web site in this way, called direct traffic, is one avenue for users to reach the site. However, this direct avenue is not the most common one; that distinction belongs to search engines. Specifically, organic search traffic3, or users who enter a search query and click through a search result, is the dominant source of traffic to business sites, per estimates in 2014 [23]. Thus, businesses focus on search as a primary means for attracting users. The crux of search traffic is that the amount a Web site receives depends strongly on its search ranking, or how early in search result listings the site appears. Search engines are designed to answer user queries with the most relevant and highest quality Web sites; in turn, users are much more likely to click the top results than to probe the long tail. Therefore, strive to build Web sites that search engines will rank highly for pertinent queries. Perhaps more important factors for a business’s Web presence, then, are its quality, relevance to certain keywords, reputation, relationship with other reputable Web sites, and so on — all of which contribute to the site’s search ranking. The wealth of techniques used to improve search ranking are collectively known as search engine optimization, or SEO. Search engines have guidelines and rules that webmasters should follow to ensure the integrity and user-friendliness of their sites. Good practice techniques that adhere to these policies are called white hat SEO (e.g., sitemaps, friendly

3Organic search traffic is distinct from paid search traffic: users who enter a search query and click through a pay-per-click advertisement. Paid search comprises 10% of all traffic to business sites [23]. Henceforth, we use the term “search traffic” to denote organic search traffic. 16

URLs, fast load times); bad practice techniques that disregard them are called black hat SEO (e.g., keyword stuffing, link farms, hidden text). Indeed, miscreants utilize surreptitious tricks that exploit a search engine’s algorithms to abusively boost their search rank. This illicit activity — known as search (result/engine) poisoning, , Web/search spam, et cetera — is used to garner click traffic; then, clicks are monetized through various scams such as fake anti-virus, download, phishing, and counterfeit stores. Our work focuses on the counterfeit luxury market, where e-commerce Web sites pose as authentic high-end storefronts, but actually sell cheap knockoff mer- chandise. These fraudulent stores employ black hat SEO campaigns — coordinated efforts to abusively elevate search ranking — to lure more visitors and ultimately increase sales. We wish to investigate this collusion among attackers: how do they operate, and how is their business structured? This kind of holistic perspective is needed to bolster defenses; existing countermeasures, which target spammed search results or individual counterfeit stores in isolation, are inadequate. Accordingly, our efforts here are similar to the email spam study in both motivation and ap- proach. We seek an ecosystem understanding and bottleneck analysis of Web search spam, and we achieve it through empirical measurement. We provide additional background and present our work on this problem, led by David Wang [84], in Chapter 4. Our research builds upon prior work from Wang et al. [86, 85]. They first studied a black hat SEO technique called cloaking, which enables search poisoning. They implemented a custom crawler to measure the prevalence of cloaking, as well as the response to cloaking from search engines. We elaborate on cloaking in Section 4.2.1 and touch upon their crawler in Section 4.3.2. In subsequent work, the researchers infiltrated an SEO botnet that poisons search results to lead users 17 to scams. Their probe sheds light on the infrastructure and operation of the botnet, and how it reacts to technical interventions (flagging compromised Web sites and poisoned search results) and monetary interventions (takedown of a fake anti-virus program, the most profitable scam the botnet drove). We describe the scheme of an SEO botnet in Section 4.2.1. Chapter 4, essentially an adaptation of Wang et al.’s third installment [84], presents a unified culmination of this line of work. We document a comprehensive ecosystem-level analysis of SEO abuse on one particular scam, the countefeit luxury market. Security researchers have studied search result poisoning for a decade, with Wang el at. [87] contributing an early end-to-end analysis of redirection spam. Since then, many have proposed various techniques for detecting cloaking and poisoning [43, 49, 62, 86], and some have delved deeper into the activity of SEO campaigns, with Wang et al. [85] infiltrating one campaign’s botnet, and John et al. [36] using clustering and signatures to identify distinct campaigns. This body of prior work underpins our own, but we contribute a new ecosystem-level viewpoint of luxury SEO and uncover the business venture’s full structure. This approach relates to other recent efforts to explore underground economies [38, 39, 42, 56, 76]. In addition, our measurement extends to an evaluation of current defenses against luxury SEO, similar in objective to other work that considers the economics of defensive interventions [12, 32, 44, 48, 55, 59].

2.3 Domain Names

Domains names — unique identifiers for places on the Internet — are an elemental part of how users navigate the Web. A catchy domain makes a place easier to remember and find. Thus, while there are almost infinitely many possible domain names, there is a drastically smaller set of desirable ones. Because good 18 domains are a scarce resource, they hold tremendous value. So while domain names are ostensibly a simple mechanism for locating places on the Web, they became hot property and now comprise an enormous online market.

2.3.1 The DNS Business Model

The (DNS), introduced in the early 1980s, provides a simple technical service: to map human-readable strings to routable IP addresses, which specify the location of an Internet resource. For example, the domain ucsd.edu translates to the IP address 132.239.180.101, the location of the ma- chine where the domain is hosted. Domain names are much easier to memorize and communicate than numerical addresses. Also, they abstract away physical location: the same string is always used to locate a resource even if its physical location changes. The DNS is organized hierarchically, with top-level domains (TLDs) at the highest level and second-level domains right beneath. A domain name’s hierarchy descends from right to left, so in the example ucsd.edu, the TLD is edu and the second-level domain is ucsd. In other terms, ucsd.edu is called a subdomain of the edu TLD. The Internet Corporation for Assigned Names and Numbers (ICANN) is the governing body of the DNS. ICANN does not manage individual TLDs, though; they contract out that responsibility to third-party organizations called registries.

For example, operates com, and EDUCAUSE operates edu. Then, registries work with registrars — companies that sell domain names to the public. Registrars offer most domains on a first-come first-served basis for an annual registration fee. Some common registrars are GoDaddy, eNom, , 1&1, and Register.com. Finally, a person or company who registers a domain name is called 19

a registrant. The registrant holds exclusive ownership of the domain name, and accordingly is also known as a domain owner. Registries act as the “wholesalers” of the domain name business, as they set a flat wholesale price that ICANN must approve. ICANN requires that this price is the same for all registrars to support fair competition. Then, registrars act as the “retailers” and may mark up the price as they see fit. Registries pay ICANN a small transaction fee for each domain sold or renewed, but only if the total transactions in any calendar quarter meet a certain threshold. For new TLDs (discussed next in Section 2.3.2), the fee is $0.25, and the transaction threshold is 50,000 [73]. All terms for how money from domain name sales is apportioned are laid out in contracts between ICANN, registries, and registrars. These agreements specify additional fees as well; for example, registries who operate new TLDs owe ICANN a fixed quarterly fee of $6,250 [73], and registrars owe ICANN a $4,000 yearly accreditation fee [72].

2.3.2 Growth of Top-Level Domains

In January 1985, the first six TLDs were released: com, org, , edu, gov, and mil. These are all categorized as generic TLDs (gTLDs), which have a further distinction: sponsored or unsponsored. A sponsored gTLD is intended for a specific community and restricts domain registrations to affiliated members. A sponsoring organization who represents the community manages the TLD and enforces policies that authorize domain registration and use. An unsponsored gTLD follows standard ICANN policy and usually is open to all registrants and purposes. Of the first

six TLDs, com, org, and net are unsponsored and open to the global Internet population; the other three are sponsored and limited to smaller communities. The

4The net TLD was not included in the original RFC 920 document from October 1984 [69]), but was implemented together with the five proposed TLDs. 20

first country code TLDs (ccTLDs) were introduced in the 1980s as well (e.g., us and uk, reserved for the United States and United Kingdom, respectively). The com TLD prevailed as the most open and popular; as a result, com eventually became saturated and, so to speak, all the good domain names were already taken. To relieve this scarcity, ICANN launched the biz and info gTLDs in May 2001. Their initiative followed the recommendations of the Working Group C report [89], which concluded that adding new TLDs would essentially help democratize the Internet. This consensus opinion was not unanimous though, and the debate raises questions about the desired versus actual effect. With com established as the dominant Web standard, would users and businesses even consider other TLDs for finding and building a solid Web enterprise? Would it be strange for different companies to have the same second-level domain in separate TLDs? Would new TLDs encourage speculation, and also spark a race between extortionists and businesses trying to protect their trademarks? Halvorson et al. [30] gave contemporary insight into this debate by measuring the usage of the biz TLD ten years after its inception. biz was meant to be an alternative to com, but their study showed it has gained limited traction. biz is about 46 times smaller than com (by total domains registered), has a disproportionately lower representation among the Internet’s most popular Web sites, and contains only a moderate fraction of primary Web identities (i.e., not undeveloped Web sites, redirects, or duplicated content from the same subdomain in a different TLD). ICANN released more TLDs throughout the ensuing years, including spon- sored TLDs like aero in 2002 and jobs, mobi, and travel in 2005. Then in 2011, ICANN added the controversial xxx TLD after a decade-long approval process. Halvorson et al. [29] conducted a similar study of this TLD to break down its 21 early usage and economic ramifications. Their findings validated the magnified concerns over defensive registrations from brand and trademark holders wanting to disassociate from the TLD’s connotations. Defensive registrants account for almost 92% of all domain registrations, and they spent a total of $24M in the first year of registration alone. Many of these individual TLDs endured a laborious induction, so in 2008 ICANN began developing policy to simplify and standardize the process of adding a new TLD. Then in 2011, ICANN authorized the launch of what they called the New gTLD Program, whose goals are “enhancing competition and consumer choice, and enabling the benefits of innovation” [2]. While the program seems well intentioned, the same debate persists about what value new TLDs contribute to the Internet community [25]. In this program, a new TLD goes through a process of application, evaluation, and delegation. An aspiring registry submits a comprehensive application to ICANN demonstrating that they are technically and financially capable of operating a TLD. Then, the application is open to public comments and gets reviewed by third- party expert panels. The applicant owes ICANN a $185,000 evaluation fee for the initial evaluation. This stage may also be followed by dispute resolution and string contention — the former, if anyone files an objection to the TLD; the latter, if there are multiple applications for the same TLD string. If the applicant passes these steps, then the TLD advances to delegation. Delegating a TLD means adding it to the DNS root zone, the global list of TLDs. The TLD is then considered “live” on the Internet and opens to domain registrations shortly thereafter. Most registries who operate public TLDs open registration in consecutive phases: sunrise, land rush, and general availability. The sunrise phase is designated for trademark holders only so they can defend their names. The land rush phase 22

1046

900

700

total # of TLDs 500

318

10/1/13 10/1/14 8/20/15 Figure 2.1. The steady and swift rollout of new gTLDs. Dates of delegated strings were collected from [18]. is meant for eager registrants who are willing to pay a premium for high demand domains. Finally, in the general availability phase, domain names are sold first-come first-served at the normal registration rate. ICANN opened the first application window in January 2012 and received a total of 1,930 applications. This remarkable volume evidently foreshadowed an equally remarkable expansion of the domain name space. As of October 1, 2013, there were 318 TLDs (most of them ccTLDs). By August 20, 2015, the root zone contained 1,046 TLDs — an addition of 728 in less than two years. Figure 2.1 plots this growth since the first new TLDs were delegated on October 23, 2015. This surge of hundreds of new top-level domains has been called “the biggest change to the Internet since its inception” [34]. We examine the early usage of these new TLDs in Chapter 5. 23

2.3.3 Abuse and Economics

The new TLD rollout opens up a fresh expanse of the Internet namespace that may be targeted by questionable, if not flatly abusive, practices. Some examples include:

• Cybersquatting: Buying a domain name related to a trademark, usually with the intent of reselling the domain to the trademark holder for an inflated price.

• Typosquatting: Registering a domain name to acquire unintentional direct traffic from users who commit a typo when entering the URL. For example,

Google prevents common instances by owning gooogle.com and googel.com, both of which simply redirect to google.com.

• Homograph attack: A domain name that spoofs an existing one by hav- ing an indistinguishable ASCII representation in certain fonts. A classic attack was the “PayPaI” phishing scam that tricked PayPal users into visit-

ing PayPaI.com, which then stole user credentials when they attempted to login [67].

• Domain parking: Monetizing an undeveloped domain by serving automati- cally spun ads and/or selling the domain at a profit.

• Domaining: A term that refers to the speculative side of domain parking: investing in domain names for future resale.

• Domaineering: A term that refers to the non-speculative side of domain parking: buying and monetizing domain names by using them as an advertis- ing medium. 24

Among these activities, cybersquatting is the only one that is altogether illegal, as it infringes upon a trademark’s intellectual property rights. To police such infringements, ICANN introduced the Uniform Domain-Name Dispute-Resolution Policy (UDRP), an arrangement for resolving domain name disputes, in 1999. However, the UDRP process is both costly and lengthy. As a cheaper and faster alternative, ICANN implemented the Uniform Rapid Suspension (URS) system for eradicating the most obvious infringements in new gTLDs. Also, several companies now use brand protection services (e.g, MarkMonitor [54] and Safenames [74]) to hunt abusive domains. Still, despite these options for defensive measures, many brand and trademark holders preemptively register congruous domains to safeguard their names. Domain parking, a widespread practice in TLDs both old and new, engenders ongoing debate about its merits. Proponents contend that domain owners have the right to (lawfully) use their property as they wish. Furthermore, some claim that a parked domain’s advertisements, which are often thematically related to the domain name, are useful portals for finding germane information. On the other hand, opponents argue that parked domains contribute little to the Internet at large. Search engine providers seem to agree; they like unique, quality content, and therefore do not index parked Web pages. We do not focus on mitigating these dubious practices in this dissertation; we are, though, generally interested in the economic motivations of actors in the domain name market. Indeed, the goal behind these pursuits is to make money, and there is much of it to be made. The openness of gTLDs and first-come first- served allocation of domain names promise tidy profits for expeditious speculators.

The most coveted domains change hands for exorbitant sums as well. In the com TLD, second-level domains that are popular generic terms, such as insurance, 25 vacationrentals, privatejet, internet, sex, and hotels, all resold for over $10M. Also, companies buy domains that are central to their brand to enhance their

Web presence. In two high profile instances, Facebook bought fb.com for $8.5M in November 2010, and Apple bought icloud.com for $6M in March 2011 [47]. Typosquatting further highlights the magnitude of the domain industry, as the amount of revenue generated by something as trivial as typos is astounding. A 2010 study estimated that Google earns about $497M per year from typosquatters whose domains are “misspellings” of the top 100,000 Web sites and are parked with Google ads [60]. The overall economics of the domain name market are just as staggering. The startup registry Donuts raised over $100M in 2012 to support its applications for 307 new TLDs [19]. GoDaddy, the largest domain registrar, raised $460M in an early 2015 IPO [24], with market cap currently around $4 billion (where future revenue is based largely on its expansion plans for registering domains in new TLDs). Today, there are more than a quarter of a billion domain names registered in total; the registration fees alone are about $2-3B in annual revenue.

2.4 Bag-of-Words Representation

Applications such as text mining, information retrieval, and topic modeling all involve learning about a corpus of documents. A crucial decision, then, is how to represent a document. One conventional way is to use a bag-of-words representation, which works as follows. We start with an extensive set of all possible words that may appear in the corpus; this set of words, typically pre-defined, is called the vocabulary (or dictionary). Let D denote the size of the vocabulary. Now, for a single document, we count the number of times each word appears in the document. These word counts are the features of the document. 26

Note that while the vocabulary is inherently an unordered set of words, we must impose an ordering on the words to maintain a consistent mathematical representation of all documents. In particular, suppose the vocabulary, denoted as V , is a sequence of D words:

V = hw1, w2, w3, . . . , wDi,

th where wj is the j word in the vocabulary. Then, we represent a document as a length D feature vector of word counts. Specifically, a document, denoted as −→x , is represented as the vector:

−→ x = hx1, x2, x3, . . . , xDi,

where xj is the number of times word wj appears in the document. If we do this for all (say N) documents, we can “stack” their feature vectors and represent the full corpus as a document-term matrix as such:

 −→  ←− x1 −→    −→   ←− x2 −→       ←− −→x −→  ,  3     .   .    −→ ←− xN −→

−→ th th where xi , the i row of the matrix, represents the i document in the corpus. This document-term matrix is a N × D matrix: there are N rows (documents) and D columns (terms, or words, in the vocabulary). While its dimensions can be huge, this matrix is usually sparse, as the huge size of the vocabulary dwarfs the size of an individual document. 27

There are some standard preprocessing steps that normally accompany a bag-of-words approach. We exclude so-called stopwords — extremely common words — from the vocabulary, as well as extremely rare words, both of which add little to no information. Also, we usually normalize each feature vector to have unit length. This converts feature values from non-negative integers to real numbers in the range [0, 1]. The most common use for bag-of-words is modeling a corpus of documents which contain natural language, like news articles, posts, discussion forums, and so on. However, we posit that the methodology can be extended to any type of textual document; for our purposes, we want to apply bag-of-words to Web pages, which are textual documents containing HTML code. For some applications, we may only want the natural language contained in the Web page, but for other applications, the full HTML implementation of the Web page may carry valuable information. Fundamentally, the vocabulary need not consist only of real English words; the approach is more flexible, and a “word” can be any kind of string. Ultimately, it is our decision what we consider to be useful “words” for a particular task. In Chapter 3, we explain how we encode HTML elements as words in an effort to capture the full richness of content and structure present in a Web page. Recall that an overarching goal of this work is to alleviate the manual burden in our efforts to categorize a large corpus of Web pages. Again, some amount of manual labeling is unavoidable at the outset, but a heavily manual methodology demands additional work to decipher distinctive features for a particular category, which could help the expert find more examples from that category. This underscores a key advantage of our feature extraction approach: bag-of-words is fully automated, and we require no manual feature engineering. Instead, we leave it to the machine learning algorithm to find similarities or predictive features automatically. 28

2.5 Related Work

The intersection of machine learning and computer security has grown into a wide area of research. Thus, we narrow our review to closely related work involving techniques and security applications of Web page classification.

2.5.1 Non-Machine Learning Methods

First, we observe a handful of studies that utilize non-machine learning methods, which directly motivated our own work. These methods may use clustering algorithms at first, but then rely on manual effort and automated heuristics, such as regular expressions (regexs), for classifying Web pages. In their analysis of the spam value chain, Levchenko et al. [44] cluster storefront Web pages automatically with a q-gram similarity approach. But then, to link the storefronts to affiliate programs, they manually craft a set of characteristic regexs for every program. (We elaborate on their methodology in Chapter 3.) Likewise, Halvorson et al. [30] automate the classification of parked domains

in the biz TLD using regexs that matched distinctive parts of the HTML content. They also manually classify a sample of 485 Web sites for comparison. In their

corresponding study of the xxx TLD, Halvorson et al. [29] use text shingling (commonly known as minhash, and covered in the following section) for clustering similar Web pages together. But again, they resort to manually engineered regexs to progress from clustering to classification. To detect cloaking in search results, Wang et al. [86] implement a data filtering step, followed by two scoring heuristics and a manually tuned threshold, for deciding if a Web site is cloaked. In subsequent work, they infiltrate an SEO botnet to learn how it poisons search results to drive traffic to criminal Web sites [85]. 29

They manually cluster and classify these Web sites according to the type of scam that monetizes the traffic (e.g., fake anti-virus, drive-by downloads, pirated media, et cetera). We note that these approaches can and do work well for the most part, but classifying Web pages by hand and constructing signatures requires excessive manual industry. To alleviate this burden, researchers have turned to machine learning-based methods, which offer more efficient and robust alternatives for classification. This direction is the foundation of this dissertation.

2.5.2 Near-Duplicate Web Pages

In many cases, the success of automated analyses hinges on the high degree of similarity among groups of Web pages. As Chapters 3 and 5 will show, our systems effectively cluster Web pages that share a very close resemblance. Though we do not appeal to these techniques in our work, well known algorithms exist for detecting near-duplicate Web pages. The leading two that are still considered state-of-the-art are termed minhash and simhash. Minhash was conceived by Broder [11, 10] and first used to eliminate redun- dancy when a search engine indexes the Web. The algorithm extracts shingles, or n-grams, from a Web page and hashes them for compression. A random sample of shingles comprise a sketch, a fixed-size set of hashes that represents the Web page. Then, to measure the resemblance (or similarity) of two Web pages, the algorithm uses the Jaccard index of their sketches — the size of the intersection divided by the size of the union. But the key insight which makes minhash fast is that the probability that two sketches have the same minimum hash is equal to the Jaccard index. Hence, the Jaccard index need not be computed in full. Ultimately, pairs of similar Web pages can be linked to form clusters of near-duplicate Web pages. 30

Later, Charikar [13] developed simhash, a related hashing scheme for efficient document comparison. This algorithm condenses a Web page down to a relatively small bit vector, and uses cosine similarity instead of the Jaccard index for comparing Web pages. The crucial insight behind simhash is that the cosine similarity of two Web pages can be estimated by the number of agreeing bits in their abbreviated vector representations. Ensuing studies contributed empirical evaluations of minhash and simhash to accompany their sound theoretical foundations. Henzinger [31] directly compared the algorithms on a task of clustering 1.6 billion Web pages, finding that simhash slightly outperformed its counterpart. Manku et al. [52] favor simhash as well, citing its even smaller fingerprints of Web pages as a main advantage.

2.5.3 Web Spam and Cloaking

As introduced in Section 2.2, one of the applications we examine in this dissertation involves black hat SEO and Web spam. Spammers adopt devious tactics to manipulate the search ranking of their Web sites. Some common features of a spam Web page are a disproportionate number of (inbound links) to feign popularity, and keyword stuffing — i.e., including and repeating keywords to appear relevant to target search queries. A fair number of related studies have used statistics and machine learning for detecting this kind of abusive activity. Some of the first automated approaches identify spam Web pages by their statistically abnormal behavior. Fetterly et al. [22] plot the distributions of certain Web site properties and inspect the outliers for spam. Some of the anomalous prop- erties of spam sites are: excessive inbound links, highly replicated SEO content, and rapid evolution in content over time (since spam sites often return auto-generated content on demand). Gy¨ongyi et al. [27] develop a system called TrustRank to 31

differentiate good and bad Web sites. Starting with an expert-labeled seed of Web sites, TrustRank uses the Web’s link structure to estimate and track the flow of “trust” between reputable sites. The system then calculates a probability that a given site is spam or not. Bencz´or et al.’s SpamRank system [7] implements a different link-based approach for mitigating Web spam. The core idea is that spam Web sites artificially boost their status with a large number of low quality backlinks. SpamRank computes a penalty for such sites that should be deducted from their ranking scores. Ntoulas et al. [65] focus less on link structure and instead consider numerous content-based features that are indicative of Web spam — for example, number of words in both the page and title, average word length, amount of anchor text, fraction of visible content, compressibility, and so on. Many of these features are signals of automatic and replicated keyword stuffing. As no single feature is a silver bullet, they learn a classifier using the full set of features. They find that boosted decision trees yield the most accurate model for detecting spam Web pages. Urvoy et al. [81] also use Web page content to characterize Web spam, but extend the type of features from textual to “extra-textual” to capture the HTML style of Web pages. The goal of this approach is to improve Web spam detection by clustering Web pages with similar HTML styles; such pages are likely generated automatically with the same scripts and templates. This property is exhibited by the spam-advertised storefronts we introduced in Section 2.1. A different kind of automatic content generation used by spammers is called spinning — auto-generating many variations of a Web page, usually by shuffling and substituing words, to avoid duplicate detection. Zhang et al. [93] develop a system called DSpin to detect spun Web content. They reverse-engineer a synonym dictionary to identify words and phrases that spinning tools do not modify, termed 32 immutables. Then, to measure pairwise similarity of Web pages, they perform a set comparison of their immutables. This similarity score is used to cluster families of Web pages that were spun from the same seed page. Other researchers have built machine learning models to detect cloaked Web sites. The key in recognizing cloaking is to determine if a delivers different content to search engine crawlers and browsers. In comparing these two versions of a Web page, Wu and Davison [91] spot the main difference in the SEO-ed nature of the crawler version. They assemble a number of content and link-based features that reveal link and keyword stuffing. Using a total of 162 features, they build a support vector machine (SVM) that achieves an average of 93% precision and 85% recall in detecting cloaked Web sites. Lin [46] provides a new insight to improve cloaking detection: the structure of legitimate Web pages changes much less frequently than the content, which may be dynamically generated. Therefore, a significant difference in the structure of the two Web pages served to a crawler and a browser suggests the use of cloaking. So while previous detectors use terms and links to compare Web page content, Lin uses HTML tags to compare Web page structure. A decision tree trained on a combination of tag and term features averages over 90% accuracy in classifying Web sites as cloaked or not.

2.5.4 Other Applicatons

Even more problems in Web page classification demonstrate the wide appli- cability of machine learning in security research. Bannur et al. [6] tackle the most general such problem — malicious or benign — by exploiting the full content of a Web page: text, tag structure, links, and visual appearance. They train logistic regression and SVM classifiers that are up to 98% accurate. 33

Kolari et al. [40] use SVMs for a different task: mapping out the “blogo- sphere”. In addition to customary n-grams, they extract the anchor text and tokenized URLs from a Web page’s links to distinguish from other types of Web sites. Their best models achieve almost 98% accuracy in identifying blogs, but are less effective in detecting spam blogs. Chapter 3 involves counterfeit storefronts that are advertised in email spam. There is a substantial body of work that uses machine learning tools for spam filtering; Guzella and Caminhas’s survey [26] highlights a considerable sample. Another application that specifically involves counterfeit storefronts, though, is the work of Leontiadis et al. [42], who investigate illegal pharmaceutical e-commerce. Using hierarchical clustering with average linkage, they group unlicensed online pharmacies based upon the similarity of their inventories. Their results indicate a potential bottleneck: most unlicensed pharmacies appear to use the same few suppliers, so cutting them off could have outsized impact on sales. The basis of our work in classifying storefront Web pages is the “affiliation” property — we build a system to group together storefronts that are associated with the same affiliate program. The success of automated methods for identifying such affiliations is often due to miscreants using automated methods to scale their attacks. Devising single-use scams is far too inefficient to be profitable, so instead, miscreants replicate their scams on many Web pages across many domains. This idea has been explored in other work as well. Anderson et al. [4] digest a huge feed of spam-advertised URLs by clustering the destination Web pages and identifying which ones promote the same scam. They implement a technique called image shingling to cluster Web pages that are visually similar when rendered in a browser. Through analysis of the clustering results, they observe over 2,000 distinct scams that monetize spam email. 34

More recently, Drew and Moore [20] also pursue the affiliation property: they cluster criminal Web sites that are managed by the same illegal organization. They develop a novel combined clustering algorithm that succeeds even when cybercriminals try to hide incriminating correlations between their Web sites. The two classes of scams they examine are fake-escrow services and high-yield investment programs. A third instance of identifying sources of common origin is from Wardman and Warner [88], who detect new phishing Web sites by matching them to ones they have already seen. Phishers typically use a “phishing kit,” which is a bundle of resources for crafting a fraudulent Web site. Thus, their phishing Web sites contain many of the same files (HTML, CSS, JavaScript, and images), and therefore can be traced back to the same source. Other systems have been designed to detect phishing Web sites as well. Zhang et al. [94] engineer a system called CANTINA which, among other content- based features and heuristics, uses terms with the highest term frequency-inverse document frequency (TF-IDF) score for classifying Web pages as phishing or legitimate. Whittaker et al. [90] describe a large-scale system that classifies millions of Web pages a day to automatically update Google’s phishing . They too limit the set of terms to the ones with the highest TF-IDF values. The intuition is that many of the highly ranked terms are victim-specific and obviously used by phishers (e.g., a site posing as eBay will contain eBay-related words). They combine these terms with additional features involving the URL, hosting information, and other page content to build a logistic regression classifier that estimates a phishing probability. For 99.9% of non-phishing Web pages, the classifier predicts a phishing probability under 1%; for 90% of phishing Web pages, a probability over 90%. Chapter 5 involves domain parking, a common practice across the Web 35 used to monetize undeveloped domains. Quite recently, Vissers et al. [83] extract features from a Web page’s content for building a domain parking classifier. On average, a random forest is 98.7% accurate in discriminating parked and not-parked Web pages. In summary, even though this review still only scratches the surface of this problem space, it highlights the assortment of security challenges that need to categorize Web pages. Modern cybercrime is organized and executed at Internet scale, and so defensive analyses must be as well. Machine learning provides an array of techniques for conducting the large-scale classification of Web pages, and it will continue to play an integral part in the Internet security field. Chapter 3

Affiliate Program Identification

A canonical machine learning application in the security domain is spam detection, where a model learns to classify an email as spam or non-spam. This binary classification task is cleanly defined, well-motivated, and thoroughly studied. Machine learning provides highly accurate models for detecting spam (e.g., [51]), but in spite of this success, the spam problem persists. This situation inspired the study from Levchenko et al. [44], which serves as the basis of this chapter. They looked beyond just spam delivery and examined the entire spam business model instead. A critical part of the end-to-end analy- sis involved identifying the counterfeit businesses (i.e., affiliate programs) whose storefront Web sites were advertised in spam emails. This task prompts a more sophisticated classification problem than simply deciding whether an email is spam or not. Specifically, the problem here is the multi-way classification of storefront Web sites, where the classes are the sponsoring affiliate programs. This particular problem is a representative instance of security efforts that partition Web sites into groups of common origin. Doing this job manually, though, is far too time-consuming for a security expert. In this chapter, we show how machine learning tools can assist the security expert in performing classification at scale. We now provide a synopsis of our work.

36 37

We describe an automated system for the large-scale monitoring of Web sites that serve as online storefronts for spam-advertised goods. Our system is developed from an extensive crawl of black-market Web sites that deal in illegal pharmaceuticals, replica luxury goods, and counterfeit software. The operational goal of the system is to identify the affiliate programs of online merchants behind these Web sites; the system itself is part of a larger effort to improve the tracking and targeting of these affiliate programs. There are two main challenges in this domain. The first is that appearances can be deceiving: Web pages that render very differently are often linked to the same affiliate program of merchants. The second is the difficulty of acquiring training data: the manual labeling of Web pages, though necessary to some degree, is a laborious and time-consuming process. Our approach in this chapter is to extract features that reveal when Web pages linked to the same affiliate program share a similar underlying structure. Using these features, which are mined from a small initial seed of labeled data, we are able to profile the Web sites of forty-four distinct affiliate programs that account, collectively, for hundreds of millions of dollars in illicit e-commerce. Our work also highlights several broad challenges that arise in the large-scale, empirical study of malicious activity on the Web.

3.1 Introduction

The Web plays host to a broad spectrum of online fraud and abuse— everything from search poisoning [36, 85] and phishing attacks [50] to false ad- vertising [45], DNS profiteering [29, 60], and browser compromise [71]. All of these malicious activities are mediated, in one way or another, by Web pages that lure unsuspecting victims away from their normal browsing to various undesirable ends. Thus, an important question is whether these malicious Web pages can 38 be automatically identified by suspicious commonalities in appearance or syn- tax [6, 29, 41, 46, 79, 81, 94]. However, more than simply distinguishing “bad” from “good” Web sites, there is further interest in classifying criminal Web sites into groups of common origin: those pages that are likely under the control of a singular organization [20, 44, 55]. Indeed, capturing this “affiliation” property has become critical both for intelligence gathering as well as both criminal and civil interventions. In this chapter, we develop an automated system for one version of this problem: the large-scale profiling of Web sites that serve as online storefronts for spam-advertised goods. While everyone with an e- account is familiar with the scourge of spam- based advertising, it is only recently that researchers have come to appreciate the complex business structure behind such messages [56]. In particular, it has become standard that merchants selling illegitimate goods (e.g., such as counterfeit Viagra and Rolex watches) are organized into affiliate programs that in turn engage with spammers as independent contractors. Under this model, the affiliate program is responsible for providing the content for online storefronts, contracting for payment services (e.g., to accept credit cards), customer support and product fulfillment—but direct advertising is left to independent affiliates (i.e., spammers). Spammers are paid a 30–40% commission on each customer purchase acquired via their advertising efforts and may operate on behalf of multiple distinct affiliate programs over time. Such activities are big business with large affiliate programs generating millions of dollars in revenue every month [38]. Thus, while there are hundreds of thousands of spam-advertised Web sites and thousands of individual spammers, most of this activity is in service to only several dozen affiliate programs. Recent work has shown that this hierarchy can be used to identify structural bottlenecks, notably in payment processing. In particular, 39 if an affiliate program is unable to process credit cards, then it becomes untenable to sustain their business (no matter the number of Web sites they operate or spammers they contract with). Recently, this vulnerability was demonstrated concretely in a concerted effort by major brand holders and payment networks to shutdown the merchant bank accounts used by key affiliate programs. Over a short period of time, this effort shut down 90% of affiliate programs selling illegal software and severely disabled a range of affiliate programs selling counterfeit pharmaceuticals [55]. The success of this intervention stemmed from a critical insight—namely, that the hundreds of thousands of Web sites harvested from millions of spam emails could be mapped to a few dozen affiliate programs (each with a small number of independent merchant bank accounts). At the heart of this action was therefore a classification problem: how to identify affiliate programs from the Web pages of their online storefronts? There are two principle challenges to classification in this domain. The first is that appearances can be deceiving: storefront pages that render very differently are often supported by the same affiliate program. The seeming diversity of these pages—a ploy to evade detection—is generated by in-house software with highly customizable templates. The second challenge is the difficulty of acquiring training data. Manually labeling storefront pages, though necessary to some degree, is a laborious and time-consuming process. The researchers in [44] spent hundreds of hours crafting regular expressions that mapped storefront pages to affiliate programs. Practitioners may endure such manual effort once, but it is too laborious to be performed repeatedly over time or to scale to even larger corpora of crawled Web pages. Our goal is to develop a more automated approach that greatly reduces manual effort while also improving the accuracy of classification. In this chapter 40

Table 3.1. Screenshots of online storefronts selling counterfeit pharmaceuticals, replicas, and software.

GlavMed Ultimate Rep. SoftSales

we focus specifically on spam-advertised storefronts for three categories of products: illegal pharmaceuticals, replica luxury goods, and counterfeit software (Table 3.1). We use the data set from [44] consisting of 226K potential storefront pages winnowed from 6M distinct URLs advertised in spam. From the examples in this data, we consider how to learn a classifier that maps storefront pages to the affiliate programs behind them. We proceed with an operational perspective in mind, focusing on scenarios of real-world interest to practitioners in computer security. Our most important findings are the following. First, we show that the online storefronts of several dozen affiliate programs can be distinguished from automatically extracted features of their Web pages. In particular, we find that a simple nearest neighbor (NN) classifier on HTML and network-based features achieves a nearly perfect accuracy of 99.99%. Second, we show that practitioners need only invest a small amount of manual effort in the labeling of examples: with just one labeled storefront per affiliate program, NN achieves an accuracy of 75%, and with sixteen such examples, it achieves an accuracy of 98%. Third, we show that our classifiers are able to label the affiliate programs of over 3700 additional storefront pages that were missed by the hand-crafted regular expressions of the original study. Finally, we show that even simple clustering methods may be useful in this domain—for example, 41

Figure 3.1. Data filtering process. Stage 1 is the entire set of crawled Web pages; stage 2, pages tagged as pharmaceutical, replica, and luxury; stage 3, storefronts of affiliate programs matched by regular expressions. to identify new affiliate programs or to reveal when known affiliate programs have adopted new software engines.

3.2 Data Set

We use the data set of crawled Web pages from the measurement study of the spam ecosystem by Levchenko et al. [44]. Figure 3.1 depicts how they collected, filtered, and labeled the data, which we summarize in this section; we refer the reader to their paper for a more detailed explanation of their methodology.

3.2.1 Data Collection

Over a period of three months, Levchenko et al. [44] crawled over 6M URLs from various spam feeds, including one feed from a top Web mail provider and others captured as output from live spamming bots. From the combined feeds, they obtained 1,692,129 distinct landing pages after discarding duplicate pages 42 referenced by multiple URLs. For each landing page the crawler saved the raw HTML as well as a screenshot. In addition, a DNS crawler collected network information on the extracted domains. The first level in Figure 3.1 corresponds to this set of Web pages.

3.2.2 Data Filtering

The authors of [44] focused on three popular categories of spam-advertised goods—illegal pharmaceuticals, replica luxury goods, and counterfeit software. They used keyword matching to filter out Web pages that were unrelated to these categories of merchandise. Keywords included both brand names, such as Viagra and Cialis, in addition to more generic terms, such as erectile dysfunction and drug. This filtering removed pages whose text did not advertise for sales of pharmaceuticals, replicas, or software—an omission that runs counter to the business interest of any viable storefront. The absence of such text was found to be a reliable screen for non-storefront pages. A total of 226,439 storefront pages emerged from this filter, which were then broadly grouped into the three categories of pharmaceuticals, luxury goods, and software. These categorized pages, shown in the second level of Figure 3.1, comprise the universe of pages that we consider in this chapter. The filtering by domain-specific keywords may be viewed as a simple opera- tional heuristic to narrow the massive data set of blindly crawled URLs to a subset of pages of interest. Note that the filtering in this step was purposely conservative so that legitimate storefront pages were not winnowed from the data set. 43

3.2.3 Data Labeling

Through a combination of manual inspection and heuristic pattern-matching, the authors of [44] managed to link the spam-advertised storefronts for pharma- ceuticals, luxury goods, and pirated software to individual affiliate programs. An initial round of manual inspection was necessary because even the cast of affiliate programs was not known a priori. Thus the first pass over crawled Web pages focused on identifying prominent affiliate programs and collecting a moderate sample of storefront pages from each. The authors identified forty-five distinct affiliate programs in this way (although one was later discovered to be a subset of another). The next and most time-consuming step of this process was to expand the number of storefront pages labeled for each affiliate program. The authors of [44] sought to automate this process through the use of regular expressions. Specifically, for each affiliate program, they devised a set of characteristic regular expressions that matched the HTML content of its online storefronts but not those of other programs. They also fixed an integer threshold for each affiliate program; if the Web page of an online storefront matched this number of regular expressions (or more), then it was labeled as belonging to the program. Using this approach, they were able to identify the affiliate programs of 180,690 online storefronts. The bottom level in Figure 3.1 represents this final set of storefront pages. The above approach depended on the meticulous crafting of regular ex- pressions that distinguish the storefront pages of different affiliate programs. To devise such an expression, it was necessary (roughly speaking) to find a pattern that was present in all the pages of one program and absent in the pages of all other programs. The painstaking effort required for this approach is best illustrated 44 by example. Here is one regular expression for storefront pages in the GlavMed affiliate program:

var\s+SessionType\s*=\s*"URL";\s*var\s+ SessionPrefix\s*=\s*"[0-9a-fA-F]32";\s*var\s+ SessionName\s*=\s*"USID";

It can take a couple hours just to hone a single regular expression such as the one above. Much more time, naturally, is required to obtain coverage of multiple affiliate programs. The full effort in [44] involved not only the careful scrutinization of HTML content, but also iterative refinement and extensive validation. In total, the authors estimated that they spent over two hundred man-hours designing and validating the regular expressions for all forty-five affiliate programs. In the present work, we limited our study to forty-four1 of these affiliate programs. We also discarded pages from the above collection which lacked a screenshot (which we need for manual validation of our results). From the initial three-month Web crawl, this left us with 178,281 labeled storefronts and 43,442 unlabeled storefronts; all of these storefronts were tagged as selling pharmaceuticals, luxury goods, or software, but only the former were successfully mapped to affiliate programs by regular expressions. We also obtained a subset of data crawled from a period of the following three months. These Web pages were collected and labeled in the same manner as before; in particular, they were labeled without updating the regular expressions from the initial crawl. It is known that storefronts evolve over time, and therefore we do not expect the labels during this time period to be as reliable as the earlier ones. Table 3.2 summarizes the data from each crawl. Note also the severe class 1The affiliate program we exclude was later found to be a proper subset of a different program. 45

Table 3.2. Summary of the data from crawls of consecutive three-month periods.

Examples 1st 3 months 2nd 3 months Labeled 178,281 29,581 Unlabeled 43,442 12,756 Largest class 58,215 13,529 Smallest class 2 0 imbalance: the largest affiliate program has 58,215 distinct storefronts, while the smallest has just 2. In addition, there were five affiliate programs whose regular expressions in the first period did not detect any storefront pages in the second period.

3.3 An Automated Approach

It should be clear that the approach of Section 3.2.3 does not scale well with the number of storefront pages or affiliate programs. This approach is also too time-consuming to repeat on a regular basis—updating the regular expressions before they go stale—as would be necessary to maintain a working operational system. In this section we describe a faster and more highly automated approach for the classification of storefront pages by affiliate program. Our data-driven approach consists of two steps. First, we use automatic scripts to extract tens of thousands of potentially informative features from the crawl of each online storefront. Next, we represent these features in a lower-dimensional vector space and use nearest neighbor classification to identify affiliate programs from whatever labeled examples are available. Our hope with this approach is to avoid the painstaking crafting of regular expressions in favor of simple, well-tested methods in data mining and machine learning. 46

3.3.1 Feature Extraction

For each online storefront in our data set, we have both the HTML source of its Web page and the DNS record of its domain. Together these provide a wealth of information about the storefront that can be extracted by automatic methods. We consider each in turn.

HTML Features

We know from previous work [44] that the storefronts of different affiliate programs can be distinguished by properties of their raw HTML. This is likely the case because each affiliate program uses its own software engine to generate storefront templates; as a result, different storefronts within the same affiliate program often have similar DOM2 structures. Indeed, the regular expressions of [44] only work insofar as commonalities in implementation exist among the different storefronts of individual affiliate programs. We extract HTML features using a bag-of-words approach. This approach ignores the ordering of HTML tags and text within a Web page, but it is sim- ple enough for the quick and automatic extraction of thousands of potentially informative features. In this respect our approach differs considerably from the manual “feature engineering” of regular expressions; we do not seek to find the most informative features ourselves, merely to extract all candidate features for a statistical model of classification. A key consideration of our approach is how to encode HTML tags as words. Consider the following snippet of HTML code:

2The document object model (DOM) refers to the representation of a Web page as a tree of HTML elements. 47

The problem in encoding this snippet is how to achieve the appropriate level of granularity. It is clear that important information is carried by the context in which individual HTML tags appear. However, this information is lost if we encode the tags and tokens in this snippet as single words. On the other hand, it is clearly infeasible to encode this whole snippet, and others like it, as a single “word”; taken as a whole, the HTML element is akin to a full sentence of text. Our compromise between these two extremes is to encode each tag-attribute-value triplet as a word. We also remove whitespace. This solution, for example, parses the above HTML into the following words:

img:src=example.jpg img:alt=Examplepic img:height=50 img:width=100 To parse HTML element content—the text between start and end tags—we treat

the text as a single string that we split on certain characters (,;{}) and whitespace. We then form “words” by stripping non-alphanumeric characters from the resulting substrings and converting all letters to lowercase. We parse HTML comments in the same way. Following convention, we exclude words that are either very common (e.g., stopwords) or very rare (appearing in few Web pages). The bag-of-words approach has the potential to generate very large feature vectors: the dimensionality of the feature space is equal to the number of extracted words. We limited our vocabulary of words to the HTML of storefronts from the initial three-month crawl. From this period alone, however, we extracted over 48

8M unique words. Since most of these words were likely to be uninformative, we used simple heuristics to prune this space. In particular, for each of the forty-four affiliate programs, we only retained words that appeared in the HTML source of 10% (or more) of the program’s online storefronts. Finally, concatenating these words we obtained a total of 34,208 HTML-based features for classification.

Network Features

The authors of [44] also developed a DNS crawler to enumerate all address and name server records3 linked to the domains of spam-advertised URLs. We expect such records to help us further in identifying affiliate programs from their online storefronts. In particular, we expect different affiliate programs to use different naming and hosting infrastructures, and thus we might hope to distinguish them by the name and Web servers that support their domains. Such information may also serve to link online storefronts with different Web pages to the same affiliate program. For example, the DOM trees of two storefronts might be quite different, but if their Web pages were hosted at the same server, then we could predict with confidence that they belong to the same affiliate program. We mined both the address and name server records of online storefronts for useful features. An address record, or A record, maps a domain name to the IP address of the machine on the Internet where the domain is hosted. For example, the

A record of the domain name ucsd.edu maps to the IP address 132.239.180.101. A name server record, or an NS record, identifies the name servers that provide these mappings of names to addresses. The NS record for the domain ucsd.edu contains the domain names of four name servers: ns0.ucsd.edu, ns1.nosc.mil, ns1.ucsd.edu, and ns2.ucsd.edu. In addition, these name servers have their

3Due to a technical issue, however, this information was only recorded during the first three- month crawl of online storefronts. 49

Table 3.3. Feature counts, density of the data, dimensionality after principal components analysis, and percentage of unique examples.

Features Count Density PCA Unique Exs. Bag-of-words 34,208 2.20% 66 12.07% Network 30,791 0.12% 665 6.09% Both 64,999 1.21% 72 31.52% own IP addresses; e.g., the A record for ns0.ucsd.edu maps to the IP address 132.239.1.51. Note that a storefront’s domain name may be associated with multiple A and NS records. These multiple associations are a counter-defense against security measures such as domain blacklisting. For each storefront domain, we extracted the IP address from its A record, the IP addresses of its name servers, and the autonomous system numbers (ASNs) of both these IP addresses. (The ASN of an IP address uniquely identifies its Internet Service Provider.) All together, these IPs and ASNs yield four categorical network features for each storefront. We tallied the number of unique IPs and ASNs observed during the initial three-month crawl of storefronts: there were 17,864 unique A record IPs, 8,825 unique NS record IPs, 2,259 unique ASNs of A record IPs, and 1,853 unique ASNs of NS record IPs. We encoded each categorical feature with k uniquely occurring values as a k-dimensional binary vector (i.e., a simple unary encoding). Concatenating the elements of these vectors, we obtained a total of 30,791 binary-valued network features for each online storefront.

3.3.2 Dimensionality Reduction & Visualization

Table 3.3 shows the numbers of HTML-based and network-based features, as well as the total number of features when combined. We were interested in the different types of information contained in HTML vs. network features. To explore the value of different features, we experimented with classifiers that used either one 50

set of features or the other; this was in addition to classifiers that worked in their combined feature space. In all of these cases, the data is very sparse and high dimensional. To reduce the dimensionality of the data we used principal component analysis (PCA) [37]: in particular, after normalizing the feature vectors to have unit length, we projected them onto enough principal components to capture over 99% of the data’s variance. We did this for each of the three feature spaces. The third and fourth columns of Table 3.3 show, respectively, the small percentages of non-zero features before PCA and the number of principal components needed to capture 99% of the data’s variance. In the last column of Table 3.3 we have noted the high percentage of duplicate representations in feature space. Within a single affiliate program, it is often the case that different storefronts have identical representations as feature vectors; this is true even though every Web page in our data set has a distinct DOM tree. These duplicates arise when the extracted features do not contain the information that distinguishes actual Web pages. For example, two storefronts may have the same bag-of-words representation if their Web pages differ only slightly in their HTML (e.g., a single link, image path, or JavaScript variable name) and if, in this case, the differentiating words were too rare to be included as features. Likewise, two storefronts may have equivalent network features if they are hosted on the same server. In Section 3.2 we noted that 180,690 distinct storefront pages were collected. The many duplicates (and many more near-duplicates) arise from the operational constraints faced by affiliate programs. To maximize their sales, affiliate programs must deploy and support many storefronts across the Internet. This process can only be streamlined by the aggressive use of template engines. Ironically, it is 51

Figure 3.2. Projection of storefront feature vectors from the largest affiliate program (EvaPharmacy) onto the data’s two leading principal components. precisely the requirement of these programs to operate at scale that makes it possible to automate the defenses against them. For illustration, Figure 3.2 plots the projections of storefront feature vectors from the largest affiliate program (EvaPharmacy) onto the data’s two leading principal components. Two observations are worth noting. First, the distribution is clumpy, with some irregularly-shaped modes, evidence of the variety of storefronts in this single program. Second, some storefronts which render very differently map to nearby points. This proximity is evidence of a similar underlying structure that associates them to the same affiliate program.

3.3.3 Nearest Neighbor Classification

We use nearest neighbor (NN) classification [15] to identify the affiliate programs of unlabeled storefronts. In particular, we store the feature vectors of the labeled storefronts (after PCA) in memory, and for each unlabeled storefront, we compute its NN in Euclidean distance among the training examples. Finally, 52 we identify the affiliate program from that of its NN. A common extension is to compute k nearest neighbors and to take the majority vote among them, but we did not find this necessary for our results in this domain. There are many statistical models of classification—some of them quite sophisticated—but in this domain we were drawn to NN classification for reasons beyond its simplicity. A primary reason is that NN classifiers do not make any strong parametric assumptions about the data: they do not assume, for instance, that the examples in different classes are linearly separable, or that the examples in a single class are drawn from a particular distribution. The two-dimensional projection of EvaPharmacy storefronts in Figure 3.2 illustrates the problematic nature of such assumptions. Some affiliate programs use multiple template engines to generate a diversity of storefronts, and this in turn yields a complicated distribution in feature space. NN classification copes well with these complexities; to label examples correctly, it is only necessary to find that they are closest to others that are similarly labeled. The resulting decision boundaries can be quite nonlinear. There are other natural advantages of NN classification for our work. It does not incur any additional complexity when there are large numbers of classes, and in our problem, there are dozens of affiliate programs. Also, NN classifiers do not need to be re-trained when new labeled data becomes available; in our problem, an operational system would need to cope with the inevitable launch of new affiliate programs. A final benefit of NN classification is that it lends itself to manual validation. For this study, we implemented a visual Web tool for validating predictions in the absence of ground truth labels. The tool displays two adjacent storefronts—one unlabeled, the other its nearest neighbor among the labeled examples—along with the distance between them and to their HTML source code. The tool was invaluable for ongoing validation of the classifier, 53

a necessary part of any eventual, real-world deployment.

3.4 Experiments

In this section we evaluate the potential of automated, data-driven methods for identifying the affiliate programs behind online storefronts. We take a bottom-up approach, beginning with simple diagnostics of the features in the previous section and ending with tests to measure how well a deployed NN classifier would perform in the real world. We begin with a brief overview of our experiments. First we verify, as a proof of concept, that the online storefronts of different affiliate programs can in fact be distinguished by their HTML-based and network- based features. In these experiments, we attempt only to classify the Web pages of storefronts that belong to our previously identified set of forty-four affiliate programs. This is unrealistic as an operational setting, since in practice a working system would also need to classify Web pages that belong to other affiliate programs (or that do not even correspond to storefronts). However, as a logical first step, it is important to verify that the features we extract can—at least, in principle—separate the online storefronts of different affiliate programs. Our next experiments remove the assumption of a “closed” universe of Web pages from previously identified affiliate programs. In these experiments we attempt to classify all Web pages that matched the category keywords of pharmaceuticals, replicas, or software (i.e., pages in the second level of Figure 3.1). A complication arises here in that we do not have “ground truth” labels for Web pages that were not matched and labeled by regular expressions. Many of these pages may indeed belong to other affiliate programs, but many from known programs may have simply gone undetected. (The regular expressions do not provide complete coverage; they can fail to match storefronts with slight, unforeseen variations in their HTML.) In 54

Eva Other affiliates Unlabeled # of examples

0 0.2 0.4 0.6 0.8 1 1.2 Distance to nearest neighbor in Eva Figure 3.3. Histogram of three NN distances to EvaPharmacy storefronts: dis- tances to storefronts in the same affiliate program, to those in other programs, and to unlabeled storefronts. Distances were computed using the HTML bag-of-words representation of the Web pages. this experiments we therefore follow NN classification with manual validation of its predictions. Here we succeed in detecting many additional storefronts from known affiliate programs that the regular expressions failed to match. Our next experiments explore the effectiveness of NN classification at the outset of a Web crawl of spam-advertised URLs. At the outset of a crawl, practi- tioners must operate with very few labeled storefronts per affiliate program. While some manual labeling is unavoidable, the key question for practitioners is a simple one: how much labeling is “enough”? In particular, how many storefronts must be identified in each affiliate program before a NN classifier can take over, thereby automating the rest of the process? To answer this question we measure the accuracy of NN classification as a function of the number of labeled storefronts per affiliate program. Our final experiments consider the most difficult scenario, one where practi- 55

tioners lack the prior knowledge to label any storefronts by affiliate program. In these experiments we investigate how well (fully) unsupervised methods are able to cluster storefronts from multiple unidentified affiliate programs. We conduct two experiments to evaluate this potential. First, we cluster all the Web pages tagged as storefronts and measure how well the clusters correlate with known affiliate programs. Second, we investigate whether unsupervised methods can discover new affiliate programs among unlabeled storefronts. We also consider a closely related problem: recognizing when a known affiliate program generates storefront templates from a new software engine.

3.4.1 Proof of Concept

Our first experiments compare how well different features distinguish store- fronts from a previously identified set of affiliate programs. In particular we experimented on the three different sets of features listed in Table 3.3. For each of these, we computed the accuracy of NN classification averaged over ten 70/30 training/test splits of labeled storefronts from the first three-month Web crawl. The splits were random except that we did preserve the proportion of affiliate programs across splits; this was necessary to ensure that even the smallest affiliate programs were represented in the training examples. NN classification in these experiments was extremely accurate. The classifier achieved over 99.6% (test) accuracy with just network-based features, and with HTML bag-of-word features it verged on ideal prediction: on average only 8 out of 53,484 test examples were misclassified. Figure 3.3 provides a visual explanation for the accuracy of NN classifica- tion. The distribution of NN distances illustrates how well-separated one class (EvaPharmacy) is from other classes. The shape of this distribution also reflects the large fraction of duplicates in the data set; these are different storefronts whose 56

Web pages have the same representation in feature space. These results demonstrate the power of very simple low-level features to distinguish the storefronts of different affiliate programs. Two points are worth emphasizing: first, that these features require no further processing for the high accuracy of NN classification, and second, that they are easy to extract automatically, requiring none of the manual effort of previous approaches. In the experiments that follow we use only the bag-of-words features for NN classification. It is clear that these features suffice to distinguish the storefronts of different affiliate programs; they are also more readily available, as only the HTML source of a Web page is required. This consideration is important for ease-of-use as a deployed system. Also, as we have already remarked, the network-based features were not available for storefronts crawled during the second three-month period of data collection.

3.4.2 Labeling More Storefronts

Next we investigated whether our approach can identify the affiliate programs of storefronts whose HTML was not matched by regular expressions. One goal was to expand the number of labeled examples for subsequent evaluation of our approach. We generated bag-of-words feature vectors for the unlabeled Web pages using the vocabulary of HTML words in labeled storefronts. For each unlabeled page, we found the NN among labeled storefronts and marked it as a candidate match for that neighbor’s affiliate program. Finally, for each affiliate program, we ranked the candidate pages by the proximity of their NNs and examined the rankings with the Web tool we implemented for manual validation. In total we labeled 3,785 additional storefronts in the first three-month crawl of spam-advertised URLs; these newly labeled storefronts came from twenty-eight 57

Table 3.4. Examples of storefronts matched by regular expressions (left column), and storefronts not matched but discovered by NN classification (right column).

Regex matched Not matched

ViaGrow ViagPURE

swissapotheke.net swiss-apotheke.net swissapotheke24.com swiss-apotheke.com of the forty-four known affiliate programs. For most of these programs the new storefronts were detected as the top-ranked candidates, and a simple distance threshold separated the “hits” from the “misses.” For example, Figure 3.3 shows that we discovered some new EvaPharmacy storefronts that were close or identical to their NN among labeled storefronts (shown by the small white segment of the first bar), but nothing beyond that. Only two affiliate programs had highly ranked candidate storefronts that belonged to a different but previously identified affiliate program. Table 3.4 shows two storefronts detected by NN classification but missed by the regular expressions. In both cases, the discovered storefronts (right column) are very similar to their NN (left column) in both their HTML content and how they render in a browser. These examples expose the brittleness of the regular expressions, 58 evading detection by a slight tweak to the domain name (swiss-apotheke.net) or a simple re-branding (ViaGrow → ViagPURE). Of course, a straightforward refinement of the regular expressions would rectify both these misses. However, it is precisely this need for refinement that we wish to avoid. The strength of our system is that it discovered and labeled these new storefronts automatically. We repeated this same process to label new storefronts from the second three- month crawl of spam-advertised URLs. For this period, we found and labeled 761 new storefronts belonging to eighteen different affiliate programs. Table 3.5 shows how many additional storefronts we detected in all six months for certain affiliate programs. In the first time period, we detected the most new storefronts (1,467) for the RX-Promotion affiliate program, although this number only represented a 4% increase in the program’s size. However, we detected an eight-fold increase in storefronts for the affiliate program Club-first.

3.4.3 Classification in the Wild

Our next experiments performed NN classification on all spam-advertised Web pages that were categorized as pharmaceutical, replica, or software Web sites. These are all the pages in the middle panel of Figure 3.1, not merely those matched by regular expressions. Most of these Web pages—the ones matched by regular expressions, and the ones detected and manually validated in the previous section— belong to the known family of forty-four affiliate programs. The others we assign to a catch-all label of “other,” thereby yielding a 45-way problem in classification. With ground-truth labels for these Web pages, we can now reliably measure the performance of NN classification on this enlarged data set. We began with experiments on the Web pages crawled during the first three-month period. As before, but now with the inclusion of an “other” label, we 59

Table 3.5. Data sizes and performance for select affiliate programs. For each program, the two rows show results from the first then second 3 months of the study. The top five programs listed are the largest classes. EuroSoft and Ultimate Replica are representative software and replica programs. ED Express suffers drastic misprediction in the second 3 months since it changed template engines. Club-first is the affiliate program whose size experiences the largest percent gain as a result of our classifier. In total, we profiled the storefronts of 29 pharmaceutical, 10 replica, and 5 software affiliate programs.

Affiliate program labeled detected precision recall EvaPharmacy 58,215 434 99.98 100.00 13,529 98 99.29 99.29 Pharmacy Express 44,017 42 100.00 100.00 2,454 11 99.59 99.55 RX-Promotion 37,245 1,467 100.00 100.00 3,317 198 94.31 93.88 Online Pharmacy 16,546 758 100.00 100.00 3,457 271 92.86 92.86 GlavMed 6,616 263 99.95 100.00 176 49 77.73 76.00 EuroSoft 2,215 15 100.00 100.00 1,681 54 97.00 96.89 Ultimate Replica 79 10 100.00 100.00 121 5 96.03 96.03 ED Express 77 5 96.30 100.00 3,653 0 90.00 0.25 Club-first 14 114 100.00 100.00 6 0 100.00 100.00 measured the accuracy of NN classification on ten 70/30 splits of the data. The average accuracy for 45-way classification was still remarkably high at 99.95%. These results show that NN classification can distinguish the storefronts of different affiliate programs even when many of them are collectively assigned an undifferentiated label of “other.” Next we investigated how well NN classification holds up over time. For this we performed NN classification using Web pages from the first three-month crawl as training examples and pages from the second three-month crawl as test examples. 60

The accuracy remained high at 87.7%, a level that is still quite operationally useful for forty-five way classification of affiliate programs. But the drop of performance also indicates that a better system should adapt to affiliate programs as they evolve in time. In particular, as we detect and label new storefronts, it would behoove NN classification to augment the set of training examples and extract a larger vocabulary of HTML features. Table 3.5 lists precision and recall on this task for some affiliate programs. The most startling result is the 0.25% recall for ED Express; in fact, almost 70% of the incorrect predictions involved misclassified storefronts from this affiliate program. We manually examined these storefronts and concluded that ED Express switched to a new template engine for generating storefronts. As a result the storefronts for ED Express from the second three-month crawl were rather different than those from the first. This is a valuable insight into how affiliate programs4 operate, and we return to this case in our final set of experiments.

3.4.4 Learning with Few Labels

We next consider how many storefronts must be manually labeled before an automated system can dependably take over. In these experiments we vary the amount of labeled storefronts that are available as training examples for NN classification. In previous sections we studied the regime where there were many labeled storefronts per identified affiliate program. Here we consider the opposite regime where each affiliate program may, for instance, have no more than one labeled storefront. We make two preliminary observations. First, in these experiments we not

4As an aside, we learned later that these misclassified storefronts belonged to Pharmacy Express, and that Pharmacy Express and ED Express are in fact part of the same aggregate program called Mailien. Some of these errors can therefore be attributed to noisy and/or imperfect labeling of the training examples. 61 only have many fewer training examples; we also have many fewer features because the feature extraction is necessarily limited to counts of HTML words that appear in the training examples. Thus, to be clear, if we have just one labeled storefront per affiliate program, then we can only extract features from those forty-four labeled storefronts. In this regime we do not exclude rare words (though we still exclude stopwords). Second, in these experiments we handle the “other” label for unidentified Web pages as a special case. At the outset of a Web crawl, we imagine that practitioners are likely to encounter numerous instances of Web pages that belong to unidentified affiliate programs or that do not represent storefronts at all. Thus in all the experiments of this section we assume that 100 Web pages are available from the undifferentiated “other” class that does not map to a previously identified affiliate program.

100

90

80

70 Balanced accuracy (%) 60

1 2 4 8 16 # of training examples per class Figure 3.4. Boxplot showing balanced accuracy for all 45 classes as a function of training size. The top and bottom edges of the box are the 75th and 25th percentiles, respectively. The whiskers show the lowest points not considered outliers, and the outliers are individual plus marks. 62

Figure 3.4 plots the balanced accuracy of NN classification per affiliate program (averaged over five random training/test splits) versus the number of labeled storefronts. Note that throughout this regime, the median accuracy holds at 100%, meaning that at least half the affiliate programs have none of their storefronts misclassified. Overall the results show that even with few labeled storefronts, NN classification is accurate enough to be operationally useful. With only one labeled storefront per affiliate program, the average 45-way classification accuracy remains nearly 75%; with two storefronts, it jumps to 85%, and with four, eight, and sixteen, it climbs respectively to 93%, 97%, and 98%. Table 3.6 shows four correctly classified storefronts that render quite dif- ferently than their nearest (labeled) neighbors in this regime. Visually, it is not immediately obvious that these pairs of storefronts belong to the same affiliate program; upon inspection of their HTML source, however, it becomes apparent that their DOM trees share a similar underlying structure. We found that the affiliate program called RX-Partners suffered the worst performance throughout this regime. Upon closer examination of these errors, however, we observed that the regular expressions for RX-Partners often matched Web pages that were not storefronts at all. This mislabeling created many confusions in NN classification between RX-Partners and the undifferentiated “other” label. These results suggest an iterative strategy for practitioners that balances the demands for high levels of both accuracy and automation. To begin, practi- tioners may seed NN classification with just one or two labeled storefronts per affiliate program. Then, using the preliminary predictions from NN classification in this regime, they may attempt to grow the number of labeled storefronts (and extractable features) in alternating rounds of manual validation and re-training. This “bootstrapping” strategy seems the most effective way to develop a practical, 63

Table 3.6. Examples of correctly classified storefronts when there is only one training example per affiliate program. The affiliate programs shown here are 33drugs and RX-Promotion.

Training Example Correct Classifications

working system.

3.4.5 Clustering

Finally, we investigate the capabilities of fully unsupervised methods for distinguishing storefronts from different affiliate programs. We do this first for the full set of pharmaceutical, replica, and software Web pages, in the hope that these storefronts might cluster in an organic way by affiliate program. Next we do this just for the set of undifferentiated storefronts marked as “other”, in the hope that we might identify new affiliate programs within this set.

Clustering of Storefronts by Affiliate Program

A preliminary clustering of online storefronts can be useful as a first step before the manual labeling of their affiliate programs. This was the approach taken in [44], and it is the operational setting that we emulate here. We ran the k-means algorithm, one of the simplest unsupervised methods for clustering, on all the Web pages tagged as storefronts for pharmaceuticals, replica 64 luxury goods, and software. These pages included those in the “other” class not identified with a known affiliate program. In general, the cluster labels are highly predictive of the affiliate program identities: of the 42 clusters, 22 are entirely composed of storefronts from a single affiliate program, and 32 very nearly have this property, with less than 1% of their storefronts from a competing program. A quantitative measure of this correlation is given by the uncertainty coeffi- cient [70], which measures the reduction of uncertainty in a Web page’s class label c (e.g., affiliate program) given its cluster label d. Let C and D denote random variables for these labels. The uncertainty coefficient is computed as:

P H(C|D) c,d p(c, d) log p(c|d) U(C|D) = 1 − = 1 − P , H(C) c p(c) log p(c) where H(C) is the entropy of C and H(C|D) is the conditional entropy of C given D. The joint probability p(c, d) is obtained by counting the number of Web pages with class label c and cluster label d, then dividing by the total number of Web pages. Note that the uncertainty coefficient is bounded between 0 and 1, where 0 indicates that the two variables are completely uncorrelated, and 1 indicates that one variable is completely determined by the other. From the results of k-means clustering we obtain an uncertainty coefficient of 0.933. This coefficient shows that very few clusters contain Web pages from more than one class. Another measure of cluster “purity” is obtained by computing the overall percentage of storefronts assigned to a cluster that contains more storefronts from a different affiliate program. This percentage is 3.27%, quite a low confusion rate for a problem in 45-way classification. It must be emphasized, however, that while k-means clustering generally separates Web pages from different affiliate programs, it does so by modeling the 65

affiliate programs with error rate < 15%

2 # programs

0 1 10 100 1000 10000 # storefronts

affiliate programs with error rate > 95% 5 # programs

0 1 10 100 1000 10000 # storefronts Figure 3.5. Number of affiliate programs of different sizes with few versus many clustering errors; see text for details. In general the larger programs have low error rates (top), while the smaller programs have very high error rates (bottom). larger affiliate programs at the expense of the smaller ones. In particular, only 11 out of 44 affiliate programs are represented as the dominant class of one or more clusters; the other 33 programs are swallowed up by the larger ones. We say a clustering “error” occurs when a storefront is mapped to a cluster that contains more storefronts from one or more different affiliate programs. The top panel of Figure 3.5 shows that all the largest affiliate programs (with over 2000 storefronts) have error rates less than 15%; conversely, the bottom panel shows that all the smallest affiliate programs (with fewer than 100 storefronts) have error rates greater than 95%. We conclude that unsupervised methods have high accuracy but poor cover- age of affiliate programs. On one hand, by distinguishing storefronts in the largest affiliate programs (which account for over 96% of the Web pages), these methods may greatly reduce the set of pages that require manual labeling. On the other hand, for complete coverage of affiliate programs, it seems imperative to label at 66 least one storefront per program. Once these labels are provided, moreover, we believe that the best approach will be a supervised model, such as NN classification, that exploits the information in these labels.

Detecting New and Evolving Programs

Lastly we investigate the potential of clustering methods to detect new and evolving affiliate programs. In our first experiment, we used the k-means algorithm to cluster the storefronts labeled as “other” from the first three-month crawl. We then ranked the storefronts in each cluster by their distance to the cluster centroid and manually examined the top-ranked (i.e., most “central”) ones with our Web visualization tool. Using this approach, we discovered a new affiliate program called RxCashCow; the 430 top-ranked storefronts in one cluster belonged to this program, and only a small fraction of the top-ranked 1,100 did not. Finally we revisit the interesting case of ED Express. Recall that NN classification failed to identify these storefronts during the second three-month crawl because the affiliate program switched to a new template engine. Our last experiment used the k-means algorithm to cluster all the storefronts from the second three-month crawl. Our hope was that the algorithm might separate out the evolved storefronts of ED Express, and indeed we found a cluster whose 3,580 top-ranked pages were exactly those generated by the new template engine.

3.5 Conclusion

We have described an automated approach for profiling the online storefronts of counterfeit merchandise. We evaluated our approach on hundreds of thousands of storefronts gathered from a Web crawl of spam-advertised URLs. Our methods aim to identify the affiliate programs behind these storefronts, an important first 67

step in tracking and targeting illicit e-commerce. Our first experiments showed that these affiliate programs could be identified, with high accuracy, from simple NN classification on bag-of-words features. With an operational setting in mind, we also showed that the feature extraction and NN classification only required a small seed of labeled storefronts to be highly effective. Our final experiments investigated the potential of unsupervised methods for clustering storefronts by affiliate program. Here we found that k-means clustering could be used to discover affiliate programs with large spam footprints, but that smaller affiliate programs (i.e., those with fewer storefronts) were not cleanly identified. Overall, our work is an encouraging case study in the application of machine learning to an important class of security problems on the Web.

3.6 Acknowledgements

Chapter 3, in part, is a reprint of the material as it appears in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2014. Der, Matthew F.; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was the primary investigator and author of this paper. Chapter 4

Counterfeit Luxury SEO

The next security application we address is highly reminiscent of the one presented in the previous chapter. Here we investigate another segment of illegal e-commerce: the counterfeit luxury market. This market and the market for spam-advertised goods exhibit the same state of affairs: behind the myriad online storefronts lurk just dozens of organizations who run them. Thus, we want to solve an identical classification problem regarding common origin: to classify storefront Web sites according to their underlying business. In this chapter, we again show how machine learning methods help a security practitioner tackle this classification problem much more efficiently than he could with manual effort alone. We concluded our analysis in Chapter 3 by advising that, if our system were to be deployed at the outset of such an endeavor, a bootstrapped version would be most effective. For this application, we actually do implement a bootstrapped system from the start. To wit, initial supervision produces a ground truth sample; then, we alternate training a classifier and making predictions with expert supervision to validate the predictions. While the problem settings in this chapter and the last are quite similar, one key difference in the cybercriminal activity is the choice of advertising vector. Instead of email, these counterfeiters exploit Web search to draw visitors to their

68 69 stores. The pillar of email spam is its high volume; in contrast, bulk Web spam is not a guaranteed way of reaching users. This deficiency is due to the dependence of search traffic on search ranking. Hence, miscreants organize into black hat SEO campaigns, which manipulate search results to improve the accessibility of their storefronts in search result pages. This chapter presents a large-scale empirical characterization of SEO abuse on the counterfeit luxury market. We provide a holistic view of this ecosystem by analyzing eight months of crawled data for sixteen luxury brands. In particular, we identify and monitor 52 distinct SEO campaigns, measure their presence in search results, estimate their order volumes over time, and examine how their operation is affected by existing interventions. Our study shows that counterfeit luxury is a bustling business, and that countermeasures could improve in both coverage and responsiveness. This chapter is an adaptation of our work published in Wang et al. [84]. We focus on the classification of storefront Web pages to SEO campaigns, and summarize the rest of the material.

4.1 Introduction

The World Wide Web is a massive digital platform with hundreds of millions of Web sites. Web search is the mechanism that makes the Web organized and usable, allowing users to quickly access the information they seek. The utility of a search engine is particularly evident for e-commerce, where it serves as a medium for connecting buyers and sellers. Customers rely on search engines for finding goods and services, while conversely, merchants rely on them for bringing user traffic. The amount of search traffic a Web site accrues is dictated mainly by its 70

search ranking. Highly ranked search results — a search engine’s “best answers” to a query — naturally receive the lion’s share of clicks. Unsurprisingly, then, business owners care a great deal about their site’s ranking, and often invest in ways to improve it. Thus was born the arm of online marketing called search engine optimization (SEO), which refers to methods for impacting the visibility of a Web site in a search engine’s organic results. SEO practices come in two varieties: those that obey a search engine’s policies (white hat SEO), and those that do not (black hat SEO). Cybercriminals regularly exercise the latter; for reasons of evasion and realization of quick financial return, they implement aggressive and abusive techniques that deceive both search engines and users. Alas, this is where miscreants join forces to launch a more profitable venture. In particular, counterfeit luxury stores form a strategic partnership with black hat SEO campaigns to promote their online stores in search results and attract unwitting customers. This class of e-crime is harmful to search engines, brands, and users: it undermines the quality and integrity of a search engine’s results; it damages a brand’s image and steals a cut of its sales; and it degrades user experience, defrauds some users, and compromises the Web presence of domain owners whose sites get hacked. Search engine providers and brand holders are the two groups with the most incentive, and the greatest means, to intervene. Search engines play a continual and increasingly complex game of cat-and-mouse with Web spammers to detect and defend against search poisoning. When they successfully identify Web spam, they can demote the offending site, or even completely remove it, from search result listings. Also, they can warn users by labeling compromised Web sites as “hacked.” 71

Brand holders lack the advantageous technical position that search engines have, but they do have the legal position to seize domain names of counterfeit storefronts. In this chapter, we characterize the dynamic between attacker and defender at scale, evaluating the prosperity of this illegal business as well as the efficacy of these standard “search and seizure” interventions. In particular, we first explain how a search poisoning scam works. Second, we summarize our technique for discerning poisoned search results, then detail our classification approach for linking storefront Web pages to the SEO campaigns that advertise them. Finally, we highlight our main findings on the prevalence and profitability of SEO abuse on luxury brands, and the effectiveness of countermeasures striving to disrupt it. In the end, our empirical analysis offers a holistic view of the ecosystem around luxury SEO abuse, and illuminates ways to improve current interventions.

4.2 Background

In this section, we review the foundations of this ecosystem — search result poisoning, the counterfeit luxury market, and defensive interventions.

4.2.1 Search Result Poisoning

As previously discussed, a Web site’s traffic volume heavily depends on its search ranking. Rankings are controlled by a search engine’s proprietary algorithm (e.g., Google’s PageRank), which weighs a plethora of features to assess the site’s importance. SEO, then, attempts to design a Web site so that the ranking algorithm will score it favorably. Search engines establish guidelines and recommend techniques they consider good design, like using a clear hierarchy, sitemap, friendly URLs, fast load times, descriptive and accurate title elements and alt attributes, and so forth. These 72 compliant techniques are called white hat SEO. On the other hand, miscreants try to exploit the ranking algorithm, resorting to techniques that elevate their Web site’s status but in ways that defy the search engine’s policies. These perverse techniques are called black hat SEO. Whereas white hat SEO is geared toward user-friendliness, black hat SEO explicitly targets the search engine and neglects the human audience. Examples, many of which deceive both search engines and users, include hidden text, keyword stuffing, unrelated keywords, spinning, link farms, doorway pages, cloaking, and so on. Still, even with black hat tactics, it may take a while for a Web page to become popular. Therefore, another gambit that attackers perform is to hack and assume control of other Web sites, and leverage the positive reputation they have earned legitimately over time. Recall that the main goal for cybercriminals is to maximize traffic to their scams. When Web search is the advertising vector, this equates to maximizing visibility in search result pages. To that end, what matters is not only where, but also how often, their Web spam appears in search results. Thus, attackers try to hack and accumulate a good many compromised Web sites to proliferate their scams. An added benefit is that they can create a , where all compromised Web sites cross-reference one another. The effect is an upsurge in page rank, as the number of backlinks — incoming links from other Web pages — a page has typically counts as an important feature for ranking algorithms. These factors are reasons why attackers organize into campaigns: to coordinate their efforts and strengthen their business venture. Two black hat SEO mechanisms at the heart of search result poisoning are doorway pages and cloaking. Campaign operators use their stock of compromised Web sites as doorway pages — Web pages designed to rank well in search, but only 73 serve to instantly redirect1 an unaware visitor elsewhere. In our case, the intended landing page is a counterfeit storefront. To achieve high rank, doorway pages use cloaking — a deceptive technique for deliberately serving different content to different types of users. Concretely, a doorway differentiates its content delivery as follows. To a search engine crawler, a doorway provides an SEO-ed Web page crafted to receive exceptional rank for targeted queries (i.e., queries about bargain shopping for luxury brands). The search engine is hence deceived into indexing this page and subsequently returning it in early search results. To users who issue targeted queries (e.g., “cheap louis vuitton”) and click through linked search results — the attackers’ coveted traffic — a doorway fashions a redirect to (or an iframe containing) a counterfeit storefront. To the site owner, and visitors who arrive at the Web page by other means, a doorway returns the original benign content before the site was hacked. In this way, cloaking is crucial for evasion, first and foremost because the site owner is unable to tell that his site has been compromised. In general, the attackers wish to avoid possible suspicion, particularly from brand holders who may want to shut the sites down. Figure 4.1 depicts a typical search poisoning attack, which dupes users into visiting scams like counterfeit storefronts. First, the attacker exploits Web sites with an open vulnerability (e.g., via cross-site scripting) and installs an SEO kit on the compromised hosts ( 1 ). An SEO kit is malware, typically an obfuscated code injection, which grants the attacker access to the site and implements black hat SEO. The attacker co-opts these compromised sites into an SEO botnet. A botnet

1Alternatively, instead of redirection, attackers place an iframe on top of the , whose content is the intended landing page and whose size is meant to occupy basically the entire screen. 74

Directory C&C

Attacker 1.2.3.4 1

C&C? @ http://scam 1.2.3.4 SEO-ed SEO ... HTTP GET /doorway Compromised Doorways

“cheap lv” 2

Louis 3 Vuitton bags jewelry shoes Search Engine PageRank Counterfeit Store User

Figure 4.1. An illustration of search result poisoning by an SEO botnet.

is a network of infected machines that is remotely controlled by the attacker (or “botmaster”), unbeknownst to the rightful owners. The botnet consists of three kinds of hosts: a directory server, a command and control (C&C) server, and the compromised Web sites (or “bots”). The directory server is the bots’ first point of contact whose only duty is to relay the location of the C&C server. The C&C server is a centralized data store from which the bots pull down SEO content and URLs of scams. This setup enables the botmaster to distribute data to the entire botnet with just a single update to the C&C. Finally, the compromised Web sites serve as doorways which, under the control of the SEO kit, perform cloaking to decide what content to deliver to what type of visitor. Several variations of cloaking exist, but they all share the same essence: 75 the Web server hosting the doorway page must be able to decipher who is making the HTTP request. The doorway first checks if the visitor is a search engine crawler. The User-Agent field in the HTTP header, or the hostname returned by the function gethostbyaddr, may reveal the crawler’s identity (e.g., Googlebot). If the requester is indeed a crawler, the SEO kit pulls content from the C&C server for fabricating a Web page that should rank well for related queries ( 2 ). The SEO-ed Web page is returned to the crawler and gets indexed by the search engine. If the visitor is not recognized as a crawler, the doorway next checks if it’s a user who clicks through a search result. This traffic is identified using the

Referer field in the HTTP header, which specifies the URL of the previous Web page that led the user to the currently requested page. If the doorway verifies that the previous address indicates a search results page, the SEO kit asks the C&C server for a URL of a scam, along with JavaScript code that will either redirect the user there, or load its content into an overlaid iframe ( 3 ). If both of these checks fail, then the doorway returns the original Web page that existed before the site was compromised. This way, users who visit the site directly, including the site owner, do not notice that the site has been hacked. Figure 4.2 shows a real instance of a search poisoning attack, involving a user searching for Louis Vuitton merchandise. The top result is poisoned, which a shrewd user may suspect because the domain name (anonymized in the figure) is unrelated to Louis Vuitton. The Web site is a compromised doorway page which clearly has been SEO-ed successfully, as Google assigned it the single-highest rank for this query. But when the user clicks the top result, the doorway correctly distinguishes the traffic and steers the user to a counterfeit storefront hosted at a different domain. Interestingly, entering the anonymized domain in the address bar leads to the original benign Web page (shown in the bottom row of Figure 4.2), so 76

Figure 4.2. Example of a poisoned search result. Top: User queries for a luxury brand and clicks the top result, which is poisoned. The result links to a doorway Web site, which redirects the user to a different domain hosting a counterfeit storefront. Bottom: When the user visits the anonymized domain directly, not via search, the doorway returns the original Web site before it was compromised. the doorway correctly deduced that this visit was not search traffic.

4.2.2 Counterfeit Luxury Market

A counterfeit luxury store is a merchant that commits fraud against a reputable, high-end brand: the store poses as the authentic brand itself, but actually sells low-quality knockoffs of their merchandise. Figure 4.3 depicts four examples from the wild, which are designed to appear legitimate to deceive customers. These stores typically trick bargain shoppers with huge (albeit fake) discount deals, but they also attract some knowing customers who, much like the store itself, just want 77

Figure 4.3. Examples of counterfeit luxury storefronts forging four brands (in row order): Louis Vuitton, Ugg, Moncler, and Beats By Dre. to feign upscale status. Figure 4.4 shows a Gucci product page that entices consumers with massive savings, offering sales around 90% off. Bear in mind, though, that these businesses employ shoddy manufacturing; the knockoff price may be an order-of-magnitude less than retail, but the cost to make it is another order-of-magnitude less. Thus, the margins are still quite favorable. High margins coupled with the high demand for trendy lifestyle fashion goods assure this scam is profitable. Through discovering one supplier’s Web site and mining their shipment data, we found that this single supplier delivered over 250K orders in nine months — evidence that business is strong indeed. Our investigation surmises that the way this business is organized is different than the affiliate marketing structure behind many other online scams, including 78

Figure 4.4. Counterfeits of Gucci products offered at false discounts. Curiously, every product’s retail price is the same. the spam-advertised counterfeits we examined in Chapter 3. The core of that model is the affiliate program, who basically runs the business and hires a diversity of affiliates — independent contractors paid on commission — to advertise and bring traffic to their storefronts. One major difference is that while an affiliate program specializes in one product category (e.g., pharmaceuticals, replicas, or software), the same SEO campaign may market a range of both products and brands. In addition, campaigns appeal to international markets with geotargeted stores intended for particular countries. For example, the counterfeit Gucci store in Figure 4.4 targets the United Kingdom. Two other differences in business models are payment processing and fulfillment, which are centralized within the affiliate program. Here, though, SEO campaigns contract with third parties. Finally, 79 evidence suggests that this counterfeit luxury market is based in Asia rather than Eastern Europe, the mainspring of many other online scams. Signals we observed include Asian language comments in SEO kit source code, Asian payment processors, Asian suppliers, and a first-hand account from an Asian programmer who works in this business.

4.2.3 Interventions

As stated earlier, the two parties most incentivized and capable of disrupting this scam are search engine operators and brand holders. A search engine strives to uphold the merit of its results, for both users and companies paying for valuable ad space. To defend against Web spam, the search engine can demote or even remove offending Web sites from search result listings. Penalizing doorway pages in this manner undermines an SEO campaign’s advertising efforts and reduces traffic to their counterfeit luxury storefronts. Also, Google began labeling these sites to warn and protect users, who may lose trust in the search engine if they get victimized. In 2008 Google flagged sites leading to malware or a phishing attack, adding the line “This site may harm your computer” to the search result snippet. To caution users further, if the user still clicked the suspicious result, Google inserted an interstitial page which asked the user if he truly wanted to visit the malicious Web site. Then in 2010, Google tagged the snippet of compromised Web sites with the text “This site may be hacked,” but did not insert an interstitial page. Another shortcoming is that only the root page of the Web site (e.g, http://example.com) is labeled as hacked, so non-root pages (e.g., http://example.com/non-root) retain their innocuous appearance. Unfortunately, oftentimes the non-root pages are the ones that get compromised and become doorways. Still, these warning labels further reduce 80 traffic to storefronts if savvy users avoid the dubious Web sites. Of course, taking any defensive action against search abuse first requires detecting it, which is a constantly evolving technical challenge. Search engines and spammers engage in the traditional move-countermove game of computer security. In fact, our study of Web spam revealed a case of an adaptive adversary: we discovered a new cloaking technique that eludes standard detection systems. In what we call iframe cloaking, the doorway does not redirect the user to a storefront, but instead loads the storefront in a full-sized iframe that covers the doorway page. Unlike in redirect cloaking, this same doorway page is delivered to search engine crawlers and users visiting via search; the Web page is SEO-ed to rank well, but also contains obfuscated JavaScript for spawning the iframe. This method relies on the fact that fully rendering a Web page is too expensive for crawlers, which must index the entire Web. An advantage for attackers is that this form of cloaking executes strictly on the client, requiring no server-side logic for examining network features, such as IP addresses or HTTP request headers. Aware of this ongoing battle, we seek to evaluate a search engine’s interven- tions and learn how they could be reinforced. Search engine operators are crucial to Web spam defense, as they have a unique technical position to adjust search rankings and analyze the Web at scale. On the other hand, brand holders, who want to preserve their image and avoid losing sales to cybercriminals, lack this technical vantage point. Instead, they must resort to hunting counterfeit storefronts in the wild. As brand and trademark holders, they have the legal power to seize domain names that host criminal storefronts. They can shut these Web sites down, but usually they display a seizure notification page instead. We have also seen redirections to a page on the exploited brand’s official Web site warning users about counterfeits. Seizing 81 domain names of doorways is less frequent since these domains are generally owned by innocent third parties. Sometimes brands take these actions themselves, but more commonly they hire lawyers or companies that offer brand protection (e.g., MarkMonitor [54], OpSec Security [66], and Safenames [74]). However, this defensive approach inexorably becomes a “whack-a-mole” game, where contravening domain names are seized, but new ones pop up to persist counterfeit storefronts elsewhere. Exacerbating this challenge are critical asymmetries that work in the adversary’s favor. In particular, the cost of the legal process for seizing a domain dwarfs the cost of buying a new one. Secondly, this process could take days to weeks2, whereas a new domain name can be activated and SEO-ed in just 24 hours [85]. We elaborate on our analysis of defensive interventions in Section 4.5.

4.3 Data Collection

Our study is based upon extensive crawls of Google search results to detect poisoned results that lead to counterfeit luxury storefronts. We monitor sixteen luxury verticals, which are brands (e.g., Nike) or product categories (e.g., watches), listed in Table 4.6. These verticals comprise a representative and heavily victimized subset of the entire counterfeit luxury market. In this section, we explain how we manufacture appropriate search queries, how we crawl the search results and detect which ones are poisoned, and how we filter down the landing pages to storefronts only.

2As court documents show, the standard practice of seizing hundreds to thousands of domain names all at once — most likely to amortize the cost — only prolongs the process. 82

4.3.1 Generating Search Queries

Our fundamental goal is to measure the prevalence of the counterfeit luxury scam. We are most likely to find the scam in search results returned to user searches that are aggressively targeted by SEO campaigns. Thus, we must strategically craft search queries for which a campaign’s SEO-ed doorways would rank highly. Ultimately, we generate a fixed set of 100 representative search queries for each luxury vertical.

Our approach was largely guided by a manual investigation of the key campaign in September 2013. This campaign is run by a big SEO botnet and

targets thirteen verticals. We manually queried to find ten key doorways for each vertical, then used site search (e.g., “site:doorway.com”) to discover additional results emanating from the same doorway. Next, we extracted keywords from the

URL — e.g., from http://doorway.com/?key=cheap+ralph+lauren, we extract “cheap ralph lauren” — to compose a large set of search terms. From this set, we select a random sample of 100 as our final set of search queries. To broaden our coverage of the counterfeit luxury market, we include three

brands not targeted by the key campaign: Ed Hardy, Louis Vuitton, and Uggs. We take a different approach to formulate search queries for these verticals. In particular, we seed a query with the brand name, then use Google Suggest to autocomplete the search with keywords — e.g., from “Louis Vuitton,” Google Suggest recommends “Louis Vuitton wallet.” Also, we do this recursively to collect suggestions for the original suggestions. We fetch yet more suggestions for the brand name with a keyword prepended (e.g., cheap, new, online) and/or appended (e.g., store, outlet, sale). To check for bias in these two approaches, we implemented the second 83

approach for the ten brands targeted by the key campaign. Interestingly, there was minimal overlap in the set of search queries, but in crawling the results, there was significant overlap in the poisoned search results and the responsible SEO campaigns. We note that any searching strategy is unavoidably biased because it returns only a segment of the whole search index, but the high number of poisoned search results we end up detecting empirically validates the representativeness of our approach.

4.3.2 Crawling & Cloaking

For each of the sixteen luxury verticals, we Google the 100 search queries and crawl the top 100 results on a daily basis for five months, from November 13, 2013 through July 15, 2014. We limit our crawls to Google because among search engines, it experiences the highest amount of cloaking [86]. In addition, Google is the most popular search engine in the U.S. and many European countries, where, according to our shipping data, most counterfeit orders are shipped. We deem that a search result is poisoned if the associated Web site is cloaked — i.e., the site delivers different content to a search engine crawler and a user who clicks through a search result. The cloaked site is likely a doorway page under the control of an SEO campaign. To detect cloaking, we extend the system implemented by Wang et al. [86] to follow redirects and fully render certain Web pages; the latter capability allows us to identify the new variation of cloaking we found, iframe cloaking. In visiting a Web site, our crawler imitates both a user on a browser and

a search engine crawler by modifying the User-Agent field in the HTTP request header. Then, it compares the content delivered to each version, computing a delta score and classifying the site as cloaked if the score meets a particular threshold. 84

The idea is that Web content that is “different enough” is a strong indicator of cloaking.

4.3.3 Detecting Storefronts

For the search results our system considers poisoned, we crawl the landing pages they return to user search traffic. Among this set of landing pages, we want to find the counterfeit luxury storefronts. We implement two heuristics to accomplish this. First, we examine the site’s cookies, looking for ones from payment processors (e.g., Realypay, Mallpayment), e-commerce platforms (e.g., Zen Cart, Magento), and Web analytics providers (e.g., Ajstat, CNZZ), which are widely used by counterfeiters. Second, we search for two keywords, “cart” and “checkout” — clear signals that the Web page is a storefront. If either check passes, then we decide that the Web page is indeed a storefront. To assess the accuracy of this heuristic, we manually inspected a random sample of landing pages from 1,800 doorways, among which we detected 532 storefronts. Encouragingly, we found no false positives and only 21 (1.2%) false negatives (storefronts that are our heuristics missed).

4.3.4 Complete Data Set

We repeated the methodology described in this section on a daily basis for eight months, from November 13, 2013 through July 15, 2014. In total, among all the search results we crawled, we detected 2.7M poisoned search results (PSRs). These PSRs linked to a much smaller set of 27K unique domains that serve as doorway Web sites. These doorways, then, led to an even smaller set of 7,404 storefront Web sites peddling counterfeit luxury merchandise. This data set consists of unlabeled Web pages crawled from the wild. Thus, 85 with no ground truth available, our analysis of this ecosystem must start from scratch. In the next section, we explain how we use a bootstrapped system, involving a combination of domain expertise and machine learning tools, to classify portions of this data.

4.4 Approach

Our targeted crawls of Google search results produce a large collection of doorway pages and counterfeit storefronts. We know that behind these thousands of doorways and storefronts lurk a much smaller number of distinct SEO campaigns, and the goal of our work is to understand the full ecosystem of campaigns operating in this counterfeit luxury market rather than focus on a singular campaign, e.g., the key campaign. A brute-force approach to this understanding would require a domain expert to examine each Web page in our collection and use domain-specific heuristics to infer the SEO campaign behind it. The manual labeling of Web pages, however, is a time-consuming and laborious endeavor that does not scale well to the many thousands of examples in our collection. Instead, we take a statistical approach, and the rest of this section describes an automated, data-driven method to identify the SEO campaigns behind individual doorway and storefront Web pages.

4.4.1 Supervised Learning

To build a statistical model, we need a data set of labeled examples. At the outset, the only option is for a domain expert to create one. Though manual investigation is tedious, the exercise is mandatory for providing ground truth data and developing an early understanding of the ecosystem. In the initial phase of supervision, we identified the prominent key campaign, among several others. 86

Overall, we were able to connect 491 storefront Web pages (and their preceding doorway pages) to 28 distinct SEO campaigns. Now, we suspend manual labeling and automate the process by learning a classifier from the labeled data. In particular, we train a classifier to make predictions on unlabeled examples, then manually validate a subset of the predictions to expand the set of labeled examples. Validating predicted labels is much easier for the supervisor than furnishing labels by hand. Our classifier makes its predictions by extracting textual HTML features from both the doorway and storefront Web pages, then analyzing the statistics of these features that distinguish Web pages from different campaigns. We expect HTML-based features to be predictive in this domain for two reasons: first, because SEO campaigns use highly specialized strategies to manipulate the search rankings of doorways [85], and second, because campaigns often develop in-house templates for the large-scale deployment of online storefronts (e.g., customized templates for Zen Cart or Magento providing a certain look and feel). However, while we suspect that doorway and storefront Web pages contain predictive signatures of the SEO campaigns behind them, we wish to avoid having to discover these signatures manually. Recall from Chapter 3 that we shed the onus of meticulous feature engineering (i.e., devising regular expressions), and instead employed a fully automated bag-of-words approach. We take that same approach here, capturing Web page content and structure with HTML tag-attribute-value triplets. (For more details, refer back to Sections 2.4 and 3.3.) One might also expect to find predictive signatures of SEO campaigns in network-based features (e.g., IPv4 address blocks, ASes). However, we found that such features were ill-suited to differentiate SEO campaigns due to the growing popularity of shared hosting and reverse proxying infrastructure (e.g., CloudFlare). 87

Therefore, after a brief period of experimentation, we did not pursue the use of such features. We now make an important distinction between these counterfeit luxury storefronts and the spam-advertised storefronts from Chapter 3. Affiliate programs auto-generate highly replicated storefront Web pages, many of which are near- duplicates; conversely, SEO campaigns produce more loosely templated pages that exhibit more variation. This difference is in large part due to an SEO campaign’s dealings in multiple product categories, whereas affiliate programs specialize in just one3. Hence, the Web page content of a fake Abercrombie store is necessarily different than the content of a fraudulent Tiffany store. Furthermore, unlike the spam-advertised storefronts, these counterfeit storefronts are also accompanied by a doorway Web page which led to the storefront. These doorways are often Web sites that the SEO campaign compromises and then controls. Thus, an additional challenge is that we cannot anticipate whether the more telling commonalities reside in the storefronts or in the doorways. However, we still expect an SEO campaign’s doorway and storefront Web pages to share at least some clues in their HTML source, which are artifacts of the campaign’s automated toolkit. Campaigns that conduct their illegal business at scale cannot afford to construct doorways and storefronts for one-time use only. So perhaps out of the many thousands of HTML features we extract, most are irrelevant while just a handful are relevant. This setting is less suitable for judging similarity with Euclidean distance, a metric that weighs all features equally. Therefore, a nearest neighbor classifier based on this distance would be a less reliable predictor of affiliation than it was in Chapter 3.

3The aggregate program ZedCash, which sells both pharmaceuticals and replicas, is the sole exception we observed. 88

Instead, we decide to learn linear models of classification from our data set of labeled examples. Specifically, we use the LIBLINEAR package [21] to learn L1-regularized models of logistic regression (LR). We briefly describe some technical details on how this model works. Suppose we have a binary-labeled data set consisting of N training examples

N D {xi, yi}i=1, where xi ∈ < and yi ∈ {−1, +1}. LIBLINEAR’s approach to L1- regularized LR is to estimate model parameters w ∈

min kwk1 + L(w, b), (4.1) w,b

D P where the regularization term kwk1 = |wj| is the L1-norm of w, and the loss j=1 term L(w, b) measures the cost of training errors as:

N X L(w, b) = C log(1 + e−zi ), (4.2) i=1

> where zi = yi(w xi + b) is a linear score, and C is a tunable cost parameter. The loss term adds a penalty whenever the model incorrectly labels a training example. The L1-regularization term penalizes elements of the weight vector w with large magnitudes. This regularizer helps mitigate overfitting the training set, and also favors sparse solutions where most weights are exactly zero. This property is particularly fitting for our application, since storefront and doorway Web pages may have a large number of irrelevant features. Thus, the resulting linear models are highly interpretable: for each campaign, the regularization serves to identify the

4To solve this optimization problem, LIBLINEAR now implements a highly optimized Newton- based method that works faster than its previous coordinate descent method [92]. The basic trick is to compute the loss function L(w, b), which involves expensive log and exp operations, less frequently. 89 most strongly characteristic HTML features from the tens of thousands of extracted ones. The cost parameter C balances the regularization and loss terms. In this way, C trades off sparsity and accuracy; e.g., fixing a lower value of C would learn a more sparse but potentially less accurate model of classification. We note that LR is inherently an algorithm for binary classification, but it can be extended to multiclass problems in standard ways. LIBLINEAR opts for the one-vs-rest strategy, where it learns a model for each class to differentiate that class (positive examples) from all others (negative examples). Once trained, the LR classifier makes a prediction for an unlabeled example x by computing a linear score z = w>x + b. Then, the sigmoid function σ = [1 + e−z]−1 converts this real-valued score to a probability of class membership. These predictions are highly interpretable as well: not only does the probability specify a likelihood of correctness, but also, decomposing the dot product w>x reveals the relative contributions of individual features to a prediction. And because L1-regularized models are sparse, the predictions of SEO campaigns are derived from only a small fraction of meaningful HTML features. Finally, we evaluated the predictive accuracy of the classifier by performing 10-fold cross-validation on the data set of 491 labeled examples. The average accuracy on held-out data was 86.8% for multi-way classification of Web pages into 28 different SEO campaigns. (Note that uniformly random predictions would have an accuracy of 1/28 = 3.6%.) Our model’s high accuracy on held-on examples gave us confidence to make predictions on the remaining (unlabeled) Web pages that we collected from poisoned search results. 90

4.4.2 Unsupervised Learning

The supervised classifiers of the previous section are extremely useful for procuring even more storefront and doorway Web pages that belong to known SEO campaigns. This capability is at once a benefit and a limitation: the classifiers are only trained to detect campaigns that we have already discovered. We know that still unidentified campaigns are at large, but we need a mechanism to help find them. Hence, we appeal to unsupervised learning techniques for this task. We have no labeled examples for unknown SEO campaigns, so we must instead search the unlabeled examples for candidates that likely belong to the same campaign. To guide our search, we avail ourselves of the hierarchical clustering algorithm with complete linkage [17]. The algorithm works as follows. Initially, every unlabeled example (or point) exists in its own cluster. Then, the two closest clusters are merged together iteratively until some terminal condition is reached. To measure how close two clusters X and Y are, complete linkage computes their pairwise distance as:

D(X,Y ) = max d(x, y), (4.3) x∈X, y∈Y

where we use the Euclidean metric for the pairwise distance d(x, y) between two points. In other words, the distance between two clusters is equal to the distance between their two farthest members. We choose the complete link method over other linking methods for its ability to directly control cluster compactness. In particular, we enforce a terminal condition that halts clustering once no pair of clusters is closer than t distance away — i.e., once min D(X,Y ) > t. This condition guarantees that all resulting clusters X,Y 91 have diameter no greater than t. Single linkage defines the opposite notion of cluster distance: the distance between two clusters is equal to the distance between their two nearest members. This method is prone to a chaining effect, where the distance between two points in the same cluster can grow to be quite large. We elect complete linkage over k-means clustering for other reasons. First, k-means enacts a global clustering of all data points, where every point necessarily gets pigeonholed into some cluster, regardless of proximity. Contrarily, in complete linkage clustering, we explicitly decide when to stop merging clusters; then, we can ignore remaining singleton clusters and focus on the largest ones. Second, k-means requires predetermining the number of clusters. In reality, we have no idea how many unidentified campaigns there might be, but we want to find at least a few more. Complete linkage with a strict distance threshold t is most apt for this task; it produces a few small, compact clusters which likely represent new SEO campaigns. In contrast, single linkage produces clusters with potentially wide diameters, and k-means produces an arbitrary number of arbitrarily large clusters. We perform this clustering independently on both the unlabeled storefronts and the unlabeled doorways, again because we do not know which of the two might exhibit greater similarity across a campaign.

4.4.3 Bootstrapping the System

We begin the classification task of mapping storefront and doorway Web pages to SEO camapigns with an initial phase of manual labeling, which produced 491 labeled examples for 28 distinct SEO campaigns. Then, we use models of linear regression, trained on this seed of labeled data, to infer the SEO campaigns behind the remaining unlabeled Web pages. To do so, we extract HTML features from the 92

labeled unlabeled examples examples

1

classifier 2 clusterer

3 predictions predicted to known new campaigns campaigns

Figure 4.5. Flowchart depicting one round of classification: (1) train a classifier on the set of labeled examples; (2) make predictions on unlabeled examples, through classification of known campaigns and clustering of new campaigns; and (3) validate and add some of the predictions to the set of labeled examples. unlabeled Web pages and used the LR classifiers to predict the most likely campaign behind each example. To validate these predictions, we manually inspect subsets of unlabeled examples to determine if they belong to their predicted campaign. This step can be done efficiently by first validating the top-ranked predictions for each SEO campaign (as reflected by the probabilities that the logistic regressions attach to each prediction). We briefly describe how we validated the classifier’s predictions on unlabeled Web pages. Primarily we assume that distinct SEO campaigns are unlikely to share certain infrastructure such as SEO doorway pages and C&Cs, payment processing, and customer support. We also consider less robust indicators such as domain names, unique templates, WHOIS registrant, image hosting, and Web traffic analytics (e.g., 51.la, cnzz.com, statcounter.com, etc.). Also, along with making predictions for the 28 known campaigns, we use 93 complete linkage clustering to group together similar Web pages that likely associate with the same distinct campaign. Validating that a cluster indeed makes up a new campaign requires locating unique indicators of ground truth like the ones mentioned above. A final stage is to refine the model, using the manually verified predictions to expand the set of labeled Web pages, retraining the classifier on this expanded set, and repeating this process in rounds. With each iteration of this process we obtain a more accurate classifier and also one with greater coverage of distinct SEO campaigns. Though some manual labeling is unavoidable, this overall approach (of repeated human-machine interaction) is far more efficient than a brute-force expert analysis. Figure 4.5 illustrates our approach in a flowchart that decomposes a single round of classification into three steps: training, predicting, and validating. We iterate this procedure to obtain more labeled examples after every round.

4.5 Results

We implemented the bootstrapped classification system described in the previous section to map storefront and doorway Web pages into the distinct SEO campaigns that promote them. Section 4.5.1 documents the classification results obtained by alternating automatic prediction and manual validation. Section 4.5.3 reports the results of additional analysis on the counterfeit luxury ecosystem, including orders and interventions.

4.5.1 Classification Results

Table 4.1 shows how classification proceeded in rounds. Ultimately, our system was used to expand the set of labeled data from 491 storefronts from 28 94

Table 4.1. Rounds of classification, in which automatic predictions are manually verified. At each round, we specify the total number of storefront Web pages, the number that we have labeled, and the number of associated SEO campaigns.

Round # Stores # Labeled # Campaigns 0 4,432 491 28 1 4,432 497 28 2 4,432 557 35 3 5,690 570 36 4 7,484 828 52 campaigns to 828 storefronts from 52 campaigns. Rounds 0 to 2 focused on a snapshot of the data set, but then rounds 3 and 4 added storefront Web pages that had since been crawled. Classification through round 3 involved coordination between a machine learning expert and a security expert. However, exchanging predictions and feedback incurred unnecessary overhead. To expedite the process, we packaged up the machine learning tools for the security expert to use himself. The benefit of this adjustment is evident in the jump in labeled storefronts and known campaigns from round 3 to round 4 (though, the increase in total storefronts suggests that this stage was simply lengthier as well). The main impediment for achieving greater storefront coverage was the manual verification step, which was slow and conservative, and done by a single domain expert. Still, a human-machine system proved many times more efficient than a fully manual approach. We now take a look “under the hood” of our classification model. Beyond just outputting predictions, our tools provide more informative reports to guide the security expert and facilitate efficient review. First, after a LR classifier is trained but before any predictions are made, we profile each SEO campaign by parsing its weight vector. Due to the L1-regularization 95

Table 4.2. The ten most distinctive features of the msvalidate campaign, along with their corresponding weights. The first column indicates whether the feature was extracted from the storefront (s) or doorway (d).

s/d Feature Weight s img:src=includes/templates/uggsootsale/ 37.087 buttons/english/button buy now.gif s li:class=curselt li 15.443 d a:target= parent 13.878 s variable 11.874 s div:class=get to cart 8.498 d div:class=listingProductImage 6.845 s img:width=160 6.033 d speedy 5.486 s speedy 5.466 s img:title=BuyNowonsale 5.443

Table 4.3. The five most likely candidates for the msvalidate campaign, all of which had probability nearly 1 and were verified as correct.

Store Campaign #1 Campaign #2 louisvuittonicon.com msvalidate 1.000 biglove 0.004 cheapestlouisvuittonoutlets.com msvalidate 1.000 biglove 0.005 royaltrolls.com msvalidate 1.000 biglove 0.005 alliedprojects.com msvalidate 1.000 jsus 0.102 redshedtackshop.com msvalidate 1.000 biglove 0.004 in LR, only a small fraction of relevant features have non-zero weights; all other irrelevant features have zero weights and are effectively ignored. We rank the relevant features by weight, so the highest ranked features are the most predictive ones for a given campaign. As an example, Table 4.2 lists the top ten features of the msvalidate campaign from the model learned in round 3. Since our bag-of-words feature values are counts and not binary indicators, these features are not necessarily unique to msvalidate’s Web pages, but they may appear frequently in them. Second, when using the trained classifier to make predictions on unlabeled examples, we sort the predictions in descending order of probability. This way, the 96

Table 4.4. Breakdown of the prediction that the louisvuittonicon.com storefront is affiliated with the msvalidate campaign. The first column indicates the feature source: storefront (s) or doorway (d).

s/d Feature Contribution d a:target= parent 6.742 (40.54%) s img:src=includes/templates/uggsootsale/ 3.833 (23.05%) buttons/english/button buy now.gif s a:target= parent 1.873 (11.26%) s div:class=get to cart 0.878 (5.28%) s vuitton 0.878 (5.28%) s img:width=160 0.624 (3.75%) s img:title=BuyNowonsale 0.563 (3.38%) d img:width=160 0.405 (2.44%) s img:alt=BuyNowsale 0.303 (1.82%) d vuitton 0.185 (1.11%) domain expert can quickly validate the most probable candidates first. Rather than printing a monolithic list, though, we found it more useful to separate predictions by SEO campaign. This format allows the expert to focus on one campaign at a time instead of repeatedly context switching between campaigns. Table 4.3 shows the top five candidates for the msvalidate campaign, along with their probability of affiliation, from round 3 of classification. Our tool also suggests their second-most probable campaign, but in this case, all five msvalidate predictions were verified as correct. The domain expert verified predictions using signals like the ones mentioned in Section 4.4.3. For an unlabeled example, we output the raw probabilities given by each campaign’s classifier rather than normalizing all the probabilities to sum to 1. The reason is because the example may not belong to any of the known campaigns; thus, for such an example, this normalization would artificially inflate the predicted probabilities. Third, we dissect the classifier’s predictions to show why they were made. 97

Table 4.5. Sizes of storefront clusters and doorway clusters. For example, there is 1 cluster containing 8 storefronts, and 2 clusters containing 8 doorways each.

Size Storefront Doorway 58 - 1 12 - 1 11 1 1 10 - 1 9 - 2 8 1 2 7 - 1 6 1 4 5 4 2 4 7 12 3 23 49 2 122 220 Total 159 287

In particular, as a prediction is determined by the linear score z = w>x + b, we decompose the dot product to reveal the importance of each feature. Table 4.4 breaks down the top-ranked prediction from Table 4.3, showing the ten largest individual scores and their relative contributions to the overall score. Notice the significant intersection between these predictive features and the most strongly characteristic features of msvalidate given in Table 4.2. Finally, to discover more SEO campaigns that we have yet to identify, we perform hierarchical clustering with complete linkage on both the unlabeled storefronts and the unlabeled doorways. One preliminary step we take is to discard unlabeled examples for which the LR classifier predicted over a 90% probability; such examples likely belong to already known campaigns. As alluded in Section 4.4.2, we fix a tight distance threshold of t = 0.1 as the merging criterion. Table 4.5 displays the results of both clusterings, in terms of counts of different cluster sizes. Certainly, we do not expect every cluster to represent a distinct SEO 98 campaign. Rather, these clusters simply represent the “best suggestions” for where to look for new campaigns. This clustering is intentionally conservative, as more compact clusters are more likely to derive from the same campaign. Furthermore, fewer and smaller clusters are easier to manually inspect. A cluster with few members, if their common affiliation is validated, can still provide a seed to bootstrap the new campaign with supervised classification. The classifier can learn the campaign’s most distinctive features, and then find even more examples belonging to that campaign.

4.5.2 Ecosystem-Level Results

We now view our classification results through a broader lens and discuss what they signify about the ecosystem surrounding SEO abuse on the counterfeit luxury market. Tables 4.6 and 4.7 offer two different global views of our collected and partially classified data — Table 4.6 is from the perspective of luxury verticals, while Table 4.7 is from the perspective of SEO campaigns. Recall that doorway and storefront Web pages emanate from poisoned search results (PSRs), and our data collection records these associations. As a result, even though our system is designed to classify doorways and storefronts, the system is able to map associated PSRs, doorways, and storefronts to SEO campaigns. Table 4.6 lists the sixteen luxury verticals we examined, indicating how many PSRs, doorway pages, and storefront pages were associated with each one, as well as the number of known campaigns that targeted each vertical. In addition, the table shows that we successfully classified 58% of all PSRs, 42% of all doorways, but only 11% of all storefronts. This disparity suggests that search result poisoning is dominated by the few most prolific campaigns who direct user traffic to a modest 99

Table 4.6. Sixteen luxury verticals and the associated # of PSRs, doorways, stores, and campaigns that target them. The key campaign, the first one we identified that guided our study, targeted all verticals except those with an ‘*’. Also indicated are the totals we were able to classify. The two rightmost columns specify the highest percentage of crawled search results (both top 10 and top 100) that were poisoned on any given day in our eight-month time frame. (This is a modified version of Table 1 from [84].)

Vertical PSRs Door. Stores Camp. Top 10 Top 100 Abercrombie 117,319 2,059 786 35 13.0 11.1 Adidas 102,694 1,275 462 22 7.8 8.1 Beats by Dre 342,674 2,425 506 16 23.4 36.5 Clarisonic 10,726 243 148 6 0.3 1.3 Ed Hardy* 99,167 1,828 648 31 11.2 31.2 Golf 11,257 679 318 20 0.4 1.3 Isabel Marant 153,927 2,356 1,150 35 3.6 11.0 Louis Vuitton* 523,368 5,462 1,246 34 20.6 37.3 Moncler 454,671 3,566 912 38 39.6 42.5 Nike 180,953 3,521 1,141 32 8.2 11.5 Ralph Lauren 74,893 1,276 648 27 3.7 5.0 Sunglasses 93,928 3,585 1,269 34 5.5 11.5 Tiffany 37,054 1,015 432 22 10.2 17.1 Uggs* 405,518 4,966 1,015 39 18.0 38.0 Watches 109,016 3,615 1,470 35 1.9 7.0 Woolrich 55,879 1,924 888 38 2.4 5.0 Total 2,773,044 27,008 7,484 52 Classified 1,614,206 11,535 828 % Classified 58.6 42.7 11.1 number of storefront Web sites. In terms of peak daily search poisoning, in both the top 10 and top 100 results, the four most heavily targeted brands are Beats By Dre, Louis Vuitton, Moncler, and Uggs. In general, brands suffer many adversaries: ten of the sixteen verticals are targeted by over 30 SEO campaigns, all of whom are competing for the same user traffic, over the course of our study. Table 4.7 lists 35 of the largest SEO campaigns we identified. We observe a 100

Table 4.7. Classified campaigns (with 30+ doorways) and the # of associated doorways, stores, brands targeted, and peak poisoning duration in days. (This is a subset of Table 2 from [84].)

Campaign # Doorways # Stores # Brands Peak 171760 30 14 7 44 adflyid 100 18 4 66 biglove 767 92 30 92 bitly 190 40 15 89 campaign.10 94 18 5 99 campaign.12 118 5 1 59 campaign.14 39 8 2 67 campaign.15 364 10 10 8 campaign.17 61 8 3 44 chanel.1 50 10 4 24 g2gmart 916 28 3 53 hackedlivezilla 43 49 9 56 iframeinjs 200 2 1 39 jarokrafka 266 55 3 87 jsus 439 59 27 68 key 1,980 97 28 65 livezilla 420 33 16 70 lv.0 42 3 1 62 lv.1 270 12 9 90 m10 581 35 8 30 moklele 982 15 4 36 moonkis 95 7 4 99 msvalidate 530 98 6 52 newsorg 926 7 5 24 northfacec 432 2 1 60 pagerand 122 7 4 43 partner 62 9 5 33 paulsimon 328 33 12 128 php?p= 255 55 24 96 robertpenner 56 7 12 50 schema.org 46 17 7 54 snowflash 271 14 1 48 stylesheet 222 9 6 63 uggs.0 428 6 5 30 vera 155 38 12 156 101 wide range in the number of doorway Web sites, from 30 used by 171760 (and even fewer used by the campaigns not shown) to 1,980 used by key. Most campaigns also use several storefront Web pages, not only for having more points of sale, but also for having backups in case any of their domains get seized by brand holders. In addition, most campaigns target numerous verticals as well. This diversification opens up multiple revenue streams and improves a campaign’s adaptability to disruption. If one brand is particularly vigilant in seizing domains, or if the supplier of a certain product falls through, then the campaign can reallocate resources quickly. The rightmost column in Table 4.7 specifies the length of time in days when each campaign enjoyed most of its poisoning prominence. Specifically, we define peak poisoning duration as the shortest consecutive time span when a campaign promoted at least 60% of its PSRs. We again see a moderate range, but SEO varies over time and usually succeeds in bursts. We measured PSRs over eight months, but the average peak duration is only about 51 days. Figure 4.6 illustrates the daily dynamics of detected, classified, and penalized PSRs for four verticals. The filled areas represent classified PSRs; in aggregate, we mapped 64% of Abercrombie PSRs to campaigns, 62% of Beats By Dre, 66% of Louis Vuitton, and 58% of Uggs. The “misc” label folds multiple smaller campaigns into one category. The red area designates PSRs that were penalized, either through Google labeling them as “hacked,” or brand holders seizing the corresponding storefront domains. A vertical slice in these plots measures PSRs on a particular day. For instance, Figure 4.6b shows that on December 1, 2013, search results for the Beats By Dre vertical were 34.6% poisoned. Of these PSRs, 85.3% led to counterfeit stores managed by five SEO campaigns: newsorg (53.8%), key (16.8%), jsus 102

unknown php?p= key penalized unknown newsorg paulsimon key jsus penalized moonkis % Search Results % Search Results 0 2 4 6 8 10 12 0 10 20 30

2013−12 2014−02 2014−04 2014−06 2013−12 2014−02 2014−04 2014−06 (a) Abercrombie (b) Beats By Dre

unknown moklele lv.0 penalized unknown uggs.0 jsus penalized misc northfacec msvalidate misc msvalidate biglove % Search Results % Search Results 0 10 20 30 40 0 10 20 30 40

2013−12 2014−02 2014−04 2014−06 2013−12 2014−02 2014−04 2014−06 (c) Louis Vuitton (d) Uggs

Figure 4.6. Stacked area plots ascribing PSRs to specific SEO campaigns in four different verticals. The filled areas denote classified PSRs of the colored campaigns, the unfilled area denotes unclassified PSRs, and the red area denotes penalized PSRs. (This is Figure 2 from [84].)

(8.0%), moonkis (5.8%), and paulsimon (0.3%). The remaining 14.7% of PSRs led to unclassified counterfeit stores. The red area at the bottom indicates that only 0.6% of PSRs were penalized; we discuss penalization more in the following section. These plots demonstrate how SEO campaigns diversify their business among

several luxury verticals. For example, key victimizes Abercrombie and Beats By Dre (Figures 4.6a and 4.6b), while msvalidate targets Louis Vuitton and Uggs (Figures 4.6c and 4.6d). Also, a campaign’s success in search poisoning is not strongly tied to the number of doorway Web sites it uses. As shown in Table 4.7,

the large newsorg and jsus campaigns used 926 and 439 doorways, respectively, while moonkis only used 95. Yet, the distribution of PSRs in Figure 4.6b attests a relatively equal competition for Beats By Dre traffic. 103

In general, we observe that poisoning activity continues for months, but its prevalence varies over time. Interestingly, poisoning share diminshes considerably from start to end for all verticals except Louis Vuitton (Figure 4.6c). In between, though, the plots verify the bursty behavior of SEO. Most noticeably, the newsorg campaign thrived at the end of 2013 (Figure 4.6b), while the moklele campaign gained steam in June 2014 (Figure 4.6c).

4.5.3 Further Analysis

Our work primarily focused on the classification of storefront and doorway Web pages into the SEO campaigns behind them. As evidenced in the previous section, the classification results are essential to understanding the ecosystem surrounding counterfeit luxury SEO. We refer the reader to Wang et al. [84] for an even more in-depth analysis, but we recap select results here. Orders. Previous studies on underground economies [44, 38] have demon- strated the value in both simulating and submitting orders on counterfeit storefront sites. This final step in an end-to-end analysis reveals information about order volumes, payment processing, and fulfillment, as well as the relationships among actors and the impact of interventions on sales. By “simulating” an order, we mean creating an order but backing out right before final confirmation. Importantly, a store still assigns the customer an order number even if the customer does not complete the purchase. In addition, stores assign order numbers in a monotonically increasing fashion; therefore, the difference between two order numbers indicates the number of orders created in the intervening time frame. This difference, then, is an upper bound on the number of orders actually placed, since other customers may have canceled their order just as we did. Nonetheless, the metric is a useful estimate of order volume, and we can still gauge 104

how order rate changes over time in the face of interventions. In 13 different luxury verticals, we simulated 1,408 orders — 343 by hand and 1,065 with automated scripts — from 290 stores operated by 24 distinct SEO campaigns. To gain insight into payment processing and order fulfillment, we submitted actual orders on 16 storefront sites managed by 12 distinct campaigns. Bank iden- tification numbers from our transactions revealed that only three banks processed our purchases, two in China and one in Korea. This finding parallels prior work showing that payments are a susceptible choke point in the spam business [44, 55]. We received 12 out of the 16 knockoff goods we purchased; all were low to medium quality and shipped from China. Fortuitously, we discerned the Web site of a product supplier on two of our packing slips. The Web site publicized extensive shipment data that we mined for order volume, delivery status, and customer location. In total, over the nine months between July 5, 2013 and March 28, 2014, we observed more than 279K orders: 256K were successfully delivered, 15K were seized at the destination, 4K were seized at the source, and 1,319 were returned. Over 81% of orders are destined to four regions: United States (90K), Japan (57K), Western Europe (41K), and Australia (39K). All in all, this data from a single supplier proves that counterfeit luxury is a prosperous business. Interventions. The two stakeholders best positioned to disrupt counterfeit luxury SEO are search engine providers and brand holders. Search engines can demote the rank of poisoned search results and label compromised doorways as “hacked.” Brand holders can seize the domains of counterfeit stores through legal means. We examine each of these three countermeasures in turn. By correlating search poisoning activity with order activity (as estimated by our simulated orders), we saw corresponding drops in both that suggest demotion

likely has some effect. The key campaign in particular suffered the worst collapse, 105 which is depicted at least in terms of PSRs in Figures 4.6a and 4.6b. We did notice one inadequacy, though. During March 2014, the moonkis campaign vaulted few PSRs into the top 10 search results, but had many among the top 100, and its order volume sustained. In cases like these, Google should be even more aggressive in its demotion of PSRs. Google achieves minimal coverage in penalizing PSRs with a “hacked” label. Even though most doorway Web sites are hacked, Google only labels 2.5% of PSRs as such. Figure 4.6 corroborates this negligible penalization. Additionally, Google only labels the root of a compromised Web site as “hacked.” Our data set contained 68,193 search results that Google labeled, but when including subpages whose root domain was penalized, Google could have labeled 102,104 (49%) more. Further yet, Google lacks responsiveness in its labeling. For the subset of compromised doorways that were not already labeled when our study began, we observed an average lifetime of 13-32 days before they get labeled — a sufficient window of opportunity for accruing traffic. Finally, over the course of our eight-month study, we witnessed 290 domain seizures of counterfeit storefronts. However, this number is a meager 3.9% of the 7,484 storefronts we crawled. Among the 290 seized domains, 187 were redirected to new storefronts within an average of 7-15 days (though 81 of these domains were subsequently seized). This response time is relatively fast, especially when compared to the responsiveness of brand holders, who only perform seizures periodically. Stores that were launched then seized within our eight-month period had an average lifetime of 48-68 days — again, a sizeable window of opportunity for monetizing targeted traffic. Ultimately, domain seizures have some effect but are nowhere near compre- hensive. Campaigns are equipped with numerous storefronts spanning a variety of 106 verticals. Even if one of their domains get seized, business resumes at their other stores, and campaigns can quickly redirect temporarily lost traffic as well.

4.6 Conclusion

This chapter explored the influence of SEO abuse on the counterfeit luxury market. In this segment of illegal e-commerce, miscreants cooperate in well- organized SEO campaigns to poison search results and propogate their scams. To discern the full scope of their activity, we go beyond simply detecting cloaking or search poisoning, beyond catching individual compromised doorways or storefront Web sites, and beyond investigating a single campaign or targeted vertical. Instead, our work provides a holistic perspective of this class of cybercrime, in which we investigated the infrastructure and operation of 52 distinct SEO campaigns that victimize sixteen luxury verticals. Such an analysis was aided by the use of machine learning tools to classify crawled doorway and storefront Web pages according to their associated campaign. The results of this analysis provide a comprehensive understanding of the ecosystem of SEO campaigns in the counterfeit luxury market. This work draws parallels with the work presented in Chapter 3, which involved an end-to-end analysis of the spam business enterprise. Indeed, purely technical mitigations — filtering spam email, and detecting search poisoning — have only limited impact in disrupting cybercriminal activity. We argue that more effective interventions must be informed by both a micro and macro-level understanding of the problem. In the case of counterfeit luxury, our work suggests that defensive pressure should be targeted at the granularity of SEO campaigns. 107

4.7 Acknowledgements

Chapter 4, in part, is a reprint of the material as it appears in Proceedings of the International Measurement Conference (IMC) 2014. Wang, David; Der, Matthew F.; Karami, Mohammad; Saul, Lawrence K.; McCoy, Damon; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was one of the primary investigators and authors of this paper. Specifically, to reference this dissertation to the published material [84]: (i) Table 4.6 is a modified version of Table 1; (ii) Table 4.7 is a subset of Table 2; (iii) Figure 4.6 is exactly Figure 2; and (iv) Section 4.4 includes excerpts written by the dissertation author in Section 4.2. The remaining material in this chapter was resynthesized by the dissertation author. Chapter 5

The Uses (and Abuses) of New Top-Level Domains

In Chapters 3 and 4, we investigated two markets of illegal e-commerce. The high-level goal was the same in each case: to understand the end-to-end organization and operation of the enterprise and, in turn, pinpoint weaknesses to target with defensive action. Also, the technical challenge was identical as well: to map the many thousands of storefront Web pages to their sponsoring business entity. This chapter examines a distinctly different market: the domain name market. In particular, we explore hundreds of new top-level domains (TLDs) that have been delegated in recent years. Our goal here is to find out how registrants are using their new domain names in these new TLDs. One major difference of this study is that the online activity is not outright abusive. Certainly, new domains may host malicious Web sites, but we explicitly do not seek to estimate the overall amount of abuse, nor to detect any particular forms of abuse. Instead, we seek a broader characterization of the usage patterns exhibited by these domains. Nevertheless, our motivation and methods are very much in the same vein as the other case studies: we seek economic insight into the new TLD rollout through empirical measurement of the Web.

108 109

The technical challenge, once again, is to solve a classification problem: to classify the Web content that the millions of domains host. This problem has a different nature than the ones we tackled before, though. The e-commerce tasks involved just one type of Web page (i.e., storefront), and the objective was to identify the adversary behind it. Contrarily, this task involves several types of Web pages, and the objective is precisely to identify the type of Web page. However, while this application is different in some ways, a bootstrapped system that performs cumulative labeling remains our basic approach. We now set the stage for our third case study in using machine learning for the large-scale classification of Web pages. In just the last few years the Internet Corporation for Assigned Names and Numbers (ICANN) has approved the introduction of several hundred new top-level domains (TLDs). This rapid expansion of the Internet namespace has generated a maelstrom of debate: put simply, are the new TLDs (e.g., .accountants, .yoga) serving the needs of actual communities, or are they merely encouraging defensive and speculative registrations based on the perceived value of well-known trademarks and brand names? Here we seek to inform this debate by analyzing the large-scale usage patterns of registered domains in newly created TLDs. We survey 480 TLDs delegated since October 2013, and within these TLDs, we compile a data set by crawling the Web pages of 4.1 million registered domains. Our analysis has two parts. First, we use a mix of clustering and classification methods to group domains into six categories: parked domains which are monetized by serving automatically generated advertisements, suspended and unused domains that are not monetized, error and junk domains that are not functional, and (most interesting of all) contentful domains that host something other than ads. Second, for the domains in this 110

last category, we develop a statistical language model to estimate the fraction of domains that are hosting relevant content—that is, content actually aligned with the name of the TLD. Our results reveal a mixed landscape of usage patterns. On one hand, we find that over 80 percent of registered domains in new TLDs are underdeveloped, not serving content of any form (excluding ads). On the other hand, based on a more detailed study of ten specialized TLDs, we observe that the domains in active use are largely embracing the intended purpose of their TLDs.

5.1 Introduction

The Domain Name System (DNS) provides human-readable host identifiers that are memorable and easy to communicate. However, this same characteristic also means that the set of desirable domain names—brand names, proper nouns, common English words, and so on—are scarce resources with considerable value. This notion is borne out by the tremendous resale market for such names (e.g.,

the insurance.com domain changed hands in 2010 for over $35M). In principle, this scarcity should be mitigated by the distinct namespaces available under each

top-level domain (e.g., insurance.com is distinct from insurance.net), but in practice many have believed that the small number of TLDs available for general purpose use have not been sufficient to realize this goal. Thus, in 2011, the Internet Corporation for Assigned Names and Numbers (ICANN) introduced a new generic top-level domain (gTLD) approval process with the explicit goal of dramatically expanding the size and nature of the DNS namespace. This process was structured around registries—the organizations

responsible for operating individual TLDs (e.g., Verisign is the registry for com, and EDUCAUSE is the registry for edu) 1 Each registry was allowed to propose an

1Note that under the current business model implicitly established by ICANN, registries 111

unlimited number of potential TLD candidates (submitting an extensive proposal and paying a $185,000 USD Evaluation Fee for each). In turn, each of these applications (almost 2,000 in total) was subject to public comment from a broad range of stakeholders including world governments, brand holders and individual Internet users. In addition to considering this feedback, ICANN rejected some applications itself based on the advice of its Governmental Advisory Committee (GAC), or when a proposed TLD might lead to confusion with existing TLDs. Finally, for those cases where multiple registries applied for the same approved

name (e.g., book), conflicts were resolved either through private settlement or through ICANN’s name contention policies (typically resulting in an auction). Successfully completing the application process results in ICANN adding an entry for the new TLD to the root zone, a process known as delegation.2 The impact of this new process has been dramatic. Between the introduction of the DNS in the early 1980s and January 2013, ICANN and its predecessors added

317 TLDs to the root zone, only 22 of which are generic TLDs (gTLDs) such as com, net, and org [64]. Today, in February 2015, there are 810 TLDs, or an addition of 493 in the intervening two years [18]. Some of these new TLDs are geographic

(geoTLDs), like berlin, london, and nyc, which are meant for businesses and individuals in those areas. Others like bike and clothing are specialized for registrants with a particular interest. Some are intended to be generic, like xyz and link, and yet others, such as google and marriott, are entirely empty and return

only operate TLDs and do not sell domains directly to consumers. Instead, registrars are the commercial entities who contract with registries for the right to offer domains in their TLD and then sell these on the retail market to individual consumers, or registrants. With some minor exceptions, all domain registrations involve exactly one registry, registrar, and registrant. 2When a typical computer looks up a domain name with an unrecognized TLD, it starts by querying the root zone servers to find the TLD-specific DNS servers. Delegation marks the time when the root name servers first respond with a useful result to this query, linking the new TLD to the rest of the DNS namespace. Usually domain name registrations begin a few months later, with trademark claims and specialized registration periods in the intervening months [77]. 112

DNS errors for all queries as a mechanism to avoid name confusion and malicious registrations. This wholesale expansion of the DNS has not been without controversy. While proponents of these new TLDs argue that this dramatic widening of the namespace will democratize the Internet and promote innovation, critics question how much the new TLDs will actually contribute to the social good [25] and whether this expansion might simply serve the commercial interests of the registrars and registries who hold sway in ICANN. Others have expressed concern at the potential for abuse, including typosquatting and phishing, and the World Intellectual Property Organization has stressed the need for IP protection mechanisms, suggesting that the addition of new gTLDs will lead to defensive registrations—companies striving to protect their brand—and speculative registrations—speculators buying a domain name in hopes of reselling it at a significantly higher price [63]. Indeed, prior work focused exclusively on the controversial xxx TLD (whose creation predates this new policy) argues that the vast majority of its domain registration revenue was driven by speculative or defensive motivations, and very few registrations supported the claimed need for the TLD (adult entertainment) [29]. However, answering such questions—how are new TLDs being used, and is this use consistent with the claimed goals of this expansion?—is challenging to do at scale and thus advocates on both sides of this debate have only had anecdotal evidence to back their claims. Our work makes two main contributions in service to this deficiency. First, we crawl all 4.1 million domains registered in 480 new TLDs and, using a combination of clustering and classification methods, directly quantify how many domains host meaningful Web content and how many are still underdeveloped (with a particular interest in domain parking due to its potential as a vector for abuse) [3, 83]. Second, we examine ten specialized TLDs to determine how well registry intent 113 matches with reality. Do domain owners choose their TLD based on its relation to their content? Or are they choosing based on some other metric, such as price, length, or identifiability? We judge how related a Web page is to its TLD with a simple statistical language model for estimating document relevance with respect to the semantics of the TLD name (e.g., that Web servers answering names in the christmas TLD provide content relevant to the Christmas holiday). From this analysis we find distinct support for both sides of the TLD argument—less than 20 percent of registered domains in new TLDs provide any meaningful content, but within the ten specialized TLDs we study these content-rich sites are largely relevant to the intended purpose of their TLDs.

5.2 Data Set

This section describes our data collection methodology, which follows prior work in [29]. We downloaded the zone files for 480 of the 491 new TLDs delegated since October 2013 (the other 11 did not provide us with access). Each zone file contains name server (NS) and address (A) records for all registered second- level domains in a TLD. “NS” records map domain names to recursively more specialized name servers to query for further information. For TLD zone files, the NS records provide information about the name server that is authoritative for the associated second-level domain name. For instance, the name server for “ucsd.edu” is “ns0.ucsd.edu”, which must be queried when attempting to resolve “ucsd.edu” or any of its subdomains, and the “edu” zone file contains this information. “A” records map domain names to IP addresses and are the end result of most DNS queries for a valid domain. In total, the zone files across the 480 new TLDs contained 4,180,300 unique domains on February 3, 2015, the date of the Web crawl we use in this work. 114

Table 5.1. The ten largest new TLDs, when they appeared in the root zone, and prices of registering a domain in them.

TLD Domains Delegated Cost (USD) xyz 768,910 02-19-14 1.99 网址 353,149 04-02-14 N/A club 166,070 01-18-14 4.99 berlin 154,986 01-08-14 27.13 wang 119,192 01-03-14 9.25 realtor 91,371 07-28-14 39.99 guru 79,890 11-06-13 27.99 nyc 68,839 03-20-14 29.99 ovh 57,348 06-20-14 1.30 link 57,089 01-18-14 9.88

We use a custom Web crawler based on Mozilla Firefox to collect data on each domain, including the HTML of the rendered Web page, a screenshot, the HTTP status code, and any associated HTTP headers. To fully emulate what a real user would experience when visiting the domain, the crawler executes JavaScript, follows redirects, and appends the contents of frames and iframes that are common features in parked and redirected Web pages. Table 5.1 shows the ten largest TLDs: how many domains are registered and present in the zone file as of February 3, 2015, along with the date the TLD was delegated to the root zone and the yearly registration fee. We extracted registration costs from the registry’s Web site where possible. When the registry did not supply a single price, we compared prices for the first ten accredited registrars provided by the registry and report the lowest price. Some registries or individual registrars ran other promotions as well. For instance, members of specific realtor associations can get one free year of registration under the realtor TLD. A large domain registrar, Network Solutions, gave some com domain owners free xyz domains with no user interaction required; the domains 115 simply appeared in their accounts. Such practices may artificially inflate the size of these TLDs relative to actual customer demand for domains.

TLDs like berlin, realtor, and nyc have obvious intended uses. Others like xyz and link are meant to be truly generic; they will spur competition for catchy and marketable Web identities. The 网址 TLD is an internationalized domain name (IDN), in this case for the Chinese term “Web address”, and is intended for generic domain name registrations for Chinese users.3 Based on our experiences with the help of a native Chinese speaker, these domains appear to be reserved rather than registered by site owners: 346,871 of the 353,149 existing 网址 domains do not resolve to an IP address when queried, and we could not find a registrar selling domains in this TLD.

Donuts Inc. is the most prolific registry, providing services for guru and 171 other TLDs as of February 2015. The prominent members of the company all have experience at other registries or registrars, and their initial funding came through venture capital [1]. They describe their TLDs as “specific to a large number of verticals,” which is reflected by their large set of domain-specific names like bike, plumbing, diamonds, and photography. Though their domains have similar policies regarding registration and brand protection, each went through the application process independently. Since not all domains return Web content, and some domains return duplicate content, the number of unique crawled Web pages we use in our analysis is smaller than the total number of domains registered in the new TLDs. Of the 4.1M registered domains, we discarded 912,096 (22%) domains that did not respond with a successful HTTP response (an ‘OK’ 200 HTTP status code). We also set

3Web browsers display IDNs with appropriate localized unicode characters, but these domains also have an ASCII representation for backwards compatibility (in this case, xn--ses554g). New TLDs have also been created in Cyrillic, Arabic, and a number of other scripts. 116 aside 10,866 (0.26%) domains whose sole content is a frame included from another source that Firefox blocked due to its security policy. Of the remaining 3,257,338 domains, we then removed duplicate pages for a final data set of 2,105,327 unique Web pages.

5.3 Clustering and Classification

Our objective is to delve “into the wild” and explore these new territories of the Web. How are domain owners using their new Internet property? Are they merely claiming it or actually settling it? Some patterns we can anticipate: for example, we expect to find parked domains across multiple TLDs, as well as many Web pages that are temporary stand-ins and near-duplicates. But we also hope to quantify these observations with large-scale measurements. To find these answers, our survey relies on a combination of automated methods and manual inspection. In this section we first classify the 2,105,327 unique Web pages into broad categories, and then in Section 5.4 we examine Web pages serving useful content in more detail.

5.3.1 Clustering

Our first step clusters Web pages with similar content. Here, by content, we mean not only the text displayed by each page in a browser, but the totality of information about the page that can be gleaned from its HTML source. To produce input for a clustering algorithm, we used a “bag of words” approach which represents each page as a high-dimensional vector whose elements count the number of times that specific features appear in the HTML source. We compiled these features by tokenizing all the data that appeared between start and end tags in the

HTML source; in addition, we compose (tag, attribute, value) tuples as the 117

“words” in our dictionary. Thus, for example, with the following snippet of HTML code:

we parse the above HTML into the following words:

img:src=ComingSoon.jpg img:alt=ComingSoon img:height=50 img:width=100

Our dictionary excludes a standard list of stop words as well as other words of very low frequency; see Sections 2.4 and 3.3 for more details of this approach. As input to a clustering algorithm, this representation attempts to account for both the structure and content of each Web page. As is typical with text data, the above procedure generates very sparse and high-dimensional representations: each Web page is converted to a word-count vector with several hundred thousand elements. In practice, however, it does not seem necessary to maintain this full representation as input to the clustering procedure. After normalizing the word-count vectors to unit length, we use principal components analysis [37]—a standard method for linear dimensionality reduction— to project them into a lower-dimensional feature space. We chose this feature space to capture the 200 leading directions of variance in the full bag-of-words representation. These smaller feature vectors were then used as input to our clustering procedure. 118

We began by clustering a smaller subset of 200K Web pages—about one tenth of the full data set. (As we shall see in Section 5.3.2, the results from this initial clustering were sufficient to provide a seed for the subsequent steps in our analysis.) We used the k-means clustering algorithm with k =400 to organize these Web pages into groups of high similarity (based on the Euclidean distance between their feature vectors). We set k to be purposefully large because we wished to discover especially cohesive clusters. Next we performed a manual inspection of these clusters. Our goal was to understand the broad categories of Web content hosted in these new top-level domains. For this purpose, we built a custom visualization tool that displays screenshots of how the Web pages rendered in a browser (at the time they were crawled); next to each screenshot is also a link to the HTML source. In these efforts we paid special attention to large clusters (those containing many Web pages), as well as clusters that appeared especially cohesive (as measured by the average distance of points to the cluster centroid). Our visualization tool sorts the Web pages in each cluster by its distance to the cluster centroid, and it also displays a “digest” view of each cluster that contains top and bottom-ranked pages along with a random sample of pages in between. With this tool, we can quickly spot clusters of Web pages that are visually nearly identical. We can also conclude with confidence that these Web pages have been appropriately grouped.

5.3.2 Classification

A thorough scan of the clusters revealed to us six broad categories of Web pages in new top-level domains. In order of prevalence, these categories are the following:

Parked: the domain is advertised for sale and/or the majority of content 119

consists of automatically generated ads.

Error: the domain does not return a 200 HTTP status code. (These domains were omitted at the outset from our analysis.)

Unused: the domain hosts an empty or placeholder Web page but is not monetized.

Content: the domain hosts meaningful Web content (not merely ads).

Junk: the domain returns a 200 HTTP status code but displays an error page.

Suspended: the domain is registered but “deactivated,” typically pending ICANN verification of the registrant.

Table 5.2 shows representative examples of four of these classes. The first class is those of parked domains, which come in three main varieties: ones that serve ads, offer resale, or do both. The vast majority display ads in a list of links, often labeled “Related Links” or “Sponsored Listings.” The registrants of parked domains are essentially speculators who hope to derive short-term profits from ad revenue and long-term profits when the domain is sold sometime in the future. Also shown in the figure are examples of unused and suspended domains. Unlike parked domains, unused domains are not monetized; instead they merely host a holding page while the domain owner develops a proper Web site. Suspended domains arise when owners have failed to provide their contact information, a mandatory requirement from ICANN as of January 1, 2014. The page explains this requirement to visitors of the site. In addition, some domains are suspended due to ongoing reviews within the Uniform Rapid Suspension System (URS), which seeks to protect the intellectual property rights of trademark owners. Examples

of suspended domains include lockheedmartin.global, holidayinn.vegas, and 120 netflix.social. Table 5.2. Examples of parked, unused, suspended, and junk Web pages.

Parked Unused

Suspended Junk

As the next step in our analysis, we sought to partition all the clustered Web pages into one of these six classes. The visualization tool gave us confidence to label en masse the large numbers of Web pages that appear in perfectly homogenous clusters. This approach was particularly effective for identifying large numbers of parked and unused domains. In practice, we found that a relatively small handful of fixed templates were used to generate many thousands of parked and placeholder Web pages in new TLDs. 121

The class of Web pages with meaningful content exhibits the most variety. The Web pages of these domains are not subject to the same degree of replication as other classes, and the new TLDs cover a diverse set of subjects. Thus at this stage, we focused entirely on bulk labeling of clusters that clearly contained underdeveloped Web pages (i.e., those without content, excluding ads). If it was not visually obvious how to label a cluster in bulk, then its pages were assigned at this stage to the class of domains with content. After this initial manual partitioning of 200K (clustered) domains into six classes, we next sought to label the much larger set of domains that were not part of our clustering experiments. To do so, we used a simple nearest-neighbor heuristic based on Euclidean distance. We began by extracting word-counts from all unlabeled Web pages in our data set and mapping these word-counts into the same lower-dimensional feature space. Then, for every unlabeled Web page, we located its nearest neighbor among the labeled examples and computed the Euclidean distance to this neighbor. If the nearest neighbor was from one of the non-content classes, and if the Euclidean distance to this neighbor was less than a certain threshold, then we labeled the page as belonging to that class. We modified our visualization tool to display and validate the results of Web pages classified in this way. We then manually tuned the distance threshold to avoid false positive labels in the parked, error, unused, junk, and suspend classes. Note that no content pages were identified in this way: pages whose nearest neighbors did not lie within the distance threshold remained unlabeled. One round of this nearest-neighbor heuristic was sufficient to label many of the remaining (non-content) Web pages in our data set with high confidence. But the efficacy of this method was limited by the small seed of our initial clustering— only 200K pages, one-tenth of the whole data set. To achieve greater coverage we 122

therefore iterated this approach. In particular, we collected a subset of the Web pages that remain unlabeled, clustered these pages using k-means, inspected the resulting clusters in our visualization tool, assigned bulk labels to clusters of pages that were visually straightforward to identify, and performed another round of thresholded nearest-neighbor classification—this time, with a larger base of labeled examples. We iterated this process until there remained no further examples to cluster. This simple iterative approach was effective for discovering the “mass” of each class, but we made a further effort to find additional undeveloped (i.e., non- content) Web pages in the tail. Such Web pages were not replicated enough to form their own cluster, and it would take much too long to inspect and label these Web pages manually; instead we used keyword search to locate additional candidate examples from each class. Some particularly effective keyword phrases were “domain is for sale” for Parked pages, “under construction” and “coming soon” for Unused pages, and error messages such as “page not found” and “problem loading page” for Junk pages. We also used our visualization tool to validate the results of pages labeled in this way. Finally, to achieve even greater coverage of parked domains, we consulted an additional data source of “ground truth” labels. Owners of parked domains sometimes contract with parking services [83]: the only task required of these owners is to point the name servers of their addresses to the name server of the parking service. Moreover, certain of these name servers are used exclusively for parked domains. Thus in our data set, we also labeled a domain as Parked if (regardless of its Web page) its name server pointed to the name server of a known parking service. Overall, we identified 26.3% of the parked domains in our data set by this rule. The other parked domains were identified by our clustering and 123

Class Domains % Parked 1,140,193 27.3% Unused 868,086 21.7% Error 912,096 21.9% Junk 470,664 11.3% Suspended 11,007 0.3% Content 687,505 16.5%

Figure 5.1. Breakdown of classes for domains in new TLDs. classification procedures (71.5%) and by keyword search (2.2%). After all of these efforts to identify undeveloped (non-content) domains, we manually inspected a random sample of the remaining unlabeled Web pages. The results of this inspection gave us confidence to assign the remaining Web pages to the class of domains with content. Table 5.1 summarizes the overall breakdown of classes that we found on all 4.1M domains in our data set. The results show that a significant majority of the registered domains in these 480 new TLDs are underdeveloped. Perhaps this preponderance is not surprising considering that the TLDs are still in their relative infancy. Most purchases of these domain names—at least initially—seem to be aimed at acquiring property rights rather than developing new avenues for content.

5.3.3 Further Analysis

As final observations from our classification efforts, we look at how registra- tion patterns have changed over time and the dominant domain parking services. 124

Table 5.3. Classification flux between two Web crawls 20 weeks apart. The first row shows, for example, how parked domains in 2014 were reclassified in 2015 (by percentage). The rightmost ∆ column gives the overall change in class size. P = Parked, U = Unused, E = Error, J = Junk, S = Suspended, C = Content.

2015-02-03 % P U E J S C ∆ P 89.3 2.4 4.6 0.1 0.03 3.6 +3.7 U 5.4 86.1 4.6 0.2 0.01 3.7 -2.2 E 20.9 5.0 65.8 1.0 0.27 7.0 -2.0 J 2.5 2.2 4.7 82.7 0.01 7.9 -0.4 S 21.8 2.3 7.6 0.8 61.28 6.3 -0.1 2014-09-16 C 3.3 2.7 12.9 0.9 0.02 80.2 +1.1

Classification Flux

We also looked for trends in how the domains of new TLDs were being used over time. To do this, we compared the classification results of the previous section to those obtained from a Web crawl completed 20 weeks earlier on September 16, 2014. We note that the earlier crawl visited only 2,236,950 domains; thus, in just 20 weeks, we observed a near doubling in the number of domains registered to new TLDs. To compare the results across time, we consider only the intersection of domain names that were present in both crawls. The intersection of these two crawls contained 2,218,548 domains; 18,402 domains that were registered as of September 16, 2014 were no longer registered by February 3, 2015. Table 5.3 shows the broad changes that occurred in domains over time. First, we notice a trend toward monetization: several domains—considerable percentages of Error and Suspended pages, in particular—become parked in 2015. Also, while many domains are developed from 2014 to 2015, the reverse also occurs: 12.9% of domains with content in the earlier crawl respond with HTTP errors in the later one. Overall, we observe that more domains are Parked and Content in the later crawl, while fewer domains are Unused, Error, and Junk. 125

Market Share

One point of interest in this new Internet marketplace regards the business side: how many players are involved, and to what extent? We measure the market share of domain parking services by examining the NS records of all parked domains. Figure 5.2 breaks out the number of parked domains by service. The plot highlights the fact that domain parking is dominated by just a few major services. The lion’s share (34.5%) of this market belongs to domaincontrol.com, which GoDaddy uses for their parked domains. (Note that this name server is not exclusively used for parked domains, however. The GoDaddy domains that we labeled as parked were those discovered by our clustering and classification methods and then validated by our visualization tool.) Likewise, the top four parking services account for over 61% of all parked domains; the top ten, 78%. The numbers on top of the bars in Figure 5.2 indicate how many distinct TLDs use each parking service. All but two Germany-based parking services cover a broad range of over 230 TLDs. We also performed a similar analysis for Unused domains. Most name servers of these domains point to a server operated by the registrar, which then provides a default holding page for the domain. (See, for example, the United Domains and Network Solutions pages in Figure 5.2.) The two most prevalent name servers among Unused domains are register.com and worldnic.com (Network Solutions), which are both web.com entities and account for 41.8% of this market. The next highest, at 5.6%, is the Germany-based registrar United Domains udagdns.de.

5.4 Document Relevance

In the previous section we found that roughly one-sixth of registered domains in new TLDs had content of some form (besides ads) on their Web pages. In this 126

247 400k

200k 255

# of parked domains 100k 289 245 4 255 231 50k 237 246 1

domaincontrol.cominternettraffic.comsedoparking.com1and1 ns36.de name.comregistrar cashparking.com123 domainprofi.de − reg.co.uk − dns.de − servers.com

Name server Figure 5.2. Number of parked domains by service. The number on top of each bar indicates how many distinct TLDs used that parking service. section we take a closer look at these domains with actual content. When an application is submitted for a new TLD, the registry must offer an intended purpose for the added namespace. We are interested in knowing how many domain owners are faithful to this intended purpose. Are these owners embracing or ignoring the spirit of the TLD? Likewise, is the TLD serving its target community? These questions do not apply to certain generic TLDs like and click, which explicitly welcome all registrants. However, they can certainly be asked for such specialized TLDs as london, christmas, and photography. In this section, we attempt to answer these questions for the ten specialized TLDs shown in Table 5.4. We chose these TLDs as a representative sample from the entirety of new TLDs in our data set. Our problem, in a nutshell, is how to judge whether a Web site, such as foo.bike (our running example henceforth), truly serves the agenda or community 127

Table 5.4. Number of registered domains, and percentage of contentful domains, in ten specialized TLDs.

TLD Domains Content % Content audio 19,161 814 4.2% bike 13,846 3,601 26.0% christmas 13,459 296 2.2% clothing 14,842 2,926 19.7% company 37,268 7,921 21.3% email 47,506 7,010 14.8% london 54,143 8,859 17.1% nyc 68,839 11,176 16.2% photography 51,329 17,028 33.2% realtor 91,371 22,838 25.0% behind the TLD. We begin by observing that this problem closely resembles the core task in information retrieval: evaluating the relevance of a document to a query [16]. In our setting, the document is the Web page hosted at foo.bike, and the query is the TLD string bike. One slight difference in our setting, though, is that we seek to estimate the relevance (in the most general sense) of Web pages to TLD strings, whereas the goal of information retrieval is to provide a ranking of documents from most to least relevant.

5.4.1 Generating a Corpus for Each TLD

Perhaps the simplest method for determining whether a Web page served at domain foo.bike is related to the TLD bike would be to search for the keyword bike (as well as derivatives such as bikes and biking) in the text of its HTML. More generally, this set of keywords could be extended to include synonyms such as bicycle and cycling. But we might also expect many words beyond strict synonyms to indicate relevance—for example, words such as ride and parts. As an initial strategy along these lines, we attempted to gather lists of related 128 words using Google Suggest, which works by appending predictable completions to any query. This approach was partially successful but suffered from two crucial drawbacks: first, that Google returns only the top ten suggestions, and second, that the suggestions are restricted to words or phrases that follow the query. Thus for the query bike, it is unlikely that Google’s ten suggestions would contain the highly relevant word mountain. Ideally, what we want is a distribution of words that are commonly associated with each TLD. Distributions of commonly co-occurring words are, of course, the intuition behind topic models such as latent Dirichlet allocation [8]. Borrowing this intuition, we henceforth refer to these desired word distributions as topics. These distributions over meaningfully related words are not known a priori, however. Instead we must seek to learn them, and for this, we must generate an appropriate corpus for each TLD. (Note that the Web pages in each TLD do not provide an appropriate corpus because many of them host content that is wholly unrelated to the TLD’s intended purpose.) To generate these corpora, we harness the capabilities of Google Search and its PageRank algorithm. We craft pertinent queries for each TLD, then use our own automated tool to search those queries on Google, fetch the top results, and crawl the corresponding Web pages. Specifically, we submit the TLD as a query to Google as well as five additional reference Web sites. As an example, for the TLD bike, the six queries we issue are:

• bike

• site:.org+bike

• site:about.com+bike

• site:ask.com+bike 129

• site:answers.com+bike

• related:http://en.wikipedia.org/wiki/bike

Note that prepending site: to a Google query limits the search results to the specified domain, while prepending related: (as in the last query above) instructs Google to return sites similar to the specified one. For each TLD, we gathered a corpus of 350–400 relevant Web pages by crawling the top 100 search results for each of the first two queries and up to 50 results for each of the remaining four.

5.4.2 Estimating Topics

We estimate the topic-word distribution of each TLD from the statistics of commonly collocated words [53] in its corpus. In particular, let q denote the focal word of the TLD (e.g., the word “bike” for TLD bike). To estimate the topic-word distributions, we locate all instances of q and count all words that fall within a collocational window of five words on either side. (As is commonly done, we do not include stop-words in these counts; such words may co-occur frequently with q but are not likely to be related in any meaningful way.) There exist several statistical methods for measuring the significance of collocating words, but we have found that simple frequency counts are appropriate for our task. We normalize the counts of co-occurrences count(w, q) to obtain a conditional distribution P (w|q), which represents how likely a word w appears in the same context as q. This distribution is our representation of the TLD’s topic. We estimated a distribution over collocated words for each of the ten TLDs in Table 5.4. The top-ranked words for these TLDs are shown in Table 5.5. We note that many metrics, such as t scores and χ2 statistics, have also been proposed for compiling similar lists of words; some of these are especially apt for 130

Table 5.5. Most likely collocating words for ten TLD words.

audio bike clothing christmas company audio bike clothing christmas company digital mountain womens tree limited sound bicycle fashion day india format bikes wear holiday companies video size mens december liability recording riding women eve business books right clothes season new music road traditional trees east file ride dress traditions insurance files shop accessories lights public email london nyc photography realtor email london city photography real address city new art estate free greater york digital realtor send university nyc camera home account transport attractions wedding find mail travel things photographs realtors service free brooklyn photographers agent message england best history become addresses hotels free film ask accounts underground island technology association discovering collocations of fixed phrases. These metrics essentially evaluate whether P (w|q)  P (w), as opposed to considering P (w|q) in isolation as we have done here. For our application, though, we generally found the latter to work better.

5.4.3 Relevance Scoring

Finally we consider how to score the relevance of Web pages within a TLD— that is, to measure how well these pages match the TLD’s intended purpose. If we regard the collocational probability P (w|q) as the “similarity” of a word w to the TLD q, then a natural score of relevance is the average similarity of words that appear in the web page. In particular, let d denote the Web page (document), and 131 let P (w|d) denote the normalized distribution of words on the Web page. Then we propose the following measure of relevance:

X  Rq(d) = P (w|q) · P (w|d) (5.1) w

Note how this equation assigns a real-valued score to each Web page d in the TLD q. (We also experimented with the Jaccard index between the set of words on web pages and the set of top-ranked words in each TLD, but in practice we found the score in eq. (5.1) to offer a richer measure of relevance.) Recall that our ultimate goal for each TLD is to estimate the percentage of contentful Web pages that are related to the TLD’s intended purpose. We estimated this percentage in the simplest possible way—by computing the scores in eq. (5.1) for all pages in each TLD and then choosing a threshold that divided these pages into relevant and irrelevant ones. To choose this threshold, we ourselves made “ground truth” judgments of relevance for several hundred web pages within each TLD. In particular, all five authors independently provided these judgments, marking each page as relevant or irrelevant based on a manual examination of its screenshot (or abstaining if this decision was unclear). We took the majority vote of our labels as ground truth for these pages. Finally, we observed the distribution of relevance scores for this manually labeled sample and chose a threshold that worked well to distinguish relevant versus irrelevant pages. Table 5.6 displays some examples of web pages, both relevant and irrelevant, that were distinguished by this simple thresholding procedure. It is visually obvious from the screenshots that the Web pages in the left column are unrelated to the TLD. On the other hand, the pages in the right column are clearly related. Figure 5.3 shows our estimated percentages of relevant Web pages in each 132

Table 5.6. Web pages that are (ir)relevant to their TLD. Shown for each page are the second-level domain, relevance score as given by eq. (5.1), and a screenshot.

.audio .clothing abled, 0.008 hardstyle, 0.004

azure, 0.013 childsplay, 0.007 Not relevant

kopfhoerer, 0.026 orage, 0.019

ocen, 0.075 sams, 0.027 Relevant

12v, 0.116 norfolk, 0.036 133

100

75

50

25 Relevant Web pages (%)

audio bike christmas clothing company email london nyc photographyrealtor

TLD Figure 5.3. Percentage of relevant Web pages in ten TLDs. of the ten TLDs. We observe that in eight out of the ten TLDs, a majority (over 50%) of Web pages align with focal word of the TLD. One omission in these results is that we excluded Web pages with very few words from our analysis. We do not expect the relevance scores, as computed by eq. (5.1), to be reliable indicators for these pages. Our results provide insight into how domain owners are using each TLD. The photography TLD, for instance, has been largely embraced; it benefits from a clear purpose, as well as the fact that the Internet is a prime platform for photo-related content. We observe a notable contrast between the business-oriented TLD realtor and the more whimsical TLD christmas. Professional realtors and realties have a strong business incentive for using a .realtor domain, whereas the business value of a .christmas domain is less clear. We noticed, however, that most of the content in the realtor TLD is not new; many domains have simply copied over existing real estate Web sites. For example, about 60% of domains in this 134

TLD pulled HTML source directly from realtor.com. The content in the .email domain was not especially rich: many Web pages provide nothing more than login portals, though some do offer legitimate email services and security sites. Our results shed less light on a TLD like .company, which seems more generic than the other specialized TLDs that we considered. For the most part, the geographic

TLDs nyc and london are indeed serving their respective areas. There are some sites about the cities themselves, but most seem to be registered by businesses and individuals based in the city.

5.5 Conclusion

The namespace of the Internet as we know it is changing dramatically. Over the last few years, ICANN has approved hundreds of new top-level domains spanning topics as diverse as christmas, guru, and link. In this chapter, we explore a snapshot from February 2015 of this remarkable development to discover how registrants are using their domains in these new TLDs. Our large-scale empirical study of new TLDs makes two main contributions. First, we combine both manual and automated techniques in alternating rounds of clustering, labeling, and classification to categorize over 4.1M crawled domains in 480 new TLDs. Second, we use a simple statistical language model to estimate the percentage of contentful Web pages whose subject matter matches the name of the TLD. Our results offer early insight into the use of these new TLDs. That over 80% of domains remain underdeveloped suggests that many registrants are anxious to lay claim to new names, yet they remain unused or in a parked state. However, taking ten topical TLDs as a representative sample, we find that the majority of domain owners who do publish content are embracing the intended purposes of those TLDs. This chapter contributed to our study in Halvorson et al. [28], which is 135 broader in scope and elaborates on the economics and uses of new TLDs. We highlight further results below, but refer the reader to this paper for more details. Registration intent. When possible, we inferred the registration intent of domain names — why did a registrant buy a particular domain name? Here we classify domain registrations into three categories: primary, defensive, and speculative. A primary registration establishes a unique Web presence, and the domain hosts meaningful content. A defensive registration protects an existing Web presence; these domains either do not resolve, or simply redirect to a different domain. A speculative registration hopes to profit from the domain name itself without ever developing a Web presence. Parked domains comprise this category. Among domain names in new TLDs with unambiguous registration intent, we inferred that domain registrations are 14.6% primary, 39.7% defensive, and 45.6% speculative. Thus, for over 85% of domains, the only value generated is through the buying and selling of the domain names themselves; less than 15% of domains contribute real value to the Internet community. Financial analysis. We used pricing information and domain registration volume to estimate the total expenditure of registrants as well as the profitability of registries. Through March 2015, we estimate that registrants spent roughly $89M USD on domains in the new TLDs. A registry must pay a $185,000 fee to apply for a new TLD. We deemed that only about half of all new TLDs recovered this amount from domain registrations. Factoring in additional costs, a more realistic price of starting a new TLD is around $500,000. In this case, only about 10% of new TLDs are profitable. TLD popularity. We gauged the popularity of new TLDs from the perspec- tive of both domain registrants and end users. By measuring registration volume across TLDs, we found that registrations in new TLDs are just a small fraction of 136 all registrations, and that they only have a minimal effect on the registration rate in old TLDs. Also, com still dominates the market. Secondly, we used Alexa lists of top Web sites to judge how popular domains are among Internet users. Compared to domains in the new TLDs, domains in the old TLDs are almost three times more likely to be ranked in the Alexa top 1M list. Abusive behavior. Similar to using Alexa lists to measure popularity, we used blacklists to measure abusive activity. We found that within the first month of registration, domains in new TLDs are twice as likely to show up on the URIBL blacklist. A partial explanation as to why spammers target new TLDs is the occasional and exceptional bargain (e.g., some registrars sold xyz domains for less than $1 per year). All of these results provide evidence that ICANN’s goals behind the New gTLD Program — increased consumer choice, competitiveness, and innovation — are not being met to an adequate degree. These new TLDs are still young, but early indications suggest that they are merely invigorating the domain name market and not adding true value to the Internet at large.

5.6 Acknowledgements

Chapter 5, in part, is currently being prepared for submission for publication of the material. Der, Matthew F.; Halvorson, Tristan; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was the primary investigator and author of this material. Chapter 6

Conclusion

Today, the Internet harbors billions of users. From a cybercriminal’s per- spective, the Internet harbors a glut of unsuspecting victims. Profit-driven cyber- criminals tap into this enormous population, luring visitors to their scams to make money. They gain user traffic through a combination of abusive advertising (e.g., email spam or search result poisoning) and an expansive attack surface, featuring a plethora of access points to a variety of scams. Indeed, well-organized online attacks operate at Internet scale; therefore, effective defenses will inevitably require Internet scale analysis. In this way, computer security has evolved into a “big data” field. Unfortunately, this data quickly grows too big for security experts to analyze by hand. Automated tools are imperative for scaling up the analysis. This need presents a ripe opportunity for machine learning, where algorithms can learn from data, improve with more data, and make predictions on unseen data. In this dissertation, we use machine learning to aid security efforts that crawl extensive sets of Web sites. In particular, we demonstrate that a bootstrapped system is effective for the large-scale classification of Web pages. The system starts with an initial seed of data labeled by a domain expert, then proceeds iteratively with rounds of human-machine interaction. Namely, the machine (re)trains a model

137 138 on labeled data and makes predictions on unlabeled data; then, the human validates predictions to expand the set of labeled data. This methodology is significantly more efficient and robust than a manual approach supported by automated heuristics. Overall, this dissertation contributes to the symbiosis between Internet security and machine learning, which has cultivated a vibrant area of research. Web crawls provide compelling large-scale data sets that constitute a real-world attack. Machine learning then builds models for organizing the data, since doing so manually is impractical. Finally, security researchers digest the results to spot weaknesses in the attack, which are opportune points of intervention. The recurrence of this methodology further emphasizes the decisive role of machine learning in this problem domain.

6.1 Impact

The central thesis of this dissertation is that machine learning enables data- driven security to be conducted at scale. Our work applied this thesis to three different security problems. First, we developed a system to classify spam-advertised storefronts accord- ing to their sponsoring affiliate program. The high accuracy our system achieved is largely due to the fact that affiliate programs use in-house templates for automat- ically generating storefront Web sites. As a result, many of their storefronts are near-duplicates and lie very close together in HTML feature space. This outcome is a manifestation of the onus on miscreants to scale their attacks. They expand their attack surface by proliferating their many storefronts across the Web, but can only do so at a reasonable cost by automating and replicating the scam. The repurcussion, though, is that automated defenses like ours become possible. Solving this classification problem was pivotal in revealing a critical bottleneck in the spam 139 value chain: hundreds of thousands of storefronts are managed by just dozens of affiliate programs, who use only a handful of merchant banks to monetize a customer’s visit. Second, we described a system for classifying counterfeit luxury storefronts by the SEO campaign that promotes them in poisoned search results. This task closely resembles affiliate program identification, but the nature and market around these SEO-ed luxury storefronts are notably different. For example, the storefronts are not templated to the same degree as an affiliate program’s storefronts; however, certain commonalities in their HTML implementations can still link them to the same campaign. We showed that sparse linear classifiers are adept at learning these commonalities. The results gave us a better understanding of the infrastructure supporting luxury SEO, and discovering the full extent of a campaign’s operation helps to guide and reinforce defensive interventions. Third, we documented a methodology for characterizing how registrants are using their millions of domains in hundreds of new top-level domains. Completing a task of this scale was feasible mainly due to three widespread patterns: HTTP errors, domain parking, and default placeholder Web pages. Web pages parked by the same service, and placeholder pages served by the same registrar, are generally duplicates or near-duplicates — a setting conducive to clustering and classifying these Web pages en masse. Our analysis provides the first baseline measurement of the entire landscape of new TLDs, and informs the debate on the purpose and value of the TLD expansion. Our three successful case studies demonstrate that machine learning is a valuable tool for cybercrime security research that can be adopted in many problems beyond just the ones addressed in this dissertation. 140

6.2 Future Work

We anticipate that our research agenda — large-scale empirical analyses of online abuse — will continue into the foreseeable future. While we could suggest other security problems for applying our Web page crawling and classifying methodology, we instead focus future work on how to maximize the utility and efficiency of our framework. This direction is crucial considering how frequently our research objective occurs. I propose that a more cohesive engineering of our software stack would greatly optimize our research practice. This practice involves a repeated procedure of crawling Web pages, clustering or classifying them, and visualizing the results for validation. The regularity of this three-step methodology underscores the benefit of integrating the stages together in a unified data collect-analyze-visualize pipeline. Such a streamlined system removes the human-in-the-loop as much as possible; the user could simply “press go” to launch a Web crawl, then not be involved again until the results are ready to display. In effect, the system would reduce the latency from completed crawls to classified then validated data, as its seamless execution eliminates time-consuming interposed manual tasks (e.g., passing data and results back and forth between security researchers and machine learners). For the data analysis component, solutions could be one or both of in-house implementations and off-the-shelf libraries. Either way, this component would be most effective with highly modular design, where the user can simply swap in a feature extractor and learner that are best suited for his task. One particularly attractive option is MLlib [58], Spark’s scalable machine learning library. MLlib is well stocked with many popular algorithms for feature extraction, dimensionality reduction, clustering, and classification; also, it is easy to deploy since our research 141

group recently set up a Spark computing cluster. By its real-time sequential nature, a Web crawl produces streaming data. Hence, another advantage of a unified pipeline is the capability of online learning. In fact, MLlib boasts a streaming k-means algorithm, offering “cluster as you crawl” functionality that could even be visualized in real-time with an upgraded front end. An additional module for future incorporation is an image-based feature extractor, which may be useful for certain Web page classification tasks. The approaches presented in this dissertation primarily extract textual features from the HTML source code of Web pages. However, we could also extract image-based features from our crawler’s screenshots of Web pages when fully rendered in a browser. Screenshots proved vital for manual validation, but they likely carry predictive power as well. For example, near-duplicate Web pages would commonly have very similar representations as image-based feature vectors. One available option is Caffe [35], a state-of-the-art deep learning library for visual recognition. Specifically, I recommend building feature vectors from the activation weights of “fc7,” the seventh and final hidden layer of the deep convolutional neural network, which has been shown to work well for general-purpose classification tasks. After engineering this pipeline and building out the data analysis component, the next highest priority I advocate is improving the visualization component. Currently, we have a fairly rudimentary and static visualization tool implemented in basic HTML. Some versions at least have a form so users can submit data labels, but the tool is otherwise inflexible. Hence, the visualization part of the pipeline could be greatly enhanced with more creative front end design and more powerful, dynamic functionality. Whether using an existing library like D3 (Data-Driven Documents) [9], or implementing a custom front end, the user could visualize, organize, and label data interactively with this superior interface. 142

When development of the pipeline matures, two of its greatest assets will be flexibility and reusability. It will have enough algorithmic options to solve the most common learning tasks, and the same configuration can be used to solve classes of related problems.

6.3 Final Thoughts

The frontier of computer security research has witnessed an upward trend in empiricism. In part, this trend is due to the greater computing resources of today, which grant us the power to collect and analyze massive amounts of data. But beyond just capability is the initiative to transcend a strictly technical approach to security. Addressing the technical challenge of securing systems, while still vitally important, often involves a traditional arms race between attacker and defender — a race which usually has inherent asymmetries favoring the attacker. Our aim is to complement technical mechanisms with empirical measurement of the infrastructure and economics that support an attack. Indeed, the true power of many defenses is realized in the combination of micro- and macro-level analyses, where the latter seeks to decompose and understand attacks as they operate today. We believe that this empirical angle is well founded in both motivation and results. A holistic analysis of a scam’s ecosystem unfolds the economics behind the adversary’s enterprise, and uncovers the narrowest bottleneck(s) whose severance would cripple their business model. Furthermore, a growing body of work in this field illustrates the already many successes of this data-driven approach (e.g., [44, 84, 68, 82, 78]). Accordingly, we see this line of work proceeding for years to come, and so the intersection of computer security and machine learning will remain fertile ground for research. Our specific goal continues: to use machine learning for offloading data analysis from security practitioners to automated tools. 143

To this end, we envisage a promising opportunity to further offload avoidable manual duty — an improvement not in the methodology itself, but rather in its execution. In particular, we reiterate that the most pressing research challenge lies in the engineering of our software stack. Our current system is basically “piecewise,” where the distinct stages — crawling, machine learning, and validating — are maintained as separate components. This design induces unnecessary overhead in between stages, and so we advise that the components be closely integrated. The synthesis of a Web crawler, modular and scalable data analyzer, and dynamic visualizer into a streamlined, robust pipeline would significantly optimize our research practice. In summary, we encourage researchers of data-driven security to not only use this dissertation as a model, but also seek ways to perform this research with maximal efficiency. Bibliography

[1] About Donuts. http://www.donuts.co/about/.

[2] http://newgtlds.icann.org/en/about/program.

[3] Sumayah Alrwais, Kan Yuan, Eihal Alowaisheq, Zhou Li, and XiaoFeng Wang. Understanding the Dark Side of Domain Parking. In Proceedings of the USENIX Security Symposium, San Diego, CA, August 2014.

[4] David S. Anderson, Chris Fleizach, Stefan Savage, and Geoffrey M. Voelker. Spamscatter: Characterizing Internet Scam Hosting Infrastructure. In Pro- ceedings of the USENIX Security Symposium, Boston, MA, August 2007.

[5] Ross Anderson, Chris Barton, Rainer Bhme, Richard Clayton, Michel J.G. van Eeten, Michael Levi, Tyler Moore, and Stefan Savage. Measuring the Cost of Cybercrime. In The Economics of Information Security and Privacy. 2013.

[6] Sushma Nagesh Bannur, Lawrence K. Saul, and Stefan Savage. Judging a site by its content: learning the textual, structural, and visual features of malicious Web Pages. In Proceedings of the ACM Workshop on Security and Artificial Intelligence, Chicago, IL, October 2011.

[7] Andras A. Bencz´ur,Karoly Csalogany, Tamas Sarlos, and M´at´eUher. Spam- Rank – Fully Automatic Link Spam Detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan, May 2005.

[8] David M. Blei, Andrew Y. Ng, Michael I. Jordan, and John Lafferty. Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003.

[9] Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. D3: Data-Driven Documents. IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2011.

[10] Andrei Z. Broder. On the Resemblance and Containment of Documents. In Proceedings of the Compression and Complexity of Sequences, Positano, Salerno, Italy, June 1997.

144 145

[11] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic Clustering of the Web. In Proceedings of WWW6 and Computer Networks 29 (8-13) and Digital/HP Technical Report SRC-TN-1997-015 1997, pages 1157–1166, Santa Clara, CA, 1997. [12] Neha Chachra, Damon Mccoy, Stefan Savage, and Geoffrey M. Voelker. Empir- ically Characterizing Domain Abuse and the Revenue Impact of Blacklisting. In Proceedings of the Workshop on the Economics of Information Security, State College, PA, June 2014. [13] Moses S. Charikar. Similarity Estimation Techniques from Rounding Algo- rithms. In Proceedings of the ACM Symposium on Theory of Computing, Montreal, PQ, Canada, May 2002. [14] Federal Trade Commission. FTC Shuts Down, Freezes Assets of Vast Interna- tional Spam E-Mail Network. https://www.ftc.gov/news-events/press-releases/ 2008/10/ftc-shuts-down-freezes-assets-vast-international-spam-e-mail, 2008. [15] T. Cover and P. Hart. Nearest neighbor pattern classification. In IEEE Transactions on Information Theory, IT-13, pages 21–27, 1967. [16] Fabio Crestani, Mounia Lalmas, Cornelis J. Van Rijsbergen, and Iain Campbell. “Is This Document Relevant? ... Probably”: A Survey of Probabilistic Models in Information Retrieval. ACM Comput. Surv., 30(4):528–552, December 1998. [17] D. Defays. An Efficient Algorithm for a Complete Link Method. Comput. J., 20(4):364–366, 1977. [18] Delegated Strings. http://newgtlds.icann.org/en/program-status/ delegated-strings. [19] Donuts raises $100 million, applies for 307 new TLDs. http://domainnamewire. com/2012/06/05/donuts-raises-100-million-applies-for-307-new-tlds/, June 2012. [20] Jake Drew and Tyler Moore. Automatic Identification of Replicated Criminal Using Combined Clustering. In Proceedings of the International Workshop on Cyber Crime, May 2014. [21] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9:1871–1874, 2008. [22] Dennis Fetterly, Mark Manasse, and Marc Najork. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases, Paris, France, June 2004. 146

[23] Amy Gesenhues. Study: Organic Search Drives 51% Of Traffic, Social Only 5%. http://searchengineland.com/ study-organic-search-drives-51-traffic-social-5-202063, August 2014.

[24] GoDaddy Raises $460 Million in IPO. http://www.wsj.com/articles/ -raises-460-million-in-ipo-1427844547, March 2015.

[25] gTLD Final Report Public Comments. http://forum.icann.org/lists/ gtldfinalreport-2007/.

[26] Thiago S. Guzella and Walmir M. Caminhas. A review of machine learning approaches to Spam filtering. Expert Systems with Applications, 36(7):10206 – 10222, 2009.

[27] Zolt´anGy¨ongyi,Hector Garcia-Molina, and Jan Pedersen. Combating Web Spam with TrustRank. In Proceedings of the 13th International Conference on Very Large Data Bases, Toronto, Canada, August 2004.

[28] Tristan Halvorson, Matthew F. Der, Ian Foster, Stefan Savage, Lawrence K. Saul, and Geoffrey M. Voelker. From .academy to .zone: An analysis of the new tld land rush. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement, Tokyo, Japan, October 2015.

[29] Tristan Halvorson, Kirill Levchenko, Stefan Savage, and Geoffrey M. Voelker. XXXtortion? Inferring Registration Intent in the .XXX TLD. In Proceedings of the International World Wide Web Conference, Seoul, Korea, April 2014.

[30] Tristan Halvorson, Janos Szurdi, Gregor Maier, Mark Felegyhazi, Christian Kreibich, Nicholas Weaver, Kirill Levchenko, and Vern Paxson. The BIZ Top-Level Domain: Ten Years Later. In Proceedings of the Passive and Active Measurement Workshop, Vienna, Austria, March 2012.

[31] Monika Henzinger. Finding Near-Duplicate Web Pages: A Large-Scale Evalua- tion of Algorithms. In Proceedings of the ACM SIGIR Conference on Research & Development on Information Retrieval, Seattle, WA, August 2006.

[32] Cormac Herley and Dinei Florˆencio.Nobody Sells Gold for the Price of Silver: Dishonesty, Uncertainty and the Underground Economy. In Proceedings of the Workshop Economics of Information Security and Privacy, London, UK, June 2009.

[33] Danny Yuxing Huang, Hitesh Dharmdasani, Sarah Meiklejohn, Vacha Dave, Chris Grier, Damon McCoy, Stefan Savage, Nicholas Weaver, Alex C. Snoeren, and Kirill Levchenko. Botcoin: Monetizing Stolen Cycles. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, February 2014. 147

[34] ICANN. Internet Domain Name Expansion Now Underway. https://www. icann.org/resources/press-material/release-2013-10-23-en, August 2013. [35] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv:1408.5093 [cs.CV], 2014. [36] John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy, and Martin Abadi. deSEO: Combating Search-Result Poisoning. In Proceedings of the USENIX Security Symposium, San Francisco, CA, August 2011. [37] I. T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986. [38] Chris Kanich, Nicholas Weaver, Damon McCoy, Tristan Halvorson, Christian Kreibich, Kirill Levchenko, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. Show Me the Money: Characterizing Spam-advertised Revenue. In Proceedings of the USENIX Security Symposium, San Francisco, CA, August 2011. [39] Mohammad Karami, Shiva Ghaemi, and Damon McCoy. Folex: An Analysis of an Herbal and Counterfeit Luxury Goods Affiliate Program. In Proceedings of the eCrime Researchers Summit, San Francisco, CA, September 2013. [40] Pranam Kolari, Tim Finin, and Anupam Joshi. SVMs for the Blogosphere: Blog Identification and Splog Detection. In Proceedings of the 2006 AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, Palo Alto, CA, March 2006. [41] Robert Layton, Paul Watters, and Richard Dazeley. Automatically Determining Phishing Campaigns using the USCAP Methodology. In Proceedings of the eCrime Researchers Summit, Dallas, TX, October 2010. [42] Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. Pick Your Poison: Pricing and Inventories at Unlicensed Online Pharmacies. In Proceedings of the ACM Conference on Electronic Commerce, Philadelphia, PA, June 2013. [43] Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. A Nearly Four- Year Longitudinal Study of Search-Engine Poisoning. In Proceedings of the ACM Conference on Computer and Communications Security, Scottsdale, AZ, November 2014. [44] Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, M´ark F´elegyh´azi,Chris Grier, Tristan Halvorson, Chris Kanich, Christian Kreibich, He Liu, Damon McCoy, Nicholas Weaver, Vern Paxson, Geoffrey M. Voelker, and Stefan Savage. Click Trajectories: End-to-End Analysis of the Spam Value Chain. In Proceedings of the IEEE Symposium and Security and Privacy, Oakland, CA, May 2011. 148

[45] Zhou Li, Kehuan Zhang, Yinglian Xie, Fang Yu, and Xiaofeng Wang. Know- ing Your Enemy: Understanding and Detecting Malicious Web Advertising. In Proceedings of the ACM Conference on Computer and Communications Security, Raleigh, NC, October 2012.

[46] Jun-Lin Lin. Detection of Cloaked Web Spam by Using Tag-based Methods. Expert Syst. Appl., 36(4):7493–7499, May 2009.

[47] List of most expensive domain names. https://en.wikipedia.org/wiki/List of most expensive domain names.

[48] He Liu, Kirill Levchenko, M´arkF´elegyh´azi,Christian Kreibich, Gregor Maier, Geoffrey M. Voelker, and Stefan Savage. On the Effects of Registrar-level Intervention. In Proceedings of the USENIX Workshop on Large-scale Exploits and Emergent Threats, Boston, MA, March 2011.

[49] Long Lu, Roberto Perdisci, and Wenke Lee. SURF: Detecting and Measuring Search Poisoning. In Proceedings of the ACM Conference on Computer and Communications Security, Chicago, IL, October 2011.

[50] Christian Ludl, Sean McAllister, Engin Kirda, and Christopher Kruegel. On the Effectiveness of Techniques to Detect Phishing Sites. In Proceedings of the Conference on Detection of Intrusions and Malware & Vulnerability Assessment, July 2007.

[51] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Learning to Detect Malicious URLs. ACM Trans. Intell. Syst. Technol., 2(3):30:1–30:24, May 2011.

[52] Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. Detecting Near- duplicates for Web Crawling. In Proceedings of the International World Wide Web Conference, Banff, Alberta, Canada, May 2007.

[53] Christopher D. Manning and Hinrich Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999.

[54] MarkMonitor. MarkMonitor Brand Protection. https://www.markmonitor. com/services/brand-protection.php.

[55] Damon McCoy, Hitesh Dharmdasani, Christian Kreibich, Geoffrey M. Voelker, and Stefan Savage. Priceless: The Role of Payments in Abuse-advertised Goods. In Proceedings of the ACM Conference on Computer and Communications Security, Raleigh, NC, October 2012.

[56] Damon McCoy, Andreas Pitsillidis, Grant Jordan, Nicholas Weaver, Chris- tian Kreibich, Brian Krebs, Geoffrey M. Voelker, Stefan Savage, and Kirill 149

Levchenko. PharmaLeaks: Understanding the Business of Online Pharmaceu- tical Affiliate Programs. In Proceedings of the USENIX Security Symposium, Bellevue, WA, August 2012.

[57] Sarah Meiklejohn, Marjori Pomarole, Grant Jordan, Kirill Levchenko, Damon McCoy, Geoffrey M. Voelker, and Stefan Savage. A Fistful of Bitcoins: Char- acterizing Payments Among Men with No Names. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement, Barcelona, Spain, October 2013.

[58] Xiangrui Meng, Joseph K. Bradley, Burak Yavuz, Evan R. Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, D. B. Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. MLlib: Machine Learning in Apache Spark. arXiv:1505.06807 [cs.LG], 2015.

[59] Tyler Moore and Richard Clayton. The consequence of noncooperation in the fight against phishing. In Proceedings of the eCrime Researchers Summit, Atlanta, GA, October 2008.

[60] Tyler Moore and Benjamin Edelman. Measuring the Perpetrators and Funders of Typosquatting. In Proceedings of the International Conference on Financial Cryptography and Data Security, January 2010.

[61] Tyler Moore, Jie Han, and Richard Clayton. The Postmodern : Empirical Analysis of High-Yield Investment Programs. In Proceedings of Financial Cryptography and Data Security, February 2012.

[62] Tyler Moore, Nektarios Leontiadis, and Nicolas Christin. Fashion Crimes: Trending-term Exploitation on the Web. In Proceedings of the ACM Conference on Computer and Communications Security, Chicago, IL, October 2011.

[63] New Generic Top-Level Domains: Intellectual Property Considerations. http: //www.wipo.int/amc/en/domains/reports/newgtld-ip/index..

[64] New gTLD — Fast Facts. http://newgtlds.icann.org/en/about/program/ materials/fast-facts-26jan15-en.pdf.

[65] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. De- tecting Spam Web Pages Through Content Analysis. In Proceedings of the International World Wide Web Conference, Edinburgh, Scotland, May 2006.

[66] OpSec. Brand protection from manufacturing to retail. http://www. opsecsecurity.com/brand-protection.

[67] PayPaI. https://en.wikipedia.org/wiki/PayPaI. 150

[68] Paul Pearce, Vacha Dave, Chris Grier, Kirill Levchenko, Saikat Guha, Damon McCoy, Vern Paxson, Stefan Savage, and Geoffrey M. Voelker. Characterizing Large-Scale in ZeroAccess. In Proceedings of the ACM Conference on Computer and Communications Security, Scottsdale, AZ, November 2014.

[69] J. Postel and J. Reynolds. Domain requirements. RFC 920, October 1984.

[70] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 2007.

[71] Niels Provos, Panayiotis Mavrommatis, Moheeb Abu Rajab, and Fabian Monrose. All Your iFRAMEs Point to Us. In Proceedings of the USENIX Security Symposium, Boston, MA, June 2008.

[72] Registrar accreditation agreement. https://www.icann.org/resources/pages/ approved-with-specs-2013-09-17-en.

[73] Registry agreement. http://newgtlds.icann.org/sites/default/files/agreements/ agreement-approved-09jan14-en.htm.

[74] Safenames. Mark Protect. http://www.safenames.net/BrandProtection/ MarkProtect.aspx.

[75] Dmitry Samosseiko. The Partnerka What is it, and why should you care? In Proceedings of Virus Bulletin Conference, 2009.

[76] Brett Stone-Gross, Ryan Abman, Richard A. Kemmerer, Christopher Kruegel, Douglas G. Steigerwald, and Giovanni Vigna. The Underground Economy of Fake Antivirus Software. In Proceedings of the Workshop on the Economics of Information Security, Fairfax, VA, June 2011.

[77] Sunrise Claims Periods. http://newgtlds.icann.org/en/program-status/ sunrise-claims-periods.

[78] Kurt Thomas, Elie Bursztein, Chris Grier, Grant Ho, Nav Jagpal, Alexandros Kapravelos, Damon McCoy, Antonio Nappa, Vern Paxson, Paul Pearce, Niels Provos, and Moheeb Abu Rajab. Ad Injection at Scale: Assessing Deceptive Advertisement Modifications. In Proceedings of the IEEE Symposium on Security and Privacy, San Jose, CA, May 2015.

[79] Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson, and Dawn Song. Design and Evaluation of a Real-Time URL Spam Filtering Service. In Proceedings of the IEEE Symposium on Security and Privacy, Oakland, CA, May 2011. 151

[80] Kurt Thomas, Damon McCoy, Chris Grier, Alek Kolcz, and Vern Paxson. Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse. In Proceedings of the USENIX Security Symposium, Washington, D.C., August 2013. [81] Tanguy Urvoy, Emmanuel Chauveau, Pascal Filoche, and Thomas Lavergne. Tracking Web Spam with HTML Style Similarities. ACM Trans. Web, 2(1):3:1– 3:28, March 2008. [82] Marie Vasek, Micah Thornton, and Tyler Moore. Empirical Analysis of Denial- of-Service Attacks in the Bitcoin Ecosystem. In Proceedings of the 1st Workshop on Bitcoin Research, March 2014. [83] Thomas Vissers, Wouter Joosen, and Nick Nikiforakis. Parking Sensors: Analyzing and Detecting Parked Domains. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, February 2015. [84] David Wang, Matthew Der, Mohammad Karami, Lawrence Saul, Damon McCoy, Stefan Savage, and Geoffrey M. Voelker. Search + Seizure: The Effectiveness of Interventions on SEO Campaigns. In Proceedings of the ACM SIGCOMM Conference on Internet Measurement, Vancouver, Canada, November 2014. [85] David Wang, Stefan Savage, and Geoffrey M. Voelker. Juice: A Longitudinal Study of an SEO Campaign. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, February 2013. [86] David Y. Wang, Stefan Savage, and Geoffrey M. Voelker. Cloak and Dagger: Dynamics of Web Search Cloaking. In Proceedings of the ACM Conference on Computer and Communications Security, Chicago, IL, October 2011. [87] Yi-Min Wang, Ming Ma, Yuan Niu, and Hao Chen. Spam Double-funnel: Con- necting Web Spammers with Advertisers. In Proceedings of the International World Wide Web Conference, Banff, Alberta, Canada, May 2007. [88] Brad Wardman and Gary Warner. Automating Phishing Website Identification through Deep MD5 Matching. In Proceedings of the eCrime Researchers Summit, Atlanta, GA, October 2008. [89] Jonathan Weinberg. Report (part one) of Working Group C of the Domain Name Supporting Organization Internet Corporation for Assigned Names and Numbers. http://www.dnso.org/dnso/notes/20000321.NCwgc-report.html, March 2000. [90] Colin Whittaker, Brian Ryner, and Marria Nazif. Large-Scale Automatic Classification of Phishing Pages. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, February 2010. 152

[91] Baoning Wu and Brian D. Davison. Detecting Semantic Cloaking on the Web. In Proceedings of the International World Wide Web Conference, Edinburgh, Scotland, May 2006.

[92] Guo-Xun Yuan, Chia-Hua Ho, and Chih-Jen Lin. An Improved GLMNET for L1-regularized Logistic Regression. Journal of Machine Learning Research, 13(1):1999–2030, June 2012.

[93] Qing Zhang, David Wang, and Geoffrey M. Voelker. DSpin: Detecting Au- tomatically Spun Content on the Web. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, February 2014.

[94] Yee Zhang, Jason Hong, and Lorrie Cranor. CANTINA: A Content-Based Approach to Detecting Phising Web Sites. In Proceedings of the International World Wide Web Conference, Banff, Alberta, Canada, May 2007.