UC San Diego UC San Diego Electronic Theses and Dissertations
Title Investigating Large-Scale Internet Abuse Through Web Page Classification
Permalink https://escholarship.org/uc/item/8jp0z4m4
Author Der, Matthew Francis
Publication Date 2015
Peer reviewed|Thesis/dissertation
eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO
Investigating Large-Scale Internet Abuse Through Web Page Classification
A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy
in
Computer Science
by
Matthew F. Der
Committee in charge:
Professor Lawrence K. Saul, Co-Chair Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair Professor Gert Lanckriet Professor Kirill Levchenko
2015 Copyright Matthew F. Der, 2015 All rights reserved. The Dissertation of Matthew F. Der is approved and is acceptable in quality and form for publication on microfilm and electronically:
Co-Chair
Co-Chair
Co-Chair
University of California, San Diego 2015
iii DEDICATION
To my family: Kristen, Charlie, David, Bryan, Sarah, Katie, and Zach.
iv EPIGRAPH
Everything should be made as simple as possible, but not simpler.
— Albert Einstein
Sic transit gloria . . . glory fades.
— Max Fischer
v TABLE OF CONTENTS
Signature Page ...... iii
Dedication ...... iv
Epigraph ...... v
Table of Contents ...... vi
List of Figures ...... ix
List of Tables ...... xi
Acknowledgements ...... xiii
Vita ...... xviii
Abstract of the Dissertation ...... xix
Chapter 1 Introduction ...... 1 1.1 Contributions ...... 4 1.2 Organization ...... 6
Chapter 2 Background ...... 8 2.1 The Spam Ecosystem ...... 8 2.1.1 Spam Value Chain ...... 9 2.1.2 Click Trajectories Finding ...... 10 2.1.3 Affiliate Programs ...... 11 2.2 SEO and Search Poisoning ...... 15 2.3 Domain Names ...... 17 2.3.1 The DNS Business Model ...... 18 2.3.2 Growth of Top-Level Domains ...... 19 2.3.3 Abuse and Economics ...... 23 2.4 Bag-of-Words Representation...... 25 2.5 Related Work ...... 28 2.5.1 Non-Machine Learning Methods ...... 28 2.5.2 Near-Duplicate Web Pages ...... 29 2.5.3 Web Spam and Cloaking ...... 30 2.5.4 Other Applicatons ...... 32
Chapter 3 Affiliate Program Identification ...... 36 3.1 Introduction ...... 37 3.2 Data Set ...... 41
vi 3.2.1 Data Collection ...... 41 3.2.2 Data Filtering ...... 42 3.2.3 Data Labeling ...... 43 3.3 An Automated Approach ...... 45 3.3.1 Feature Extraction ...... 46 3.3.2 Dimensionality Reduction & Visualization ...... 49 3.3.3 Nearest Neighbor Classification ...... 51 3.4 Experiments ...... 53 3.4.1 Proof of Concept ...... 55 3.4.2 Labeling More Storefronts ...... 56 3.4.3 Classification in the Wild ...... 58 3.4.4 Learning with Few Labels ...... 60 3.4.5 Clustering ...... 63 3.5 Conclusion ...... 66 3.6 Acknowledgements ...... 67
Chapter 4 Counterfeit Luxury SEO ...... 68 4.1 Introduction ...... 69 4.2 Background ...... 71 4.2.1 Search Result Poisoning ...... 71 4.2.2 Counterfeit Luxury Market ...... 76 4.2.3 Interventions ...... 79 4.3 Data Collection ...... 81 4.3.1 Generating Search Queries ...... 82 4.3.2 Crawling & Cloaking ...... 83 4.3.3 Detecting Storefronts ...... 84 4.3.4 Complete Data Set ...... 84 4.4 Approach ...... 85 4.4.1 Supervised Learning ...... 85 4.4.2 Unsupervised Learning ...... 90 4.4.3 Bootstrapping the System ...... 91 4.5 Results ...... 93 4.5.1 Classification Results ...... 93 4.5.2 Ecosystem-Level Results ...... 98 4.5.3 Further Analysis ...... 103 4.6 Conclusion ...... 106 4.7 Acknowledgements ...... 107
Chapter 5 The Uses (and Abuses) of New Top-Level Domains ...... 108 5.1 Introduction ...... 110 5.2 Data Set ...... 113 5.3 Clustering and Classification ...... 116 5.3.1 Clustering ...... 116
vii 5.3.2 Classification ...... 118 5.3.3 Further Analysis ...... 123 5.4 Document Relevance ...... 125 5.4.1 Generating a Corpus for Each TLD ...... 127 5.4.2 Estimating Topics ...... 129 5.4.3 Relevance Scoring ...... 130 5.5 Conclusion ...... 134 5.6 Acknowledgements ...... 136
Chapter 6 Conclusion ...... 137 6.1 Impact ...... 138 6.2 Future Work ...... 140 6.3 Final Thoughts ...... 142
Bibliography ...... 144
viii LIST OF FIGURES
Figure 2.1. The steady and swift rollout of new gTLDs. Dates of delegated strings were collected from [18]...... 22
Figure 3.1. Data filtering process. Stage 1 is the entire set of crawled Web pages; stage 2, pages tagged as pharmaceutical, replica, and luxury; stage 3, storefronts of affiliate programs matched by regular expressions...... 41
Figure 3.2. Projection of storefront feature vectors from the largest affiliate program (EvaPharmacy) onto the data’s two leading principal components...... 51
Figure 3.3. Histogram of three NN distances to EvaPharmacy storefronts: distances to storefronts in the same affiliate program, to those in other programs, and to unlabeled storefronts...... 54
Figure 3.4. Boxplot showing balanced accuracy for all 45 classes as a function of training size...... 61
Figure 3.5. Number of affiliate programs of different sizes with few versus many clustering errors; see text for details. In general the larger programs have low error rates (top), while the smaller programs have very high error rates (bottom)...... 65 Figure 4.1. An illustration of search result poisoning by an SEO botnet. . 74
Figure 4.2. Example of a poisoned search result...... 76
Figure 4.3. Examples of counterfeit luxury storefronts forging four brands (in row order): Louis Vuitton, Ugg, Moncler, and Beats By Dre. 77
Figure 4.4. Counterfeits of Gucci products offered at false discounts. Curi- ously, every product’s retail price is the same...... 78
Figure 4.5. Flowchart depicting one round of classification...... 92
Figure 4.6. Stacked area plots ascribing PSRs to specific SEO campaigns in four different verticals. (This is Figure 2 from [84].) ...... 102
Figure 5.1. Breakdown of classes for domains in new TLDs...... 123
ix Figure 5.2. Number of parked domains by service. The number on top of each bar indicates how many distinct TLDs used that parking service...... 126
Figure 5.3. Percentage of relevant Web pages in ten TLDs...... 133
x LIST OF TABLES
Table 2.1. The number of storefronts in the original data set for all forty- four affiliate programs. Aggregate programs are indicated as ZedCash* and Mailien†...... 13
Table 3.1. Screenshots of online storefronts selling counterfeit pharmaceu- ticals, replicas, and software...... 40
Table 3.2. Summary of the data from crawls of consecutive three-month periods...... 45
Table 3.3. Feature counts, density of the data, dimensionality after princi- pal components analysis, and percentage of unique examples. . 49
Table 3.4. Examples of storefronts matched by regular expressions (left column), and storefronts not matched but discovered by NN classification (right column)...... 57
Table 3.5. Data sizes and performance for select affiliate programs. For each program, the two rows show results from the first then second 3 months of the study...... 59
Table 3.6. Examples of correctly classified storefronts when there is only one training example per affiliate program. The affiliate programs shown here are 33drugs and RX-Promotion...... 63
Table 4.1. Rounds of classification, in which automatic predictions are manually verified. At each round, we specify the total number of storefront Web pages, the number that we have labeled, and the number of associated SEO campaigns...... 94
Table 4.2. The ten most distinctive features of the msvalidate campaign, along with their corresponding weights. The first column indi- cates whether the feature was extracted from the storefront (s) or doorway (d)...... 95
Table 4.3. The five most likely candidates for the msvalidate campaign, all of which had probability nearly 1 and were verified as correct. 95
Table 4.4. Breakdown of the prediction that the louisvuittonicon.com store- front is affiliated with the msvalidate campaign. The first column indicates the feature source: storefront (s) or doorway (d)...... 96
xi Table 4.5. Sizes of storefront clusters and doorway clusters. For exam- ple, there is 1 cluster containing 8 storefronts, and 2 clusters containing 8 doorways each...... 97
Table 4.6. Sixteen luxury verticals and the associated # of PSRs, doorways, stores, and campaigns that target them. The key campaign, the first one we identified that guided our study, targeted all verticals except those with an ‘*’...... 99
Table 4.7. Classified campaigns (with 30+ doorways) and the # of asso- ciated doorways, stores, brands targeted, and peak poisoning duration in days. (This is a subset of Table 2 from [84].) . . . . . 100
Table 5.1. The ten largest new TLDs, when they appeared in the root zone, and prices of registering a domain in them...... 114
Table 5.2. Examples of parked, unused, suspended, and junk Web pages. 120
Table 5.3. Classification flux between two Web crawls 20 weeks apart. . . . 124
Table 5.4. Number of registered domains, and percentage of contentful domains, in ten specialized TLDs...... 127
Table 5.5. Most likely collocating words for ten TLD words...... 130
Table 5.6. Web pages that are (ir)relevant to their TLD. Shown for each page are the second-level domain, relevance score as given by eq. (5.1), and a screenshot...... 132
xii ACKNOWLEDGEMENTS First and foremost, I would like to thank my advisors: Professors Lawrence Saul, Geoff Voelker, and Stefan Savage. I feel so humbled and grateful to have had the privilege of working with them. I have the utmost respect for them as researchers, advisors, and teachers, but I have equal admiration for the people they are. The combination of their personalities created an affable and witty group dynamic, which made working together a real pleasure. In addition, I greatly appreciate their involvement and service to the department; their efforts help engender a gregarious community that makes CSE a special place. Also, their passion for volunteering trickles down to their students. I thank Lawrence further for his time and invaluable feedback on dry runs of my presentations, his guidance and course materials for the Intro to AI course I taught, and all of our conversations about sports. Big thanks to Gert Lanckriet and Kirill Levchenko for taking time to serve on my thesis committee. Another CSE professor I must credit is Sorin Lerner, with whom I had the pleasure of working on Visit Day and Gradcom over the years. I applaud his ability to corral abundant opinions, remain focused on the task at hand, and get things done. I thank my department colleagues and co-authors Do-kyum Kim, Tristan Halvorson, and David Wang; I thoroughly enjoyed working with them. I always had more fun collaborating than working in isolation, and one of the greatest pleasures of graduate school was learning from my peers. I owe a huge debt of gratitude to CSE staff member Jessica Gross, who had an answer or pointer for every one of my administrative questions. I admire and thank her for her willingness to help. Also, thanks to Julie Conner for shepherding
xiii me from beginning to end of the program. Beyond the walls of EBU3B, I would like to thank Ian Roxborough, my intern host at Google San Francisco in the summers of 2011 and 2012. It was a joy to work with him and the Internal Privacy team. He is a badass software engineer and virtuous champion of privacy who reassured me that Googlers genuinely do care about it. Also, he strongly reinforced the importance of software testing. Furthermore, Ian was an outstanding mentor and good friend. I cherished our daily (and supremely delicious) lunches in the sunshine by the Bay Bridge, and our competitive and fun games of pool. I thank him profoundly for granting me this special opportunity and experience. My graduate study would not have been possible without the teaching, advising, and inspiration I received from my mentors at the University of Rich- mond, Professors Jim Davis and Barry Lawson. Jim offered me my first research opportunity and through him, I learned that research is intellectually stimulating, challenging but fun, daunting but rewarding. Additionally, he is a paragon of work-life balance, a true role model for me who shares similar values and priorities. I find it impossible how Jim thrives as a teacher and researcher, but is a devoted family man who leaves work at work so he can invest in his family, and somehow still finds time for hobbies of his own. Barry welcomed me as I shifted my gaze from mathematics to computer science. I completed my honors thesis with him on a topic of mutual interest, which was the seed from which my graduate research grew. He was The Cool Professor who I looked up to, and who strikes a mighty fine work-life balance himself. Today, I am glad to call him a lifelong friend. I thank both Jim and Barry for making my undergraduate experience what it was, and for the encouragement to go onward and upward into graduate school.
xiv Most importantly, I would like to thank my family for their love and support. Family is the most special thing to me in life, and my happiest days are the ones spent with them. I have missed them dearly during my sojourn on the West Coast. I am deeply grateful for the way my parents raised us four kids; one of their greatest gifts to us was allowing us the independence to make our own way. I tip my hat to my brother Bryan, who blazed the academic trail for me and who beat me to “First PhD in the Family” by two years. Also, I thank my brother David for presenting me with an exciting next chapter in my career. I am absolutely thrilled to work and learn together with him. In the absence of immediate family, Dan, Cindy, and Molly Ennis became my family in San Diego; I thank them to the moon and back. I am very blessed to have met them, and to have them in my life now and always. Another friend I wish to thank is fellow computer scientist Daniel Ricketts; we both started at UCSD in Fall 2010. He was my workout partner, both in the gym and on the pitch (i.e., soccer field). I took him under my wing, and it was a joy to watch him blossom into the confident and polished soccer player he is today. The dream to play for the U.S. Men’s National Team (USMNT) is very much alive, and they need us now more than ever. But in all seriousness, Dan was my go-to guy whenever I wanted an honest and trusted opinion; his voice of reason countered my emotion. I value his input tremendously, and want to especially thank him for encouraging me to teach a summer course. Some of the most fun I have had in graduate school is playing intramural sports, soccer in particular. I wish to single out my longest tenured teammate and good friend, Gary Johnston. He was my most trusted teammate, a great athlete and fierce competitor who could unintentionally make me laugh even in the heat of battle. I thank all of my teammates, especially the ones on the championship
xv Cherrypickers team. I loved competing with them, I was honored to serve as their captain, and I will miss leading them to victory! Yet another friend I would like to thank is Brown Farinholt, a fellow Rich- monder! Having someone from my hometown enter the department was very refreshing, and it is impossible to not have fun whilst spending time with Brown. I look forward to seeing him down the road. I want to thank many more friends for the camaraderie, experiences, and laughs we shared: fellow CSE Fall 2010 doctoral inductees Sheeraz Ahmad, Akshay Balsubramani, Karyn Benson, Alan Leung, Wilson Lian, Daniel Moeller, and Mingxun Wang; department colleagues and soccer teammates Yashar Asgarieh, Stefan Schneider, Chris Tosh, Michael Walter, and, of course, Professor Yannis Papakonstantinou, who finally earned his first championship T-shirt with us; officemates Mayank Dhiman, David Kohlbrenner, and Ding Yuan; other CSE friends Rohan Anil, Alex Bakst, Dimo Bounov, Neha Chachra, Anukool Junnarkar, Alex Kapravelos (via UCSB), Sam Kwak, Keaton Mowery, Zach Murez, Nima Nikzad, Vivek Ramavajjala, Dustin Richmond, Valentin Robert, and misc king Arjun Roy; and, last but not least, nanoengineer Sohini Manna. Finally, I would like to sincerely thank my esteemed roommates of the past three years — Garrett Graham, Davis Graham, Austin Kieffer, Zack Jones, Matt Hoffman, Aaron Polhamus, and Jack Silva — for making our apartment a respectful, comfortable, safe, and convivial place to call home. Chapter 3, in part, is a reprint of the material as it appears in Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) 2014. Der, Matthew F.; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was the primary investigator and author of this paper. Chapter 4, in part, is a reprint of the material as it appears in Proceedings
xvi of the International Measurement Conference (IMC) 2014. Wang, David; Der, Matthew F.; Karami, Mohammad; Saul, Lawrence K.; McCoy, Damon; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was one of the primary investigators and authors of this paper. Chapter 5, in part, is currently being prepared for submission for publication of the material. Der, Matthew F.; Halvorson, Tristan; Saul, Lawrence K.; Savage, Stefan; Voelker, Geoffrey M. The dissertation author was the primary investigator and author of this material.
xvii VITA
2010 Bachelor of Science in Computer Science and Mathematics University of Richmond 2013 Master of Science in Computer Science University of California, San Diego 2015 Doctor of Philosophy in Computer Science University of California, San Diego
PUBLICATIONS
Tristan Halvorson, Matthew F. Der, Ian Foster, Stefan Savage, Lawrence K. Saul, Geoffrey M. Voelker. “From .academy to .zone: An Analysis of the New TLD Land Rush.” To appear in Proceedings of the 15th ACM SIGCOMM Conference on Internet Measurement, Tokyo, Japan, 2015.
David Wang, Matthew F. Der, Mohammad Karami, Lawrence K. Saul, Damon McCoy, Stefan Savage, Geoffrey M. Voelker. “Search + Seizure: The Effectiveness of Interventions on SEO Campaigns.” In Proceedings of the 14th ACM SIGCOMM Conference on Internet Measurement, Vancouver, BC, Canada, 2014.
Matthew F. Der, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker. “Knock It Off: Profiling the Online Storefronts of Counterfeit Merchandise.” In Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, New York City, New York, USA, 2014.
Do-kyum Kim, Matthew F. Der, Lawrence K. Saul. “A Gaussian Latent Variable Model for Large Margin Classification of Labeled and Unlabeled Data.” In Proceed- ings of the 17th International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 2014.
Matthew F. Der, Lawrence K. Saul. “Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning.” In Advances in Neural Information Processing Systems 25 , Lake Tahoe, Nevada, USA, 2012.
Corneliu Bodea, Calina Copos, Matthew F. Der, David O’Neal, James A. Davis. “Shared Autocorrelation Property of Sequences.” In IEEE Transactions on Infor- mation Theory, Volume 57, Issue 6, 3805-3809, June 2011.
xviii ABSTRACT OF THE DISSERTATION
Investigating Large-Scale Internet Abuse Through Web Page Classification
by
Matthew F. Der
Doctor of Philosophy in Computer Science
University of California, San Diego, 2015
Professor Lawrence K. Saul, Co-Chair Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair
The Internet is rife with abuse: examples include spam, phishing, malicious advertising, DNS abuse, search poisoning, click fraud, and so on. To detect, investigate, and defend against such abuse, security efforts frequently crawl large sets of Web sites that need to be classified into categories, e.g., the attacker behind the abuse or the type of abuse. Domain expertise is often required at first, but classifying thousands to even millions of Web pages manually is infeasible. In this dissertation, I develop machine
xix learning tools to help security practitioners classify Web pages at scale. These automated, data-driven methods are made possible by the efforts of miscreants to operate at scale. Crafting every scam from scratch is too expensive, so miscreants use some degree of automation and replication to recreate their attacks. As a result, underlying similarities in both Web site content and structure can link related pages together. In the end, this automated classification of “big data” collected from the Web has significant impact, as it enables large-scale measurement and informs potential defensive interventions. This dissertation focuses on three applications. First, I present a system for monitoring Web sites that serve as online storefronts for spam-advertised goods. The system is highly accurate, even when training data is very limited. Second, I describe a system for identifying the black hat SEO campaigns that promote online stores selling counterfeit luxury goods. This system was used to nearly double the number of known campaigns to track, and increase the number of associated stores by 69%. Third, I discuss a system for categorizing the Web content hosted in new top-level domains. In total, this system was used to classify 4.1 million domains in 480 new TLDs. Overall, today’s scale of well-organized cybercrime demands the use of scalable defensive analysis. This setting is where the data-driven techniques of machine learning prove especially useful. Furthermore, large-scale classification has become a frequent need in security, and our methods are more generally applicable to problems beyond just the ones documented in this dissertation.
xx Chapter 1
Introduction
The Internet ignited a technological revolution that dramatically transformed society, both culturally and economically. Today, the Internet bustles with a variety of activity and content: email, news, multimedia, banking, shopping, social media, and so on. Along with enabling near-instant communication and access to information, the World Wide Web has also become a massive marketplace for consumers. There is great opportunity for big business since customers have immediate access to virtual stores and need not travel to physical ones; plus, every Internet user is a potential customer. Further still, even the parts of the Internet not devoted to e-commerce largely serve as a portal to it, given the ubiquity of advertisements which are often customized for the individual user. Inevitably, this marketplace attracts both legitimate and illegitmate busi- nesses alike. In either case, the goal is to get users to visit their Web sites, and ultimately make a purchase. To acquire customers, illegal businesses commonly resort to malicious or deceptive techniques to amass user traffic: for example, spam emails containing links to online storefronts, and black hat search engine optimization (SEO) through which high-ranking search results are rewired to direct users to a fraudulent store. Interestingly, the Internet namespace is its own unique marketplace, where
1 2
domain names are scarce and exclusive — and in turn, potentially quite valuable — resources. Not every domain owner publishes useful content, either: while much of the Web is teeming with daily traffic, a significant portion of it remains underdeveloped and largely unused. Instead of hosting meaningful content, some domain owners just sit on their online property and try to earn money passively by serving automatically spun ads, or by reselling their domain name in the future. While such practice, called domain parking, is not necessarily abusive, the real value it adds to the social good is certainly questionable. A common theme resonates among e-commerce (legal or not), advertising, domain parking, and many other online services: a substantial percentage of Internet activity is motivated by making money. Indeed, in recent years security researchers have found considerable success in adopting a socio-economic perspective toward computer security: follow the money to learn about the business models and relationships among actors behind these online enterprises (e.g., [44, 38, 56, 55, 80, 57, 33, 5, 61]). This approach for investigating and combatting cybercrime is engineered through empiricism — i.e., through large-scale measurement and analysis of abuse on the Internet. Specifically, the usual methodology involves crawling sizable collections of Web sites, which then must be organized and evaluated to draw quantitative, evidence-based conclusions. In security applications, for example, a set of Web pages needs to be classified into categories, such as the type of abuse, or the attacker behind the abuse. For a given application, to begin understanding the problem space unavoidably requires domain expertise. However, manual review is a laborious job for the human expert, and a fully manual approach indubitably does not scale. This encumbrance impels the need for automated methods to categorize Web pages accurately and efficiently to aid security practitioners. The premise of 3 this dissertation is that machine learning tools can fulfill precisely this demand. Automated machine learning techniques are effective solutions in part due to the regularity of replication on the Internet, where many duplicate or near-duplicate Web pages exist. For e-crime in particular, miscreants wish to operate at scale, as consistently crafting unique scams from scratch is far too expensive. Thus, at least some degree of replication and automation is behind their attacks, and therefore, underlying similarities in Web sites can link related pages together. Essentially, miscreants trade off stealth for scale. The impact of machine learning methods in this setting is that it removes the domain expert as a bottleneck in empirical assessment. Large-scale measurement becomes feasible, and from a security angle, the results can reveal choke points in the attackers’ infrastructure, and consequently inform defensive interventions. The task of organizing crawled Web sites is one of data labeling. Much work in the machine learning field focuses on supervised learning, where classifiers are trained on a data set with ground truth labels. In contrast, Web pages collected from the wild lack ground truth labels by nature. The task, then, is to establish ground truth through assigning Web pages a label indicating their categorization. Of course, we want to make this process most efficient, transferring as much of the workload as possible from the domain expert to the machine. Labeling data necessarily begins with manual inspection to understand the problem and recognize early patterns, but we can eventually bootstrap labeling from an initially manual procedure to a more automated one. At some point, the expert will have labeled a sufficient sample of data for training a machine learning model, which then can scale up the labeling process by making many automatic predictions. Unfortunately, we cannot blindly accept the predicted labels as ground truth. We can conduct proof-of-concept experiments to gain confidence in the model’s 4 performance, yet still the human must be brought back into the loop to validate predictions. Due to the sheer size of the data, it is impractical to validate all predictions, and a representative random sample must suffice. Advantageously, though, after this validation step, the expanded set of labeled data can be used to train even more accurate models, which then make more confident predictions on the remaining unlabeled data. In this way, data labeling becomes an iterative approach occurring in rounds. To summarize the overall methodology, the first round starts with a domain expert labeling some data by hand, scales with machine learning models that predict labels automatically, and concludes with the expert validating (a sample of) the predictions. Then, we iterate the process with repeated rounds of retraining the model, making more predictions, and validating select predictions, until the labeling is finished. To establish ground truth, human-in-the-loop is unavoidable, but performing this task at scale is only possible with automation in a human-machine system. My thesis is that machine learning offers an invaluable toolkit for data-driven security researchers, who face a repeated problem of partitioning large sets of crawled Web pages into categories.
1.1 Contributions
This dissertation addresses three particular instances of this general problem of organizing large collections of Web pages crawled on the Internet. The first instance arises from a holistic analysis of the spam ecosystem. Illegitimate businesses called affiliate programs run online stores selling counterfeit goods, and they team with spammers to advertise their stores via email spam. I seek to identify which storefront Web sites belong to which affiliate programs — a technical challenge behind a defensive intervention with considerable economic 5
impact. The second instance also involves online counterfeit stores, but instead of acquiring user traffic via email spam, here the criminals use black hat SEO — specifically, search result poisoning and cloaking — to lure users to their stores. The goal is analogous as well: I develop techniques to match these storefront Web pages to the SEO campaigns that promote them. The third instance emerges from the recent and dramatic expansion of the Internet namespace, wherein the Internet Corporation for Assigned Names and Numbers (ICANN) introduced a program for delegating new top-level domains (TLDs). From a crawl of millions of registered domains in hundreds of new TLDs, I measure how domain registrants are using their new domain names. The successful results for all three studies prove the critical role of machine learning in empirical approaches to Internet security. Specifically, the contributions of this dissertation are as follows:
• I present a system that solves a classification problem of identifying the affiliate programs that manage online storefronts selling counterfeit merchandise. This task underlies a crucial bottleneck in the spam business model (to be explained in Chapters 2 and 3), illuminating a promising point for defensive intervention. Classification is highly accurate, and the methodology is significantly more automated than the approach first used by other researchers. Furthermore, the classifier discovers additional storefronts belonging to known affiliate programs—false negatives which were previously missed. In mimicking an operational deployment when training data is limited, the system remains highly accurate. Lastly, the system can unveil a new affiliate program for which we have no labeled examples. 6
• I describe a system for linking counterfeit luxury stores to the SEO campaigns which promote them in poisoned search results. The system is designed for maximal usability by a security researcher who is not a machine learning expert; it is fast to train, fast to make predictions, and its output is highly interpretable. Additionally, the system can discover new SEO campaigns by grouping together highly similar unlabeled stores.
• I introduce an iterative system for classifying the Web content of millions of registered domains in hundreds of new top-level domains. The system performs rounds of clustering, labeling, classification, and validation. Also, I classify the same set of domains but twenty weeks later to see how domain owners change their Web presence over time. Secondly, for domains hosting legitimate Web content in a subset of ten specialized TLDs, I develop a statistical language model for judging how relevant the content is to the TLD’s target community.
I stress that this machine learning methodology is broadly applicable to data-driven cybercrime and more general than just the aforementioned applications. In that spirit, this dissertation also serves as a model for other researchers who wish to solve similar problems.
1.2 Organization
The remainder of this dissertation is organized as follows. Chapter 2 provides context for the three problems I explore, background on bag-of-words feature extraction which I use for each problem, and a brief review of related work. 7
Chapter 3 presents a system for classifying spam-advertised storefronts according to the affiliate program that manages them. Chapter 4 describes a system for identifying the different SEO campaigns which promote counterfeit luxury stores in poisoned search results. Chapter 5 explains a methodology for categorizing registered domains in new TLDs. In addition, for domains that host meaningful Web content in a sample of ten specialized TLDs, I introduce a technique for estimating how relevant the content is to the TLD’s intended purpose. Finally, Chapter 6 summarizes the contributions of this dissertation and offers potential directions for future work. Chapter 2
Background
This chapter imparts all necessary context for understanding the contribu- tions of this dissertation. Sections 2.1 to 2.3 provide a high-level overview of the three security applications we study. Section 2.4 explains the basics of bag-of-words feature extraction, which plays a fundamental role in all of our approaches. Lastly, Section 2.5 discusses select related work.
2.1 The Spam Ecosystem
Spam, or unsolicited bulk email, is a decades-old and widespread problem affecting anyone with an email account. Even though anti-spam tools have developed into a multi-billion dollar industry, spam continues to flood inboxes today. Viewed through an economic lens, this malpractice, which is essentially an advertising business, persists because spammers are driving a profitable venture; they will continue spamming as long as they are making money. This reality suggests a different approach to combatting spam that is poten- tially more fruitful than simply trying to prevent spam delivery. The hypothesis is that by investigating the entire spam ecosystem, we can learn about the underlying business model and, in turn, discover parts of it that are most susceptible to defensive action.
8 9
2.1.1 Spam Value Chain
Today, spam is a much more complex enterprise than it once was; now it is a highly diversified operation in which responsibilites are distributed among many entities. Some components have been studied in isolation — primarily spam delivery since that is what the user directly experiences — but until recently, the security community lacked a comprehensive view and understanding of the whole ensemble. This void was filled by the initiative of Levchenko et al. [44], who conducted a holistic, end-to-end analysis of the full spam value chain — the infrastructure and coordinating organizations which monetize spam. More specifically, the endeavor sought to identify the diversity of participat- ing actors, understand their individual roles as well as the relationships among them, and gauge the width of bottlenecks and cost (both replacement and opportunity) if a link in the chain were taken out. To accomplish this, the researchers carried out a large-scale empirical study in which they traced a complete “click trajectory,” from the initial act of following a URL advertised in a spam email to the final selective purchasing of goods from counterfeit stores. Their work showed that the spam ecosystem can be decomposed into three main functions: advertising, click support, and realization. We summarize each function below, but refer the reader to their paper for a more thorough treatment. Advertising. Spammers mainly operate as advertisers, trying to lure potential customers into clicking their URLs. They serve as affiliates of fraudulent stores and earn a 30-50% commission. We note that email is just one of many advertising vectors, but it is the most popular. Click support. After a user clicks on a spam-advertised URL, several pieces are required to take the user to an online store. Rarely does the URL directly 10 link to the store; instead, redirection sites are used to evade domain blacklisting and takedowns. Of course, the store must be reachable somewhere, so spammers or their associates acquire domain names to publish the storefront sites. Now, the spammers or third parties secure the standard infrastructure necessary for supporting any Web site on the Internet. In particular, they need name servers and Web servers; the name servers provide address records that indicate which Web servers host their sites. The stores themselves are managed by illegitimate businesses called affiliate programs, who fulfill numerous back-end obligations. They provide advertising materials, storefront templates, shopping cart management, analytics help, and Web interfaces for their affiliates to track conversions and register for payouts. In addition, affiliate programs contract payment and fulfillment services to outside parties. Realization. Finally, the tail end of the spam value chain consists of resources needed to complete a sale. Specifically, payment services enable online purchases, involving a merchant (or “acquiring”) bank and usually a payment processor as well. Then, fulfillment services ship physical products to customers; virtual products like software, music, and videos are available via direct download on the Internet.
2.1.2 Click Trajectories Finding
The collective effort of Levchenko et al. crawled nearly 15M spam-advertised URLs, 6M of which successfully resolved and rendered Web content. After filtering out any landing page that was not a storefront, they were left with a set of roughly 221K unique storefront Web pages. Their investigation revealed that the vast majority of these storefronts are managed by forty-five prevalent affiliate programs; 11 even further, they found that these affiliate programs use the same few merchant banks. Thus, solving a classification problem — mapping storefront Web pages to the affiliate programs behind them — was crucial to unveiling this choke point in the spam value chain. One downside, though, was that solving this classification problem took a great deal of manual effort by the research team. This motivated us to perform classification in a more automated way, which is the basis of the material we present in Chapter 3.
2.1.3 Affiliate Programs
Affiliate programs have controlled the spam business for years now. The Federal Trade Commission shut down a few of them in 2008 [14], but the take- down was a rare case of combatting affiliate programs directly. Indeed, other affiliate programs (called “partnerka” in Russia) continued to flourish, as evidenced by Samosseiko’s report [75] which unveiled their business model and economics. Levchenko et al. [44] advanced our understanding of the central position these programs have in the spam value chain, and also inspired further work that explored their economics in more depth. Kanich et al. [38] introduced two measurement techniques for estimating order volume (and hence revenue) and purchasing behavior (what products to which customers). They found that ten leading affiliate programs (seven pharmaceutical, three counterfeit software) collected over 119K monthly orders, with the most prolific programs each generating over $1M in revenue per month. Also, the distribution of pharmaceutical purchases is mostly male enhancement drugs (62%) but has a long tail (289 distinct products). Thirdly, spam is largely funded by Western purchases, the U.S. in particular. 12
McCoy et al. [56] had a particularly unique vantage point into the economics of affiliate programs: they obtained four years’ worth of raw transactional data from three prominent pharmaceutical programs (GlavMed, SpamIt, and RX-Promotion). Through mining this data, which covered hundreds of thousands of orders totaling over $185M in revenue, they learned about product demand, customer behavior, affiliates (advertisers), payment service providers, and costs. Their results include the following: (i) erectile dysfunction purchases domi- nate revenue; (ii) an appreciable fraction of purchases are from repeat customers; (iii) a few big affiliates out of hundreds to thousands dominate the market; (iv) only a handful of payment processors handle most of the transaction volume; and (v) while business is seemingly thriving, substantial costs dwindle profit to under just 20% of sales. McCoy et al. [55] also honed in on the payment portion of the spam ecosystem, following up on the previous works suggesting that financial services are a susceptible bottleneck in the value chain. They characterized the relationship between affiliate programs and their acquiring banks, and documented the effects and reactions to interventions by brand holders and payment card networks. They showed how concentrated efforts against just a small set of high-risk banks can terminate merchant accounts within weeks and severely disrupt the business of many affiliate programs. In our work, we track a total of forty-four1 affiliate programs across three product categories: pharmaceuticals, replica luxury goods, and counterfeit software.
1Levchenko et al.’s paper discusses forty-five affiliate programs, but, due to an artifact of their regular expression matcher, any storefront that matched the Stimul-cash program automatically matched the RX Partners program as well. Hence, Stimul-cash storefronts were a proper subset of RX Partners’. However, this outcome actually reflects the true circumstance, as the Stimul-cash program was acquired by the owners of RX Partners, roughly in 2008 [55]. Thus, we ignored Stimul-cash as a distinct program in our study. 13
Table 2.1 shows the number of storefronts we had for each program at the outset of our study. There are two aggregrate programs, Mailien and ZedCash, whose constituents are indicated in the table. Mailien administers two pharma brands, while ZedCash manages seven pharma and nine replica brands. ZedCash is unique in that it has storefront brands for more than one product category. We explain how we classify storefronts by their sponsoring affiliate programs in Chapter 3.
Table 2.1. The number of storefronts in the original data set for all forty-four affiliate programs. Aggregate programs are indicated as ZedCash* and Mailien†.
Affiliate Program Storefronts WatchShop 472 Affordable Accessories* 341 One Replica* 269 Ultimate Replica* 79 Prestige Replicas* 32 Diamond Replicas* 14 Replica Luxury Replica* 13 Distinction Replica* 11 Exquisite Replicas* 9 Swiss Replica & Co.* 2 EuroSoft 2,215 Royal Software 808 Authorized Software Resellers 314 Soft Sales 39 Software OEM Soft Store 24 Continued on next page 14
Table 2.1. The number of storefronts in the original data set for all forty-four affiliate programs. Aggregate programs are indicated as ZedCash* and Mailien† — continued.
Affiliate Program Storefronts EvaPharmacy 58,215 Pharmacy Express† 44,017 RX-Promotions 37,245 Online Pharmacy 16,546 GlavMed 6,616 World Pharmacy 4,340 Greenline 3,857 RX Partners 1,486 RX Rev Share 548 Canadian Pharmacy 261 33Drugs2 181 ED Express† 77 RXCash 72 MediTrust 42 MaxGentleman* 22 PH Online 20 Dr. Maxman* 19 Pharmaceutical Club-first 14 Stallion 11 MAXX Extend 9 Viagrow* 8 HerbalGrowth 8 US HealthCare* 6 ManXtenz* 4 VigREX* 4 Swiss Apotheke 3 Stud Extreme* 3 Ultimate Pharmacy 3 Virility 2 Total 178,281
2Also known as DrugRevenue. 15
2.2 SEO and Search Poisoning
The popularity of the Internet demands a business to have a strong Web presence. Certainly, their Web site should be easy for users to find. One key ingredient, then, is a good domain name — memorable, short, and therefore simple to type into the address bar. Visiting a Web site in this way, called direct traffic, is one avenue for users to reach the site. However, this direct avenue is not the most common one; that distinction belongs to search engines. Specifically, organic search traffic3, or users who enter a search query and click through a search result, is the dominant source of traffic to business sites, per estimates in 2014 [23]. Thus, businesses focus on search as a primary means for attracting users. The crux of search traffic is that the amount a Web site receives depends strongly on its search ranking, or how early in search result listings the site appears. Search engines are designed to answer user queries with the most relevant and highest quality Web sites; in turn, users are much more likely to click the top results than to probe the long tail. Therefore, webmasters strive to build Web sites that search engines will rank highly for pertinent queries. Perhaps more important factors for a business’s Web presence, then, are its quality, relevance to certain keywords, reputation, relationship with other reputable Web sites, and so on — all of which contribute to the site’s search ranking. The wealth of techniques used to improve search ranking are collectively known as search engine optimization, or SEO. Search engines have guidelines and rules that webmasters should follow to ensure the integrity and user-friendliness of their sites. Good practice techniques that adhere to these policies are called white hat SEO (e.g., sitemaps, friendly
3Organic search traffic is distinct from paid search traffic: users who enter a search query and click through a pay-per-click advertisement. Paid search comprises 10% of all traffic to business sites [23]. Henceforth, we use the term “search traffic” to denote organic search traffic. 16
URLs, fast load times); bad practice techniques that disregard them are called black hat SEO (e.g., keyword stuffing, link farms, hidden text). Indeed, miscreants utilize surreptitious tricks that exploit a search engine’s algorithms to abusively boost their search rank. This illicit activity — known as search (result/engine) poisoning, spamdexing, Web/search spam, et cetera — is used to garner click traffic; then, clicks are monetized through various scams such as fake anti-virus, malware download, phishing, and counterfeit stores. Our work focuses on the counterfeit luxury market, where e-commerce Web sites pose as authentic high-end storefronts, but actually sell cheap knockoff mer- chandise. These fraudulent stores employ black hat SEO campaigns — coordinated efforts to abusively elevate search ranking — to lure more visitors and ultimately increase sales. We wish to investigate this collusion among attackers: how do they operate, and how is their business structured? This kind of holistic perspective is needed to bolster defenses; existing countermeasures, which target spammed search results or individual counterfeit stores in isolation, are inadequate. Accordingly, our efforts here are similar to the email spam study in both motivation and ap- proach. We seek an ecosystem understanding and bottleneck analysis of Web search spam, and we achieve it through empirical measurement. We provide additional background and present our work on this problem, led by David Wang [84], in Chapter 4. Our research builds upon prior work from Wang et al. [86, 85]. They first studied a black hat SEO technique called cloaking, which enables search poisoning. They implemented a custom crawler to measure the prevalence of cloaking, as well as the response to cloaking from search engines. We elaborate on cloaking in Section 4.2.1 and touch upon their crawler in Section 4.3.2. In subsequent work, the researchers infiltrated an SEO botnet that poisons search results to lead users 17 to scams. Their probe sheds light on the infrastructure and operation of the botnet, and how it reacts to technical interventions (flagging compromised Web sites and poisoned search results) and monetary interventions (takedown of a fake anti-virus program, the most profitable scam the botnet drove). We describe the scheme of an SEO botnet in Section 4.2.1. Chapter 4, essentially an adaptation of Wang et al.’s third installment [84], presents a unified culmination of this line of work. We document a comprehensive ecosystem-level analysis of SEO abuse on one particular scam, the countefeit luxury market. Security researchers have studied search result poisoning for a decade, with Wang el at. [87] contributing an early end-to-end analysis of redirection spam. Since then, many have proposed various techniques for detecting cloaking and poisoning [43, 49, 62, 86], and some have delved deeper into the activity of SEO campaigns, with Wang et al. [85] infiltrating one campaign’s botnet, and John et al. [36] using clustering and signatures to identify distinct campaigns. This body of prior work underpins our own, but we contribute a new ecosystem-level viewpoint of luxury SEO and uncover the business venture’s full structure. This approach relates to other recent efforts to explore underground economies [38, 39, 42, 56, 76]. In addition, our measurement extends to an evaluation of current defenses against luxury SEO, similar in objective to other work that considers the economics of defensive interventions [12, 32, 44, 48, 55, 59].
2.3 Domain Names
Domains names — unique identifiers for places on the Internet — are an elemental part of how users navigate the Web. A catchy domain makes a place easier to remember and find. Thus, while there are almost infinitely many possible domain names, there is a drastically smaller set of desirable ones. Because good 18 domains are a scarce resource, they hold tremendous value. So while domain names are ostensibly a simple mechanism for locating places on the Web, they became hot property and now comprise an enormous online market.
2.3.1 The DNS Business Model
The Domain Name System (DNS), introduced in the early 1980s, provides a simple technical service: to map human-readable strings to routable IP addresses, which specify the location of an Internet resource. For example, the domain ucsd.edu translates to the IP address 132.239.180.101, the location of the ma- chine where the domain is hosted. Domain names are much easier to memorize and communicate than numerical addresses. Also, they abstract away physical location: the same string is always used to locate a resource even if its physical location changes. The DNS is organized hierarchically, with top-level domains (TLDs) at the highest level and second-level domains right beneath. A domain name’s hierarchy descends from right to left, so in the example ucsd.edu, the TLD is edu and the second-level domain is ucsd. In other terms, ucsd.edu is called a subdomain of the edu TLD. The Internet Corporation for Assigned Names and Numbers (ICANN) is the governing body of the DNS. ICANN does not manage individual TLDs, though; they contract out that responsibility to third-party organizations called registries.
For example, Verisign operates com, and EDUCAUSE operates edu. Then, registries work with registrars — companies that sell domain names to the public. Registrars offer most domains on a first-come first-served basis for an annual registration fee. Some common registrars are GoDaddy, eNom, Network Solutions, 1&1, and Register.com. Finally, a person or company who registers a domain name is called 19
a registrant. The registrant holds exclusive ownership of the domain name, and accordingly is also known as a domain owner. Registries act as the “wholesalers” of the domain name business, as they set a flat wholesale price that ICANN must approve. ICANN requires that this price is the same for all registrars to support fair competition. Then, registrars act as the “retailers” and may mark up the price as they see fit. Registries pay ICANN a small transaction fee for each domain sold or renewed, but only if the total transactions in any calendar quarter meet a certain threshold. For new TLDs (discussed next in Section 2.3.2), the fee is $0.25, and the transaction threshold is 50,000 [73]. All terms for how money from domain name sales is apportioned are laid out in contracts between ICANN, registries, and registrars. These agreements specify additional fees as well; for example, registries who operate new TLDs owe ICANN a fixed quarterly fee of $6,250 [73], and registrars owe ICANN a $4,000 yearly accreditation fee [72].
2.3.2 Growth of Top-Level Domains
In January 1985, the first six TLDs were released: com, org, net4, edu, gov, and mil. These are all categorized as generic TLDs (gTLDs), which have a further distinction: sponsored or unsponsored. A sponsored gTLD is intended for a specific community and restricts domain registrations to affiliated members. A sponsoring organization who represents the community manages the TLD and enforces policies that authorize domain registration and use. An unsponsored gTLD follows standard ICANN policy and usually is open to all registrants and purposes. Of the first
six TLDs, com, org, and net are unsponsored and open to the global Internet population; the other three are sponsored and limited to smaller communities. The
4The net TLD was not included in the original RFC 920 document from October 1984 [69]), but was implemented together with the five proposed TLDs. 20
first country code TLDs (ccTLDs) were introduced in the 1980s as well (e.g., us and uk, reserved for the United States and United Kingdom, respectively). The com TLD prevailed as the most open and popular; as a result, com eventually became saturated and, so to speak, all the good domain names were already taken. To relieve this scarcity, ICANN launched the biz and info gTLDs in May 2001. Their initiative followed the recommendations of the Working Group C report [89], which concluded that adding new TLDs would essentially help democratize the Internet. This consensus opinion was not unanimous though, and the debate raises questions about the desired versus actual effect. With com established as the dominant Web standard, would users and businesses even consider other TLDs for finding and building a solid Web enterprise? Would it be strange for different companies to have the same second-level domain in separate TLDs? Would new TLDs encourage speculation, and also spark a race between extortionists and businesses trying to protect their trademarks? Halvorson et al. [30] gave contemporary insight into this debate by measuring the usage of the biz TLD ten years after its inception. biz was meant to be an alternative to com, but their study showed it has gained limited traction. biz is about 46 times smaller than com (by total domains registered), has a disproportionately lower representation among the Internet’s most popular Web sites, and contains only a moderate fraction of primary Web identities (i.e., not undeveloped Web sites, redirects, or duplicated content from the same subdomain in a different TLD). ICANN released more TLDs throughout the ensuing years, including spon- sored TLDs like aero in 2002 and jobs, mobi, and travel in 2005. Then in 2011, ICANN added the controversial xxx TLD after a decade-long approval process. Halvorson et al. [29] conducted a similar study of this TLD to break down its 21 early usage and economic ramifications. Their findings validated the magnified concerns over defensive registrations from brand and trademark holders wanting to disassociate from the TLD’s connotations. Defensive registrants account for almost 92% of all domain registrations, and they spent a total of $24M in the first year of registration alone. Many of these individual TLDs endured a laborious induction, so in 2008 ICANN began developing policy to simplify and standardize the process of adding a new TLD. Then in 2011, ICANN authorized the launch of what they called the New gTLD Program, whose goals are “enhancing competition and consumer choice, and enabling the benefits of innovation” [2]. While the program seems well intentioned, the same debate persists about what value new TLDs contribute to the Internet community [25]. In this program, a new TLD goes through a process of application, evaluation, and delegation. An aspiring registry submits a comprehensive application to ICANN demonstrating that they are technically and financially capable of operating a TLD. Then, the application is open to public comments and gets reviewed by third- party expert panels. The applicant owes ICANN a $185,000 evaluation fee for the initial evaluation. This stage may also be followed by dispute resolution and string contention — the former, if anyone files an objection to the TLD; the latter, if there are multiple applications for the same TLD string. If the applicant passes these steps, then the TLD advances to delegation. Delegating a TLD means adding it to the DNS root zone, the global list of TLDs. The TLD is then considered “live” on the Internet and opens to domain registrations shortly thereafter. Most registries who operate public TLDs open registration in consecutive phases: sunrise, land rush, and general availability. The sunrise phase is designated for trademark holders only so they can defend their names. The land rush phase 22
1046
900
700
total # of TLDs 500
318
10/1/13 10/1/14 8/20/15 Figure 2.1. The steady and swift rollout of new gTLDs. Dates of delegated strings were collected from [18]. is meant for eager registrants who are willing to pay a premium for high demand domains. Finally, in the general availability phase, domain names are sold first-come first-served at the normal registration rate. ICANN opened the first application window in January 2012 and received a total of 1,930 applications. This remarkable volume evidently foreshadowed an equally remarkable expansion of the domain name space. As of October 1, 2013, there were 318 TLDs (most of them ccTLDs). By August 20, 2015, the root zone contained 1,046 TLDs — an addition of 728 in less than two years. Figure 2.1 plots this growth since the first new TLDs were delegated on October 23, 2015. This surge of hundreds of new top-level domains has been called “the biggest change to the Internet since its inception” [34]. We examine the early usage of these new TLDs in Chapter 5. 23
2.3.3 Abuse and Economics
The new TLD rollout opens up a fresh expanse of the Internet namespace that may be targeted by questionable, if not flatly abusive, practices. Some examples include:
• Cybersquatting: Buying a domain name related to a trademark, usually with the intent of reselling the domain to the trademark holder for an inflated price.
• Typosquatting: Registering a domain name to acquire unintentional direct traffic from users who commit a typo when entering the URL. For example,
Google prevents common instances by owning gooogle.com and googel.com, both of which simply redirect to google.com.
• Homograph attack: A domain name that spoofs an existing one by hav- ing an indistinguishable ASCII representation in certain fonts. A classic attack was the “PayPaI” phishing scam that tricked PayPal users into visit-
ing PayPaI.com, which then stole user credentials when they attempted to login [67].
• Domain parking: Monetizing an undeveloped domain by serving automati- cally spun ads and/or selling the domain at a profit.
• Domaining: A term that refers to the speculative side of domain parking: investing in domain names for future resale.
• Domaineering: A term that refers to the non-speculative side of domain parking: buying and monetizing domain names by using them as an advertis- ing medium. 24
Among these activities, cybersquatting is the only one that is altogether illegal, as it infringes upon a trademark’s intellectual property rights. To police such infringements, ICANN introduced the Uniform Domain-Name Dispute-Resolution Policy (UDRP), an arrangement for resolving domain name disputes, in 1999. However, the UDRP process is both costly and lengthy. As a cheaper and faster alternative, ICANN implemented the Uniform Rapid Suspension (URS) system for eradicating the most obvious infringements in new gTLDs. Also, several companies now use brand protection services (e.g, MarkMonitor [54] and Safenames [74]) to hunt abusive domains. Still, despite these options for defensive measures, many brand and trademark holders preemptively register congruous domains to safeguard their names. Domain parking, a widespread practice in TLDs both old and new, engenders ongoing debate about its merits. Proponents contend that domain owners have the right to (lawfully) use their property as they wish. Furthermore, some claim that a parked domain’s advertisements, which are often thematically related to the domain name, are useful portals for finding germane information. On the other hand, opponents argue that parked domains contribute little to the Internet at large. Search engine providers seem to agree; they like unique, quality content, and therefore do not index parked Web pages. We do not focus on mitigating these dubious practices in this dissertation; we are, though, generally interested in the economic motivations of actors in the domain name market. Indeed, the goal behind these pursuits is to make money, and there is much of it to be made. The openness of gTLDs and first-come first- served allocation of domain names promise tidy profits for expeditious speculators.
The most coveted domains change hands for exorbitant sums as well. In the com TLD, second-level domains that are popular generic terms, such as insurance, 25 vacationrentals, privatejet, internet, sex, and hotels, all resold for over $10M. Also, companies buy domains that are central to their brand to enhance their
Web presence. In two high profile instances, Facebook bought fb.com for $8.5M in November 2010, and Apple bought icloud.com for $6M in March 2011 [47]. Typosquatting further highlights the magnitude of the domain industry, as the amount of revenue generated by something as trivial as typos is astounding. A 2010 study estimated that Google earns about $497M per year from typosquatters whose domains are “misspellings” of the top 100,000 Web sites and are parked with Google ads [60]. The overall economics of the domain name market are just as staggering. The startup registry Donuts raised over $100M in 2012 to support its applications for 307 new TLDs [19]. GoDaddy, the largest domain registrar, raised $460M in an early 2015 IPO [24], with market cap currently around $4 billion (where future revenue is based largely on its expansion plans for registering domains in new TLDs). Today, there are more than a quarter of a billion domain names registered in total; the registration fees alone are about $2-3B in annual revenue.
2.4 Bag-of-Words Representation
Applications such as text mining, information retrieval, and topic modeling all involve learning about a corpus of documents. A crucial decision, then, is how to represent a document. One conventional way is to use a bag-of-words representation, which works as follows. We start with an extensive set of all possible words that may appear in the corpus; this set of words, typically pre-defined, is called the vocabulary (or dictionary). Let D denote the size of the vocabulary. Now, for a single document, we count the number of times each word appears in the document. These word counts are the features of the document. 26
Note that while the vocabulary is inherently an unordered set of words, we must impose an ordering on the words to maintain a consistent mathematical representation of all documents. In particular, suppose the vocabulary, denoted as V , is a sequence of D words:
V = hw1, w2, w3, . . . , wDi,
th where wj is the j word in the vocabulary. Then, we represent a document as a length D feature vector of word counts. Specifically, a document, denoted as −→x , is represented as the vector:
−→ x = hx1, x2, x3, . . . , xDi,
where xj is the number of times word wj appears in the document. If we do this for all (say N) documents, we can “stack” their feature vectors and represent the full corpus as a document-term matrix as such:
−→ ←− x1 −→ −→ ←− x2 −→ ←− −→x −→ , 3 . . −→ ←− xN −→
−→ th th where xi , the i row of the matrix, represents the i document in the corpus. This document-term matrix is a N × D matrix: there are N rows (documents) and D columns (terms, or words, in the vocabulary). While its dimensions can be huge, this matrix is usually sparse, as the huge size of the vocabulary dwarfs the size of an individual document. 27
There are some standard preprocessing steps that normally accompany a bag-of-words approach. We exclude so-called stopwords — extremely common words — from the vocabulary, as well as extremely rare words, both of which add little to no information. Also, we usually normalize each feature vector to have unit length. This converts feature values from non-negative integers to real numbers in the range [0, 1]. The most common use for bag-of-words is modeling a corpus of documents which contain natural language, like news articles, blog posts, discussion forums, and so on. However, we posit that the methodology can be extended to any type of textual document; for our purposes, we want to apply bag-of-words to Web pages, which are textual documents containing HTML code. For some applications, we may only want the natural language contained in the Web page, but for other applications, the full HTML implementation of the Web page may carry valuable information. Fundamentally, the vocabulary need not consist only of real English words; the approach is more flexible, and a “word” can be any kind of string. Ultimately, it is our decision what we consider to be useful “words” for a particular task. In Chapter 3, we explain how we encode HTML elements as words in an effort to capture the full richness of content and structure present in a Web page. Recall that an overarching goal of this work is to alleviate the manual burden in our efforts to categorize a large corpus of Web pages. Again, some amount of manual labeling is unavoidable at the outset, but a heavily manual methodology demands additional work to decipher distinctive features for a particular category, which could help the expert find more examples from that category. This underscores a key advantage of our feature extraction approach: bag-of-words is fully automated, and we require no manual feature engineering. Instead, we leave it to the machine learning algorithm to find similarities or predictive features automatically. 28
2.5 Related Work
The intersection of machine learning and computer security has grown into a wide area of research. Thus, we narrow our review to closely related work involving techniques and security applications of Web page classification.
2.5.1 Non-Machine Learning Methods
First, we observe a handful of studies that utilize non-machine learning methods, which directly motivated our own work. These methods may use clustering algorithms at first, but then rely on manual effort and automated heuristics, such as regular expressions (regexs), for classifying Web pages. In their analysis of the spam value chain, Levchenko et al. [44] cluster storefront Web pages automatically with a q-gram similarity approach. But then, to link the storefronts to affiliate programs, they manually craft a set of characteristic regexs for every program. (We elaborate on their methodology in Chapter 3.) Likewise, Halvorson et al. [30] automate the classification of parked domains
in the biz TLD using regexs that matched distinctive parts of the HTML content. They also manually classify a sample of 485 Web sites for comparison. In their
corresponding study of the xxx TLD, Halvorson et al. [29] use text shingling (commonly known as minhash, and covered in the following section) for clustering similar Web pages together. But again, they resort to manually engineered regexs to progress from clustering to classification. To detect cloaking in search results, Wang et al. [86] implement a data filtering step, followed by two scoring heuristics and a manually tuned threshold, for deciding if a Web site is cloaked. In subsequent work, they infiltrate an SEO botnet to learn how it poisons search results to drive traffic to criminal Web sites [85]. 29
They manually cluster and classify these Web sites according to the type of scam that monetizes the traffic (e.g., fake anti-virus, drive-by downloads, pirated media, et cetera). We note that these approaches can and do work well for the most part, but classifying Web pages by hand and constructing signatures requires excessive manual industry. To alleviate this burden, researchers have turned to machine learning-based methods, which offer more efficient and robust alternatives for classification. This direction is the foundation of this dissertation.
2.5.2 Near-Duplicate Web Pages
In many cases, the success of automated analyses hinges on the high degree of similarity among groups of Web pages. As Chapters 3 and 5 will show, our systems effectively cluster Web pages that share a very close resemblance. Though we do not appeal to these techniques in our work, well known algorithms exist for detecting near-duplicate Web pages. The leading two that are still considered state-of-the-art are termed minhash and simhash. Minhash was conceived by Broder [11, 10] and first used to eliminate redun- dancy when a search engine indexes the Web. The algorithm extracts shingles, or n-grams, from a Web page and hashes them for compression. A random sample of shingles comprise a sketch, a fixed-size set of hashes that represents the Web page. Then, to measure the resemblance (or similarity) of two Web pages, the algorithm uses the Jaccard index of their sketches — the size of the intersection divided by the size of the union. But the key insight which makes minhash fast is that the probability that two sketches have the same minimum hash is equal to the Jaccard index. Hence, the Jaccard index need not be computed in full. Ultimately, pairs of similar Web pages can be linked to form clusters of near-duplicate Web pages. 30
Later, Charikar [13] developed simhash, a related hashing scheme for efficient document comparison. This algorithm condenses a Web page down to a relatively small bit vector, and uses cosine similarity instead of the Jaccard index for comparing Web pages. The crucial insight behind simhash is that the cosine similarity of two Web pages can be estimated by the number of agreeing bits in their abbreviated vector representations. Ensuing studies contributed empirical evaluations of minhash and simhash to accompany their sound theoretical foundations. Henzinger [31] directly compared the algorithms on a task of clustering 1.6 billion Web pages, finding that simhash slightly outperformed its counterpart. Manku et al. [52] favor simhash as well, citing its even smaller fingerprints of Web pages as a main advantage.
2.5.3 Web Spam and Cloaking
As introduced in Section 2.2, one of the applications we examine in this dissertation involves black hat SEO and Web spam. Spammers adopt devious tactics to manipulate the search ranking of their Web sites. Some common features of a spam Web page are a disproportionate number of backlinks (inbound links) to feign popularity, and keyword stuffing — i.e., including and repeating keywords to appear relevant to target search queries. A fair number of related studies have used statistics and machine learning for detecting this kind of abusive activity. Some of the first automated approaches identify spam Web pages by their statistically abnormal behavior. Fetterly et al. [22] plot the distributions of certain Web site properties and inspect the outliers for spam. Some of the anomalous prop- erties of spam sites are: excessive inbound links, highly replicated SEO content, and rapid evolution in content over time (since spam sites often return auto-generated content on demand). Gy¨ongyi et al. [27] develop a system called TrustRank to 31
differentiate good and bad Web sites. Starting with an expert-labeled seed of Web sites, TrustRank uses the Web’s link structure to estimate and track the flow of “trust” between reputable sites. The system then calculates a probability that a given site is spam or not. Bencz´or et al.’s SpamRank system [7] implements a different link-based approach for mitigating Web spam. The core idea is that spam Web sites artificially boost their status with a large number of low quality backlinks. SpamRank computes a penalty for such sites that should be deducted from their ranking scores. Ntoulas et al. [65] focus less on link structure and instead consider numerous content-based features that are indicative of Web spam — for example, number of words in both the page and title, average word length, amount of anchor text, fraction of visible content, compressibility, and so on. Many of these features are signals of automatic and replicated keyword stuffing. As no single feature is a silver bullet, they learn a classifier using the full set of features. They find that boosted decision trees yield the most accurate model for detecting spam Web pages. Urvoy et al. [81] also use Web page content to characterize Web spam, but extend the type of features from textual to “extra-textual” to capture the HTML style of Web pages. The goal of this approach is to improve Web spam detection by clustering Web pages with similar HTML styles; such pages are likely generated automatically with the same scripts and templates. This property is exhibited by the spam-advertised storefronts we introduced in Section 2.1. A different kind of automatic content generation used by spammers is called spinning — auto-generating many variations of a Web page, usually by shuffling and substituing words, to avoid duplicate detection. Zhang et al. [93] develop a system called DSpin to detect spun Web content. They reverse-engineer a synonym dictionary to identify words and phrases that spinning tools do not modify, termed 32 immutables. Then, to measure pairwise similarity of Web pages, they perform a set comparison of their immutables. This similarity score is used to cluster families of Web pages that were spun from the same seed page. Other researchers have built machine learning models to detect cloaked Web sites. The key in recognizing cloaking is to determine if a Web server delivers different content to search engine crawlers and browsers. In comparing these two versions of a Web page, Wu and Davison [91] spot the main difference in the SEO-ed nature of the crawler version. They assemble a number of content and link-based features that reveal link and keyword stuffing. Using a total of 162 features, they build a support vector machine (SVM) that achieves an average of 93% precision and 85% recall in detecting cloaked Web sites. Lin [46] provides a new insight to improve cloaking detection: the structure of legitimate Web pages changes much less frequently than the content, which may be dynamically generated. Therefore, a significant difference in the structure of the two Web pages served to a crawler and a browser suggests the use of cloaking. So while previous detectors use terms and links to compare Web page content, Lin uses HTML tags to compare Web page structure. A decision tree trained on a combination of tag and term features averages over 90% accuracy in classifying Web sites as cloaked or not.
2.5.4 Other Applicatons
Even more problems in Web page classification demonstrate the wide appli- cability of machine learning in security research. Bannur et al. [6] tackle the most general such problem — malicious or benign — by exploiting the full content of a Web page: text, tag structure, links, and visual appearance. They train logistic regression and SVM classifiers that are up to 98% accurate. 33
Kolari et al. [40] use SVMs for a different task: mapping out the “blogo- sphere”. In addition to customary n-grams, they extract the anchor text and tokenized URLs from a Web page’s links to distinguish blogs from other types of Web sites. Their best models achieve almost 98% accuracy in identifying blogs, but are less effective in detecting spam blogs. Chapter 3 involves counterfeit storefronts that are advertised in email spam. There is a substantial body of work that uses machine learning tools for spam filtering; Guzella and Caminhas’s survey [26] highlights a considerable sample. Another application that specifically involves counterfeit storefronts, though, is the work of Leontiadis et al. [42], who investigate illegal pharmaceutical e-commerce. Using hierarchical clustering with average linkage, they group unlicensed online pharmacies based upon the similarity of their inventories. Their results indicate a potential bottleneck: most unlicensed pharmacies appear to use the same few suppliers, so cutting them off could have outsized impact on sales. The basis of our work in classifying storefront Web pages is the “affiliation” property — we build a system to group together storefronts that are associated with the same affiliate program. The success of automated methods for identifying such affiliations is often due to miscreants using automated methods to scale their attacks. Devising single-use scams is far too inefficient to be profitable, so instead, miscreants replicate their scams on many Web pages across many domains. This idea has been explored in other work as well. Anderson et al. [4] digest a huge feed of spam-advertised URLs by clustering the destination Web pages and identifying which ones promote the same scam. They implement a technique called image shingling to cluster Web pages that are visually similar when rendered in a browser. Through analysis of the clustering results, they observe over 2,000 distinct scams that monetize spam email. 34
More recently, Drew and Moore [20] also pursue the affiliation property: they cluster criminal Web sites that are managed by the same illegal organization. They develop a novel combined clustering algorithm that succeeds even when cybercriminals try to hide incriminating correlations between their Web sites. The two classes of scams they examine are fake-escrow services and high-yield investment programs. A third instance of identifying sources of common origin is from Wardman and Warner [88], who detect new phishing Web sites by matching them to ones they have already seen. Phishers typically use a “phishing kit,” which is a bundle of resources for crafting a fraudulent Web site. Thus, their phishing Web sites contain many of the same files (HTML, CSS, JavaScript, and images), and therefore can be traced back to the same source. Other systems have been designed to detect phishing Web sites as well. Zhang et al. [94] engineer a system called CANTINA which, among other content- based features and heuristics, uses terms with the highest term frequency-inverse document frequency (TF-IDF) score for classifying Web pages as phishing or legitimate. Whittaker et al. [90] describe a large-scale system that classifies millions of Web pages a day to automatically update Google’s phishing blacklist. They too limit the set of terms to the ones with the highest TF-IDF values. The intuition is that many of the highly ranked terms are victim-specific and obviously used by phishers (e.g., a site posing as eBay will contain eBay-related words). They combine these terms with additional features involving the URL, hosting information, and other page content to build a logistic regression classifier that estimates a phishing probability. For 99.9% of non-phishing Web pages, the classifier predicts a phishing probability under 1%; for 90% of phishing Web pages, a probability over 90%. Chapter 5 involves domain parking, a common practice across the Web 35 used to monetize undeveloped domains. Quite recently, Vissers et al. [83] extract features from a Web page’s content for building a domain parking classifier. On average, a random forest is 98.7% accurate in discriminating parked and not-parked Web pages. In summary, even though this review still only scratches the surface of this problem space, it highlights the assortment of security challenges that need to categorize Web pages. Modern cybercrime is organized and executed at Internet scale, and so defensive analyses must be as well. Machine learning provides an array of techniques for conducting the large-scale classification of Web pages, and it will continue to play an integral part in the Internet security field. Chapter 3
Affiliate Program Identification
A canonical machine learning application in the security domain is spam detection, where a model learns to classify an email as spam or non-spam. This binary classification task is cleanly defined, well-motivated, and thoroughly studied. Machine learning provides highly accurate models for detecting spam (e.g., [51]), but in spite of this success, the spam problem persists. This situation inspired the study from Levchenko et al. [44], which serves as the basis of this chapter. They looked beyond just spam delivery and examined the entire spam business model instead. A critical part of the end-to-end analy- sis involved identifying the counterfeit businesses (i.e., affiliate programs) whose storefront Web sites were advertised in spam emails. This task prompts a more sophisticated classification problem than simply deciding whether an email is spam or not. Specifically, the problem here is the multi-way classification of storefront Web sites, where the classes are the sponsoring affiliate programs. This particular problem is a representative instance of security efforts that partition Web sites into groups of common origin. Doing this job manually, though, is far too time-consuming for a security expert. In this chapter, we show how machine learning tools can assist the security expert in performing classification at scale. We now provide a synopsis of our work.
36 37
We describe an automated system for the large-scale monitoring of Web sites that serve as online storefronts for spam-advertised goods. Our system is developed from an extensive crawl of black-market Web sites that deal in illegal pharmaceuticals, replica luxury goods, and counterfeit software. The operational goal of the system is to identify the affiliate programs of online merchants behind these Web sites; the system itself is part of a larger effort to improve the tracking and targeting of these affiliate programs. There are two main challenges in this domain. The first is that appearances can be deceiving: Web pages that render very differently are often linked to the same affiliate program of merchants. The second is the difficulty of acquiring training data: the manual labeling of Web pages, though necessary to some degree, is a laborious and time-consuming process. Our approach in this chapter is to extract features that reveal when Web pages linked to the same affiliate program share a similar underlying structure. Using these features, which are mined from a small initial seed of labeled data, we are able to profile the Web sites of forty-four distinct affiliate programs that account, collectively, for hundreds of millions of dollars in illicit e-commerce. Our work also highlights several broad challenges that arise in the large-scale, empirical study of malicious activity on the Web.
3.1 Introduction
The Web plays host to a broad spectrum of online fraud and abuse— everything from search poisoning [36, 85] and phishing attacks [50] to false ad- vertising [45], DNS profiteering [29, 60], and browser compromise [71]. All of these malicious activities are mediated, in one way or another, by Web pages that lure unsuspecting victims away from their normal browsing to various undesirable ends. Thus, an important question is whether these malicious Web pages can 38 be automatically identified by suspicious commonalities in appearance or syn- tax [6, 29, 41, 46, 79, 81, 94]. However, more than simply distinguishing “bad” from “good” Web sites, there is further interest in classifying criminal Web sites into groups of common origin: those pages that are likely under the control of a singular organization [20, 44, 55]. Indeed, capturing this “affiliation” property has become critical both for intelligence gathering as well as both criminal and civil interventions. In this chapter, we develop an automated system for one version of this problem: the large-scale profiling of Web sites that serve as online storefronts for spam-advertised goods. While everyone with an e-mail account is familiar with the scourge of spam- based advertising, it is only recently that researchers have come to appreciate the complex business structure behind such messages [56]. In particular, it has become standard that merchants selling illegitimate goods (e.g., such as counterfeit Viagra and Rolex watches) are organized into affiliate programs that in turn engage with spammers as independent contractors. Under this model, the affiliate program is responsible for providing the content for online storefronts, contracting for payment services (e.g., to accept credit cards), customer support and product fulfillment—but direct advertising is left to independent affiliates (i.e., spammers). Spammers are paid a 30–40% commission on each customer purchase acquired via their advertising efforts and may operate on behalf of multiple distinct affiliate programs over time. Such activities are big business with large affiliate programs generating millions of dollars in revenue every month [38]. Thus, while there are hundreds of thousands of spam-advertised Web sites and thousands of individual spammers, most of this activity is in service to only several dozen affiliate programs. Recent work has shown that this hierarchy can be used to identify structural bottlenecks, notably in payment processing. In particular, 39 if an affiliate program is unable to process credit cards, then it becomes untenable to sustain their business (no matter the number of Web sites they operate or spammers they contract with). Recently, this vulnerability was demonstrated concretely in a concerted effort by major brand holders and payment networks to shutdown the merchant bank accounts used by key affiliate programs. Over a short period of time, this effort shut down 90% of affiliate programs selling illegal software and severely disabled a range of affiliate programs selling counterfeit pharmaceuticals [55]. The success of this intervention stemmed from a critical insight—namely, that the hundreds of thousands of Web sites harvested from millions of spam emails could be mapped to a few dozen affiliate programs (each with a small number of independent merchant bank accounts). At the heart of this action was therefore a classification problem: how to identify affiliate programs from the Web pages of their online storefronts? There are two principle challenges to classification in this domain. The first is that appearances can be deceiving: storefront pages that render very differently are often supported by the same affiliate program. The seeming diversity of these pages—a ploy to evade detection—is generated by in-house software with highly customizable templates. The second challenge is the difficulty of acquiring training data. Manually labeling storefront pages, though necessary to some degree, is a laborious and time-consuming process. The researchers in [44] spent hundreds of hours crafting regular expressions that mapped storefront pages to affiliate programs. Practitioners may endure such manual effort once, but it is too laborious to be performed repeatedly over time or to scale to even larger corpora of crawled Web pages. Our goal is to develop a more automated approach that greatly reduces manual effort while also improving the accuracy of classification. In this chapter 40
Table 3.1. Screenshots of online storefronts selling counterfeit pharmaceuticals, replicas, and software.
GlavMed Ultimate Rep. SoftSales
we focus specifically on spam-advertised storefronts for three categories of products: illegal pharmaceuticals, replica luxury goods, and counterfeit software (Table 3.1). We use the data set from [44] consisting of 226K potential storefront pages winnowed from 6M distinct URLs advertised in spam. From the examples in this data, we consider how to learn a classifier that maps storefront pages to the affiliate programs behind them. We proceed with an operational perspective in mind, focusing on scenarios of real-world interest to practitioners in computer security. Our most important findings are the following. First, we show that the online storefronts of several dozen affiliate programs can be distinguished from automatically extracted features of their Web pages. In particular, we find that a simple nearest neighbor (NN) classifier on HTML and network-based features achieves a nearly perfect accuracy of 99.99%. Second, we show that practitioners need only invest a small amount of manual effort in the labeling of examples: with just one labeled storefront per affiliate program, NN achieves an accuracy of 75%, and with sixteen such examples, it achieves an accuracy of 98%. Third, we show that our classifiers are able to label the affiliate programs of over 3700 additional storefront pages that were missed by the hand-crafted regular expressions of the original study. Finally, we show that even simple clustering methods may be useful in this domain—for example, 41