UC San Diego UC San Diego Electronic Theses and Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
UC San Diego UC San Diego Electronic Theses and Dissertations Title Investigating Large-Scale Internet Abuse Through Web Page Classification Permalink https://escholarship.org/uc/item/8jp0z4m4 Author Der, Matthew Francis Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Investigating Large-Scale Internet Abuse Through Web Page Classification A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Matthew F. Der Committee in charge: Professor Lawrence K. Saul, Co-Chair Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair Professor Gert Lanckriet Professor Kirill Levchenko 2015 Copyright Matthew F. Der, 2015 All rights reserved. The Dissertation of Matthew F. Der is approved and is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Co-Chair Co-Chair University of California, San Diego 2015 iii DEDICATION To my family: Kristen, Charlie, David, Bryan, Sarah, Katie, and Zach. iv EPIGRAPH Everything should be made as simple as possible, but not simpler. | Albert Einstein Sic transit gloria . glory fades. | Max Fischer v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . v Table of Contents . vi List of Figures . ix List of Tables . xi Acknowledgements . xiii Vita . xviii Abstract of the Dissertation . xix Chapter 1 Introduction . 1 1.1 Contributions . 4 1.2 Organization . 6 Chapter 2 Background . 8 2.1 The Spam Ecosystem . 8 2.1.1 Spam Value Chain . 9 2.1.2 Click Trajectories Finding . 10 2.1.3 Affiliate Programs . 11 2.2 SEO and Search Poisoning . 15 2.3 Domain Names . 17 2.3.1 The DNS Business Model . 18 2.3.2 Growth of Top-Level Domains . 19 2.3.3 Abuse and Economics . 23 2.4 Bag-of-Words Representation. 25 2.5 Related Work . 28 2.5.1 Non-Machine Learning Methods . 28 2.5.2 Near-Duplicate Web Pages . 29 2.5.3 Web Spam and Cloaking . 30 2.5.4 Other Applicatons . 32 Chapter 3 Affiliate Program Identification . 36 3.1 Introduction . 37 3.2 Data Set . 41 vi 3.2.1 Data Collection . 41 3.2.2 Data Filtering . 42 3.2.3 Data Labeling . 43 3.3 An Automated Approach . 45 3.3.1 Feature Extraction . 46 3.3.2 Dimensionality Reduction & Visualization . 49 3.3.3 Nearest Neighbor Classification . 51 3.4 Experiments . 53 3.4.1 Proof of Concept . 55 3.4.2 Labeling More Storefronts . 56 3.4.3 Classification in the Wild . 58 3.4.4 Learning with Few Labels . 60 3.4.5 Clustering . 63 3.5 Conclusion . 66 3.6 Acknowledgements . 67 Chapter 4 Counterfeit Luxury SEO . 68 4.1 Introduction . 69 4.2 Background . 71 4.2.1 Search Result Poisoning . 71 4.2.2 Counterfeit Luxury Market . 76 4.2.3 Interventions . 79 4.3 Data Collection . 81 4.3.1 Generating Search Queries . 82 4.3.2 Crawling & Cloaking . 83 4.3.3 Detecting Storefronts . 84 4.3.4 Complete Data Set . 84 4.4 Approach . 85 4.4.1 Supervised Learning . 85 4.4.2 Unsupervised Learning . 90 4.4.3 Bootstrapping the System . 91 4.5 Results . 93 4.5.1 Classification Results . 93 4.5.2 Ecosystem-Level Results . 98 4.5.3 Further Analysis . 103 4.6 Conclusion . 106 4.7 Acknowledgements . 107 Chapter 5 The Uses (and Abuses) of New Top-Level Domains . 108 5.1 Introduction . 110 5.2 Data Set . 113 5.3 Clustering and Classification . 116 5.3.1 Clustering . 116 vii 5.3.2 Classification . 118 5.3.3 Further Analysis . 123 5.4 Document Relevance . 125 5.4.1 Generating a Corpus for Each TLD . 127 5.4.2 Estimating Topics . 129 5.4.3 Relevance Scoring . 130 5.5 Conclusion . 134 5.6 Acknowledgements . ..