UC San Diego UC San Diego Electronic Theses and Dissertations

Total Page:16

File Type:pdf, Size:1020Kb

UC San Diego UC San Diego Electronic Theses and Dissertations UC San Diego UC San Diego Electronic Theses and Dissertations Title Investigating Large-Scale Internet Abuse Through Web Page Classification Permalink https://escholarship.org/uc/item/8jp0z4m4 Author Der, Matthew Francis Publication Date 2015 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Investigating Large-Scale Internet Abuse Through Web Page Classification A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science by Matthew F. Der Committee in charge: Professor Lawrence K. Saul, Co-Chair Professor Stefan Savage, Co-Chair Professor Geoffrey M. Voelker, Co-Chair Professor Gert Lanckriet Professor Kirill Levchenko 2015 Copyright Matthew F. Der, 2015 All rights reserved. The Dissertation of Matthew F. Der is approved and is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Co-Chair Co-Chair University of California, San Diego 2015 iii DEDICATION To my family: Kristen, Charlie, David, Bryan, Sarah, Katie, and Zach. iv EPIGRAPH Everything should be made as simple as possible, but not simpler. | Albert Einstein Sic transit gloria . glory fades. | Max Fischer v TABLE OF CONTENTS Signature Page . iii Dedication . iv Epigraph . v Table of Contents . vi List of Figures . ix List of Tables . xi Acknowledgements . xiii Vita . xviii Abstract of the Dissertation . xix Chapter 1 Introduction . 1 1.1 Contributions . 4 1.2 Organization . 6 Chapter 2 Background . 8 2.1 The Spam Ecosystem . 8 2.1.1 Spam Value Chain . 9 2.1.2 Click Trajectories Finding . 10 2.1.3 Affiliate Programs . 11 2.2 SEO and Search Poisoning . 15 2.3 Domain Names . 17 2.3.1 The DNS Business Model . 18 2.3.2 Growth of Top-Level Domains . 19 2.3.3 Abuse and Economics . 23 2.4 Bag-of-Words Representation. 25 2.5 Related Work . 28 2.5.1 Non-Machine Learning Methods . 28 2.5.2 Near-Duplicate Web Pages . 29 2.5.3 Web Spam and Cloaking . 30 2.5.4 Other Applicatons . 32 Chapter 3 Affiliate Program Identification . 36 3.1 Introduction . 37 3.2 Data Set . 41 vi 3.2.1 Data Collection . 41 3.2.2 Data Filtering . 42 3.2.3 Data Labeling . 43 3.3 An Automated Approach . 45 3.3.1 Feature Extraction . 46 3.3.2 Dimensionality Reduction & Visualization . 49 3.3.3 Nearest Neighbor Classification . 51 3.4 Experiments . 53 3.4.1 Proof of Concept . 55 3.4.2 Labeling More Storefronts . 56 3.4.3 Classification in the Wild . 58 3.4.4 Learning with Few Labels . 60 3.4.5 Clustering . 63 3.5 Conclusion . 66 3.6 Acknowledgements . 67 Chapter 4 Counterfeit Luxury SEO . 68 4.1 Introduction . 69 4.2 Background . 71 4.2.1 Search Result Poisoning . 71 4.2.2 Counterfeit Luxury Market . 76 4.2.3 Interventions . 79 4.3 Data Collection . 81 4.3.1 Generating Search Queries . 82 4.3.2 Crawling & Cloaking . 83 4.3.3 Detecting Storefronts . 84 4.3.4 Complete Data Set . 84 4.4 Approach . 85 4.4.1 Supervised Learning . 85 4.4.2 Unsupervised Learning . 90 4.4.3 Bootstrapping the System . 91 4.5 Results . 93 4.5.1 Classification Results . 93 4.5.2 Ecosystem-Level Results . 98 4.5.3 Further Analysis . 103 4.6 Conclusion . 106 4.7 Acknowledgements . 107 Chapter 5 The Uses (and Abuses) of New Top-Level Domains . 108 5.1 Introduction . 110 5.2 Data Set . 113 5.3 Clustering and Classification . 116 5.3.1 Clustering . 116 vii 5.3.2 Classification . 118 5.3.3 Further Analysis . 123 5.4 Document Relevance . 125 5.4.1 Generating a Corpus for Each TLD . 127 5.4.2 Estimating Topics . 129 5.4.3 Relevance Scoring . 130 5.5 Conclusion . 134 5.6 Acknowledgements . ..
Recommended publications
  • Click Trajectories: End-To-End Analysis of the Spam Value Chain
    Click Trajectories: End-to-End Analysis of the Spam Value Chain ∗ ∗ ∗ ∗ z y Kirill Levchenko Andreas Pitsillidis Neha Chachra Brandon Enright Mark´ Felegyh´ azi´ Chris Grier ∗ ∗ † ∗ ∗ Tristan Halvorson Chris Kanich Christian Kreibich He Liu Damon McCoy † † ∗ ∗ Nicholas Weaver Vern Paxson Geoffrey M. Voelker Stefan Savage ∗ y Department of Computer Science and Engineering Computer Science Division University of California, San Diego University of California, Berkeley z International Computer Science Institute Laboratory of Cryptography and System Security (CrySyS) Berkeley, CA Budapest University of Technology and Economics Abstract—Spam-based advertising is a business. While it it is these very relationships that capture the structural has engendered both widespread antipathy and a multi-billion dependencies—and hence the potential weaknesses—within dollar anti-spam industry, it continues to exist because it fuels a the spam ecosystem’s business processes. Indeed, each profitable enterprise. We lack, however, a solid understanding of this enterprise’s full structure, and thus most anti-spam distinct path through this chain—registrar, name server, interventions focus on only one facet of the overall spam value hosting, affiliate program, payment processing, fulfillment— chain (e.g., spam filtering, URL blacklisting, site takedown). directly reflects an “entrepreneurial activity” by which the In this paper we present a holistic analysis that quantifies perpetrators muster capital investments and business rela- the full set of resources employed to monetize spam email— tionships to create value. Today we lack insight into even including naming, hosting, payment and fulfillment—using extensive measurements of three months of diverse spam data, the most basic characteristics of this activity. How many broad crawling of naming and hosting infrastructures, and organizations are complicit in the spam ecosystem? Which over 100 purchases from spam-advertised sites.
    [Show full text]
  • The Internet and Drug Markets
    INSIGHTS EN ISSN THE INTERNET AND DRUG MARKETS 2314-9264 The internet and drug markets 21 The internet and drug markets EMCDDA project group Jane Mounteney, Alessandra Bo and Alberto Oteo 21 Legal notice This publication of the European Monitoring Centre for Drugs and Drug Addiction (EMCDDA) is protected by copyright. The EMCDDA accepts no responsibility or liability for any consequences arising from the use of the data contained in this document. The contents of this publication do not necessarily reflect the official opinions of the EMCDDA’s partners, any EU Member State or any agency or institution of the European Union. Europe Direct is a service to help you find answers to your questions about the European Union Freephone number (*): 00 800 6 7 8 9 10 11 (*) The information given is free, as are most calls (though some operators, phone boxes or hotels may charge you). More information on the European Union is available on the internet (http://europa.eu). Luxembourg: Publications Office of the European Union, 2016 ISBN: 978-92-9168-841-8 doi:10.2810/324608 © European Monitoring Centre for Drugs and Drug Addiction, 2016 Reproduction is authorised provided the source is acknowledged. This publication should be referenced as: European Monitoring Centre for Drugs and Drug Addiction (2016), The internet and drug markets, EMCDDA Insights 21, Publications Office of the European Union, Luxembourg. References to chapters in this publication should include, where relevant, references to the authors of each chapter, together with a reference to the wider publication. For example: Mounteney, J., Oteo, A. and Griffiths, P.
    [Show full text]
  • Adsense-Blackhat-Edition.Pdf
    AdSense.BlackHatEditionV.coinmc e Tan Vince Tan AdSense.BlackHatEdition.com i Copyright Notice All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical. Any unauthorized duplication, reproduction, or distribution is strictly prohibited and prosecutable by the full-extent of the law. Legal Notice While attempts have been made to verify the information contained within this publication, the author, publisher, and anyone associated with its creation, hereby assume absolutely no responsibility as it pertains to its contents and subject matter; nor with regards to it’s usage by the public or in any matters concerning potentially erroneous and/or contradictory information put forth by it. Furthermore, the reader agrees to assume all accountability for the usage of any information obtained from it; and heretofore, completely absolves Vince Tan, the publishers and any associates involved, of any liability for it whatsoever. Additional Notice: This book was obtained at http://AdSense.BlackHatEdition.com . For a limited time, when you register for free, you will be able to earn your way to get the latest updates of this ebook, placement on an invitation-only list to participate in exclusive pre-sales, web seminars, bonuses & giveaways, and be awarded a “special backdoor discount” that you will never find anywhere else! So if you haven’t yet, make sure to visit http://AdSense.BlackHatEdition.com and register now, before it’s too late! i Table of Contents Introduction
    [Show full text]
  • Show Me the Money: Characterizing Spam-Advertised Revenue
    Show Me the Money: Characterizing Spam-advertised Revenue Chris Kanich∗ Nicholas Weavery Damon McCoy∗ Tristan Halvorson∗ Christian Kreibichy Kirill Levchenko∗ Vern Paxsonyz Geoffrey M. Voelker∗ Stefan Savage∗ ∗ y Department of Computer Science and Engineering International Computer Science Institute University of California, San Diego Berkeley, CA z Computer Science Division University of California, Berkeley Abstract money at all [6]. This situation has the potential to distort Modern spam is ultimately driven by product sales: policy and investment decisions that are otherwise driven goods purchased by customers online. However, while by intuition rather than evidence. this model is easy to state in the abstract, our under- In this paper we make two contributions to improving standing of the concrete business environment—how this state of affairs using measurement-based methods to many orders, of what kind, from which customers, for estimate: how much—is poor at best. This situation is unsurpris- ing since such sellers typically operate under question- • Order volume. We describe a general technique— able legal footing, with “ground truth” data rarely avail- purchase pair—for estimating the number of orders able to the public. However, absent quantifiable empiri- received (and hence revenue) via on-line store order cal data, “guesstimates” operate unchecked and can dis- numbering. We use this approach to establish rough, tort both policy making and our choice of appropri- but well-founded, monthly order volume estimates ate interventions. In this paper, we describe two infer- for many of the leading “affiliate programs” selling ence techniques for peering inside the business opera- counterfeit pharmaceuticals and software. tions of spam-advertised enterprises: purchase pair and • Purchasing behavior.
    [Show full text]
  • Fully Automatic Link Spam Detection∗ Work in Progress
    SpamRank – Fully Automatic Link Spam Detection∗ Work in progress András A. Benczúr1,2 Károly Csalogány1,2 Tamás Sarlós1,2 Máté Uher1 1 Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) 11 Lagymanyosi u., H–1111 Budapest, Hungary 2 Eötvös University, Budapest {benczur, cskaresz, stamas, umate}@ilab.sztaki.hu www.ilab.sztaki.hu/websearch Abstract Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page stratified random sample with bias towards large PageRank values. 1 Introduction Identifying and preventing spam was cited as one of the top challenges in web search engines in a 2002 paper [24]. Amit Singhal, principal scientist of Google Inc. estimated that the search engine spam industry had a revenue potential of $4.5 billion in year 2004 if they had been able to completely fool all search engines on all commercially viable queries [36]. Due to the large and ever increasing financial gains resulting from high search engine ratings, it is no wonder that a significant amount of human and machine resources are devoted to artificially inflating the rankings of certain web pages.
    [Show full text]
  • Clique-Attacks Detection in Web Search Engine for Spamdexing Using K-Clique Percolation Technique
    International Journal of Machine Learning and Computing, Vol. 2, No. 5, October 2012 Clique-Attacks Detection in Web Search Engine for Spamdexing using K-Clique Percolation Technique S. K. Jayanthi and S. Sasikala, Member, IACSIT Clique cluster groups the set of nodes that are completely Abstract—Search engines make the information retrieval connected to each other. Specifically if connections are added task easier for the users. Highly ranking position in the search between objects in the order of their distance from one engine query results brings great benefits for websites. Some another a cluster if formed when the objects forms a clique. If website owners interpret the link architecture to improve ranks. a web site is considered as a clique, then incoming and To handle the search engine spam problems, especially link farm spam, clique identification in the network structure would outgoing links analysis reveals the cliques existence in web. help a lot. This paper proposes a novel strategy to detect the It means strong interconnection between few websites with spam based on K-Clique Percolation method. Data collected mutual link interchange. It improves all websites rank, which from website and classified with NaiveBayes Classification participates in the clique cluster. In Fig. 2 one particular case algorithm. The suspicious spam sites are analyzed for of link spam, link farm spam is portrayed. That figure points clique-attacks. Observations and findings were given regarding one particular node (website) is pointed by so many nodes the spam. Performance of the system seems to be good in terms of accuracy. (websites), this structure gives higher rank for that website as per the PageRank algorithm.
    [Show full text]
  • Zambia and Spam
    ZAMNET COMMUNICATION SYSTEMS LTD (ZAMBIA) Spam – The Zambian Experience Submission to ITU WSIS Thematic meeting on countering Spam By: Annabel S Kangombe – Maseko June 2004 Table of Contents 1.0 Introduction 1 1.1 What is spam? 1 1.2 The nature of Spam 1 1.3 Statistics 2 2.0 Technical view 4 2.1 Main Sources of Spam 4 2.1.1 Harvesting 4 2.1.2 Dictionary Attacks 4 2.1.3 Open Relays 4 2.1.4 Email databases 4 2.1.5 Inadequacies in the SMTP protocol 4 2.2 Effects of Spam 5 2.3 The fight against spam 5 2.3.1 Blacklists 6 2.3.2 White lists 6 2.3.3 Dial‐up Lists (DUL) 6 2.3.4 Spam filtering programs 6 2.4 Challenges of fighting spam 7 3.0 Legal Framework 9 3.1 Laws against spam in Zambia 9 3.2 International Regulations or Laws 9 3.2.1 US State Laws 9 3.2.2 The USA’s CAN‐SPAM Act 10 4.0 The Way forward 11 4.1 A global effort 11 4.2 Collaboration between ISPs 11 4.3 Strengthening Anti‐spam regulation 11 4.4 User education 11 4.5 Source authentication 12 4.6 Rewriting the Internet Mail Exchange protocol 12 1.0 Introduction I get to the office in the morning, walk to my desk and switch on the computer. One of the first things I do after checking the status of the network devices is to check my email.
    [Show full text]
  • Email Phishing for IT Providers How Phishing Emails Have Changed and How to Protect Your IT Clients
    Email Phishing for IT Providers How phishing emails have changed and how to protect your IT clients 1 © 2016 Calyptix Security Corporation. All rights reserved. I [email protected] I (800) 650 – 8930 (800) 650-8930 I [email protected] Contents Introduction ............................................................................................ 2 Phishing overview .................................................................................. 3 Trends in phishing emails ...................................................................... 6 Email phishing tactics .......................................................................... 11 Steps for MSP & VARS .......................................................................... 24 Advice for your clients .......................................................................... 29 Sources .................................................................................................. 35 1 © 2016 Calyptix Security Corporation. All rights reserved. I [email protected] I (800) 650 – 8930 Introduction There are only so many ways to break into a bank. You can march through the door. You can climb through a window. You can tunnel through the floor. There is the service entrance, the employee entrance, and access on the roof. Criminals who want to rob a bank will probably use an open route – such as a side door. It’s easier than breaking down a wall. Criminals who want to break into your network face a similar challenge. They need to enter. They can look for a weakness in your
    [Show full text]
  • 2020 Identity Theft Statistics | Consumeraffairs
    2020 Identity Theft Statistics | ConsumerAffairs Trending Home / Finance / Identity Theft Protection / Identity theft statistics Buyers Guides Last Updated 01/16/2020 News Write a review 2020Write a review Identity theft statistics Trends and statistics about identity theft Learn about identity theft protection by Rob Douglas Identity Theft Protection Contributing Editor In 2018, the Federal Trade Commission processed 1.4 million fraud reports totaling $1.48 billion in losses. According to the FTC’s “Consumer Sentinel Network Data Book,” the most common categories for fraud complaints were imposter scams, debt collection and identity theft. Credit card fraud was most prevalent in identity theft cases — more than 167,000 people reported a fraudulent credit card account was opened with their information. Identity theft trends in 2019 In the next year, the Identity Theft Resource Center (ITRC) predicts identity theft protection services will primarily focus on data breaches, data abuse and data privacy. ITRC also predicts that https://www.consumeraffairs.com/finance/identity-theft-statistics.html 2020 Identity Theft Statistics | ConsumerAffairs consumers will become more knowledgeable about how data breaches work and expect companies to provide more information about the specific types of data breached and demand more transparency in general in data breach reports. Cyber attacks are more ambitious According to a 2019 Internet Security Threat Report by Symantec, cybercriminals are diversifying their targets and using stealthier methods to commit identity theft and fraud. Cybercrime groups like Mealybug, Gallmaker and Necurs are opting for off-the-shelf tools and operating system features such as PowerShell to attack targets. Supply chain attacks are up 78% Malicious PowerShell scripts have increased by 1,000% Microsoft Office files make up 48% of malicious email attachments Internet of Things threats on the rise Cybercriminals attack IoT devices an average of 5,233 times per month.
    [Show full text]
  • Copyrighted Material
    00929ftoc.qxd:00929ftoc 3/13/07 2:02 PM Page ix Contents Acknowledgments vii Introduction xvii Chapter 1: You: Programmer and Search Engine Marketer 1 Who Are You? 2 What Do You Need to Learn? 3 SEO and the Site Architecture 4 SEO Cannot Be an Afterthought 5 Communicating Architectural Decisions 5 Architectural Minutiae Can Make or Break You 5 Preparing Your Playground 6 Installing XAMPP 7 Preparing the Working Folder 8 Preparing the Database 11 Summary 12 Chapter 2: A Primer in Basic SEO 13 Introduction to SEO 13 Link Equity 14 Google PageRank 15 A Word on Usability and Accessibility 16 Search Engine Ranking Factors 17 On-Page Factors 17 Visible On-Page Factors 18 Invisible On-Page Factors 20 Time-Based Factors 21 External FactorsCOPYRIGHTED MATERIAL 22 Potential Search Engine Penalties 26 The Google “Sandbox Effect” 26 The Expired Domain Penalty 26 Duplicate Content Penalty 27 The Google Supplemental Index 27 Resources and Tools 28 Web Analytics 28 00929ftoc.qxd:00929ftoc 3/13/07 2:02 PM Page x Contents Market Research 29 Researching Keywords 32 Browser Plugins 33 Community Forums 33 Search Engine Blogs and Resources 34 Summary 35 Chapter 3: Provocative SE-Friendly URLs 37 Why Do URLs Matter? 38 Static URLs and Dynamic URLs 38 Static URLs 39 Dynamic URLs 39 URLs and CTR 40 URLs and Duplicate Content 41 URLs of the Real World 42 Example #1: Dynamic URLs 42 Example #2: Numeric Rewritten URLs 43 Example #3: Keyword-Rich Rewritten URLs 44 Maintaining URL Consistency 44 URL Rewriting 46 Installing mod_rewrite 48 Testing mod_rewrite 49 Introducing
    [Show full text]
  • SERI Conference Proceedings SHORTLISTED PAPERS
    National Centre of Excellence for Cybersecurity Technology PRESENTS AT Development & Entrepreneurship OUR PARTNERS Promoting Cybersecurity Education, Research and Innovation SERI Conference Proceedings SHORTLISTED PAPERS THEME AUTHORS INSTITUTE/ ORGANIZATION Data Privacy Signify North America Considerations during HARSHA BANAVARA Corporation, Burlington, Requirements Phase of MA USA IoT Product Development ASTRA: A Post AKSHAY JAIN Exploitation Red Teaming DR. BHUVNA J Jain University, Bengaluru tool SUBARNA PANDA zSpaze Technologies, New Space and New RAMESH KUMAR V Bengaluru Threats PRASANNA PHANINDRAN Easwari Engineering College, Chennai Cross-Channel Scripting Attacks (XCS) in Web SHASHIDHAR R Bennett University Applications Quantum Cryptography SANCHALI KSHIRSAGAR UMIT SNDT Women's and Use Cases: A Short DR. SANJYA PAWAR University, Mumbai Survey Paper SHRAVANI SHAHAPURE NCoE-DSCI, New Delhi Securing IoT using Tata Advanced System SHALINI DHULL Permissioned Blockchain Limited, Noida Parul University, Vadodara An Analysis of Internet of YASSIR FAROOQUI Sankalchand Patel University, Things(IoT) Architecture DR. KIRIT MODI Visnagar Zonal advisory at Consumer Cyber Security-Modern Era DR. SUMANTA BHATTACHARYA Challenge to Human Race Rights Organization BHAVNEET KAUR SACHDEV and it’s impact on COVID-19 Indian Institute of Human Data Privacy Considerations during Requirements Phase of IoT Product Development HARSHA BANAVARA Data Privacy Considerations during Requirements Phase of IoT Product Development Harsha Banavara Signify North America
    [Show full text]
  • Understanding and Analyzing Malicious Domain Take-Downs
    Cracking the Wall of Confinement: Understanding and Analyzing Malicious Domain Take-downs Eihal Alowaisheq1,2, Peng Wang1, Sumayah Alrwais2, Xiaojing Liao1, XiaoFeng Wang1, Tasneem Alowaisheq1,2, Xianghang Mi1, Siyuan Tang1, and Baojun Liu3 1Indiana University, Bloomington. fealowais, pw7, xliao, xw7, talowais, xm, [email protected] 2King Saud University, Riyadh, Saudi Arabia. [email protected] 3Tsinghua University, [email protected] Abstract—Take-down operations aim to disrupt cybercrime “clean”, i.e., no longer involved in any malicious activities. involving malicious domains. In the past decade, many successful Challenges in understanding domain take-downs. Although take-down operations have been reported, including those against the Conficker worm, and most recently, against VPNFilter. domain seizures are addressed in ICANN guidelines [55] Although it plays an important role in fighting cybercrime, the and in other public articles [14, 31, 38], there is a lack of domain take-down procedure is still surprisingly opaque. There prominent and comprehensive understanding of the process. seems to be no in-depth understanding about how the take-down In-depth exploration is of critical importance for combating operation works and whether there is due diligence to ensure its cybercrime but is by no means trivial. The domain take-down security and reliability. process is rather opaque and quite complicated. In particular, In this paper, we report the first systematic study on domain it involves several steps (complaint submission, take-down takedown. Our study was made possible via a large collection execution, and release, see SectionII). It also involves multiple of data, including various sinkhole feeds and blacklists, passive parties (authorities, registries, and registrars), and multiple DNS data spanning six years, and historical WHOIS informa- domain management elements (DNS, WHOIS, and registry tion.
    [Show full text]