UNIVERSITY of CALIFORNIA, SAN DIEGO Understanding the Role Of

Total Page:16

File Type:pdf, Size:1020Kb

UNIVERSITY of CALIFORNIA, SAN DIEGO Understanding the Role Of UNIVERSITY OF CALIFORNIA, SAN DIEGO Understanding the Role of Outsourced Labor in Web Service Abuse A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Marti A. Motoyama Committee in charge: Professor Stefan Savage, Co-Chair Professor George Varghese, Co-Chair Professor Geoffrey M. Voelker, Co-Chair Professor Gert Lanckriet Professor Akos Ronas-Tas Professor Lawrence K. Saul 2011 Copyright Marti A. Motoyama, 2011 All rights reserved. The dissertation of Marti A. Motoyama is approved, and it is acceptable in quality and form for publication on microfilm and electronically: Co-Chair Co-Chair Co-Chair University of California, San Diego 2011 iii DEDICATION To my family, friends, and loved ones, who all made this journey a little less painful. iv EPIGRAPH Cool story, bro. —Christopher William Kanich A mind needs books as a sword needs a whetstone, if it is to keep its edge. —Tyrion Lannister I think it is the excitement that only a free man can feel, a free man starting a long journey whose conclusion in uncertain. —Red v TABLE OF CONTENTS Signature Page ................................... iii Dedication ...................................... iv Epigraph ....................................... v Table of Contents .................................. vi List of Figures .................................... ix List of Tables .................................... xii Acknowledgements . xiii Vita ......................................... xv Abstract of the Dissertation .............................xvii Chapter 1 Introduction ............................. 1 1.1 Contributions ......................... 3 1.2 Organization ......................... 4 Chapter 2 Background and Related Work ................... 6 2.1 Freelancing Sites ....................... 6 2.2 CAPTCHAs .......................... 9 2.3 Related Work ......................... 10 2.3.1 Outsourcing Abuse Tasks . 10 2.3.2 CAPTCHA Solving . 11 Chapter 3 The Role of Freelance Labor in Web Service Abuse . 13 3.1 Data Overview ........................ 15 3.1.1 Data Collection Methodology . 15 3.1.2 Data Summary .................... 15 3.1.3 Categorizing Jobs ................... 17 3.1.4 Posting Job Listings . 19 3.2 Case Studies .......................... 20 3.2.1 Accounts ....................... 20 3.2.2 OSN Linking ..................... 27 3.2.3 Spamming ...................... 33 3.2.4 Search Engine Optimization . 36 3.3 User Analysis ......................... 41 3.3.1 Country of Origin ................... 42 vi 3.3.2 Specialization ..................... 44 3.3.3 Reputation and Lifetime . 44 3.4 Discussion ........................... 45 3.5 Summary ........................... 46 Chapter 4 Understanding CAPTCHA-Solving Services in an Economic Con- text ................................. 47 4.1 Automated Software Solvers . 49 4.1.1 Empirical Case Studies . 49 4.1.2 Economics ...................... 52 4.2 Human Solver Services .................... 54 4.2.1 Opportunistic Solving . 54 4.2.2 Paid Solving ..................... 55 4.2.3 Economics ...................... 55 4.2.4 Active Measurement Issues . 57 4.3 Solver Service Quality .................... 59 4.3.1 Customer Account Creation . 59 4.3.2 Customer Interface . 60 4.3.3 Service Pricing .................... 61 4.3.4 Test Corpus ...................... 61 4.3.5 Verifying Solutions . 62 4.3.6 Quality of Service . 62 4.3.7 Value ......................... 68 4.3.8 Capacity ....................... 69 4.3.9 Load and Availability . 70 4.4 Workforce ........................... 72 4.4.1 Account Creation ................... 72 4.4.2 Worker Interface ................... 72 4.4.3 Worker Wages .................... 73 4.4.4 Geolocating Workers . 75 4.4.5 Adaptability ...................... 77 4.4.6 Targeted Sites ..................... 80 4.5 Summary ........................... 81 Chapter 5 Conclusion ............................. 85 5.1 Future Directions ....................... 86 5.1.1 Purchasing Via Retail Sites Versus Outsourcing . 86 5.1.2 Identification of Abuse Jobs at Scale . 87 5.1.3 Combating Human-Based CAPTCHA Solving . 88 5.2 Final Thoughts ........................ 88 Appendix A Interesting Job Posts and Bids from Freelancer.com . 90 A.1 Interesting Jobs ........................ 90 vii A.1.1 Legitimate ...................... 90 A.1.2 Accounts ....................... 90 A.1.3 SEO .......................... 91 A.1.4 Spamming ...................... 91 A.1.5 OSN Linking ..................... 92 A.1.6 Miscellaneous .................... 92 A.2 Interesting Bids ........................ 93 A.2.1 Accounts ....................... 93 A.2.2 SEO .......................... 93 A.2.3 Spamming ...................... 93 A.2.4 OSN Linking ..................... 94 A.2.5 Misc .......................... 94 Bibliography .................................... 95 viii LIST OF FIGURES Figure 2.1: Screenshot showing a job post on Freelancer.com. The numbers correspond to the following fields: –project budget, –employer username and rating, –project description, –employer selected keywords. ............................... 7 Figure 2.2: Screenshot demonstrating the various fields for the worker bids on Freelancer.com. The numbers correspond to the following fields: –worker username, –worker bid, typically not reflective of de- sired salary, –worker rating, –worker message to employer de- tailing skills, additional messages are sent via private message. 8 Figure 2.3: Examples of CAPTCHAs from various Internet properties. 9 Figure 3.1: Growth in Freelancer activity over time. Numbers in parentheses in the legend denote the largest activity in a month. 16 Figure 3.2: Median monthly prices offered by buyers for 1,000 CAPTCHA solves (top) and the monthly volume of CAPTCHA solving posts (bottom), both as functions of time. The solid vertical price bars show 25% to 75% price quartiles. 24 Figure 3.3: Sites targeted in account registration jobs. 26 Figure 3.4: Demand for account registration jobs over time. The dashed vertical lines indicate approximate dates when Craigslist introduced phone verification for erotic services ads (March 2008) and other services (May 2008). ............................. 27 Figure 3.5: Number of job postings for social networking links. 29 Figure 3.6: Social link growth rate. The first day a social link appears is counted as Day 1. ............................... 31 Figure 3.7: The number of user accounts common to each pair of workers hired to create social links. Labeled solid lines indicate at least 100 user accounts (out of 1,000 requested) in common, dashed lines indicate at least 10 but fewer than 100 user accounts in common. Work per- formed by MY1, PK4, and BD2 was done in April, while the remaining jobs were done roughly a month later. 32 Figure 3.8: Median number of friends vs. median number of page social links for the sets of users linked to our websites. 32 Figure 3.9: Median monthly prices offered by buyers for each Craigslist ad posted (top), and the monthly number of posts (bottom), both as a function of time. The solid vertical price bars show 25% to 75% price quartiles. The dashed vertical lines indicate approximate dates when Craigslist introduced phone verification for erotic services ads (March 2008) and other services (May 2008). The three bids re- ceived in response to our solicitation are indicated with a triangle on the right edge. ........................... 35 ix Figure 3.10: Median monthly prices offered by buyers for each article posted (top) and monthly average number of buyer posts per day (bottom), as a function of time. Vertical price bars show 25% to 75% price quartiles. ............................... 37 Figure 3.11: Average price buyers offered for backlinks on pages with a given PageRank (PR). Higher PRs correspond to more popular (and valu- able) pages. The number above the bar corresponds to the number of jobs requesting backlinks of that PR. 40 Figure 3.12: Distributions of countries for buyers and bidders. 42 Figure 3.13: Top five countries of buyers posting abusive jobs. 43 Figure 3.14: Top five countries of workers bidding on abusive jobs. 43 Figure 3.15: How the various elements of the market fit together . 45 Figure 4.1: Examples of CAPTCHAs downloaded directly from reCaptcha at dif- ferent time periods. .......................... 51 Figure 4.2: CAPTCHA-solving market workflow: GYC Automator tries to register a Gmail account and is challenged with a Google CAPT- CHA. GYC uses the DeCaptcher plug-in to solve the CAPTCHA at $2 per 1,000. DeCaptcher queues the CAPTCHA for a worker on the affiliated PixProfit back end. PixProfit selects a worker and pays at $1/1,000. Worker enters a solution to PixProfit, which returns it to the plug-in. GYC then enters the solution for the CAPTCHA to Gmail to register the account. 54 Figure 4.3: Median error rate and response time (in seconds) for all services. Services are ranked top-to-bottom in order of increasing error rate. 63 Figure 4.4: Error rate and median response time for each combination of ser- vice and CAPTCHA type. The area of each circle upper table is pro- portional to the error rate (among solved CAPTCHAs). In the lower table, circle area is proportional to the response time minus ten sec- onds (for increased contrast); negative values are denoted by un- shaded circles. Numeric values corresponding to the values in the leftmost and rightmost columns are shown on the side. Thus, the error rate of BypassCaptcha on Youku CAPTCHAs is 66%, and for BeatCaptchas on PayPal 4%. The median
Recommended publications
  • Symantec Report on Rogue Security Software July 08 – June 09
    REPORT: SYMANTEC ENTERPRISE SECURITY SYMANTEC REPORT: Symantec Report on Rogue Security Software July 08 – June 09 Published October 2009 Confidence in a connected world. White Paper: Symantec Enterprise Security Symantec Report on Rogue Security Software July 08 – June 09 Contents Introduction . 1 Overview of Rogue Security Software. 2 Risks . 4 Advertising methods . 7 Installation techniques . 9 Legal actions and noteworthy scam convictions . 14 Prevalence of Rogue Security Software . 17 Top reported rogue security software. 17 Additional noteworthy rogue security software samples . 25 Top rogue security software by region . 28 Top rogue security software installation methods . 29 Top rogue security software advertising methods . 30 Analysis of Rogue Security Software Distribution . 32 Analysis of Rogue Security Software Servers . 36 Appendix A: Protection and Mitigation. 45 Appendix B: Methodologies. 48 Credits . 50 Symantec Report on Rogue Security Software July 08 – June 09 Introduction The Symantec Report on Rogue Security Software is an in-depth analysis of rogue security software programs. This includes an overview of how these programs work and how they affect users, including their risk implications, various distribution methods, and innovative attack vectors. It includes a brief discussion of some of the more noteworthy scams, as well as an analysis of the prevalence of rogue security software globally. It also includes a discussion on a number of servers that Symantec observed hosting these misleading applications. Except where otherwise noted, the period of observation for this report was from July 1, 2008, to June 30, 2009. Symantec has established some of the most comprehensive sources of Internet threat data in the world through the Symantec™ Global Intelligence Network.
    [Show full text]
  • Xrumer Manual
    XRumer manual © Botmaster Labs, 2006 - 2010 2 XRumer manual Table of Contents Foreword 0 Part I Introduction 6 1 About XRumer................................................................................................................................... 6 2 How to start ................................................................................................................................... 7 3 Interface ................................................................................................................................... 7 4 Interface sleep................................................................................................................................... mode 11 5 Files structure................................................................................................................................... 12 6 Tools ................................................................................................................................... 12 7 Traffic saving................................................................................................................................... algorithm 19 Part II Program settings 21 1 Proxy checking................................................................................................................................... settings 21 2 Speed & success................................................................................................................................... rate 23 3 AntiSpam system..................................................................................................................................
    [Show full text]
  • Dspin: Detecting Automatically Spun Content on the Web
    DSpin: Detecting Automatically Spun Content on the Web Qing Zhang David Y. Wang Geoffrey M. Voelker University of California, San Diego University of California, San Diego University of California, San Diego [email protected] [email protected] [email protected] Abstract—Web spam is an abusive search engine optimization promoted. The belief is that these “backlinks”, as well as technique that artificially boosts the search result rank of pages keywords, will increase the page rank of the promoted sites promoted in the spam content. A popular form of Web spam in Web search results, and consequently attract more traffic to today relies upon automated spinning to avoid duplicate detection. the promoted sites. Search engines seek to identify duplicate Spinning replaces words or phrases in an input article to pages with artificial backlinks to penalize them in the page create new versions with vaguely similar meaning but sufficiently rank calculation, but spinning evades detection by producing different appearance to avoid plagiarism detectors. With just a few clicks, spammers can use automated tools to spin a selected artificial content that masquerades as original. article thousands of times and then use posting services via There are two ways content can be spun. The first is to proxies to spam the spun content on hundreds of target sites. employ humans to spin the content, as exemplified by the The goal of this paper is to develop effective techniques to detect automatically spun content on the Web. Our approach is directly many spinning jobs listed on pay-for-hire Web sites such as tied to the underlying mechanism used by automated spinning Fiverr and Freelancer [26].
    [Show full text]
  • Russian Underground 101
    Trend Micro Incorporated Research Paper 2012 Russian Underground 101 Max Goncharov Contents Introduction ........................................................................................................................................... 1 File Encryption and Crypting Services ............................................................................................ 1 Crypter Prices ............................................................................................................................... 2 Dedicated Servers................................................................................................................................ 3 Dedicated Server Prices ............................................................................................................. 3 Proxy Servers ....................................................................................................................................... 3 SOCKS Bots ...........................................................................................................................................4 VPN Services ........................................................................................................................................5 VPN Service Prices ......................................................................................................................5 Pay-per-Install Services ......................................................................................................................6 Pay-per-Install
    [Show full text]
  • Re: Captchas – Understanding CAPTCHA-Solving Services in an Economic Context
    Re: CAPTCHAs – Understanding CAPTCHA-Solving Services in an Economic Context Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker and Stefan Savage University of California, San Diego fmmotoyam, klevchen, ckanich, dlmccoy, voelker, [email protected] Abstract alphanumeric characters that are distorted in such a way that available computer vision algorithms have difficulty Reverse Turing tests, or CAPTCHAs, have become an segmenting and recognizing the text. At the same time, ubiquitous defense used to protect open Web resources humans, with some effort, have the ability to decipher from being exploited at scale. An effective CAPTCHA the text and thus respond to the challenge correctly. To- resists existing mechanistic software solving, yet can day, CAPTCHAs of various kinds are ubiquitously de- be solved with high probability by a human being. In ployed for guarding account registration, comment post- response, a robust solving ecosystem has emerged, re- ing, and so on. selling both automated solving technology and real- This innovation has, in turn, attached value to the time human labor to bypass these protections. Thus, problem of solving CAPTCHAs and created an indus- CAPTCHAs can increasingly be understood and evaluated trial market. Such commercial CAPTCHA solving comes in purely economic terms; the market price of a solution in two varieties: automated solving and human labor. vs the monetizable value of the asset being protected. We The first approach defines a technical arms race between examine the market-side of this question in depth, ana- those developing solving algorithms and those who de- lyzing the behavior and dynamics of CAPTCHA-solving velop ever more obfuscated CAPTCHA challenges in re- service providers, their price performance, and the un- sponse.
    [Show full text]
  • Cloak of Visibility: Detecting When Machines Browse a Different Web
    Cloak of Visibility: Detecting When Machines Browse A Different Web Luca Invernizzi∗, Kurt Thomas∗, Alexandros Kapravelosy, Oxana Comanescu∗, Jean-Michel Picod∗, and Elie Bursztein∗ ∗Google, Inc. finvernizzi, kurtthomas, elieb, oxana, [email protected] yNorth Carolina State University [email protected] Abstract—The contentious battle between web services and a limited set of known cloaking techniques. These include miscreants involved in blackhat search engine optimization and redirect cloaking in search engine results [16], [18], [24], malicious advertisements has driven the underground to develop [27], [33], [34] or search visitor profiling based on the increasingly sophisticated techniques that hide the true nature of malicious sites. These web cloaking techniques hinder the User-Agent and Referer of HTTP requests [32], [35]. effectiveness of security crawlers and potentially expose Internet An open question remains as to what companies and crawlers users to harmful content. In this work, we study the spectrum blackhat cloaking software targets, the capabilities necessary of blackhat cloaking techniques that target browser, network, or for security practitioners to bypass state of the art cloaking, contextual cues to detect organic visitors. As a starting point, we and ultimately whether blackhat techniques generalize across investigate the capabilities of ten prominent cloaking services marketed within the underground. This includes a first look traffic sources including search results and advertisements. at multiple IP blacklists
    [Show full text]
  • Detecting Automatically Spun Content on the Web
    DSpin: Detecting Automatically Spun Content on the Web Qing Zhang David Y. Wang Geoffrey M. Voelker University of California, San Diego University of California, San Diego University of California, San Diego [email protected] [email protected] [email protected] Abstract—Web spam is an abusive search engine optimization promoted. The belief is that these “backlinks”, as well as technique that artificially boosts the search result rank of pages keywords, will increase the page rank of the promoted sites promoted in the spam content. A popular form of Web spam in Web search results, and consequently attract more traffic to today relies upon automated spinning to avoid duplicate detection. the promoted sites. Search engines seek to identify duplicate Spinning replaces words or phrases in an input article to pages with artificial backlinks to penalize them in the page create new versions with vaguely similar meaning but sufficiently rank calculation, but spinning evades detection by producing different appearance to avoid plagiarism detectors. With just a few clicks, spammers can use automated tools to spin a selected artificial content that masquerades as original. article thousands of times and then use posting services via There are two ways content can be spun. The first is to proxies to spam the spun content on hundreds of target sites. employ humans to spin the content, as exemplified by the The goal of this paper is to develop effective techniques to detect many spinning jobs listed on pay-for-hire Web sites such as automatically spun content on the Web. Our approach is directly tied to the underlying mechanism used by automated spinning Fiverr and Freelancer [26].
    [Show full text]
  • The Nuts and Bolts of a Forum Spam Automator
    The Nuts and Bolts of a Forum Spam Automator Youngsang Shin, Minaxi Gupta, Steven Myers School of Informatics and Computing, Indiana University - Bloomington [email protected], [email protected], [email protected] USENIX LEET 2011 Mo7va7on } The Web is huge and keeps expanding } Over 255 million active websites on the Internet } 21.4 million were newly added in 2010 } Google claimed to know of one trillion pages even in 2008 } Making a website discoverable is challenging! } Web spamming } Exploiting Search Engine Optimization (SEO) techniques ¨ Keyword stuffing, cloaking ¨ Link farms ¨ Content farms 1 Why Forum Spamming? } Forum } A website where visitors can contribute content } Examples } Web boards, blogs, wikis, guestbooks } Forums are an attractive target for spamming } Many contain valuable information } Blacklisting or taking-down is not an option in most cases } Spammers’ benefit from forum spamming } Visitors could be directed to spammers’ websites } Boosting search engine rankings for their websites 2 Overview of Forum Spam Automators } Basic function } To automate the process of posting forum spam } Advanced Functions } Goal: to improve the success rate of spamming } Approach: to avoid forum spam mitigation techniques } Registration } Email address verification } Legitimate posting history } CAPTCHA } Examples } XRumer, SEnuke, ScrapeBox, AutoPligg, Ultimate WordPress Comment Submitter (UWCS) 3 Outline } Introduction } Overview of Forum Spam Automators } Primal Functionalities } Advanced Functionalities } Traffic Characteristics
    [Show full text]
  • Xrumer Manual
    Xrumer manual © Botmaster Labs, 2006 - 2008 2 Xrumer manual Table of Contents Foreword 0 Part I Introduction 6 1 About XRumer................................................................................................................................... 6 2 How to start ................................................................................................................................... 7 3 Interface ................................................................................................................................... 7 4 Interface sleep................................................................................................................................... mode 11 5 Files structure................................................................................................................................... 12 6 Tools ................................................................................................................................... 12 7 Traffic saving................................................................................................................................... algorithm 19 Part II Program settings 21 1 Proxy checking................................................................................................................................... settings 21 2 Speed & success................................................................................................................................... rate 23 3 AntiSpam system..................................................................................................................................
    [Show full text]
  • The Nuts and Bolts of a Forum Spam Automator
    The Nuts and Bolts of a Forum Spam Automator Youngsang Shin, Minaxi Gupta, Steven Myers School of Informatics and Computing Indiana University, Bloomington {shiny,minaxi}@cs.indiana.edu, [email protected] Abstract The effects of forum spamming have been studied be- fore [18, 21]. In contrast, we focus on how spammers Web boards, blogs, wikis, and guestbooks are forums fre- actually post spam links on forums since understand- quented and contributed to by many Web users. Unfor- ing their modus operandi can offer important insights for tunately, the utility of these forums is being diminished mitigating forum spamming. Toward our goal, we survey due to spamming, where miscreants post messages and forum spam automator tools including XRumer, SEnuke, links not intended to contribute to forums, but to adver- ScrapeBox, AutoPligg, and Ultimate WordPress Com- tise their websites. Many such links are malicious. In ment Submitter (UWCS) for their functionality. These this paper we investigate and compare automated tools tools enable forum spammers to automatically post spam used to spam forums. We analyze the functionality of the to a large number of target forums. most popular forum spam automator, XRumer, in details We explore XRumer [7] in detail. It is one of the most and find that it can intelligently get around many prac- popular forum spamming automators on blackhat SEO tices used by forums to distinguish humans from bots, all forums, in fact perhaps the most, and has the most ex- while keeping the spammer hidden. Insights gained from tensive set of features. These include the ability to au- our study suggest specific measures that can be used to tomatically register fake accounts at target forums, build block spamming by this automator.
    [Show full text]
  • NOMAD: Towards Non-Intrusive Moving-Target Defense Against Web Bots
    NOMAD: Towards Non-Intrusive Moving-Target Defense against Web Bots Shardul Vikram, Chao Yang, Guofei Gu SUCCESS Lab, Texas A&M University shardul.vikram, yangchao, guofei @cse.tamu.edu { } Abstract—Web bots, such as XRumer, Magic Submitter and accounts with these free email services [12]. Popular online SENuke, have been widely used by attackers to perform illicit social networking websites such as Twitter and Facebook have activities, such as massively registering accounts, sending spam, also become web bots’ attacking targets [13]. and automating web-based games. Although the technique of CAPTCHA has been widely used to defend against web bots, it Most existing work utilizes CAPTCHA (Completely Au- requires users to solve some explicit challenges, which is typically tomated Public Tests to tell Computers and Humans Apart) interactive and intrusive, resulting in decreased usability. to defend against web bots [21]–[23], [26], [38]. However, In this paper, we design a novel, non-intrusive moving-target CAPTCHA requires users to solve explicit challenges, which defense system, NOMAD, to complement existing solutions. NO- is typically interactive and intrusive. Also, the robustness of MAD prevents web bots from automating web resource access by randomizing HTML elements while not affecting normal users. CAPTCHA relies on intrinsic difficulties of artificial intelli- Specifically, to prevent web bots uniquely identifying HTML gence challenges, which could be compromised through achiev- elements for later automation, NOMAD randomizes name/id ing artificial intelligence breakthroughs [21], [22]. As those parameter values of HTML elements in each HTTP form page. challenges become more complex to defend against evolved We evaluate NOMAD against five powerful state-of-the-art web web bots, they have become more difficult as well for legit- bots on several popular open source web platforms.
    [Show full text]
  • Cloak of Visibility: Detecting When Machines Browse a Different Web
    Cloak of Visibility: Detecting When Machines Browse A Different Web Luca Invernizzi∗, Kurt Thomas∗, Alexandros Kapravelosy, Oxana Comanescu∗, Jean-Michel Picod∗, and Elie Bursztein∗ ∗Google, Inc. finvernizzi, kurtthomas, elieb, oxana, [email protected] yNorth Carolina State University [email protected] Abstract—The contentious battle between web services and [27], [33], [34] or search visitor profiling based on the miscreants involved in blackhat search engine optimization and User-Agent and Referer of HTTP requests [32], [35]. malicious advertisements has driven the underground to develop An open question remains as to what companies and crawlers increasingly sophisticated techniques that hide the true nature of malicious sites. These web cloaking techniques hinder the blackhat cloaking software targets, the capabilities necessary effectiveness of security crawlers and potentially expose Internet for security practitioners to bypass state of the art cloaking, users to harmful content. In this work, we study the spectrum and ultimately whether blackhat techniques generalize across of blackhat cloaking techniques that target browser, network, or traffic sources including search results and advertisements. contextual cues to detect organic visitors. As a starting point, we In this paper, we marry both an underground and empirical investigate the capabilities of ten prominent cloaking services marketed within the underground. This includes a first look perspective of blackhat cloaking to study how miscreants scru- at multiple IP blacklists that contain over 50 million addresses tinize an incoming client’s browser, network, and contextual tied to the top five search engines and tens of anti-virus and setting and the impact it has on polluting search results and security crawlers.
    [Show full text]