Re: Captchas – Understanding CAPTCHA-Solving Services in an Economic Context
Total Page:16
File Type:pdf, Size:1020Kb
Re: CAPTCHAs – Understanding CAPTCHA-Solving Services in an Economic Context Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker and Stefan Savage University of California, San Diego fmmotoyam, klevchen, ckanich, dlmccoy, voelker, [email protected] Abstract alphanumeric characters that are distorted in such a way that available computer vision algorithms have difficulty Reverse Turing tests, or CAPTCHAs, have become an segmenting and recognizing the text. At the same time, ubiquitous defense used to protect open Web resources humans, with some effort, have the ability to decipher from being exploited at scale. An effective CAPTCHA the text and thus respond to the challenge correctly. To- resists existing mechanistic software solving, yet can day, CAPTCHAs of various kinds are ubiquitously de- be solved with high probability by a human being. In ployed for guarding account registration, comment post- response, a robust solving ecosystem has emerged, re- ing, and so on. selling both automated solving technology and real- This innovation has, in turn, attached value to the time human labor to bypass these protections. Thus, problem of solving CAPTCHAs and created an indus- CAPTCHAs can increasingly be understood and evaluated trial market. Such commercial CAPTCHA solving comes in purely economic terms; the market price of a solution in two varieties: automated solving and human labor. vs the monetizable value of the asset being protected. We The first approach defines a technical arms race between examine the market-side of this question in depth, ana- those developing solving algorithms and those who de- lyzing the behavior and dynamics of CAPTCHA-solving velop ever more obfuscated CAPTCHA challenges in re- service providers, their price performance, and the un- sponse. However, unlike similar arms races that revolve derlying labor markets driving this economy. around spam or malware, we will argue that the underly- ing cost structure favors the defender, and consequently, 1 Introduction the conscientious defender has largely won the war. The second approach has been transformative, since Questions of Internet security frequently reflect under- the use of human labor to solve CAPTCHAs effectively lying economic forces that create both opportunities and side-steps their design point. Moreover, the combination incentives for exploitation. For example, much of today’s of cheap Internet access and the commodity nature of Internet economy revolves around advertising revenue, today’s CAPTCHAs has globalized the solving market; and consequently, a vast array of services—including e- in fact, wholesale cost has dropped rapidly as providers mail, social networking, blogging—are now available to have recruited workers from the lowest cost labor mar- new users on a basis that is both free and largely anony- kets. Today, there are many service providers that can mous. The implicit compact underlying this model is that solve large numbers of CAPTCHAs via on-demand ser- the users of these services are individuals and thus are vices with retail prices as low as $1 per thousand. effectively “paying” for services indirectly through their In either case, we argue that the security of CAPTCHAs unique exposure to ad content. Unsurprisingly, attack- can now be considered in an economic light. This prop- ers have sought to exploit this same freedom and acquire erty pits the underlying cost of CAPTCHA solving, ei- large numbers of resources under singular control, which ther in amortized development time for software solvers can in turn be monetized (e.g., via thousands of free Web or piece-meal in the global labor market, against the mail accounts for sourcing spam e-mail messages). value of the asset it protects. While the very existence of CAPTCHAs were developed as a means to limit the CAPTCHA-solving services tells us that the value of the ability of attackers to scale their activities using auto- associated assets (e.g., an e-mail account) is worth more mated means. In its most common implementation, a to some attackers than the cost of solving the CAPTCHA, CAPTCHA consists of a visual challenge in the form of the overall shape of the market is poorly understood. Ab- (a) Aol. (b) mail.ru (c) phpBB 3.0 (d) Simple Machines Forum (e) Yahoo! (f) youku Figure 1: Examples of CAPTCHAs from various Internet properties. sent this understanding, it is difficult to reason about the data collection approach and then presenting our experi- security value that CAPTCHAs offer us. ments to measure key qualities such as response time, ac- This paper investigates this issue in depth and, where curacy, and capacity. Section6 describes the demograph- possible, on a empirical basis. We document the commer- ics of the CAPTCHA-solving labor pool. Finally, we dis- cial evolution of automated solving tools (particularly via cuss the implications of our results in Section7 along the successful Xrumer forum spamming package) and with potential directions for future research. how they have been largely eclipsed by the emergence of the human-based CAPTCHA-solving market. To char- 2 Background acterize this latter development, our approach is to en- gage the retail CAPTCHA-solving market on both the sup- The term “CAPTCHA” was first introduced in 2000 by ply side and the demand side, as both a client and as von Ahn et al. [21], describing a test that can differentiate “workers for hire.” In addition to these empirical mea- humans from computers. Under common definitions [4], surements, we also interviewed the owner and operator the test must be: of a successful CAPTCHA-solving service (MR. E), who has provided us both validation and insight into the less • Easily solved by humans, visible aspects of the underlying business processes.1 In • Easily generated and evaluated, but the course of our analysis, we attempt to address key • Not easily solved by computer. questions such as which CAPTCHAs are most heavily tar- geted, the rough solving capacity of the market leaders, Over the past decade, a number of different techniques the relationship of service quality to price, the impact for generating CAPTCHAs have been developed, each of market transparency and arbitrage, the demographics satisfying the properties described above to varying de- of the underlying workforce and the adaptability of ser- grees. The most commonly found CAPTCHAs are visual challenges that require the user to identify alphanumeric vice offerings to changes in CAPTCHA content. We be- characters present in an image obfuscated by some com- lieve our findings, or at least our methodology, provide 2 a context for reasoning about the net value provided by bination of noise and distortion. Figure1 shows ex- amples of such visual CAPTCHAs. The basic challenge CAPTCHAs under existing threats and offer some direc- tions for future development. in designing these obfuscations is to make them easy The remainder of this paper is organized as fol- enough that users are not dissuaded from attempting a so- lution, yet still too difficult to solve using available com- lows: Section2 reviews CAPTCHA design and provides puter vision algorithms. a qualitative history and overview of the CAPTCHA- solving ecosystem. Next, in Section3 we empirically The issue of usability has been studied on a functional characterize two automated solver systems, the popular level—focusing on differences in expected accuracy and Xrumer package and a specialized reCaptcha solver. In response time [3, 19, 22, 26]—but the ultimate effect of Sections4 and5 we then characterize today’s human- CAPTCHA difficulty on legitimate goal-oriented users is not well documented in the literature. That said, Elson et powered CAPTCHA-solving services, first describing our al. provide anecdotal evidence that “even relatively sim- ple challenges can drive away a substantial number of po- 1By agreement, we do not identify MR. E or the particular service he runs. While we cannot validate all of his statements, when we tested 2There exists a range of non-textual and even non-visual CAPTCHAs his service empirically our results for measures such as response time, that have been created but, excepting Microsoft’s Asirra [9], we do not accuracy, capacity and labor makeup were consistent with his reports, consider them here as they play a small role in the current CAPTCHA- supporting his veracity. solving ecosystem. 2 tential customers” [9], suggesting CAPTCHA design re- Xrumer flects a real trade-off between protection and usability. The second challenge, defeating automation, has re- Xrumer [24] is a well-known forum spamming tool, ceived far more attention and has kicked off a competi- widely described on “blackhat” SEO forums as being one tion of sorts between those building ever more sophisti- of the most advanced tools for bypassing many differ- ent anti-spam mechanisms, including CAPTCHAs. It has cated algorithms for breaking CAPTCHAs and those cre- been commercially available since 2006 and currently re- ating new, more obfuscated CAPTCHAs in response [7, 11, 16, 17, 18, 25]. In the next section we examine this tails for $540, and we purchased a copy from the au- issue in more depth and explain why, for economic rea- thor at this price for experimentation. While we would sons, automated solving has been relegated to a niche have liked to include several other well known spamming status in the open market. tools (SEnuke, AutoPligg, ScrapeBox, etc), the cost of these packages range from $97 to $297, which would Finally, an alternative regime for solving CAPTCHAs is to outsource the problem to human workers. Indeed, render this study prohibitively expensive. this labor-based approach has been commoditized and Xrumer’s market success in turn led to a surge of today a broad range of providers operate to buy and sell spam postings causing most service providers targeted by Xrumer to update their CAPTCHAs. This development CAPTCHA-solving service in bulk. We are by no means the first to identify the growth of this activity.