EPL682 - ADVANCED SECURITY TOPICS Instructor: Elias Athanasopoulos CAPTCHA REPORT Andreas Charalampous April 2020

EPL682 - ADVANCED SECURITY TOPICS Instructor: Elias Athanasopoulos CAPTCHA REPORT Andreas Charalampous April 2020 1. Captcha Background 1.1. Introduction Using computers as bots, attackers can attack at scale, for example automatic registration for spam accounts, or comment posting. It is visible that a defense mechanism is needed, to guard resources (e.g. account registration) from automated and scaled attacks and at the same time does not block humans from accessing it. A defense mechanism for this purpose is Captcha, which protects Web Resources from being exploited at scale. Captcha stands for Completely Automated Public Turing test to tell Computers and Humans Apart and as its name says, is a Test used for determining if a user is human or not; if a user passes the test, he is considered a human, else he is considered a computer. It is also known as the “Reverse Turing Test”. When a user wants to access a web resource, then a challenge is shown to him and to continue and access the resource, the user must solve it giving the correct answer. 1.2. Captcha Challenges A Captcha challenge must at the same time make the bot fail and the human easily solve it. From 1997 when captcha was first introduced, until today, challenges are evolving and new types are created. The first version of the Captcha challenge was the “twisted text” [Picture 1a], where the user was shown a distorted text and had to provide the text shown. Early and most used challenges are the math captcha, audio captcha [Picture 1b] and image captcha [Picture 1c]. There are is a grand variety of captcha challenges [Picture 1d - 1f]. (a) Twisted Text (b) Math/Audio (c) Image (d) SlideLock (e) Drag n’ Drop (f) Trivial Picture 1: Captcha Challenges 1.3. reCaptcha In 2007 reCaptcha was developed, in 2009 was acquired by Google and today is the most used Captcha. The three most common reCaptcha are the distorted text reCaptcha [Picture 2a], the Image reCaptcha [Picture 2b] and the noCaptcha reCaptcha (checkbox) [Picture 2c]. Captchas are evolving for more than 20 years and will keep on evolving with more kinds of captchas created. The reason is that they are improving, finding ways to make it easier for humans, including minorities like health impaired users, and at the same time make them more difficult to bots. Also, captchas are kept being bypassed by automation software or solver services, creating an arms race between solvers and providers. (a) Distorted text (b) Image (c) Checkbox Picture 2: reCaptcha Challenges The distorted text reCaptcha was used as an aid to digitize “The New York Times” archives. During the automatic scanning of the archives, many words were not recognized by computers. To translate the words easily, each unknown word is sent as a challenge with another known word. If the user gives a correct answer for the known word, then his guess for the unknown word is considered as the translation for the scanned word. For more accuracy, the same unknown word is given in multiple challenges to different users. 1.4. noCaptcha reCaptcha During the evolution of captchas, human solving services were founded, that sold solutions provided by human solvers. In 2014 the noCaptcha reCaptcha was developed to distinguish not only bots from humans, but good humans from bad humans (fraud solvers). This rather easier challenge consisted of a checkbox where the user is asked to just click it. In the background, a behavioral analysis on the user and its browser is performed to choose if the user is a bot or human (good or bad). More specifically the Advanced Risk Analysis System (ARAS) acquires user information from Google tracking cookies and browser, analyzes it and based on them it provides an easy (image), hard (difficult distorted text) or no challenge at all to the user. When a site protects a resource, it contains a reCaptcha widget, which collects the cookie and browser information. The user is given a checkbox (as shown in Picture 2c) and is asked to click it. When the user clicks it, a request is sent to Google containing the Referrer, SiteKey, Cookie and all information gathered by the widget, which are all analyzed by the ARAS and an HTML frame with the corresponding challenge is returned to the user. Also, when the checkbox is clicked, an HTML field is populated with a token, which must be become valid by Google and then be submitted to the site containing the resource. The token becomes valid if the user is legit or when the user passes the test given. When the site gets the token from the user, it sends a verification request to Google and gets a response indicating if the token verification was a success. Finally, the site gives access to the resource. 2. Re: CAPTCHAs – Understanding CAPTCHA-Solving Services in an Economic Context Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker and Stefan Savage University of California, San Diego 2.1. Introduction The fact that Captchas were deployed to guard resources, does not mean it ends there; attackers that were “abusing” those resources, are now looking for solutions that will help them bypass captchas. A need for automatically solving captchas appeared and hence services are taking advantage of that need, selling captcha solvers, creating a whole business model. In the paper, the two types of solvers, automatic and human labor solvers are presented, followed by the economics around it. 2.2. Automated Solvers The first type of solvers is automated solvers, which mainly is a software that uses Optical Character Recognition (OCR) algorithms, reading and solving Text Captchas. Two solvers investigated were Xrumer and reCaptchaOCR. The solvers made providers change their Captchas, creating an arms race between them, that favors the defender (providers). The first reason is that in order to develop new solvers, highly skilled labor is required. Another reason is that those solvers have low accuracy and most sites blacklist IP addresses after 5-7 failed attempts. Finally, sites have alternative captchas ready for swift deployment in case of existing captchas get bypassed. Other than the arms race lost, automated solvers did not survive in the market, because of the human solvers. 2.3. Human Solvers The second type of solvers is human solvers. The motivation is that Captchas are intended to obstruct automated solvers and this can be sidestepped by giving captchas to human labor pools. Paid solving is the core of the Captcha-solving ecosystem. There is a whole business model around paid solving services; an example is shown in Picture 3, where an automating-spamming software tries to create multiple Gmail accounts, but is prevented by Captcha. Picture 3 - Captcha-solving market workflow 1. GYC Automator (Client) tries to create a Gmail account and is challenged with a Captcha. 2. The client pays DeCaptcher (Solving Service) to solve the Captcha. 3. Solving Service puts the Captcha in a PixProfit (Workers forum) pool. 4. PixProfit selects a worker from a pool. 5. The worker responds to PixProfit with the solution. 6. PixProfit sends the solution to the Solving Service and then to Client. 7. The client enters the solution to Gmail, gets validated and the account is created. In an attempt to find geolocation details about the workers, the authors created Captchas in different languages or asking about the local time and polled the human-based solving services, concluding that more workers come from low-cost labor countries (China, India, etc.). Because of its nature, being an unskilled activity and switching to low-cost labor from Eastern Europe, Bangladesh, China, India, Vietnam etc. paid-solving services not only survived, they expanded and became highly competitive as well. Even though the wages started from $10/10001, in a few years it dropped to ~ $0.5/1000. 2.4. Conclusion • The quality of captchas made it easy to outsource to the global unskilled labor market. • Business of solving captchas growing and highly competitive. • Do Captchas work: o Telling computers and humans apart: succeeded. o Preventing automated site access: failed. o Limiting automated site access: reduces attackers expected profit. 1 Price of Captcha Solving Services is counted as dollars paid per 1000 solved captchas. For example $5/1000 means the client pays 5 dollars for 1000 solved captchas. 3. I am Robot: (DEEP) Learning to Break Semantic Image CAPTCHAs Suphannee Sivakorn, Iasonas Polakis and Angelos D. Keromytis Department of Computer Science Columbia University, New York, USA 3.1. Introduction Two other Captcha attacks were developed for researching purposes, focusing on solving reCaptcha Image using online Image Annotation Modules and noCaptcha reCaptcha by influencing the Advanced Risk Analysis System. To achieve this, a system is developed consisting of two main components. 3.2. System Overview 3.2.1. Cookie Manager The first component is the Cookie Manager which its main job is to automatically create and train cookies so they appear as real users. After creating each cookie, the system is configured to perform humane actions using them, some examples are google searching certain terms and follow the links provided, open videos in youtube, google map searches, etc. 3.2.2. ReCaptcha Breaker The second component is the ReCaptcha Breaker. It uses the cookies from the Cookie Manager and visits sites to employ reCaptchas. It locates the reCaptcha Iframe that contains the checkbox looking for reCaptcha-anchor, performs a click and extracts the reCaptcha-token. If the reCaptcha is solved, then it is considered a checkbox challenge, otherwise, if a popup is created in goog-bubble-content, an image challenge is shown. The info of image challenge, hint-sample image(rc-imageselect-desc) and candidate images(rc-imageselect-tile) are extracted and passed to another module.

Load more