EPL682 - ADVANCED SECURITY TOPICS Instructor: Elias Athanasopoulos CAPTCHA REPORT Andreas Charalampous April 2020

1. Captcha Background 1.1. Introduction Using computers as bots, attackers can attack at scale, for example automatic registration for spam accounts, or comment posting. It is visible that a defense mechanism is needed, to guard resources (e.g. account registration) from automated and scaled attacks and at the same time does not block humans from accessing it.

A defense mechanism for this purpose is Captcha, which protects Web Resources from being exploited at scale. Captcha stands for Completely Automated Public Turing test to tell Computers and Humans Apart and as its name says, is a Test used for determining if a user is human or not; if a user passes the test, he is considered a human, else he is considered a computer. It is also known as the “Reverse Turing Test”. When a user wants to access a web resource, then a challenge is shown to him and to continue and access the resource, the user must solve it giving the correct answer. 1.2. Captcha Challenges A Captcha challenge must at the same time make the bot fail and the human easily solve it. From 1997 when was first introduced, until today, challenges are evolving and new types are created. The first version of the Captcha challenge was the “twisted text” [Picture 1a], where the user was shown a distorted text and had to provide the text shown. Early and most used challenges are the math captcha, audio captcha [Picture 1b] and image captcha [Picture 1c]. There are is a grand variety of captcha challenges [Picture 1d - 1f].

(a) Twisted Text (b) Math/Audio (c) Image

(d) SlideLock (e) Drag n’ Drop (f) Trivial

Picture 1: Captcha Challenges

1.3. reCaptcha In 2007 reCaptcha was developed, in 2009 was acquired by and today is the most used Captcha. The three most common reCaptcha are the distorted text reCaptcha [Picture 2a], the Image reCaptcha [Picture 2b] and the noCaptcha reCaptcha (checkbox) [Picture 2c]. are evolving for more than 20 years and will keep on evolving with more kinds of captchas created. The reason is that they are improving, finding ways to make it easier for humans, including minorities like health impaired users, and at the same time make them more difficult to bots. Also, captchas are kept being bypassed by automation software or solver services, creating an arms race between solvers and providers.

(a) Distorted text (b) Image (c) Checkbox

Picture 2: reCaptcha Challenges

The distorted text reCaptcha was used as an aid to digitize “” archives. During the automatic scanning of the archives, many words were not recognized by computers. To translate the words easily, each unknown word is sent as a challenge with another known word. If the user gives a correct answer for the known word, then his guess for the unknown word is considered as the translation for the scanned word. For more accuracy, the same unknown word is given in multiple challenges to different users. 1.4. noCaptcha reCaptcha During the evolution of captchas, human solving services were founded, that sold solutions provided by human solvers. In 2014 the noCaptcha reCaptcha was developed to distinguish not only bots from humans, but good humans from bad humans (fraud solvers). This rather easier challenge consisted of a checkbox where the user is asked to just click it. In the background, a behavioral analysis on the user and its browser is performed to choose if the user is a bot or human (good or bad). More specifically the Advanced Risk Analysis System (ARAS) acquires user information from Google tracking cookies and browser, analyzes it and based on them it provides an easy (image), hard (difficult distorted text) or no challenge at all to the user.

When a site protects a resource, it contains a reCaptcha widget, which collects the cookie and browser information. The user is given a checkbox (as shown in Picture 2c) and is asked to click it. When the user clicks it, a request is sent to Google containing the Referrer, SiteKey, Cookie and all information gathered by the widget, which are all analyzed by the ARAS and an HTML frame with the corresponding challenge is returned to the user. Also, when the checkbox is clicked, an HTML field is populated with a token, which must be become valid by Google and then be submitted to the site containing the resource. The token becomes valid if the user is legit or when the user passes the test given. When the site gets the token from the user, it sends a verification request to Google and gets a response indicating if the token verification was a success. Finally, the site gives access to the resource.

2. Re: CAPTCHAs – Understanding CAPTCHA-Solving Services in an Economic Context Marti Motoyama, Kirill Levchenko, Chris Kanich, Damon McCoy, Geoffrey M. Voelker and Stefan Savage University of California, San Diego

2.1. Introduction The fact that Captchas were deployed to guard resources, does not mean it ends there; attackers that were “abusing” those resources, are now looking for solutions that will help them bypass captchas. A need for automatically solving captchas appeared and hence services are taking advantage of that need, selling captcha solvers, creating a whole business model. In the paper, the two types of solvers, automatic and human labor solvers are presented, followed by the economics around it. 2.2. Automated Solvers The first type of solvers is automated solvers, which mainly is a software that uses Optical Character Recognition (OCR) algorithms, reading and solving Text Captchas. Two solvers investigated were Xrumer and reCaptchaOCR. The solvers made providers change their Captchas, creating an arms race between them, that favors the defender (providers). The first reason is that in order to develop new solvers, highly skilled labor is required. Another reason is that those solvers have low accuracy and most sites blacklist IP addresses after 5-7 failed attempts. Finally, sites have alternative captchas ready for swift deployment in case of existing captchas get bypassed. Other than the arms race lost, automated solvers did not survive in the market, because of the human solvers. 2.3. Human Solvers The second type of solvers is human solvers. The motivation is that Captchas are intended to obstruct automated solvers and this can be sidestepped by giving captchas to human labor pools. Paid solving is the core of the Captcha-solving ecosystem. There is a whole business model around paid solving services; an example is shown in Picture 3, where an automating-spamming software tries to create multiple accounts, but is prevented by Captcha.

Picture 3 - Captcha-solving market workflow

1. GYC Automator (Client) tries to create a Gmail account and is challenged with a Captcha. 2. The client pays DeCaptcher (Solving Service) to solve the Captcha. 3. Solving Service puts the Captcha in a PixProfit (Workers forum) pool. 4. PixProfit selects a worker from a pool. 5. The worker responds to PixProfit with the solution. 6. PixProfit sends the solution to the Solving Service and then to Client. 7. The client enters the solution to Gmail, gets validated and the account is created.

In an attempt to find geolocation details about the workers, the authors created Captchas in different languages or asking about the local time and polled the human-based solving services, concluding that more workers come from low-cost labor countries (China, India, etc.).

Because of its nature, being an unskilled activity and switching to low-cost labor from Eastern Europe, Bangladesh, China, India, Vietnam etc. paid-solving services not only survived, they expanded and became highly competitive as well. Even though the wages started from $10/10001, in a few years it dropped to ~ $0.5/1000. 2.4. Conclusion • The quality of captchas made it easy to outsource to the global unskilled labor market. • Business of solving captchas growing and highly competitive. • Do Captchas work: o Telling computers and humans apart: succeeded. o Preventing automated site access: failed. o Limiting automated site access: reduces attackers expected profit.

1 Price of Captcha Solving Services is counted as dollars paid per 1000 solved captchas. For example $5/1000 means the client pays 5 dollars for 1000 solved captchas. 3. I am Robot: (DEEP) Learning to Break Semantic Image CAPTCHAs Suphannee Sivakorn, Iasonas Polakis and Angelos D. Keromytis Department of Computer Science Columbia University, New York, USA 3.1. Introduction Two other Captcha attacks were developed for researching purposes, focusing on solving reCaptcha Image using online Image Annotation Modules and noCaptcha reCaptcha by influencing the Advanced Risk Analysis System. To achieve this, a system is developed consisting of two main components. 3.2. System Overview

3.2.1. Cookie Manager The first component is the Cookie Manager which its main job is to automatically create and train cookies so they appear as real users. After creating each cookie, the system is configured to perform humane actions using them, some examples are google searching certain terms and follow the links provided, open videos in , google map searches, etc. 3.2.2. ReCaptcha Breaker The second component is the ReCaptcha Breaker. It uses the cookies from the Cookie Manager and visits sites to employ reCaptchas. It locates the reCaptcha Iframe that contains the checkbox looking for reCaptcha-anchor, performs a click and extracts the reCaptcha-token. If the reCaptcha is solved, then it is considered a checkbox challenge, otherwise, if a popup is created in goog-bubble-content, an image challenge is shown. The info of image challenge, hint-sample image(rc-imageselect-desc) and candidate images(rc-imageselect-tile) are extracted and passed to another module. 3.3. Breaking the image reCaptcha To solve the image reCaptcha, the system uses Deep Learning Techniques to match the given hint- sample with the candidate images. Sample and candidate images are passed to Image Annotation Modules, like Google Reverse Image Search (GRIS), Clarifai and Alchemy, which given an image they return 10-20 tags describing it. GRIS is used as well for searching better quality images, for more accurate results. In case candidate images’ tags do not match the hint, a Tag Classifier is used, that models tags and hint as vectors and uses cosine similarity between them to find the candidate images that are the most probably to be of the same category as hint-sample. Because of repetition in reCaptcha images, a History Module is used, that keeps pairs of in a labelled_dataset, so future candidate images are searched in there for a hint. This attack managed to score 70.78% accuracy, against 2235.

The algorithm for breaking an image reCaptcha is:

• Each candidate image will be assigned to one of 3 sets: Select, Discard, Undecided.

• Initially all candidate images are placed in Undecided.

1. If the hint is not provided, the sample image is searched in the labeled dataset to obtain one.

2. Information about all images are collected from GRIS.

3. Every candidate image is searched in the labeled dataset. • If found, compares their tag to hint and if found match, candidate image is placed in the select set.

• If not found, hint_list is checked, and if found match, the candidate image is placed in the discard set.

4. Image annotation processes all images and tags are assigned.

• If tags match the hint, the image is added in the select set.

• If it matches one of the tags in the hint_list, it is added in the discard set.

5. System picks from select set, if not enough, picks from undecided. 3.4. Influencing the Advanced Risk Analysis System For influencing ARAS into getting the easiest challenge, a variety of actions on different components were made, exporting surprising conclusions.

1. Token - Browsing History: • Without Account: o No matter the network setup (TOR, university, etc) or geolocation, after the 9th day from token creation, even without browsing, ARAS was neutralized and provided a checkbox challenge. • With Account: o Tried different settings, with or without phone verification, with alternative email from another provider. The result was getting a checkbox challenge after 60 days. o It is better not to use an account. • Token Harvesting: o Experimented if creating a large number of cookies from a single IP is prohibited. o 63000 cookies in a single day without getting blocked. o Tokens could be sold, creating a harvesting attack. 2. Browser Checks: a. Automation: Webdriver attribute, indicating automation kit found in the browser, was set to True, but made no difference. b. Canvas FingerPrint 2– UserAgent3: i. If they do not match, fallback (hardest) challenge is provided. ii. If User-Agent is outdated, fallback challenge is provided. iii. If User-Agent is misformatted or does not contain complete info, fallback challenge is provided. c. Screen Resolution: tested a variety of resolutions, from 1x1 to 4096x2160, but made no difference. d. Mouse: Automated movements, multiple clicks in widget and even used getElementById().click() function to simulate click without hovering, but made no difference.

2 HTML Canvas provided alongside the Widget, not visible to user, that collects information about user’s browser. 3 Attached on HTTP Requests, containing information about the client, like browser version, extensions, etc. 3.5. Conclusions - Countermeasures Based on the two attacks above, many guidelines and countermeasures were presented.

• Token Auctioning: Token verification API has an optional field comparing the IP address of the user that solved and the one that submitted the token. It should be mandatory to prevent services from selling tokens obtained from the checkbox challenge. • Risk Analysis: o Account: • Requests should be valid only when they are from users logged in, those that are not logged in will have to solve the hardest challenge. • Limit the number of tokens per IP address. o Cookie Reputation: • Should elevate with the amount of browsing conducted. • Number of cookies that can be created within a time period, should be regulated. o Browser Checks: Stricter approach and return no challenge if overtly suspicious, e.g. mismatch browser-user-agent. • Image captcha attacks: o Solution: • Increase the number of correct images. • Change the range of correct images. • Remove flexibility. o Repetition: • When a challenge is shown, it should be removed from the pool. • Pool of challenges should 0062e larger. o Hint and Content: • Hint should be removed. • Providers can make experiments to find problematic image categories for image annotation software. • Populate challenges with filler images of the same category as solutions. o Advanced Semantic Relations: • Instead of similar objects, the user could be asked to select semantically related objects (tennis ball, racket, tennis court). o Adversarial Images: • Altering a small number of pixels, images are misclassified, but are the same visually.