Evaluation of an Attack Approach on Google V3 Captcha
Total Page:16
File Type:pdf, Size:1020Kb
Evaluation of an attack approach on Google V3 Captcha Mohit Mohit Saksham Chawala Sumit Mokashi Department of Computer Science Department of Computer Science Department of Computer Science University of Texas at Arlington University of Texas at Arlington University of Texas at Arlington Arlington,Texas Arlington,Texas Arlington,Texas Abstract— CAPTCHAs are used as the first line of defense to spaced intervals at a variable pitch or speed, often with an defend against automated account creation and service abuse. accent and distortion/noise. To solve the captcha, a user must Google’s reCaptcha is currently used by millions of websites for correctly identify the digits or words spoken in the audio protection against automated attackers(testing whether a user is truly human). In this project we present an approach where clip. Attacks have been demonstrated on these audio captchas we try to compromise the Google V3 Captcha module using with varying degrees of success in the past. This is usually techniques involving machine learning. We create a click-bot done by training local machine-learning models to identify that generates and emulates human cursor movements on the spoken words, a high-resource and time-consuming screen, in an attempt to fool the captcha module. Using our bot approach. Additionally, although researchers have explored we create a dataset that we then use to train our model. In our approach we try to use Binary classification, and we conclude using online speech recognition services, including Sphinx our project by providing details of a more comprehensive or Google Speech Recognition, these services have not been approach that involves Reinforcement Learning. accurate enough to compete with offline services or solve the captcha reliably. I. INTRODUCTION CAPTCHAs (the Completely Automated Public Turing II. BACKGROUND tests to tell Computers and Humans Apart) are systems The reCaptcha system relies on an advanced risk analysis designed to protect against automated account creation and engine. As the user interacts with reCaptcha (clicking buttons service abuse by presenting users with a challenge that and typing), the system determines a level of suspicion for is easy for humans to solve but difficult for computers. that user. Today, many users will find that they simply need to Captchas are used extensively online as a defense against click the checkbox and be verified without needing to solve automated bots and Sybil attacks, as well as preventing a captcha. This occurs when the reCaptcha is fairly confident spam. For instance, many online registration platforms, from that the user is human and not an automated attacker (this is social media services to email to ticketing systems, require called the “noCaptcha reCaptcha”). If the system is unsure the user to solve a captcha during registration to prevent if the user is a human (but is not highly suspicious either), it automated creation of fake accounts. In a similar vein, some will deliver a moderate challenge to the user (an easy image online services have recently begun requiring Tor clients to problem or a short audio string of numbers to transcribe). solve captchas before delivering web content. The security of This often occurs when a user does not yet have a long captchas is paramount to protecting services on the Internet enough history of interaction with Google. However, as from these attacks. As for the remainder of the paper, the reCaptcha system becomes increasingly suspicious, it we follow industry convention and write the acronym in delivers harder challenges: 10 digits in the audio challenge, lowercase, as “captcha”, for readability. Spread of news or prompting the user to solve multiple challenges. By and information is increasingly driven by user content on default, a user with no past history with Google services sites like Twitter, YouTube, and Reddit, bots that could will be automatically given the most difficult challenge. It defeat the captcha system and register a disproportionate is these most difficult challenges that unCaptcha attempts to number of accounts could theoretically control the flow of solve. Although the new reCaptcha system was introduced information. It is therefore unsurprising that captchas have in 2014 to replace the traditional “distorted text” captcha, been the target of attack for researchers and attackers for not much is known about its inner workings. Google has years. Until recently, captchas have featured distorted text protected the inner design of reCaptcha heavily, releasing few that users must correctly type to pass. Bursztein et al. showed details about how their software works. The captcha system these text-based captchas to be insecure by demonstrating is run from an encrypted, isolated VM (Virtual Machine) a system with near-complete (98%) accuracy. As a result, in JavaScript with a unique bytecode language. To make text-based captchas have been largely phased out in favor reverse engineering even more difficult, the bytecode has of image captchas. However, visually impaired users are direct access to JavaScript variables of its own interpreter, incapable of solving these visual captchas, prompting the and changes its own decryption key and even its own opcodes creation of audio captchas. Typical audio captchas consist numbers at many points during its own execution. A full of different speakers saying words or digits at randomly working disassembler and decompiler for the system was released, and it was determined that the captcha system, in audio captchas. Further independent studies deployed a addition to confirming the actual captcha solving, checked two-phase segment-then-classify approach and successfully for the presence of: valid plugins; a valid user-agent string; a broke older versions of Google and Yahoo audio captchas. valid screen resolution; execution time; computer timezone; These two-phase solvers usually operate by first extracting number of click, keyboard, or touch actions in the iframe portions of the captcha that contain the digit, and then of the captcha; many browser-specific functions and CSS running pre-trained machine learning algorithms to classify rules; canvas rendering properties; server side cookies; and those individual digits, rather than classifying them all at likely more. In 2016, a further analysis by Sivakorn et al. once. of the reCaptcha system explored the weaknesses of the The aim of this project is to study and evaluate the initial implementation of the image captcha. It is important Google V3 captcha systems that are in use today. With this to note that the image captcha has changed since that paper, study we propose an attack scenario that utilizes automated and their methodology is no longer sufficient to defeat scripts to compromise the system. Our work relies on a the captcha. However, their analysis of the captcha’s risk previous case study conducted by Ismail Akrout, Amal analysis system lends insight into its inner workings. In Feriani, Mohamed Akrout in their paper Hacking Google particular, Silvakorn et al. found that Google’s tracking reCAPTCHA v3 using Reinforcement Learning[1]. This cookies play an integral role in the captcha’s defenses. The paper presents a Reinforcement Learning (RL) methodology captcha system is made aware of every time a user interacts to bypass Google reCAPTCHA v3 where the agent learns with a Google service (or a page with Google’s tracking how to move the mouse and click on the reCAPTCHA button cookies, such as Google analytics). After just 9 days of to receive a high score. Authors also used a divide and automated browsing across different Google services, their conquer strategy to defeat the reCAPTCHA system for any bots’ tracking cookie was sufficient to fool the risk analysis grid resolution. Their proposed method achieves a success system into thinking they were human, and checking off rate of 97.4 the box. However, their experiments revealed each cookie In our research we also relied upon another case study could only immediately complete 8 captchas per day before conducted by Suphannee Sivakorn, Iasonas Polakis and needing to solve additional challenges. Their results also Angelos D. Keromytis in their paper I am Robot: (Deep) showed that the reCaptcha system attempts to fingerprint the Learning to Break Semantic Image CAPTCHAs(12th May browser, using canvas rendering techniques, comparing the 2016). In this paper, the authors conduct a comprehensive user-agent to what the browser reports, and potentially more. study of reCaptcha, and explore how the risk analysis Despite these impressive efforts of the risk analysis engine process is influenced by each aspect of the request. to identify a bot before the captcha, reCaptcha still remains Through extensive experimentation, they identify flaws that susceptible to low-resource attacks to its audio challenge. allow adversaries to effortlessly influence the risk analysis, Over the last decade, reCAPTCHA has continuously bypass restrictions, and deploy large-scale attacks. They evolved its technology. In reCAPTCHA v1, every user was design a novel low-cost attack that leverages deep learning asked to pass a challenge by reading distorted text and typing technologies for the semantic annotation of images. Their into a box. To improve both user experience and security, system is extremely effective, automatically solving 70.78 they introduced reCAPTCHA v2 and began to use many Another study focuses on presenting a tool called other signals to determine whether a request came from a unCaptcha, an automated system that can solve reCaptcha’s human or bot. This enabled reCAPTCHA challenges to move most difficult auditory challenges with high success rate. from a dominant to a secondary role in detecting abuse, We evaluate unCaptcha using over 450 reCaptcha challenges letting about half of users pass with a single click. Today with from live websites, and show that it can solve them with reCAPTCHA v3, sites can test for human vs. bot activities by 85.15 returning a score to tell you how suspicious an interaction is and eliminating the need to interrupt users with challenges at all.