<<

I AM NOT A ROBOT: - AN OVERVIEW ON ’S CAPTCHA

A Thesis Presented to the Faculty of California State Polytechnic University, Pomona

In Partial Fulfillment Of the Requirements for the Degree Master of Science In Computer Science

By Uday Prabhala 2016 SIGNATURE PAGE

THESIS: I AM NOT A ROBOT: - AN OVERVIEW ON GOOGLE’S CAPTCHA

AUTHOR: Uday Prabhala

DATE SUBMITTED: Summer 2016 Computer Science Department.

Dr. Gilbert Young ______Thesis Committee Chair Computer Science

Dr. Fang D. Tang ______Computer Science

Dr. Yu Sun ______Computer Science

ii

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my family members, Yashoda, Lucky, and

Diskey, as well as my girlfriend Siri, who helped make this endeavor possible. Their limitless support, assistance, and encouragement during the times when I was close to giving up were greatly helpful, and I wouldn’t have been able to overcome the obstacles without them.

I would also like to send my appreciation and gratitude to the Professors who were part of my thesis committee. Most notably, I would like to thank Professor Gilbert Young, chair of the committee, for his support, patience, guidance, and sharing of knowledge throughout the program.

I would also like to thank Professor Tang and Professor Yusun for reviewing my paper and attending my presentation. The above three Professors not only helped me to complete my program, but also served as an excellent example by exercising professionalism, versatility, and commitment to the developing engineering students at California State Polytechnic University, Pomona.

iii

ABSTRACT

I am not a Robot Overview on Google’s Captcha

Uday Kiran Prabhala

Computers are one of the greatest inventions done by humans; these devices not only made

our work easy, but could also be misused in various ways. One of such way is "being human".

Captcha’s are a way to prevent these, having said that, Captcha’s can be compromised by many

attacks. To make Captcha’s stronger to attacks different techniques have been implemented. These

Captcha’s run the gamut from the old plain Captcha’s to the newest Nu-Captcha [1]; however attackers are finding different ways to break these Captcha’s [2].

On a tangential note, there are human resolvers solving the Captcha’s by using automated tools [3] for reasonable prices. I am not a Robot is the new secure Captcha designed by Google. Is it really secure? How to protect it from human resolvers?

iv

TABLE OF CONTENTS

SIGNATURE PAGE ...... ii

ACKNOWLEDGEMENTS ...... iii

ABSTRACT ...... iv

LIST OF FIGURES ...... vii

CHAPTER

1. INTRODUCTION ...... 1

2. LITERATURE SURVEY ...... 2

2.1. Early Development ...... 2

2.2. Areas for Captcha ...... 2

2.3. Captcha Attacks ...... 5

2.4. Types of Captcha ...... 6

2.5. Past Research ...... 12

2.6. Research Goal ...... 14

2.7. Methodology ...... 14

2.8. Research Findings ...... 14

3. I AM NOT A ROBOT ...... 16

3.1. Overview ...... 16

3.1.1. Mouse Readings ...... 17

3.1.2. Cookie Method ...... 17

3.2. Breaking Google’s Captcha ...... 19

v

3.3. Integrating I am not a robot to Website ...... 20

3.4. Limitations ...... 23

3.5. Mouse patterns ...... 23

4. HUMAN RESOLVERS ...... 26

4.1. How they work ...... 26

4.2. Limitations ...... 31

4.3. Observations ...... 31

4.3.1. Mouse Coordinates and clicks ...... 31

4.4. Puzzle Architecture ...... 33

4.5. Proposed Solution ...... 34

4.5.1. Submittals ...... 35

4.5.2. Time Frame ...... 35

4.6. Explanation ...... 35

5. EXPERIMENTS AND RESULTS ...... 37

5.1. Experiment ...... 37

5.2. Results ...... 39

6. CONCLUSION AND FUTURE WORK ...... 43

6.1. Conclusion ...... 48

6.2. Future work ...... 48

REFERENCES ...... 50

vi

LIST OF FIGURES

Figure 1 Websites where Captcha not installed ...... 4

Figure 2 Websites where Captcha installed ...... 4

Figure 3 Gimpy Captcha’s ...... 8

Figure 4 Face Recognition Captcha ...... 9

Figure 5 Optical Illusion Captcha ...... 10

Figure 6 Captcha Games, Click ...... 11

Figure 7 Captcha Games, Drag ...... 11

Figure 8 I Am Not A Robot Captcha ...... 13

Figure 9 I am Not a Robot Captcha 2 ...... 16

Figure 10 Registering Domains ...... 20

Figure 11 Site Key ...... 21

Figure 12 Snippets ...... 21

Figure 13 Location ...... 21

Figure 14 Login Page ...... 22

Figure 15 Generated Key ...... 22

vii

Figure 16 API Client ...... 27

Figure 17 Storefront Image ...... 30

Figure 18 Confusing ...... 30

Figure 19 Mouse Coordinates ...... 33

Figure 20 Dog Rotation Puzzle ...... 34

Figure 21 Question 1 ...... 37

Figure 22 Question 2 ...... 38

viii

1. Introduction

CAPTCHA- Completely Automated Public to tell Computers and Humans

Apart, This popular term was invented in the year 2003 [5] to secure the webpages from unauthorized access by running a Turing test. Un-authorized access in this context mainly refers to the bots (Malwares), or computer programs to login to the webpage and perform illegal activities. The goal of this Turing test is to generate a test so that most of the humans can pass the test and the computers cannot [6]. This is the current security mechanism most of the websites are obtaining in order to protect their websites from spamming by unauthorized programs or users [6].

A well designed Captcha should not have a success rate (attack rate) which is greater than 0.01% [7,8], So the researchers of Captcha are focused on designing either a model which has a success rate lesser than that value, or a Bot with a greater success rate.

On a tangential note, the attacks done on the applications by these bots run the gamut from email accounts to the trading sites [6]. These bot’s create fake email addresses and post inappropriate matter on blogs , Couple of other attacks are dictionary attacks for password, multiple account creations, voting in the polling system, Intrusion into the trading systems etc. [9]. The most serious attack is the Denial of Service attack, where

Captcha with other security mechanisms tries to abuse the resources which a user tries to connect, with the increase in the users accessing online resources this could be considered as a very serious concern.

1

2. Literature Survey

2.1. Early Development

With the increase in bots, it is very difficult to protect our data online. One such

example is the spam bot, in the early 80’s, when internet started its recognition and people

started using mailing services, many spam mails used to fill their mail boxes. This made

most of them not to give their email addresses on any online sources. Later they started a

new way of writing their email address by using complete text example john@.com

was written as john at Gmail dot com. This mechanism was easily understood by humans,

initially it was difficult for web crawlers or spam bots to detect the mail address, but this

mechanism was not successful for many days, as anything which is read by human can also

be understood by computers or programs by using basic regular expressions and with

certain scripting skills, thus the need to secure data has triggered.

Then they started to adopt certain techniques of hackers and maneuvered them as simple Captcha’s. One among those techniques is to substitute words by numerical content, for example a letter ‘o’ is substituted by a number ‘0’. This acted as a catalyst in designing

Captcha’s, where the main goal is creating simple puzzles where only the humans should solve and the computers should not.

On a Tangential not, this differentiation between humans and computers plays a vital role in many different applications, Penning down a few.

2.2. Areas for Captcha.

The usage of Captcha spreads widely over different sectors, it runs the gamut from stocks to the blogs. In other words, Where ever there is human interaction over the internet, a Captcha should be installed. This makes the network safe from bots or web crawlers below is certain areas mentioned. 2

a) Fake accounts

A bot can disguise himself as a registered user and can create multiple fake accounts; this is one of the reasons for getting spam emails, commenting the blogs by using different names etc. b) Denial of service

This is a most common attack which will be found on the web-servers, when a web server is compromised it throws a DOS error. Captcha’s can be used on web servers to prevent these attacks. c) Online reviews

Most of the online products are associated with reviews, “A user gave 1000 1 star reviews to a product Amazon” [10]. Amazon, the famous website was compromised once and lost its users reviews.

The above mentioned are just few major attacks which can be prevented by , having said that, the work done by captcha isn’t always appreciated. As per the famous saying

“Every coin has two sides”, every product has its own advantages and disadvantages, so does captcha.

As per the Case Study of Casey Henry “Captchas Effect on conversion rates” [11].The total hits on a website decreased drastically when a captcha is implemented. This case study was done on over 50 websites, which range from less than 1year to 5 years old, these forms were a collection of common information such as name, address, city, email address and a comment’s section. These Webpages are observed over a period on 6 months where

Captcha’s were installed after 3 months from the start date, the below statistics explains this.

3

Figure1: Websites where Captcha not installed

Figure2: Websites where Captcha installed

4

These images show that, there was a reduction of spam from 99 to 11, having said that,

the failed conversations increased from 0 to 159, this proves that 159 users could not solve

the Captcha, so had to leave the conversation. This is a very serious concern as the business

lost 3 percent of its clients.

These above results prove that it’s just not the security which is a serious concern, but

even the complexity of the Captcha should be in the range where humans should not use

too much effort to solve it.

2.3. Captcha attacks

Understanding the different Captcha attacks gives an enhanced understanding on the

different types of Captcha, which will be discussed in the next section.

2.3.1. Bot attacks

This is one of the most common Captcha attack, different types of web-crawlers or bots are used to break the Captcha. These attacks are the first known attacks on Captcha.

According to Web Scrapping, an online blog “Every minute there will be a minimum of

100 Web-spiders or web crawlers running on a given webpage”. Initially these crawlers

used to just mine the data and send the useful data to its owner, Later, after the evolution

of bots these miners were used as spam bots in mining the user names or email addresses

from the website, this can be considered as the evolution of spam bots. This acted as a

catalyst for the evolution of Captcha.

Captcha’s are placed at the login page of the webpages and upon solving them the user

will be granted access to enter the webpage. In this attack, these bots will act themselves

as users and will solve the captcha (by using preinstalled algorithms) and mask themselves

as humans and will gain the access over the webpage. In order to protect the webpage from

5

the bots, the captcha should be designed in such a way that the user alone could resolve it

and the bot should not.

Initially it was hard for the bots to break the Captcha’s, but lately with the advancement

of image processing techniques and scripting, the bots are able to compromise the Captcha.

For example, considering a regular word Captcha, the bot can use image processing

techniques and could get the outline of the letter and will compare its database with the

relevant letter and will type the letter the in the captcha.

2.3.2. Human resolvers

They are considered as the biggest threat for the current Captcha’s, one such example

is the Death By Captcha. Both the bots and the humans will be working together to resolve

Captcha’s. This is a paid service, and the Captcha’s will be solved by humans. An

automated web scraping tool (bot) gathers multiple Captcha’s and will send to the human resolvers and those resolvers input the solved Captcha back to the tool, thus the bot gains the access over the webpage.

This service is very affordable and the price ranges from 0.75$ to 1$ for 1000 captcha’s, according to Death by Captcha, “They offer 80% of success rate in resolving Captcha’s”.

2.4. Types of Captcha

Those squiggly letters and numbers used to verify that a user is human and not a bot: -

Jimmy Fallon [12]. Yes, these are Captchas, Having said that, there are many different types of captcha’s.

2.4.1. Text Captchas

These are considered as the most popular captcha’s, where a user is given bunch of distorted letters with some noise injected, and the user has to identify those letters.

6

As per New York Times these captcha’s were first invented in 1997 and are based on

Moni Naor’s Theory which is “A user might be shown a portrait, he said, and asked to name its subject’s sex. Or he might be presented with an image of several people and told to find the one who wasn’t wearing any clothes”. His theory acted as a catalyst and was later developed by a group of researchers from Digital Equipment Corporation. They designed the first Text Captcha by using different distorted letters and masked those letters using basic techniques like adding noise etc. This was not a secured captcha, and could be easily hacked with mediocre knowledge however it was strong enough to stop attacks from other users.

These Captchas were easily broken down using OCR (Optical Character Recognition) software’s. This made them revamp their old techniques by adding the limitations in OCR models to the Captchas and design a stronger captcha which could not be detected by an

OCR, These were called OCR Captchas.

Later that year Yahoo had a problem with spamming bots in their chat rooms where immense bots used to create accounts and login to the chat rooms and spam them. With the collaboration with Carnegie Mellon university Graduate students certain captchas called

Gimpy captchas were created. These Gimpy Captcha’s are one of the most reliable systems.

These are group of letters which are extremely distorted and corrupted text and are challenging for the computer programs to solve.

7

a) EZ Gimpy captcha b) Gimpy Captcha c) Gimpy – r Captcha

Figure3: Gimpy Captcha’s

The above three are the Captcha’s designed to fix the chat room issue. The first figure is an EZ Gimpy Captcha where a word is selected from the dictionary and is distorted. The second one is called the Gimpy Captcha where five different words are selected and noise is added to them by using various techniques and the user has to solve a minimum of three

Captcha’s to make it correct, and the third one is the Gimpy –r Captcha where different letters from the dictionary are used and are masked using different noises.

Greg Mori and Jitendra Malik in the year 2004 has designed a template dictionary and have solved Gimpy with more than 90% success rate and were solved using distortion imaging techniques [13].

After this, there were different other Captcha’s which were invented using the text base techniques, but every Captcha designed had a minor loophole which made the bots or human resolvers to solve them easily. As discussed earlier, these text based Captchas are more prone to bot attacks.

2.4.2. Image based Captcha

Since it was easy for bots to break text based captcha’s, they needed different types of

captcha’s which could not be processed or understood by captcha, the best possible captcha

could be an Image based. Initially it was considered as the solution for all the captcha

8

problems; lately with the advancements in image processing and scripting even these captcha’s were proven to be compromised.

As per the case study [14] image based captcha’s were not only difficult for the bots to solve but even for the humans. The difficulty level in these captchas range from medium to high thus the humans have to use some extra knowledge to solve this. There were different approaches in the image based Captcha couple of them are.

A. Face recognition

In 2006 researches used certain images from public domains and asked users to identify the wrongly placed image (or similar questions). [15]

Figure4: Face recognition Captcha

In the above figure, the user will be asked generalized questions based on the gender of the image, the emotions related to the image (happy, sad) etc. This makes the bot difficult to answer as the bot has to process the image, compare the figure in its in database and then find certain features of the image using image processing techniques and then answer the question. However with the advancements of computer vision these captcha’s were solved using certain facial recognition algorithms.

9

B. Optical Illusion Captcha’s

Figure5: Optical Illusion Captcha

These Captcha’s used the illusion techniques, for its questionnaire. One example is the

above image. The user would be asked the number of circles in the above figure. Honestly,

in the above figure, with a naked eye a human can see 5 circles. Since those circles are overlapping and it is hard to find borders, in the given time frame before the captcha expires, the success rate of a computer vision algorithm became lower.

At this stage the bots did give up, because of the timeout error they faced and it was

hard for them to find the solution in the given time. Then the human resolvers assisted the

bots. This is the attack which we discussed earlier. From the year 2011, the bot developers

started working on mining and transportation of Captcha rather than solving a captcha as

they have the human resolvers who could solve these captchas.

C. Puzzle Solving

The researchers observed the difficulty of bots solving Image based Captcha, so started

developing Puzzles and games based on these images. The user’s interaction is considered

as one of the important goals while designing a puzzle. Initially, the users started to solve

these puzzles as they are interesting. Having said that, these puzzles are time taking.

There are different types of games designed based on their complexity, penning down few.

10

D. Click the Object

Figure6: Captcha games, Click

In this game the user has to click the appropriate object based on the question. This satisfied the two important conditions of Captcha

1) The Users difficulty level while solving

2) The bots difficulty level to understand and solve.

The users difficulty level could be considered between easy to medium while solving this puzzle on a parallel note, the bots difficulty level could be considered as difficult to impossible to understand this puzzle in the given time and solve it. Having said that, the bots started using their old technique of brute forcing on the whole Captcha, and as per the case study [16] the success rate is 75% using brute force attacks.

E. Drag and Drop

Figure7: Captcha games, Drag

11

This game is an advanced version of the previous game, in this game the user has to

drag and drop the objects. It has a higher difficulty level when compared to the previous

puzzle, however bots are designed to solve this puzzle by using both scripting (PHP) [17]

and brute force.

On a tangential note, according to Dan Yao [18] in 2010, Puzzles could be considered

as the future captcha. If a time frame is included it is difficult for a bot to solve a captcha

by using the above techniques, this could be considered as the fundamentals of “I am Not

a Robot, Google new Captcha”.

2.5. Past Research

In the past decade, the research done on improving captcha has increased drastically.

Researchers believe two principals in solving captcha.

1) Either build a Captcha which is hard to solved by a human resolver or a bot

2) Or, Build a Bot which solves the captcha.

These are the only two types of researchers available. With the advancements in Computer

Vision and Scripting, it’s always a great challenge for the Captcha developers to develop a

model which is prone to attacks, and the human resolvers just adds fuel to the fire.

However, Google’s new I am not a robot got is not only based on image based captcha

or the image but puzzle, but also has a unique feature in identifying bots. Previously,

Bayesian Networks have identified the bots by using its unique spam filtering technique

which is good on DNS lookups etc., but was proven unsafe on the websites because of the

Bayesian Filtering Poisoning [19], where the spammers sends certain legitimate words with the spam and the spam filter cannot notice it and would let it go.

12

Having said that, the Google’s new feature is completely different from the Bayers

Classifier, It observers the users mouse patterns and determines whether it’s a bot or a human [20]. Google has developed this algorithm by using the immense data patterns of their users. What if Google does not recognize the user pattern? Google developed another solution for this problem, if the user’s pattern is not recognized it uses the image based

Captcha and asks the user to identify. However the Captcha has higher difficulty when compared to the regular Captcha’s.

Example

Figure8: I am Not a Robot Captcha

13

In the above figure, the first image is hard to answer when compared to the second

image. However currently this is considered as one of the secure Captcha.

“How human resolvers work?”

“How does this Captcha Handles Human Resolvers”

“How can it improve, its Security”

These are the three Research problems which my Papers cover.

2.6. Research Goal

In this study, the main goal is to understand the different ways each captcha works and

how the bots or human resolvers are breaking into. This paper even discusses the security

concerns of the “Google’s: I am not a robot“, which is considered as one of the secured

Captchas, and recommendations to improve its security.

2.7. Methodology

In order to design the methodology for this paper, I have to understand how the

Google’s captcha work, what all parameters are considered while building the captcha, the

security flaws. I have created an account in Death by Captcha to understand how the human

resolvers would be working on solving the captchas and how the tool will send the captcha to the human resolvers and understand the flaws in their system. After getting an enhance view of both the systems, I would like to propose a system which removes those observed

disadvantages.

2.8. Research Findings

1) Sort out the reasons for failures in the current models.

2) Differentiate how Google’s approach is different from other models.

14

3) Explain the security concerns with the current model.

4) Improve the Captcha, by proposing an updated one.

15

3. I am not a Robot

3.1. Overview

In early 2014 Google wanted to revamp their Re-captcha model as it was both time

consuming and not completely automated as per the principles of Captcha. The result of

the research is “I am not a Robot”, which is considered as one of the better solutions for captcha problems.

Figure9: I am not a Robot Captcha

Google used a unique approach for this method; they used the following two parameters

to decide whether the user is a human or a robot:

a) Mouse readings

b) Cookies Method

16

3.1.1. Mouse reading

Google is one of the most used website and is considered to have the most number of

users, more the users, more the data they have. They have designed an algorithm to monitor

the mouse movements of the users while solving a Captcha, all the correctly solved

Captcha’s mouse movements were recorded and they have designed a unique pattern, Since

they have tested this pattern on the immense data they have, this pattern is unique and

easily detected when a bot is solving a Captcha. Having said that, they did not use this

method while users are solving a Captcha but have inserted a checkbox saying “I am not a

robot” and the user has to check the checkbox. The algorithm compares the user’s mouse

movement during this process and confirms whether he is a human or bot.

If the above process is successful the user will be given access else there will be a

questionnaire just the like the old re-captcha to determine the output. However, there are other methods which are used with this method to determine the output they are the cookies and IP addresses [21].

3.1.2. Cookie Method

In this method, when the user first solves the Captcha correctly a cookie is saved in his

computer. Whenever a user accesses similar Captcha, it verifies the cookie and confirms

that he is an authenticated user and gives him access to his website with just a click. The

cookie which is generated isn’t related to his session, and is tied up with his IP address, so

this method does not work if his cookies are deleted or his IP address is renewed.

On a tangential note, As per ghacks.net Cookie based predictions are not always

correct while predicting the users, there are certain cases which have to be considered,

penning down a few

17

1) If the user is connected to DHCP.

2) If the user is connected using VPN.

3) If user has agents like Cookie Monster installed in his computer.

3.1.2.1. User connected to DHCP

If a user is in a network which uses DHCP rather than static IP addressing, and if there is a specific time frame for renewing his IP address, then the Cookie based method cannot work, as the user cannot authenticate using the same IP address, having said that, the cookie generated previously could be verified however the IP address related to that cookie isn’t

the same so Google considers it as a new user.

3.1.2.2. User connected to VPN

If a user is accessing a website by using VPN, he is basically using a secure gateway

(might be multiple gateway’s/ proxy’s) for connected to that website, so the cookie generated would not be constant and will be using the IP address of his gateway, this cannot authenticate the user.

3.1.2.3. User using agents

There are agents who are installed by users like cookie monster, which will delete all

the cookies after the session ends or after a scheduled time interval. These agents are used

as security measures for protecting user’s data. If these agents are installed, there will not

be any cookies saved in his computer and it cannot authenticate the user.

Although, Google have designed a mechanism to track the mouse movements of people

who are using laptops and Desktops, but 90 percent of the current population prefer to use

mobile phones for accessing websites. Does Google have an automated process for mobile

18

users? It is a million dollar question, and the answer is NO, Google did not design a

mechanism for people connecting to websites using mobile phones, however, it uses the

traditional approach of re-captcha, the Captcha consist of set of questions and asks users to

pick answers from a pool of answers provided, however these questions have multiple

answers so it is hard for a bot to guess or hit a random answer, it needs human intervention

for solving the puzzle or Captcha, if the user did not enter all the answers the Captcha

provides additional set of answers and asks the users to answer [24].

3.2. Breaking Google’s Captcha

It took two years for the hackers to decode the Google’s captcha , In the early 2016

students from Colombia university has demonstrated a live hack on I am not a robot in

Black hat Asia 2016. Rather than hacking the Captcha directly, they revere engineered on how Google is getting its images for the questionnaire. Google uses its immense database of images and images repeat most of the times.

These hackers used Google’s reverse image search for obtaining the description of the images and tried to assign a tag for that particular image. After assigning a tag, they tried to use tag classifiers for comparing .Once the question is asked, it compares the question and matches its corresponding tag classifier and tries to look for a similar pattern in the options provided and will predict the answer.

This method solved the Captcha’s successfully by about 70% .However it took a little more than 115 seconds to solve. Google have a time limit of 55 seconds to solve each

Captcha so this method can be considered as a failure. As per the author of naked security,

the approach which was used for this attack is unique, however the hackers has to consider

two important parameters.

19

a) They were using Google’s own database for its images, rather than images to

match a word, to help them find images in a Re-captcha set that shared a

particular characteristic.

b) What if Google uses a separate database for its images rather than the public

database and makes its access private.

These two parameters are considered as one of the reasons for the failure of the attack. On a parallel note, the attackers should even consider the time frame for solving captcha’s in order to make the attack successful.

3.3. Integrating I am not a robot to Website

Adding advanced security features to a website always make the website more secure and more user hit ratio. I am not a robot Captcha is definitely one among them. It’s not only user friendly but also easy to install on the website. In order to install, we should have a and the following steps should be performed [23].

Step1

Registering the website, domains and sub-domains.

Figure10: Registering domains

20

Step 2

Enter its site key and partner key which can be generated by an existing option.

Figure11:Site Key

Step 3

Make a note of the snippets.

Figure12: Snippets

Step 4

The location of captcha on your website.

Figure13: Location

21

Step 5

Create your login web-page

Figure14: Login page

Step 6

A key is generated for every user

Figure15: Key

Step 7

Add the pre-built captcha library (PHP code) from the git-hub to our code so that all we get the captcha questionnaire on our website.

22

Step 8

It compares our sitekey if the cookies are present it gives access else it sends to the webpage containing the questionnaire and the () matches the answers and gives access.

3.4. Limitations

I am not a robot, is one of the best automated captcha’s designed and is running successfully from the past 2 years. However there are limitations in this captcha model which are making the attackers successful in breaking the captcha. Penning down a few

[22]

Are mouse patterns safe?

a) Are cookies safe?

b) Repetition of images.

c) Static puzzle vs dynamic puzzle

3.5. Mouse patterns

The immense data owned by Google, helped them in designing a pattern for the mouse movements, can these mouse patterns be duplicated? The answer is YES, with the advancements in scripting techniques it is very much possible for a hacker to understand those movements and depict one in the future. On a tangential note, the mouse movements use java scripting which makes it easier for a hacker to design one which is similar to that.

Recording our mouse movements and mouse clicks is it legal? According to Vinay

Shet, the project manager of I am not a Robot, confirms that, Google only records the patterns while the user is solving a Captcha and use that information for research purpose.

23

Technically this process is called advanced risk analysis, they use scripting languages to

determine the movement of the mouse , which gives the coordinates of the mouse and with

the mouse clicks they will understand what the user is selecting this is invading of privacy,

however this multibillion company have their own law suits to defend this process.

However this process isn’t new, , , Gmail or any webpage can track

everything you do and could be key logging you’re every pointer movement or keystroke.

Logging keystrokes is no super-secret, privacy-sucking vampire sauce. It’s plain old Web

1.0. This is not news, but it’s certainly worth repeating: anybody with a website can capture what you type, as you type it, if they want to.

The reality is that JavaScript, the language that makes this kind of monitoring possible, is both powerful and ubiquitous. It’s a fully featured programming language that can be embedded in web pages, and all browsers support it. It’s been around almost since the beginning of the web, and the web would be hurting without it, given the things it makes happen. Among the many features of the language are the abilities to track the position of your cursor, track your keystrokes and call “home” without refreshing the page or making any kind of visual display. Those aren’t intrinsically bad things. In fact, they’re enormously useful. Without those sort of capabilities sites like Facebook and Gmail would be almost unusable, searches wouldn’t auto-suggest and wouldn’t save our bacon in the background.

In the case of Google’s advances with reCAPTCHA, such ability can stop a lot of bad bots from doing things that can be worse than the annoyance of having to endure typing in text from a blobby image.

24

Think bots that harvest email addresses from contact or guestbook pages, site scrapers that

grab the content of websites and re-use it without permission on automatically generated doorway pages, bots that take part in Distributed Denial of Service (DDoS) attacks, and more.

25

4. Human Resolvers

They are considered as the biggest threat for Google, There are not Hackers to break-

in their databases, they are not coders to write applications like decaptcha and solve the

Captcha’s, in-fact they are common humans who are working for their wages. They are

based in countries like India and China where the labor are cheap and are paid based on

the number of solved Captcha’s.

4.1. How they work

There are many services like decaptcha, deathbycaptcha and others, however death by

Captcha is considered as the famous Captcha service. Anyone who wants to earn money

can register themselves and start solving Captcha’s, they are paid 0.02 cents per every

correctly solved Captcha’s and their rate increases after certain amount of solved Captcha’s or based on the number of hours they worked. On the other end, there are two methods for the users to get their captchas solved.

1) Sending Captcha’s through a pre-built API client.

2) Using their own API and connecting it to DBC.

4.1.1. Sending Captcha’s through a pre-built API client

Death by Captcha has its own pre-built API client for the users to send the Captcha’s,

Initially the user has to create an account with DBC and pay for the service, the service

costs 5$ for 1000 Captcha’s solved, the user can either manually crop the Captcha portion

on his website and send it to the DBC account or he will be using an automated software

which crops the Captcha from the website and send it to him, the average response time is

26

less than 5 seconds and the user will get the answer in a small dialogue box. This is the least used method as the users have to manually crop or manually send it to the API client.

Figure16:API client

4.1.2. Using their own API and connecting it to DBC

This is the most common way of using DBC; users have their own web-scrapping tools and will connect to the DBC secure account through the port numbers provided by DBC.

This method gives them additional features as they can customize their code and could be specific about what they want.

Steps included

1) Web-scraping

2) Cropping and Transporting

3) Solving by human resolvers

4) Entering the resulted values

27

1) Web-scraping

The hacker will even write his own code or will subscribe to a service; the main goal

in this step is to do web-scraping over the websites and search for Captcha’s. Once the

captcha is found it goes to the second part of the code.

2) Captcha Cropping

Once the Captcha is found, the code will try to crop the captcha for the Webpage or takes a screenshot of the screen and transport it to the users secure account of the DBC using their dedicated port number.

3) Solving

Human resolvers will identify the Captcha, resolve the captcha and send it back to the web-scraping tool by entering it in the results tab. The code in the web-scraping tool will then enter the results on the actual webpage. If the captcha is correct the tool is given access, else the tool is denied.

DBC claims that the average Captcha solving time is less than 11 seconds, however on an average as per my findings the Captcha solving rate is less than 5 seconds. Having said, that there are time limits for Captcha for I am not a Robot the predefine time is 50 seconds.

Comparing these two numbers it is evident that human resolvers have tipple the amount of time to solve the Captcha before it expires.

Mostly human resolvers solve two kinds of captchas:

a. OCR Captcha / Recaptcha

b. Advance Captcha (I am not a Robot).

28

3.1. OCR Captcha / Recaptcha

As learned before OCR Captchas are straight forward and are very easy for a human resolvers to solve, as this includes straight forward questions such as writing words, letters and simple additions. These kinds of Captcha’s are mostly used because of its flexibility.

3.2. Advance Captcha (I am not a Robot)

In this kind of Captcha, the human resolvers initially understand the Captcha before solving it and then answer the questions. This kind of Captcha is considered as a secure

Captcha as it gives more margins of errors for the human resolvers.

Example: Identifying store front.

In the below figure there are many images of different signs of stores however, the end user should understand the meaning of the store front then should solve the Captcha.

Similarly the human resolver should understand the meaning of store front and the resolve the Captcha on time.

Not all online users are educated, not all online users can understand English there is a minute different between the old OCR Captcha/ ReCaptcha and the new I am not a robot

Captcha. In the previous version it’s just the letters of the number which were tested. This was easy for both the users and the human resolvers, so it is less secure. In the current method the user have to use his mental abilities in resolving the Captcha. It is hard for both the users and the human resolvers. This violates the rule of Captcha “Captcha should be easily understood and solved with basic understanding and effort” [22].

29

Figure17: Storefront image

Figure18: Confusing Captchas

30

4.2. Limitations

Currently they are few limitations for human resolvers which includes.

a. Improper configuration of the tool.

b. Captcha not entered correctly.

Improper configuration of the tool: When the tool above is not properly configured

the Captcha includes an irrelevant matter which is hard for the human resolver to

understand.

Captcha not entered correctly: In certain cases human resolver might not enter the

Captcha correctly this restricts the tool from entering the web page so the human resolvers are penalized and the refund for the wrongly entered Captcha is processed by the DBC.

4.3. Observations

I created an account in DBC to understand the limitations of the human resolvers. I

could not find many limitations, However, I found certain loopholes in the methodologies

followed by human resolvers for solving the Captcha.

These strategies of the human resolvers acted as a catalyst in my research findings. Kindly

find the below observations:

1. Mouse coordinates and clicks

2. Puzzle architecture

4.3.1. Mouse coordinates and clicks

As per my observations, the inputs for human resolvers are two types:

a. Entering numbers

b. Sending the coordinates

31

4.3.1.1. Entering numbers

The tool used by the hacker divides the Captcha into coordinates using image processing techniques and assigns digits or letters on all the coordinates where the valid images are identified and assign to the human resolvers. These human resolvers solve the

Captcha and enter the appropriate digits / letter in the results tab and send it back to the tool. The tool converts those digits into its appropriate coordinates and enters back to the results tab. This process needs more network bandwidth and there are high possibilities for packet loss and these are easily tracked by the firewall or in the DNS as there is lot of network related operations running which could raise suspicious flags for the administrators.

4.3.1.2. Sending the coordinates

This is the second method where the mouse coordinates are sent directly by the human resolvers by the tool clicks on these mouse coordinates.

What if the coordinates on the puzzle keep changing?

32

Figure19: Mouse coordinates

4.4. Puzzle Architecture

In this, the human resolvers will send a series of coordinates to the tool one after the other to solve the puzzle. For example, kindly find the below image has to be positioned correctly either by moving left or right.

33

Figure20: dog rotation puzzle

What if there is a time out for every coordinate entered?

4.5.Proposed solution

Based on my observations in the previous section, I have designed a mechanism to mislead the human resolver.

Initially the original Captcha will have 2 questions, and there is a submit button included in first question. , and the second question should be a subset of the first question.

N = {q1, q2}

A= q1

B= q2, only if “q2 q1”

34

In the above equation, N is a set of questions q1 and q2. The first question would be q1

and the second question would be q2, however the condition is q2 should be a subset

of q1

This mechanism is divided into two stages

a) Submittals

b) Time Frame

4.5.1. Submittals

In this stage, both the questionnaire should be having a submit option. The user should be able to submit the answer before it goes to the second question.

4.5.2. Time Frame

In this phase, the questionnaire should have a timeframe. The first question should have a timeframe of 25 seconds; the second question should have a timeframe of 10 seconds.

4.6. Explanation

In this section I would like to explain my proposed solution and discuss the expected

results.

Let us consider two users who are solving the captcha, one among them would be the

human resolver and the other is a legitimate end user, When the first question is triggered

the clock is even started at the backend and the page would have a timeout after 25 seconds.

So the human resolver or the legitimate users have 25 seconds to solve this question. The

first question would be a simple straightforward question and let us consider both the users

have answered correctly and clicked the submit button before timing out. since the submit

button is clicked the page ends and the webpage checks for answers, however in this stage

35

the human resolver will not be knowing about the result and the tool his end considers it as a positive submittal and ends this transaction and proceeds to another puzzle . However the original user will wait for the answer and since the answer would be correct it redirects to the second question. However the web scrapping tool identifies another captcha and quickly sends it to the human resolver randomly. As per the above equation, the second question is based on the results of the first question and the timeout is 10 seconds, the legitimate user can answer this question as he answered the previous question. However the human resolver could not understand the question and can only predict the answer with a blind hunch.

There is no mechanism currently in the tool, where it could send it to the same human resolver and since it have a timeout of 10 seconds, it would be difficult for the human resolver to understand and solve the captcha. This process can deceive the human resolver.

.

36

5. Experiment and Results

5.1. Experiment

I have used a very basic example of fruits,

I have designed a question with a picture of apples.

Figure21: Question 1

I have sent it to the human resolvers and got the correct answer.

In my second question, I have asked.

37

Figure22: Question 2

The answer which I have got is invalid Captcha. This proves that the human resolver who solved initially was not the same as the second resolver, and he could not understand the question and have flagged as invalid Captcha.

38

5.2. Results

Result

39

Result

40

Result

41

Result

Kindly find the results of more other experiments which are performed on both

Humans and Death by Captcha. The humans had a positive response of 98% of answering correctly however the Human resolvers have only 1% percent of guessing correctly. Find the below results

42

Answer 1

Answer 2

43

Answer 1

Answer 2

44

Answer 1

Answer 2

45

Answer 1

Answer 2

46

In all the above Figures, Anwer 1 is the response given for the first Question, and

Answer 2 is the response given for the Second Question. When tried the same experiement on the common humans ( End Users) the success rate was close to hundred percent how ever when the same Captcha’s were used on the human Resolvers the success rate was less than one percent for solving both the Captcha’s.

Initialy the human resolvers could not understand the second question as it is related to the first question so they had used a Question mark symbol which means its an invalid captcha, However if seen the last experiment , the human resolver tried to guess the answer , having said that, he guessed it incorrectly but there are chances that the human resolver can also guess correctly. This proves that guessing the answer is the only possible solution for the proposed thoery.

All the above results prove that the Proposed solution can trick the human resolvers and can make them asnwer incorrectly resulting in the success of the proposed solution.

47

6. Conclusion and Future Work.

6.1. Conclusion

The results obtained in the previous chapter prove that two consecutive questions will not be answered by the same human resolver. However if second human resolver wants to contact the first resolver , there is a time constraint which acts as an important factor and the human resolver should resolve the Captcha in 10 seconds which only a legitimate end user can do. However if this method is implemented, the confusing Captcha’s can be removed permanently and this increases the flexibility for the user’s. The users just need to solve the similar question twice within the time-limit, this even reduces the attacks caused by bots as it is similar to a two factor authentication.

The results even prove that if the human resolver is not able to answer the question, he would send it as invalid captcha and it comes with a question mark (?) to the hacker. In my experiments above, the human resolver was not able to answer the questions and thus he sent a question mark back, which could be considered as the success of this paper.

Google’s I am not a robot is one of the best solutions in the current market, however it need certain patches which could make the protocol even stronger, currently human resolvers are considered as the biggest treat for Captcha, In order to stop them we have to adapt certain mechanisms as discussed above. This method not only has any disadvantages, but it is easy to implement and weed out the unauthorized users.

6.2. Future work:

The reason for the success of Google’s I am not a Robot Captcha is mainly because of the mouse patterns as it’s the first phase of eliminating Bots. Huge patterns of data which

48

Google could access made them design this pattern. This arise certain questions which could be considered as the future work or an extension of this paper.

1) What if the hackers design their own mouse pattern using this pattern?

2) Cookies are used for returning users. What if the cookies are stolen and modify the

mac address tagged to it.

These are couple of future researches which are very much possible and could exploit

Google Captcha’s vulnerabilities.

49

REFERENCES

1. Kulkarni, Sushama, and H. S. Fadewar. "CAPTCHA based web security: an

overview." Int. J. Adv. Res. Comput. Sci. Softw. Eng 3.11 (2013): 154-158.

2. Bursztein, Elie, Matthieu Martin, and John Mitchell. "Text-based CAPTCHA

strengths and weaknesses." Proceedings of the 18th ACM conference on

Computer and communications security. ACM, 2011.

3. Pathak, Avanish. An analysis of various tools, methods and systems to generate

fake accounts for social media. Diss. Northeastern University Boston, 2014.

4. Onwudebelu, Ugochukwu, et al. "CAPTCHA Malaise: Users suffer

Consequences of the Anti-spam Technology while the Spammers Adapt." (2010).

5. Von Ahn, Luis, et al. "CAPTCHA: Using hard AI problems for

security."International Conference on the Theory and Applications of

Cryptographic Techniques. Springer Berlin Heidelberg, 2003.

6. Von Ahn, Luis, , and John Langford. "Telling humans and

computers apart automatically." Communications of the ACM 47.2 (2004): 56-

60.

50

7. Yan, Jeff, and Ahmad Salah El Ahmad. "A Low-cost Attack on a Microsoft

CAPTCHA." Proceedings of the 15th ACM conference on Computer and

communications security. ACM, 2008.

8. Chellapilla, Kumar, et al. "Designing human friendly human interaction proofs

(HIPs)." Proceedings of the SIGCHI conference on Human factors in

systems. ACM, 2005.

9. Pinkas, Benny, and Tomas Sander. "Securing passwords against dictionary

attacks." Proceedings of the 9th ACM conference on Computer and

communications security. ACM, 2002.

10. Did Amazon Get Hacked by Greenpeace?" Amazon.com: Customer

Discussions:. N.p., n.d. Web. 23 June 2016

11. Henry, Casey. "CAPTCHAs' Effect on Conversion Rates." Moz. N.p., n.d. Web.

29 June 2016.

12. "Who Does CAPTCHA Discourage: Spammers or Customers? -." N.p., 2015.

Web. 29 June 2016.

13. Moy, Gabriel, et al. "Distortion estimation techniques in solving visual

CAPTCHAs." Computer Vision and Pattern Recognition, 2004. CVPR 2004.

51

Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 2. IEEE,

2004.

14. Yahoo! captcha is broken. Online at http://network-

securityresearch.blogspot.com/, January 2008.

15. Deapesh Misra and Kris Gaj. Face recognition captchas. In International

Conference on Internet and Web Applications and Services/Advanced

International Conference on Telecommunications, page 122, Washington, DC,

USA, February 2006

16. Baecher, Paul, et al. "CAPTCHAs: The Good, the Bad, and the Ugly."Sicherheit.

Vol. 170. 2010.

17. Vaishakh, B. N., and G. Harish. "CAPTCHAS: SURVEY OF EXISTING

TECHNIQUES AND A NEW APPROACH." National Conference on Recent

Trends in Computer Technology Technology Technology. 2011.

18. Gao, Haichang, et al. "A novel image based CAPTCHA using

puzzle."Computational Science and Engineering (CSE), 2010 IEEE 13th

International Conference on. IEEE, 2010.

19. Khanna, Sumit. "Breaking the multi colored box: a study of CAPTCHA." (2009).

52

20. Sivakorn, Suphannee, Jason Polakis, and Angelos D. Keromytis. "I’m not a

human: Breaking the Google reCAPTCHA."

21. Google Online Security Blog, “Are you a robot? Introducing “No CAPTCHA

reCAPTCHA”,” http:// googleonlinesecurity.blogspot.com/2014/12/are-you-

robot-introducing-no-captcha.html.

22. I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-digit

number recognition from street view imagery using deep convolutional neural

networks,” in CoRR ’13.

23. E. Bursztein, J. Aigrain, A. Moscicki, and J. C. Mitchell, “The end is nigh:

Generic solving of text-based CAPTCHAs.” in USENIX WOOT ’14.

24. L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum, “reCAPTCHA:

Human-based character recognition via web security measures,” Science, vol.

321, no. 5895, 2008.

53