<<

The art of breaking and designing captchas

Elie Bursztein

Session ID: HT02-402 Insert presenter logo here onSession slide master. SeeClassification: hidden xxxxxxxxxxxx slide 4 for direcons Insert presenter logo here on slide master. See hidden slide 4 for direcons 2 Insert presenter logo here on slide master. See hidden slide 4 for direcons 2 Insert presenter logo here on slide master. See hidden slide 4 for direcons 2 Insert presenter logo here on slide master. See hidden slide 4 for direcons 2 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net World Most-Popular Captchas !

[Reddit] [CNN] [Megaupload] [eBay]

[Baidu] [Recaptcha]

[Authorize] [Captcha.net] [Skyrock]

[Digg] [NIH] [Google]

[Slashdot] [Wikipedia] [Blizzard]

Elie Bursztein (@elie) https://elie.net 4 World Most-Popular Captchas !

[Reddit] [CNN] [Megaupload] [eBay]

[Baidu] [Recaptcha]

[Authorize] [Captcha.net] [Skyrock]

[Digg] [NIH] [Google]

[Slashdot] [Wikipedia] [Blizzard]

Elie Bursztein (@elie) https://elie.net 4 Captcha Design Goal

Hard for computer

Hard for human

Elie Bursztein (@elie) https://elie.net 5 Captcha Design Goal

Hard for computer

Human

Hard for human

Elie Bursztein (@elie) https://elie.net 5 Captcha Design Goal

AI ?

Hard for computer

Human

Hard for human

Elie Bursztein (@elie) https://elie.net 5 Captcha Design Goal

AI ?

sweet spot

Hard for computer

Human

Hard for human

Elie Bursztein (@elie) https://elie.net 5 Focus of this talk ! xw

How to break and design CAPTCHAs

Elie Bursztein (@elie) https://elie.net 6 Based on the breaking 21 of the most popular schemes and designing the new Wikipedia captcha

Elie Bursztein (@elie) https://elie.net 7 Outline

Elie Bursztein (@elie) https://elie.net 8 Outline

! How to break text captcha

Elie Bursztein (@elie) https://elie.net 8 Outline

! How to break text captcha ! How to make captchas easier for human

Elie Bursztein (@elie) https://elie.net 8 Outline

! How to break text captcha ! How to make captchas easier for human ! How to break audio captcha

Elie Bursztein (@elie) https://elie.net 8 Outline

! How to break text captcha ! How to make captchas easier for human ! How to break audio captcha ! How to break video captcha

Elie Bursztein (@elie) https://elie.net 8 Evaluation metrics

Accuracy

Elie Bursztein (@elie) https://elie.net 9 Evaluation metrics

Accuracy Solving time

Elie Bursztein (@elie) https://elie.net 9 Evaluation metrics

Accuracy Solving time Learnability

Elie Bursztein (@elie) https://elie.net 9 How to Break Text-Captchas

Insert presenter logo here on slide master. See hidden slide 4 for direcons 10 Think Lego

Elie Bursztein (@elie) https://elie.net 11 7 3 3 11

How to break a captcha: example

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: background removal

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: background removal

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: captcha binarization

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: captcha binarization

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: detection

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: Line detection

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: Line removal

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Pre-processing: Line removal

Elie Bursztein (@elie) https://elie.net 12 7 3 3 11

Segmentation: clustering algorithm

Elie Bursztein (@elie) https://elie.net 12 7 3 3 1

Segmentation: clustering algorithm

Elie Bursztein (@elie) https://elie.net 12 7 3 3 1

Segmentation: cluster separation

Elie Bursztein (@elie) https://elie.net 12 3 7 13

Segmentation: cluster separation

Elie Bursztein (@elie) https://elie.net 12 3 7 13

Post-segmentation: inverting rotation

Elie Bursztein (@elie) https://elie.net 12 3 7 13

Post-segmentation: inverting rotation

Elie Bursztein (@elie) https://elie.net 12 3 7 13

Recognition:

Elie Bursztein (@elie) https://elie.net 12 Recognition: 3 7 1 3

Elie Bursztein (@elie) https://elie.net 12 Breaker 5 Stages Pipeline

Slashdot captcha

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

Post- segmentation

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

Post- segmentation

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

Post- segmentation

Recognition

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

Post- segmentation

Recognition f a e t e s t

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

Post- segmentation

Recognition f a e t e s t

Post-recognition

13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline

Preprocessing

Segmentation

Post- segmentation

Recognition f a e t e s t

Post-recognition f a s t e s t

13 Elie Bursztein (@elie) https://elie.net From the image to the representation

14 From the image to the matrix representation

14 From the image to the matrix representation

14 From the image to the matrix representation

14 From the image to the matrix representation

14 From the matrix representation to the vector representation

15 From the matrix representation to the vector representation

15 From the matrix representation to the vector representation

15 L1 L2

From the matrix representation to the vector representation

15 L1 L2 L3

From the matrix representation to the vector representation

15 vectorL1 L2 L3 L4 L5 L6

From the matrix representation to the vector representation

15 Known vectors Distance

A

A

B vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A

A

B vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A

B vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A

B vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A 40

B vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A 40

B vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A 40

B 32 vector B

C

C

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A 40

B 32 vector B 70

C 12

C 18

From the vector representation to the segment value (classification) 16 Known vectors Distance

A 42

A 40

B 32 vector B 70

C 12

C 18

From the vector representation to the segment value (classification) 16 Breaker efficiency

Solver accuracy = Coverage * Precision^length

Coverage: Segmentation rate Precision: Recognition rate

Elie Bursztein (@elie) https://elie.net 17 Anti-recognition techniques

Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques

Blurring

Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques

Blurring

Distortion

Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques

Blurring

Distortion

Rotation

Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques

Blurring

Distortion

Rotation

Fonts

Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques

Blurring

Distortion

Rotation

Fonts

Charsets Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques

Blurring

Distortion

Rotation

Fonts

Charsets 0123456789 Elie Bursztein (@elie) http://elie.im 18 SVM learning rate

100%

90%

80%

70%

60%

50% 09 40% %success AZ09 30% azAZ09 Distortion 20% 3 fonts 10% 5 fonts Angles 0% 10 20 50 100 200 500 Trainning set size Elie Bursztein (@elie) https://elie.net 19 KNN learning rate

100%

90%

80%

70%

60%

50% 09 40% %success AZ09 30% azAZ09 Distortion 20% 3 fonts 10% 5 fonts Angles 0% 10 20 50 100 200 500 Trainning set size Elie Bursztein (@elie) https://elie.net 20 Anti-recognition taxonomy

Elie Bursztein (@elie) http://elie.im 21 Anti-recognition taxonomy

Background Confusion

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Collapsing

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Collapsing

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Collapsing

Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy

Background Confusion

Lines

Collapsing

Elie Bursztein (@elie) http://elie.im 21 Background confusion Breaking World of Warcraft

Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft

Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft

Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft

Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft

Elie Bursztein (@elie) http://elie.im 22 Breaking Captcha.net

Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net

Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net

Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net

Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net

Elie Bursztein (@elie) http://elie.im 23 Breaking Wikipedia

Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia

Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia

Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia

Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia

Elie Bursztein (@elie) http://elie.im 24 Breaking Digg

Elie Bursztein (@elie) http://elie.im 25 Breaking Digg

Elie Bursztein (@elie) http://elie.im 25 Breaking Digg

Elie Bursztein (@elie) http://elie.im 25 Breaking Digg

Elie Bursztein (@elie) http://elie.im 25 Breaking Digg

Elie Bursztein (@elie) http://elie.im 25 Breaking Slashdot

Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot

Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot

Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot

Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot

Elie Bursztein (@elie) http://elie.im 26 Breaking eBay

Elie Bursztein (@elie) http://elie.im 27 Breaking eBay

Elie Bursztein (@elie) http://elie.im 27 Breaking eBay

Elie Bursztein (@elie) http://elie.im 27 Breaking eBay

Elie Bursztein (@elie) http://elie.im 27 Breaking eBay

Elie Bursztein (@elie) http://elie.im 27 Failing to break eBay

Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay

Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay

Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay

Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay

Elie Bursztein (@elie) http://elie.im 28 Breaking Baidu

Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu

Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu

Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu

Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu

Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu

Elie Bursztein (@elie) http://elie.im 29 Overall results

Segmentation Solving rate Authorize 84%rate 66% Baidu 98% 5% Blizzard 75% 70% Captcha.net 96% 73% CNN 50% 16% Digg 86% 20% eBay 95% 43% Google 0% 0% MegaUpload n/a 93% NIH 87% 72% Recaptcha 0% 0% Reddit 71% 42% Skyrock 30% 2% Slashdot 52% 35% Wikipedia 57% 25%

Elie Bursztein (@elie) https://elie.net 30 Learning rate for real schemes

90% Authorize Baidu

80% Blizzard Captcha.net CNN

70% Digg eBay Megaupload

60% NIH Reddit Skyrock Slashdot 50% Wikipedia %success 40%

30%

20%

10%

0% 10 20 50 100 200 500 Trainning set size

Elie Bursztein (@elie) https://elie.net 31 Decaptcha main interface

Elie Bursztein (@elie) https://elie.net 32 Apply design principles ! Core design principles ! Randomize length ! Randomize character size ! Wave the captcha ! Use anti-recognition as a means of strengthening captcha ! Don’t use a complex charset ! Bad for human (see our research on this) ! Useless for security ! Use collapsing or lines

Elie Bursztein (@elie) https://elie.net 33 Designing Better Captchas

Insert presenter logo here on slide master. See hidden slide 4 for direcons 34 Think Lego again ! Decompose in features ! Analyze ! feature in isolation ! features interaction

Elie Bursztein (@elie) https://elie.net 35 Evaluation system

Payement validation Mechanical Turk Monitoring system

Feedback Captcha Test system Web Fronted image results Generator

Tasks Generator Tasks

Elie Bursztein (@elie) https://elie.net 36 Experiment details

Round Task N possible N sampled N tests per sample Total tests 1 Baseline (“Control”) 1 1 1000 1000 2 Real world captchas 8 8 1000 8000 3 Features in isolation 496 496 200 99200 4 2 feature interactions 60950 60950 5 304750 5 3 feature interactions 1 303 224 25000 10 250000 6 4 feature interactions 113 951 684 25000 10 250000 Total 912150

Table 2. Overview of the experimentation rounds.

100 Then we considered tasks where the values of 2, 3 or 4 dif- % Completed % Failed %Timeout ferent features were changed. We chose a limit of 4 features 90 for two reasons. First, we observed that captchas in the 80 wild rarely exhibit more than one or two anti-segmentation 70 features and one orElie two Bursztein anti-recognition (@elie) features [4]. Thus https://elie.net 37 observing captchas where 2-4 features are varying is more 60 representative of real world captchas. The other reason for 50 limiting ourselves to 4 interacting features is the combinato- 40 rial explosion of possible feature combinations. 30

Because of the large number of possible combinations for 2, 20 3 and 4 feature captchas, we could only sample a subset of 10 all captchas for these tasks. We first examined subject per- 0 1 2 3 4 formance in the individual feature tasks and removed feature dimension values as follows. For features where performance was fairly constant across the different values or exhibited a roughly Figure 5. Success, failure and timeout rates vs. number of features linear progression, we removed every second value, leaving only 50% of the original values. For fonts, we selected one font for each 10% bucket of accuracy, e.g. one font with EXPERIMENTAL RESULTS accuracy around 10%, one font with accuracy around 20%, We evaluated subject performance on captchas in two ways: etc. (See the discussion below for the choice of 10%.) We solving accuracy and solving time. For solving accuracy, removed all of the similar foreground/background color anti- we compared the answers given by subjects to the text from segmentation features because the color confusion defense, which we generated the captcha, ignoring differences in case while popular among captcha builders is known to be inse- or spacing. Solving time was measured by recording the time cure because computers are much better at distinguishing between when the subject was presented with a captcha and between two colors than humans. Therefore an attacker can when they submitted their response. Note that measuring exploit this difference of color between the background and times on AMT can result in high variance because many the letters to successfully clean the background. That is why Turkers do other things on their computer at the same time highly used captchas such as Google, Recaptcha and eBay even if we explicitly asked them to do it as fast as possible. use a uniform background. Thus it is less interesting from the Still, by averaging over large numbers of subjects, we can get perspective of implications for future captcha design. Table 2 an idea of which captchas take more or less time. shows the size of the resulting reduced feature spaces. One of the most striking observations about user behavior is There is a tradeoff between the number of points in the space that even though they were paid, when the captchas became of feature combinations that we can sample, and the number too hard the users quit or typed in garbage. This is clearly of samples we take for each point. For example, to get a visible on the overall user statistics displayed in Figure 5. As good estimate of how accurate humans are on captchas with one can see, the number of users that quit or gave up increased 1000 random dots and 2 straight lines, we would ideally have as the captchas became harder and harder. The number of 100 or even 1000 examples of this scenario. However, given users that gave up seems to increase linearly with the number that there are 60,950 possible two feature combinations, it of features we added to the captchas. The solving time also is unrealistic to take this many samples for each such point. increased as it took the users more time to understand the Thus we compromise at taking 10 samples of each point – captchas. this gives us a rough estimate of accuracy (10%, 20%, 30%, etc.) while still allowing us to sample a reasonable number of Individual Features Results points in the feature combination space. Thus, we randomly selected 25000 captchas for each of the 3 and 4 feature Character sets. Table 3 shows how subjects performed on interaction groups, and had each captcha annotated 10 times. different character sets. Pseudo-words, words, and simple As a result we had 250 000 captcha tested for these 2 groups. character sets like all digits, all lowercase letters and all uppercase letters were the easiest, with accuracies of 97% or Some of the features tested

Blurring Text color Font Background color

Collapsing Tilting Waving DistorDistortiontion

line angle line line shape nb line

line coverage line position line size Noise

Elie Bursztein (@elie) https://elie.net 38 Angle of rotation

14 1.0

13 0.9

12 0.8

11 0.7

10 0.6

9 0.5

8 0.4 accuracy solvingtime(s) 7 0.3

6 0.2

5 0.1 solving time accuracy 4 0 50 100 150 200 250 300 350 rotation angle (°)

Elie Bursztein (@elie) https://elie.net 39 Collapsing 14 1.0 solving time accuracy 13 0.9

12 0.8

11 0.7

10 0.6

9 0.5

8 0.4 accuracy solvingtime(s) 7 0.3

6 0.2

5 0.1

4 0 4 2 0 -2 -4 -6 -8 -10 -12 -14 character gap width

Elie Bursztein (@elie) https://elie.net 40 Character size

14 1.0

13 0.9

12 0.8

11 0.7

10 0.6

9 0.5

8 0.4 accuracy solvingtime(s) 7 0.3

6 0.2

5 0.1 solving time accuracy 4 0 2 4 6 8 10 12 14 16 18 20 character size Elie Bursztein (@elie) https://elie.net 41 Resolution invariant 100

95

90

85

80

75

Accuracy 70

65

60 <= 1024 > 1024 55 all captchas 50 5 10 15 20 25 30 captcha length (number of characters)

Elie Bursztein (@elie) https://elie.net 42 2D interactions

Elie Bursztein (@elie) https://elie.net 43 Length vs Angle interaction

Elie Bursztein (@elie) https://elie.net 44 Perception Does Not Match Number

35 %fast %easy 30 %like

25

20

15

10

5

0 az 09 AZ az09 AZ09 azAZ azAZ09 pretty HF+ cutest LF + guilty HF - molest LF -

Elie Bursztein (@elie) https://elie.net 45 The New Wikipedia

! Use digit ! Wave the captcha ! Use random length (5-7) ! Use random size (34-50) ! Rotate letter (-25/ 25) ! Add a line for a super secure version

Elie Bursztein (@elie) https://elie.net 46 End result

Accuracy

Solving time

Elie Bursztein (@elie) https://elie.net 47 End result

Accuracy 84.8%

Solving time

Elie Bursztein (@elie) https://elie.net 47 End result

Accuracy 84.8%

Solving time 7.8s

Elie Bursztein (@elie) https://elie.net 47 End result

Accuracy 84.8%

Solving time 7.8s

Elie Bursztein (@elie) https://elie.net 47 End result

89.2% Accuracy 84.8% 82.6%

Solving time 7.8s

Elie Bursztein (@elie) https://elie.net 47 End result

89.2% Accuracy 84.8% 82.6%

4.9s Solving time 7.8s 5.3s

Elie Bursztein (@elie) https://elie.net 47 End result

confusing

89.2% Accuracy 84.8% 82.6%

4.9s Solving time 7.8s 5.3s

Elie Bursztein (@elie) https://elie.net 47 End result

confusing

89.2% Accuracy 84.8% 82.6%

4.9s Solving time 7.8s 5.3s

Elie Bursztein (@elie) https://elie.net 47 End result

confusing

89.2% 97% Accuracy 84.8% 82.6% 92.2%

4.9s Solving time 7.8s 5.3s

Elie Bursztein (@elie) https://elie.net 47 End result

confusing

89.2% 97% Accuracy 84.8% 82.6% 92.2%

4.9s 4.9s Solving time 7.8s 5.3s 5.2s

Elie Bursztein (@elie) https://elie.net 47 How to Break Audio-Captcha

Insert presenter logo here on slide master. See hidden slide 4 for direcons 48 Audio Captchas

Elie Bursztein (@elie) https://elie.net 49 Audio Captchas

Elie Bursztein (@elie) https://elie.net 49 Creating Audio Captcha

SuperCaptcha secureMaker captcha

Elie Bursztein (@elie) https://elie.net 50 Creating Audio Captcha

SuperCaptcha secureMaker captcha Voices

Elie Bursztein (@elie) https://elie.net 50 Creating Audio Captcha

SuperCaptcha Maker secure captcha Noises

Elie Bursztein (@elie) https://elie.net 50 Creating Audio Captcha

Super secure captcha

Elie Bursztein (@elie) https://elie.net 50 Noise intensity (RMS/SNR)

K J 5 H Scheme Authorize Digg eBay Recaptcha Yahoo Length 5 5 6 Authori10 8 7 Type of voice Female Female Various Various Various Child Background Noise None Constant (random) Constant (random) Constant (random) Constant (random) None Intermediate noise None None Regular (speech) Regular (speech) Regular (speech) Regular (speech) Charset 0-9a-z a-z 0-9 0-9 0-9 0-9 Avg. duration 5.0 6.8 4.4 7.1 25.3 18.0 Sample rateJ 8000 A 8000 8000 K 8000 8000 8000 22050 Beep no no no no no yes Table I Dig COMMERCIAL AUDIO CAPTCHA FEATURE DESCRIPTION

Figure 4. Authorize Captcha Figure 5. Digg Captcha 2 9 0 0 parameter. These properties can make RLSC orders of magnitude faster to train than an SVM [?]. This efficiency Micros is noticeable in Decaptcha; it takes 2 minutes (5 minutes) to train on thousands of captchas with a unidimensional (two-dimensional) representation, respectively. Figure 4. Authorize Captcha Figure 5. Digg Captcha IV. COMMERCIAL CAPTCHAS Elie Bursztein (@elie)This section describes the commercial captchas we used parameter. These properties can make RLSC orders of https://elie.net to validate Decaptcha as well as our51 testing methodology magnitude faster to train than an SVM [?]. This efficiency and results. We tested audio captchas from Authorize, Digg, is noticeable in Decaptcha; it takes 2 minutes (5 minutes) eBay, Microsoft, Recaptcha, and Yahoo. We were unable to to train on thousands of captchas with a unidimensional test Google’s captchas because of difficulties we encountered (two-dimensional) representation, respectively. obtaining reliableFigure annotations; 7. Microsoft they Captcha are so difficult for Figure 9. Yahoo Captcha IV. COMMERCIAL CAPTCHAS humans that they are ineffective as captchas. Figure 6. Ebay Captcha This section describes the commercial captchas we used A. Corpus description vocal noises that look like digits in the waveform. Apart to validate Decaptcha as well as our testing methodology from the presence of semantic noise, Recaptcha captchas and results. We tested audio captchas from Authorize, Digg, Authorize. Audio captchas on authorize.net consist of five letterare similarJ shows to live.com patterns captchas, similar to but those the digits of the are letter deliveredJ in eBay, Microsoft, Recaptcha, and Yahoo. We were unable to letters or digits spoken aloud by a female voice. The voice themuch Authorize more slowly. captcha The (see waveform figure and10). spectrogram presented test Google’s captchas because of difficulties we encountered clearly articulates each character and there is minimal in Figure 8 show a portion of a captcha containing the digits obtaining reliable annotations; they are so difficult for distortion. The waveform and spectrogram presented in Audio captchas on .com consist of six digits eBay.1, 7, 3 and 5. As will be discussed in 10 the five digit from humans that they are ineffective as captchas. Figure 4 show a portion of a captcha containing the spoken by a different speaker and in a different setting with this captcha shows similar harmonic patterns the five digit digits/letters K, J, 5 and H. A long pause appears between regular background noise. The waveform and spectrogram Figure 6. Ebay Captcha from Authorize and eBay captchas. A. Corpus description spoken characters and vowels are clearly articulated. The presented in Figure 6 show part of a captcha containing the letters K and H, which are fricative consonants, show some digitsYahoo.9, 5Audio, 7 and captchas6. The digits from inyahoo.com these captchasconsist are delivered of three Audio captchas on authorize.net consist of five Authorize. harmonicletter J shows patterns patterns in the similar spectrogram to those while of the letter letter JJ hasin muchbeeps faster followed than by those seven of authorize.net digits spokenor bydigg.com a child.. The The letters or digits spoken aloud by a female voice. The voice almostthe Authorize no harmonic captcha patterns. (see figure 10). waveformcaptcha is shows obscured the variability with other of childrens’ the various voices digits in due the clearly articulates each character and there is minimal tobackground. different speakers The waveform and different and spectrogram background presentednoise levels, in Digg. Audio captchas on digg.com consist of five letters distortion. The waveform and spectrogram presented in eBay. Audio captchas on ebay.com consist of six digits whileFigure the9 show spectrogram a portion shows of a captcha that the containing vowels are the short digits and1, spoken aloud by a female voice. There is random white Figure 4 show a portion of a captcha containing the spoken by a different speaker and in a different setting with relatively7 and 6. The unobscured digits are by the noise. largest amplitude sections in the noise in the background and sometimes an empty, but digits/letters K, J, 5 and H. A long pause appears between regular backgroundFigure noise. 8. The Recaptcha waveform Captcha and spectrogram waveform and the spectrogram shows that the background louder, segment is played between letters. The waveform Audio captchas from live.com consist of ten spoken characters and vowels are clearly articulated. The presented in Figure 6 show part of a captcha containing the Microsoft.voices do not confuse the patterns of the digits. This and spectrogram presented in Figure 5 show a portion of a digits spoken by different speakers over a low quality letters K and H, which are fricative consonants, show some digits 9, 5, 7 and 6. The digits in these captchas are delivered spectrogram shows different patterns than the spectrograms captcha containing the letters J, A and K. The overall brown recording. There is a regular background noise consisting harmonic patterns in the spectrogram while the letter J has muchcorrespond faster tothan actual those digits, of authorize.net the spectrogramor digg.com shows. The that of the other captchas because of the use of a child’s voice. It color of the spectrogram shows the heavy constant noise of several simultaneous conversations. The waveform and almost no harmonic patterns. waveformthe vowels shows are somewhat the variability obscured of the by various background digits noise. due seems that the patterns induced by a child are much clearer thatto different obscures speakers vowels and but still different maintains background some characteristic noise levels, spectrogram presented in Figure 7 show a portion of a Audio captchas on digg.com consist of five letters Interestingly, the two 0 digits show very similar patterns, than the patterns of an adult’s voice. This makes digits easier Digg. whilepatterns the of spectrogram the letters. These shows patterns that the cannot vowels be are completely short and captcha containing the digits 2, 9, 0 and 0. Like the spoken aloud by a female voice. There is random white but this pattern is not easily distinguished from the pattern to recognize even though the noise in Yahoo’s captchas has maskedrelativelyobserved by unobscured for the the white9 digit. by noise noise. since they are necessary for eBaymore audio energy captchas, than the these noise digits in other are captchas. spoken very quickly. noise in the background and sometimes an empty, but human recognition. Interestingly, the spectrogram of the While all of the high amplitude sections of the waveform louder, segment is played between letters. The waveform Microsoft.Recaptcha.AudioAudio captchas captchas from from live.comrecaptcha.netconsistconsist of ten of Comparison. Figure 10 illustrates some differences between and spectrogram presented in Figure 5 show a portion of a digitseight digits spoken spoken by different by different speakers speakers. over Distortions a low quality include commercial captcha schemes. The first line presents the captcha containing the letters J, A and K. The overall brown recording.background There conversations is a regular and background approximately noise two consisting semantic TFR of the digit five from Authorize, eBay, and Recaptcha color of the spectrogram shows the heavy constant noise of several simultaneous conversations. The waveform and that obscures vowels but still maintains some characteristic spectrogram presented in Figure 7 show a portion of a patterns of the letters. These patterns cannot be completely captcha containing the digits 2, 9, 0 and 0. Like the masked by the white noise since they are necessary for eBay audio captchas, these digits are spoken very quickly. human recognition. Interestingly, the spectrogram of the While all of the high amplitude sections of the waveform Sound representation

TCR

Cep

WAV DFT TDC

TFR

Elie Bursztein (@elie) https://elie.net 52 Solving an audio captcha

Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha

Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha

Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha

Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha

Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha

C

Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha

C T T A R R 2 A F S

Elie Bursztein (@elie) http://elie.im 53 Dealing with random noise ! Statistical learning ! Supervised learning ! RLS (Regularized least square) 5: classifier Authorize eBay Recaptcha

J:

Authorize Digg

Elie Bursztein (@elie) https://elie.net 54 Semantic noise

Elie Bursztein (@elie) https://elie.net 55 Results

Length Coverage Digit Captcha Authorize 5 100 97 89.2% Digg 5 100 76 41.4% eBay 6 85.6 92.5 82.9% Microsoft 10 80.6 89.6 48.9%

Recaptcha 8 99.9 40.5 1.5% Yahoo 7 99.1 74.7 45.4%

Elie Bursztein (@elie) https://elie.net 56 Recaptcha semantic noise

0 3 7 9 4 -10 2 0 1 N 5 N

-20

-30

DB -40

-50

-60

-70 0 20 40 60 80 100 120 140 160 180 200 Time in seconds

Elie Bursztein (@elie) https://elie.net 57 Recaptcha semantic noise

0 3 7 9 4 -10 2 0 1 N 5 N

-20

-30

DB -40

-50

-60

-70 0 20 40 60 80 100 120 140 160 180 200 Time in seconds

Elie Bursztein (@elie) https://elie.net 57 How many captchas do you need ?

100 Authorize 90 Digg Ebay MSLive 80 Recaptcha Yahoo 70

60

50

40

Per − Captcha Precision (%) 30

20

10

0 2 3 4 10 10 10 Corpus Size (in Digits)

Elie Bursztein (@elie) https://elie.net 58 Video captcha ! Interesting direction -> more design space ! Good for human ! Good for computer :( ! Working on it

See blog post for more information: http://elie.im/blog

Elie Bursztein (@elie) https://elie.net 59 Apply ! Within 3 months ! Make sure you have a strong captcha scheme (use mine if you want) ! Ensure that your site is accessible

! Within 6 months ! Log your captchas failure rate and monitor them ! Have a backup captcha scheme in case your scheme is broken

Elie Bursztein (@elie) https://elie.net 60 Thank you ! Thank you Questions ?

Follow-me ! : @elie

Captcha research: http://elie.im/tag/captcha

Elie Bursztein (@elie) https://elie.net 61