The art of breaking and designing captchas
Elie Bursztein
Session ID: HT02-402 Insert presenter logo here onSession slide master. SeeClassification: hidden xxxxxxxxxxxx slide 4 for direc ons Insert presenter logo here on slide master. See hidden slide 4 for direc ons 2 Insert presenter logo here on slide master. See hidden slide 4 for direc ons 2 Insert presenter logo here on slide master. See hidden slide 4 for direc ons 2 Insert presenter logo here on slide master. See hidden slide 4 for direc ons 2 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net 3 Elie Bursztein (@elie) https://elie.net World Most-Popular Captchas !
[Reddit] [CNN] [Megaupload] [eBay]
[Baidu] [Recaptcha]
[Authorize] [Captcha.net] [Skyrock]
[Digg] [NIH] [Google]
[Slashdot] [Wikipedia] [Blizzard]
Elie Bursztein (@elie) https://elie.net 4 World Most-Popular Captchas !
[Reddit] [CNN] [Megaupload] [eBay]
[Baidu] [Recaptcha]
[Authorize] [Captcha.net] [Skyrock]
[Digg] [NIH] [Google]
[Slashdot] [Wikipedia] [Blizzard]
Elie Bursztein (@elie) https://elie.net 4 Captcha Design Goal
Hard for computer
Hard for human
Elie Bursztein (@elie) https://elie.net 5 Captcha Design Goal
Hard for computer
Human
Hard for human
Elie Bursztein (@elie) https://elie.net 5 Captcha Design Goal
AI ?
Hard for computer
Human
Hard for human
Elie Bursztein (@elie) https://elie.net 5 Captcha Design Goal
AI ?
sweet spot
Hard for computer
Human
Hard for human
Elie Bursztein (@elie) https://elie.net 5 Focus of this talk ! xw
How to break and design CAPTCHAs
Elie Bursztein (@elie) https://elie.net 6 Based on the breaking 21 of the most popular schemes and designing the new Wikipedia captcha
Elie Bursztein (@elie) https://elie.net 7 Outline
Elie Bursztein (@elie) https://elie.net 8 Outline
! How to break text captcha
Elie Bursztein (@elie) https://elie.net 8 Outline
! How to break text captcha ! How to make captchas easier for human
Elie Bursztein (@elie) https://elie.net 8 Outline
! How to break text captcha ! How to make captchas easier for human ! How to break audio captcha
Elie Bursztein (@elie) https://elie.net 8 Outline
! How to break text captcha ! How to make captchas easier for human ! How to break audio captcha ! How to break video captcha
Elie Bursztein (@elie) https://elie.net 8 Evaluation metrics
Accuracy
Elie Bursztein (@elie) https://elie.net 9 Evaluation metrics
Accuracy Solving time
Elie Bursztein (@elie) https://elie.net 9 Evaluation metrics
Accuracy Solving time Learnability
Elie Bursztein (@elie) https://elie.net 9 How to Break Text-Captchas
Insert presenter logo here on slide master. See hidden slide 4 for direc ons 10 Think Lego
Elie Bursztein (@elie) https://elie.net 11 7 3 3 11
How to break a captcha: example
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: background removal
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: background removal
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: captcha binarization
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: captcha binarization
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: Line detection
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: Line detection
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: Line removal
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Pre-processing: Line removal
Elie Bursztein (@elie) https://elie.net 12 7 3 3 11
Segmentation: clustering algorithm
Elie Bursztein (@elie) https://elie.net 12 7 3 3 1
Segmentation: clustering algorithm
Elie Bursztein (@elie) https://elie.net 12 7 3 3 1
Segmentation: cluster separation
Elie Bursztein (@elie) https://elie.net 12 3 7 13
Segmentation: cluster separation
Elie Bursztein (@elie) https://elie.net 12 3 7 13
Post-segmentation: inverting rotation
Elie Bursztein (@elie) https://elie.net 12 3 7 13
Post-segmentation: inverting rotation
Elie Bursztein (@elie) https://elie.net 12 3 7 13
Recognition:
Elie Bursztein (@elie) https://elie.net 12 Recognition: 3 7 1 3
Elie Bursztein (@elie) https://elie.net 12 Breaker 5 Stages Pipeline
Slashdot captcha
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
Post- segmentation
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
Post- segmentation
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
Post- segmentation
Recognition
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
Post- segmentation
Recognition f a e t e s t
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
Post- segmentation
Recognition f a e t e s t
Post-recognition
13 Elie Bursztein (@elie) https://elie.net Breaker 5 Stages Pipeline
Preprocessing
Segmentation
Post- segmentation
Recognition f a e t e s t
Post-recognition f a s t e s t
13 Elie Bursztein (@elie) https://elie.net From the image to the matrix representation
14 From the image to the matrix representation
14 From the image to the matrix representation
14 From the image to the matrix representation
14 From the image to the matrix representation
14 From the matrix representation to the vector representation
15 From the matrix representation to the vector representation
15 From the matrix representation to the vector representation
15 L1 L2
From the matrix representation to the vector representation
15 L1 L2 L3
From the matrix representation to the vector representation
15 vectorL1 L2 L3 L4 L5 L6
From the matrix representation to the vector representation
15 Known vectors Distance
A
A
B vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A
A
B vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A
B vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A
B vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A 40
B vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A 40
B vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A 40
B 32 vector B
C
C
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A 40
B 32 vector B 70
C 12
C 18
From the vector representation to the segment value (classification) 16 Known vectors Distance
A 42
A 40
B 32 vector B 70
C 12
C 18
From the vector representation to the segment value (classification) 16 Breaker efficiency
Solver accuracy = Coverage * Precision^length
Coverage: Segmentation rate Precision: Recognition rate
Elie Bursztein (@elie) https://elie.net 17 Anti-recognition techniques
Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques
Blurring
Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques
Blurring
Distortion
Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques
Blurring
Distortion
Rotation
Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques
Blurring
Distortion
Rotation
Fonts
Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques
Blurring
Distortion
Rotation
Fonts
Charsets Elie Bursztein (@elie) http://elie.im 18 Anti-recognition techniques
Blurring
Distortion
Rotation
Fonts
Charsets 0123456789 Elie Bursztein (@elie) http://elie.im 18 SVM learning rate
100%
90%
80%
70%
60%
50% 09 40% %success AZ09 30% azAZ09 Distortion 20% 3 fonts 10% 5 fonts Angles 0% 10 20 50 100 200 500 Trainning set size Elie Bursztein (@elie) https://elie.net 19 KNN learning rate
100%
90%
80%
70%
60%
50% 09 40% %success AZ09 30% azAZ09 Distortion 20% 3 fonts 10% 5 fonts Angles 0% 10 20 50 100 200 500 Trainning set size Elie Bursztein (@elie) https://elie.net 20 Anti-recognition taxonomy
Elie Bursztein (@elie) http://elie.im 21 Anti-recognition taxonomy
Background Confusion
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Collapsing
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Collapsing
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Collapsing
Elie Bursztein (@elie) http://elie.im 21 Background confusion Anti-recognition taxonomy
Background Confusion
Lines
Collapsing
Elie Bursztein (@elie) http://elie.im 21 Background confusion Breaking World of Warcraft
Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft
Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft
Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft
Elie Bursztein (@elie) http://elie.im 22 Breaking World of Warcraft
Elie Bursztein (@elie) http://elie.im 22 Breaking Captcha.net
Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net
Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net
Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net
Elie Bursztein (@elie) http://elie.im 23 Breaking Captcha.net
Elie Bursztein (@elie) http://elie.im 23 Breaking Wikipedia
Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia
Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia
Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia
Elie Bursztein (@elie) http://elie.im 24 Breaking Wikipedia
Elie Bursztein (@elie) http://elie.im 24 Breaking Digg
Elie Bursztein (@elie) http://elie.im 25 Breaking Digg
Elie Bursztein (@elie) http://elie.im 25 Breaking Digg
Elie Bursztein (@elie) http://elie.im 25 Breaking Digg
Elie Bursztein (@elie) http://elie.im 25 Breaking Digg
Elie Bursztein (@elie) http://elie.im 25 Breaking Slashdot
Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot
Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot
Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot
Elie Bursztein (@elie) http://elie.im 26 Breaking Slashdot
Elie Bursztein (@elie) http://elie.im 26 Breaking eBay
Elie Bursztein (@elie) http://elie.im 27 Breaking eBay
Elie Bursztein (@elie) http://elie.im 27 Breaking eBay
Elie Bursztein (@elie) http://elie.im 27 Breaking eBay
Elie Bursztein (@elie) http://elie.im 27 Breaking eBay
Elie Bursztein (@elie) http://elie.im 27 Failing to break eBay
Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay
Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay
Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay
Elie Bursztein (@elie) http://elie.im 28 Failing to break eBay
Elie Bursztein (@elie) http://elie.im 28 Breaking Baidu
Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu
Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu
Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu
Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu
Elie Bursztein (@elie) http://elie.im 29 Breaking Baidu
Elie Bursztein (@elie) http://elie.im 29 Overall results
Segmentation Solving rate Authorize 84%rate 66% Baidu 98% 5% Blizzard 75% 70% Captcha.net 96% 73% CNN 50% 16% Digg 86% 20% eBay 95% 43% Google 0% 0% MegaUpload n/a 93% NIH 87% 72% Recaptcha 0% 0% Reddit 71% 42% Skyrock 30% 2% Slashdot 52% 35% Wikipedia 57% 25%
Elie Bursztein (@elie) https://elie.net 30 Learning rate for real schemes
90% Authorize Baidu
80% Blizzard Captcha.net CNN
70% Digg eBay Megaupload
60% NIH Reddit Skyrock Slashdot 50% Wikipedia %success 40%
30%
20%
10%
0% 10 20 50 100 200 500 Trainning set size
Elie Bursztein (@elie) https://elie.net 31 Decaptcha main interface
Elie Bursztein (@elie) https://elie.net 32 Apply design principles ! Core design principles ! Randomize length ! Randomize character size ! Wave the captcha ! Use anti-recognition as a means of strengthening captcha security ! Don’t use a complex charset ! Bad for human (see our research on this) ! Useless for security ! Use collapsing or lines
Elie Bursztein (@elie) https://elie.net 33 Designing Better Captchas
Insert presenter logo here on slide master. See hidden slide 4 for direc ons 34 Think Lego again ! Decompose in features ! Analyze ! feature in isolation ! features interaction
Elie Bursztein (@elie) https://elie.net 35 Evaluation system
Payement validation Amazon Mechanical Turk Monitoring system
Feedback Captcha Test system Web Fronted image results Generator
Tasks Generator Tasks
Elie Bursztein (@elie) https://elie.net 36 Experiment details
Round Task N possible N sampled N tests per sample Total tests 1 Baseline (“Control”) 1 1 1000 1000 2 Real world captchas 8 8 1000 8000 3 Features in isolation 496 496 200 99200 4 2 feature interactions 60950 60950 5 304750 5 3 feature interactions 1 303 224 25000 10 250000 6 4 feature interactions 113 951 684 25000 10 250000 Total 912150
Table 2. Overview of the experimentation rounds.
100 Then we considered tasks where the values of 2, 3 or 4 dif- % Completed % Failed %Timeout ferent features were changed. We chose a limit of 4 features 90 for two reasons. First, we observed that captchas in the 80 wild rarely exhibit more than one or two anti-segmentation 70 features and one orElie two Bursztein anti-recognition (@elie) features [4]. Thus https://elie.net 37 observing captchas where 2-4 features are varying is more 60 representative of real world captchas. The other reason for 50 limiting ourselves to 4 interacting features is the combinato- 40 rial explosion of possible feature combinations. 30
Because of the large number of possible combinations for 2, 20 3 and 4 feature captchas, we could only sample a subset of 10 all captchas for these tasks. We first examined subject per- 0 1 2 3 4 formance in the individual feature tasks and removed feature dimension values as follows. For features where performance was fairly constant across the different values or exhibited a roughly Figure 5. Success, failure and timeout rates vs. number of features linear progression, we removed every second value, leaving only 50% of the original values. For fonts, we selected one font for each 10% bucket of accuracy, e.g. one font with EXPERIMENTAL RESULTS accuracy around 10%, one font with accuracy around 20%, We evaluated subject performance on captchas in two ways: etc. (See the discussion below for the choice of 10%.) We solving accuracy and solving time. For solving accuracy, removed all of the similar foreground/background color anti- we compared the answers given by subjects to the text from segmentation features because the color confusion defense, which we generated the captcha, ignoring differences in case while popular among captcha builders is known to be inse- or spacing. Solving time was measured by recording the time cure because computers are much better at distinguishing between when the subject was presented with a captcha and between two colors than humans. Therefore an attacker can when they submitted their response. Note that measuring exploit this difference of color between the background and times on AMT can result in high variance because many the letters to successfully clean the background. That is why Turkers do other things on their computer at the same time highly used captchas such as Google, Recaptcha and eBay even if we explicitly asked them to do it as fast as possible. use a uniform background. Thus it is less interesting from the Still, by averaging over large numbers of subjects, we can get perspective of implications for future captcha design. Table 2 an idea of which captchas take more or less time. shows the size of the resulting reduced feature spaces. One of the most striking observations about user behavior is There is a tradeoff between the number of points in the space that even though they were paid, when the captchas became of feature combinations that we can sample, and the number too hard the users quit or typed in garbage. This is clearly of samples we take for each point. For example, to get a visible on the overall user statistics displayed in Figure 5. As good estimate of how accurate humans are on captchas with one can see, the number of users that quit or gave up increased 1000 random dots and 2 straight lines, we would ideally have as the captchas became harder and harder. The number of 100 or even 1000 examples of this scenario. However, given users that gave up seems to increase linearly with the number that there are 60,950 possible two feature combinations, it of features we added to the captchas. The solving time also is unrealistic to take this many samples for each such point. increased as it took the users more time to understand the Thus we compromise at taking 10 samples of each point – captchas. this gives us a rough estimate of accuracy (10%, 20%, 30%, etc.) while still allowing us to sample a reasonable number of Individual Features Results points in the feature combination space. Thus, we randomly selected 25000 captchas for each of the 3 and 4 feature Character sets. Table 3 shows how subjects performed on interaction groups, and had each captcha annotated 10 times. different character sets. Pseudo-words, words, and simple As a result we had 250 000 captcha tested for these 2 groups. character sets like all digits, all lowercase letters and all uppercase letters were the easiest, with accuracies of 97% or Some of the features tested
Blurring Text color Font Background color
Collapsing Tilting Waving DistorDistortiontion
line angle line line shape nb line
line coverage line position line size Noise
Elie Bursztein (@elie) https://elie.net 38 Angle of rotation
14 1.0
13 0.9
12 0.8
11 0.7
10 0.6
9 0.5
8 0.4 accuracy solvingtime(s) 7 0.3
6 0.2
5 0.1 solving time accuracy 4 0 50 100 150 200 250 300 350 rotation angle (°)
Elie Bursztein (@elie) https://elie.net 39 Collapsing 14 1.0 solving time accuracy 13 0.9
12 0.8
11 0.7
10 0.6
9 0.5
8 0.4 accuracy solvingtime(s) 7 0.3
6 0.2
5 0.1
4 0 4 2 0 -2 -4 -6 -8 -10 -12 -14 character gap width
Elie Bursztein (@elie) https://elie.net 40 Character size
14 1.0
13 0.9
12 0.8
11 0.7
10 0.6
9 0.5
8 0.4 accuracy solvingtime(s) 7 0.3
6 0.2
5 0.1 solving time accuracy 4 0 2 4 6 8 10 12 14 16 18 20 character size Elie Bursztein (@elie) https://elie.net 41 Resolution invariant 100
95
90
85
80
75
Accuracy 70
65
60 <= 1024 > 1024 55 all captchas 50 5 10 15 20 25 30 captcha length (number of characters)
Elie Bursztein (@elie) https://elie.net 42 2D interactions
Elie Bursztein (@elie) https://elie.net 43 Length vs Angle interaction
Elie Bursztein (@elie) https://elie.net 44 Perception Does Not Match Number
35 %fast %easy 30 %like
25
20
15
10
5
0 az 09 AZ az09 AZ09 azAZ azAZ09 pretty HF+ cutest LF + guilty HF - molest LF -
Elie Bursztein (@elie) https://elie.net 45 The New Wikipedia
! Use digit ! Wave the captcha ! Use random length (5-7) ! Use random size (34-50) ! Rotate letter (-25/ 25) ! Add a line for a super secure version
Elie Bursztein (@elie) https://elie.net 46 End result
Accuracy
Solving time
Elie Bursztein (@elie) https://elie.net 47 End result
Accuracy 84.8%
Solving time
Elie Bursztein (@elie) https://elie.net 47 End result
Accuracy 84.8%
Solving time 7.8s
Elie Bursztein (@elie) https://elie.net 47 End result
Accuracy 84.8%
Solving time 7.8s
Elie Bursztein (@elie) https://elie.net 47 End result
89.2% Accuracy 84.8% 82.6%
Solving time 7.8s
Elie Bursztein (@elie) https://elie.net 47 End result
89.2% Accuracy 84.8% 82.6%
4.9s Solving time 7.8s 5.3s
Elie Bursztein (@elie) https://elie.net 47 End result
confusing
89.2% Accuracy 84.8% 82.6%
4.9s Solving time 7.8s 5.3s
Elie Bursztein (@elie) https://elie.net 47 End result
confusing
89.2% Accuracy 84.8% 82.6%
4.9s Solving time 7.8s 5.3s
Elie Bursztein (@elie) https://elie.net 47 End result
confusing
89.2% 97% Accuracy 84.8% 82.6% 92.2%
4.9s Solving time 7.8s 5.3s
Elie Bursztein (@elie) https://elie.net 47 End result
confusing
89.2% 97% Accuracy 84.8% 82.6% 92.2%
4.9s 4.9s Solving time 7.8s 5.3s 5.2s
Elie Bursztein (@elie) https://elie.net 47 How to Break Audio-Captcha
Insert presenter logo here on slide master. See hidden slide 4 for direc ons 48 Audio Captchas
Elie Bursztein (@elie) https://elie.net 49 Audio Captchas
Elie Bursztein (@elie) https://elie.net 49 Creating Audio Captcha
SuperCaptcha secureMaker captcha
Elie Bursztein (@elie) https://elie.net 50 Creating Audio Captcha
SuperCaptcha secureMaker captcha Voices
Elie Bursztein (@elie) https://elie.net 50 Creating Audio Captcha
SuperCaptcha Maker secure captcha Noises
Elie Bursztein (@elie) https://elie.net 50 Creating Audio Captcha
Super secure captcha
Elie Bursztein (@elie) https://elie.net 50 Noise intensity (RMS/SNR)
K J 5 H Scheme Authorize Digg eBay Microsoft Recaptcha Yahoo Length 5 5 6 Authori10 8 7 Type of voice Female Female Various Various Various Child Background Noise None Constant (random) Constant (random) Constant (random) Constant (random) None Intermediate noise None None Regular (speech) Regular (speech) Regular (speech) Regular (speech) Charset 0-9a-z a-z 0-9 0-9 0-9 0-9 Avg. duration 5.0 6.8 4.4 7.1 25.3 18.0 Sample rateJ 8000 A 8000 8000 K 8000 8000 8000 22050 Beep no no no no no yes Table I Dig COMMERCIAL AUDIO CAPTCHA FEATURE DESCRIPTION
Figure 4. Authorize Captcha Figure 5. Digg Captcha 2 9 0 0 parameter. These properties can make RLSC orders of magnitude faster to train than an SVM [?]. This efficiency Micros is noticeable in Decaptcha; it takes 2 minutes (5 minutes) to train on thousands of captchas with a unidimensional (two-dimensional) representation, respectively. Figure 4. Authorize Captcha Figure 5. Digg Captcha IV. COMMERCIAL CAPTCHAS Elie Bursztein (@elie)This section describes the commercial captchas we used parameter. These properties can make RLSC orders of https://elie.net to validate Decaptcha as well as our51 testing methodology magnitude faster to train than an SVM [?]. This efficiency and results. We tested audio captchas from Authorize, Digg, is noticeable in Decaptcha; it takes 2 minutes (5 minutes) eBay, Microsoft, Recaptcha, and Yahoo. We were unable to to train on thousands of captchas with a unidimensional test Google’s captchas because of difficulties we encountered (two-dimensional) representation, respectively. obtaining reliableFigure annotations; 7. Microsoft they Captcha are so difficult for Figure 9. Yahoo Captcha IV. COMMERCIAL CAPTCHAS humans that they are ineffective as captchas. Figure 6. Ebay Captcha This section describes the commercial captchas we used A. Corpus description vocal noises that look like digits in the waveform. Apart to validate Decaptcha as well as our testing methodology from the presence of semantic noise, Recaptcha captchas and results. We tested audio captchas from Authorize, Digg, Authorize. Audio captchas on authorize.net consist of five letterare similarJ shows to live.com patterns captchas, similar to but those the digits of the are letter deliveredJ in eBay, Microsoft, Recaptcha, and Yahoo. We were unable to letters or digits spoken aloud by a female voice. The voice themuch Authorize more slowly. captcha The (see waveform figure and10). spectrogram presented test Google’s captchas because of difficulties we encountered clearly articulates each character and there is minimal in Figure 8 show a portion of a captcha containing the digits obtaining reliable annotations; they are so difficult for distortion. The waveform and spectrogram presented in Audio captchas on ebay.com consist of six digits eBay.1, 7, 3 and 5. As will be discussed in 10 the five digit from humans that they are ineffective as captchas. Figure 4 show a portion of a captcha containing the spoken by a different speaker and in a different setting with this captcha shows similar harmonic patterns the five digit digits/letters K, J, 5 and H. A long pause appears between regular background noise. The waveform and spectrogram Figure 6. Ebay Captcha from Authorize and eBay captchas. A. Corpus description spoken characters and vowels are clearly articulated. The presented in Figure 6 show part of a captcha containing the letters K and H, which are fricative consonants, show some digitsYahoo.9, 5Audio, 7 and captchas6. The digits from inyahoo.com these captchasconsist are delivered of three Audio captchas on authorize.net consist of five Authorize. harmonicletter J shows patterns patterns in the similar spectrogram to those while of the letter letter JJ hasin muchbeeps faster followed than by those seven of authorize.net digits spokenor bydigg.com a child.. The The letters or digits spoken aloud by a female voice. The voice almostthe Authorize no harmonic captcha patterns. (see figure 10). waveformcaptcha is shows obscured the variability with other of childrens’ the various voices digits in due the clearly articulates each character and there is minimal tobackground. different speakers The waveform and different and spectrogram background presentednoise levels, in Digg. Audio captchas on digg.com consist of five letters distortion. The waveform and spectrogram presented in eBay. Audio captchas on ebay.com consist of six digits whileFigure the9 show spectrogram a portion shows of a captcha that the containing vowels are the short digits and1, spoken aloud by a female voice. There is random white Figure 4 show a portion of a captcha containing the spoken by a different speaker and in a different setting with relatively7 and 6. The unobscured digits are by the noise. largest amplitude sections in the noise in the background and sometimes an empty, but digits/letters K, J, 5 and H. A long pause appears between regular backgroundFigure noise. 8. The Recaptcha waveform Captcha and spectrogram waveform and the spectrogram shows that the background louder, segment is played between letters. The waveform Audio captchas from live.com consist of ten spoken characters and vowels are clearly articulated. The presented in Figure 6 show part of a captcha containing the Microsoft.voices do not confuse the patterns of the digits. This and spectrogram presented in Figure 5 show a portion of a digits spoken by different speakers over a low quality letters K and H, which are fricative consonants, show some digits 9, 5, 7 and 6. The digits in these captchas are delivered spectrogram shows different patterns than the spectrograms captcha containing the letters J, A and K. The overall brown recording. There is a regular background noise consisting harmonic patterns in the spectrogram while the letter J has muchcorrespond faster tothan actual those digits, of authorize.net the spectrogramor digg.com shows. The that of the other captchas because of the use of a child’s voice. It color of the spectrogram shows the heavy constant noise of several simultaneous conversations. The waveform and almost no harmonic patterns. waveformthe vowels shows are somewhat the variability obscured of the by various background digits noise. due seems that the patterns induced by a child are much clearer thatto different obscures speakers vowels and but still different maintains background some characteristic noise levels, spectrogram presented in Figure 7 show a portion of a Audio captchas on digg.com consist of five letters Interestingly, the two 0 digits show very similar patterns, than the patterns of an adult’s voice. This makes digits easier Digg. whilepatterns the of spectrogram the letters. These shows patterns that the cannot vowels be are completely short and captcha containing the digits 2, 9, 0 and 0. Like the spoken aloud by a female voice. There is random white but this pattern is not easily distinguished from the pattern to recognize even though the noise in Yahoo’s captchas has maskedrelativelyobserved by unobscured for the the white9 digit. by noise noise. since they are necessary for eBaymore audio energy captchas, than the these noise digits in other are captchas. spoken very quickly. noise in the background and sometimes an empty, but human recognition. Interestingly, the spectrogram of the While all of the high amplitude sections of the waveform louder, segment is played between letters. The waveform Microsoft.Recaptcha.AudioAudio captchas captchas from from live.comrecaptcha.netconsistconsist of ten of Comparison. Figure 10 illustrates some differences between and spectrogram presented in Figure 5 show a portion of a digitseight digits spoken spoken by different by different speakers speakers. over Distortions a low quality include commercial captcha schemes. The first line presents the captcha containing the letters J, A and K. The overall brown recording.background There conversations is a regular and background approximately noise two consisting semantic TFR of the digit five from Authorize, eBay, and Recaptcha color of the spectrogram shows the heavy constant noise of several simultaneous conversations. The waveform and that obscures vowels but still maintains some characteristic spectrogram presented in Figure 7 show a portion of a patterns of the letters. These patterns cannot be completely captcha containing the digits 2, 9, 0 and 0. Like the masked by the white noise since they are necessary for eBay audio captchas, these digits are spoken very quickly. human recognition. Interestingly, the spectrogram of the While all of the high amplitude sections of the waveform Sound representation
TCR
Cep
WAV DFT TDC
TFR
Elie Bursztein (@elie) https://elie.net 52 Solving an audio captcha
Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha
Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha
Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha
Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha
Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha
C
Elie Bursztein (@elie) http://elie.im 53 Solving an audio captcha
C T T A R R 2 A F S
Elie Bursztein (@elie) http://elie.im 53 Dealing with random noise ! Statistical learning ! Supervised learning ! RLS (Regularized least square) 5: classifier Authorize eBay Recaptcha
J:
Authorize Digg
Elie Bursztein (@elie) https://elie.net 54 Semantic noise
Elie Bursztein (@elie) https://elie.net 55 Results
Length Coverage Digit Captcha Authorize 5 100 97 89.2% Digg 5 100 76 41.4% eBay 6 85.6 92.5 82.9% Microsoft 10 80.6 89.6 48.9%
Recaptcha 8 99.9 40.5 1.5% Yahoo 7 99.1 74.7 45.4%
Elie Bursztein (@elie) https://elie.net 56 Recaptcha semantic noise
0 3 7 9 4 -10 2 0 1 N 5 N
-20
-30
DB -40
-50
-60
-70 0 20 40 60 80 100 120 140 160 180 200 Time in seconds
Elie Bursztein (@elie) https://elie.net 57 Recaptcha semantic noise
0 3 7 9 4 -10 2 0 1 N 5 N
-20
-30
DB -40
-50
-60
-70 0 20 40 60 80 100 120 140 160 180 200 Time in seconds
Elie Bursztein (@elie) https://elie.net 57 How many captchas do you need ?
100 Authorize 90 Digg Ebay MSLive 80 Recaptcha Yahoo 70
60
50
40
Per − Captcha Precision (%) 30
20
10
0 2 3 4 10 10 10 Corpus Size (in Digits)
Elie Bursztein (@elie) https://elie.net 58 Video captcha ! Interesting direction -> more design space ! Good for human ! Good for computer :( ! Working on it
See blog post for more information: http://elie.im/blog
Elie Bursztein (@elie) https://elie.net 59 Apply ! Within 3 months ! Make sure you have a strong captcha scheme (use mine if you want) ! Ensure that your site is accessible
! Within 6 months ! Log your captchas failure rate and monitor them ! Have a backup captcha scheme in case your scheme is broken
Elie Bursztein (@elie) https://elie.net 60 Thank you ! Thank you Questions ?
Follow-me ! Twitter: @elie
Captcha research: http://elie.im/tag/captcha
Elie Bursztein (@elie) https://elie.net 61