Analyzing Password Strength
Martin M.A. Devillers
Juli 2010
Password authentication requires no ABSTRACT specialized hardware, such as with finger- print authentication, can be easily imple- Password authentication is still the most mented by developers and just as easily used used authentication mechanism in today’s by users. In short: It’s usable. computer systems. In most systems, the But is it safe? The topic of “Security ver- password is set by the user and must adhere sus Usability” has always been of much de- to certain password requirements. Addition- bate in the computer security world (3). Us- ally, password checkers rank the strength of ers must be protected from harm by security a password to give the user an indication of controls, but these controls may not interfere how secure their password is. In this paper, (much) with the tasks the users want to per- we take a look at a large database of user form. For instance, a firewall that simply chosen passwords to determine the current blocks all traffic can be considered secure, state of affairs. In the end, we extract a mod- but heavily impedes the overall usability of el from the database and provide our own the system. password checker which ranks passwords in various ways. We ran this checker against Security experts or system developers are our dataset which shows that over 90% of usually the ones who have to make this tra- the passwords is highly insecure. deoff, but with password authentication, this task is essentially passed on to the user (3): INTRODUCTION One can either choose a short and simple password, which is easy to remember but As we perform increasingly important also easy to crack, or a very long and com- tasks from our living room computer, the plicated password, which is hard to remem- topic of computer security also becomes in- ber but also hard to crack. creasingly important. And indeed, many ad- Unfortunately, most users do not see a vances have been made in the field of com- tradeoff: They see an obstacle and they will puter security to protect home users from choose the path of least resistance to over- digital crime. For instance, the wireless come that obstacle. Thus, short and simple transmission protocol, Wi-Fi, has had sever- passwords that are easy to remember, but al (much overdue) security overhauls to pro- also easy to guess, are used (4). tect home users from being eavesdropped or worse (1). However, when we look at what we To stop users from using weak pass- typically use to authenticate ourselves with, words, most systems enforce certain re- we are stuck with a system that dates back quirements that a password must meet be- to the Roman empire (2): passwords. fore it gets set. Examples of common re- quirements are minimal length of the pass- Although an ancient concept, password word, the occurrence of uppercase letters, authentication is, and most likely will be for digits and/or symbols in the password and a long time, the most used authentication inequality with the user’s username or e- mechanism for computer users. mail address (5).
However, holding on to the principle of usernames. Since the source of the data was the path of least resistance, one can expect questionable at best, we ran various tests users to try and ‘circumvent’ these require- and filters to ensure the quality of the data. ments in a predictable manner (4). For in- Various noise factors were discovered: stance, given that an user starts out with an 1. Analysis showed that entries longer actual word such as ‘house’ and the re- than 30 characters were more often noise quirement that the password must contain than actual passwords. A lot of entries were at least one digit is given, one might expect pieces of HTML or JavaScript, which might the user to simply suffix the word with a sin- be indicative of an injection attack or some gle digit. In the findings which we will kind of input anomaly. Others contain a sin- present, you will see that 15% of the pass- gle character repeated up to a hundred words were a word or name suffixed with the times. The most likely explanation for this number one. anomaly, are users who want to quickly In this paper, we will first discuss the da- sign-up and fill in bogus information to get a taset which we used to perform our research. onetime account. Such entries cannot be After this, we will show you the results of our considered to be actual passwords, since the preliminary tests on this dataset. In the next user does not have the intention of memoriz- chapter, we will go one step further and ex- ing or re-using the password. Since these tract actual patterns from the passwords. entries are much longer than normal entries, After this, we will show you how we used a a filter was applied that removes any entry probabilistic approach to password analysis. longer than 30 characters. In the final chapter, we will present you our 2. At various points in the database, the password checker, which combines the re- character encoding switches. For instance, sults of all the previous chapters. the first 3 million passwords contain Un- icode characters in their HTML encoded ATASET D form. Hereafter, Unicode characters are On December 4th 2009, a hacker stored in their actual form, which might be breached a company database of RockYou!1 indicative of the backend switching to UTF-8 containing the usernames and unencrypted encoding support. Similarly, the file itself is passwords of about 32 million users (3). This part encoded in ANSI and part encoded in database was subsequently published to the UTF-8. Character analysis gave indication internet and is now in wide circulation. Ob- when these switches occurred and a pro- viously, we don’t condone hacking, but the gram was written to rewrite the database in presence of this database gives us an unique one uniform encoding scheme (UTF-8). opportunity to perform a large scale empiri- 3. Various (uncommon) passwords occur cal study on passwords. multiple times in close proximity to each The most notable fact about the afore- other. This might be indicative of a single mentioned affair would not be that a large user registering multiple accounts. A filter database was hacked, but that the pass- was applied that removes these extra occur- words inside the database were stored in rences. Common passwords did not apply to unencrypted form (so-called plaintext pass- this filter, as they may naturally occur mul- words). These days, it is common practice to tiple times in close proximity. salt and hash passwords before permanently storing them, which makes it generally hard PRELIMINARY TESTS to study passwords even when granted access to the right databases. Before we started with the actual analy- sis, various basic tests were ran to gain The RockYou! database was acquired in some insights into the database. These in- the form of a long text file where each pass- clude a letter frequency analysis, a character word resides on its own line. The file con- type analysis, a length distribution analysis tained no other information, such as the and a common password analysis.
1 RockYou! (originally known as RockMySpace), Letter frequency analysis helps us in var- based in Redwood City, California is a publisher ious ways. Knowing the frequency of each and developer of applications and other social letter gives us the ability to define a more network services. As of December, 2007 it is the fine-grained metric for measuring password most successful widget maker for the Facebook strength. By grading the chance of occur- platform in terms of total installations.
Radboud University Nijmegen 2 Martin M. A. Devillers 16% RockYou English Spanish
14%
12%
10%
8%
Percentage 6%
4%
2%
0% abcde f ghi j k lmnopqrs tuvwxyz
Figure 1 – Letter frequencies of the RockYou! dataset, English and Spanish language rence of each individual character rather Unicode, which are any characters that than grading each letter a flat twenty-sixths do not belong to the above categories. chance of occurrence. Examples of these are the euro sign and Furthermore, letter frequency analysis the Japanese alphabet. allows us to measure the degree in which the Given these categories, 2 32 combina- passwords conform to actual words. This tions are possible. However, we expect that helps us to better understand if the dataset only a few will be really prominent. follows a language and what language this might be (6). Figure 1 shows the letter frequencies of the dataset and those of the English and Other Spanish language. As you can see, the dis- 9% tribution of the RockYou! dataset shows great similarity to those of the English and Digit Lower Spanish language. When we do a quick 16% Case search for English and Spanish words in the 42% dataset, we find that 10% of the entries match English words and 2.5% match Span- ish words. Lower Character type analysis looks at every Case, password and flags what kind of characters Digit make up the password. We distinguish be- 33% tween the following character types: Lowercase, which are the standard lo-
wercase letters of the alphabet. Figure 2 – Character type analysis Uppercase, which are the standard up- percase letters of the alphabet. Figure 2 shows the results of our charac- ter type analysis. The largest category, which Digits, which are the digits zero through nearly makes up for half the dataset, are nine passwords that consist solely out of lower- Symbols, which are any characters found case characters. This is troublesome, consi- in the non-extended ASCII set that do not dering that most passwords are only 6 to 8 belong to the above categories. Most of characters long and the letters found in the these can be found on a standard Ameri- passwords conform to that of a language. can keyboard.
Radboud University Nijmegen 3 Martin M. A. Devillers This would suggest that a significant part princess 33291 0.1021% of our dataset are words or names. Another 1234567 21727 0.0666% troublesome result, are the passwords that rockyou 20903 0.0641% consist solely out of digits, which represents 12345678 20553 0.0630% 16% of the dataset. Numeric passwords have abc123 16648 0.8918% limited complexity. Table 1 – Top 10 passwords Length distribution analysis gives us in- Table 1 shows the top 10 passwords with sights in what the common length of user their absolute counts. The use of numerical chosen passwords are. sequences immediately stands out as 5 of the 10 passwords represents this class. 30% When we break the numbers down to 25% percentages, two troublesome conclusions can be drawn: 20% 1. Roughly one out of every hundred 15% passwords is the sequence ‘123456’
Percentage 10% 2. The top 10, 100, 1,000 and 10,000 passwords cover respectively 2, 5, 11 and 22 5% percent of all passwords. 0% In other words, if an attacker would try 56789101112 to gain access to a system by trying the top 10 most common passwords using a listing Figure 3 – Password length distribution of known accounts. He could expect to suc- ceed within 25 accounts, costing him only The results of the analysis, as shown in 250 guesses. Given more accounts and just Figure 3, do not show a normal form distri- the password ‘123456’, success could be ex- bution, but rather a truncated form. We ar- pected with the 50th account, costing only 50 gue that this is a result of minimum pass- guesses. This attack is feasible without au- word length requirements. Strangely, the tomation and low-profile enough to not raise database contained entries as short as one any alarms. character. At the moment of writing, the RockYou! website enforces a minimum 8 We compare our results with other top character password length, but this was tens of common passwords (4) (5) in Table 2. most likely less (or non-existent) in the past All highlighted cells represent passwords (10). that are present in our top 10. It should be In the previous chapter we already noted noted that most other passwords are in our that we ignored any entry longer than 30 top 100 and all are present in our top 1000. characters. The range of passwords of size RockYou! PCMag M.Burnet 15 through 29 covered about 2 percent of 123456 password 123456 the database. Half the passwords of length 12345 123456 password 20 and above were e-mail addresses, thus 123456789 qwerty 12345678 explaining their unusual length. password abc123 1234 iloveyou letmein pussy Finally, we take a look at the most com- princess monkey 12345 mon passwords. We expect these passwords 1234567 myspace1 dragon to be really weak, as a password that is rockyou password1 qwerty widely used is most likely one you can logi- 12345678 blink182 696969 cally guess, find in a common password list- abc123 Firstname mustang ing or is context dependent (such as the Table 2 – Top 10 comparison (the last entry of PCMag name of the service) ‘firstname’ is a placeholder for an actual firstname)
Password Count Percentage Although these overlapping lists, streng- 123456 290731 0.8918% thens our believe that the RockYou! dataset 12345 79078 0.2426% is representative, it at the same time sad- 123456789 76790 0.2356% dens us to see that so many users still use password 59463 0.1824% iloveyou 49952 0.1532% very predictable passwords.
Radboud University Nijmegen 4 Martin M. A. Devillers are matched. We then look at the most PATTERN ANALYSIS popular numbers and revert these back to regular expressions. The previous chapter showed several ways of defining characteristics of a pass- Number Count Percentage word. Now, we want to go a step further and 1 1476941 16.37% extract actual patterns from passwords. This 123 325963 3.61% 2 284354 3.15% will help us to better understand how pass- 12 213870 2.37% words are formed and ultimately allow us to 3 166762 1.85% construct an improved password checker. 13 150069 1.66% M. Dell’Amico (6) studied a much smaller 7 147951 1.64% database of roughly 10,000 entries. Various 11 122630 1.36% 5 120376 1.33% regular expressions were ran against the da- 22 107444 1.19% tabase, in an attempt to recognize patterns 23 106425 1.18% in passwords. We repeat their experiment on 01 102756 1.14% the RockYou! database and present both 4 101573 1.13% their and our results below in Table 3 07 100693 1.12% Expression Example IIMS RockYou 21 100370 1.11% 14 95288 1.06% [a-z]+ abcdef 51.20% 41.69% 10 92655 1.03% [A-Z]+ ABCDEF 0.29% 1.50% 06 86495 0.96% [A-Za-z]+ AbCdEf 53.74% 44.05% 08 86065 0.95% [0-9]+ 123456 9.10% 15.93% 8 83819 0.93% [a-zA-Z0-9]+ A1b2C3 93.43% 96.20% 15 83708 0.93% [a-z]+[0-9]+ abc123 14.51% 27.69% 69 81299 0.90% [a-zA-Z]+[0-9]+ aBc123 16.30% 30.16% 16 78506 0.87% [0-9]+[a-zA-Z]+ 123aBc 1.80% 2.75% 6 76798 0.85% [0-9]+[a-z]+ 123abc 1.65% 2.53% 18 71343 0.79% Table 3 – Regular expressions Table 4 – Top 25 numeric suffixes The most notable difference is the use of digits. Most percentages in the IIMS data- Table 4 shows the 25 most used numeric base indicate a reliance on lowercase charac- suffixes, which makes up for halve the ters, whilst in the RockYou! database pat- passwords matched by the regular expres- terns that contain digits are more prominent. sion ‘[a-zA-Z]+[0-9]+’. We expect single The patterns that describe a word that is digits to be popular suffixes and indeed, the suffixed by a number are almost double as digit ‘1’ covers over a million passwords that popular in the RockYou! database. Pass- use it as a suffix. However, double digit words that are made purely out of digits are also more popular in the RockYou! database. numbers occur twice as often as single di- gits, which we did not expect. Interestingly, Based on these results we can argue that the digit 9 is only present once in the listing the users of RockYou! have been trained to (and not even as a single digit number). create more secure passwords, most likely by other applications which enforce stricter re- Besides the top 25, we present a second quirements. One exception would be the list which contains entries longer than 2 passwords that consist purely out of digits. characters from the top 100. These passwords are used more often in the RockYou! dataset and we consider these Number Count Percentage kind of passwords to be very insecure, due to 101 51065 0.57% 1234 49619 0.55% their limited complexity. 2007 30731 0.34% Besides the aforementioned expressions, 2006 29122 0.32% we propose our own set of supplemental ex- 666 24317 0.27% pressions. We are most interested in pass- 2008 24300 0.27% words that start with letters and end with 12345 20276 0.22% digits, as they make up for one third of our 2005 18694 0.21% 007 18261 0.20% dataset and regular expressions can help us 420 16470 0.18% better understand this set. Therefore, we 123456 15811 0.18% repeat the regular expression ‘[a-zA-Z]+[0- 1994 14288 0.16% 9]+’ and keep track of all the numbers that 1993 13695 0.15%
Radboud University Nijmegen 5 Martin M. A. Devillers Number Count Percentage 1992 13609 0.15% N-GRAM APPROACH 777 13223 0.15% 1995 13149 0.15% Another approach to analyzing the data- 2000 12613 0.14% set is by counting the occurrences of small 111 12426 0.14% sized character pairs that make up the 1991 12358 0.14% password. Such an approach delivers a so- 1990 11740 0.13% called n-gram model, which is one of the ap- 2004 11466 0.13% proaches used in a paper by M. Dell’Amico 321 11073 0.12% (6) to generate passwords. An n-gram model 1989 10940 0.12% 1987 10689 0.12% is a type of probabilistic model for predicting the next item in a sequence. n-grams are Table 5 – Entries from top 100 numeric suffixes used in various areas of statistical natural with length larger than 2 characters language processing and genetic sequence Table 5 lists these entries. The length of analysis. the entries exposes very apparent patterns. A n-gram model can be useful in two All entries of length 4 are years. We argue ways: that the gap between 1994 and 2004 is in- 1. It can be used to measure the likelih- dicative of users using the current year dur- ood of a password as a product of the rela- ing sign-up or a birth year. We know that the tive occurrences of its containing n-grams. user base of RockYou! consists largely out of This gives us yet another metric to measure teenagers. Entries larger than 4 represent passwords. numeric sequences while entries with length 2. It can be used to generate new pass- 3 represent repetitive digits or numbers from words, based on the n-gram model. This pop culture such as the number of the beast could form the basis of a new kind of pass- (666), lucky sevens (777), James Bond’s call word attack strategy, which is smarter than sign (007) and four-twenty from the canna- a brute force approach. (8) bis subculture (420). To build a n-gram based model, every All two digit numbers are represented possible combination of characters of every within the top 180 entries, which makes any password in the dataset must be extracted. This is required to solve the probability term entry in that range have more than 50% of a n-gram based likelihood calculation, chance of falling in this category. We formu- which can be described with the term: late this in a regular expression that matches any two digit number. | , , , Expression Description To calculate the probabilistic likelihood ^(1|12|123|1234 Numeric sequence of a password, the following formula is ap- |12345|123456)$ plied: ^[0-9]$ Single digit ^[0-9]{2}$ Double digit | | | ^(19[7-9][0-9] Year between 1970 and | | |200[0-9])$ 2009 Where is the size of the n-gram, is Table 6 – Regular expressions for numeric suffixes the password, | | is the length of the pass- word and | | is the probability of being Table 6 shows the patterns described as that length. regular expressions. It should be noted that the order is of importance. For instance, we The formula | is consider the use of years as a suffix to be the conditional probability of the substring more secure than a single digit suffix. Regu- given all possible strings that can be created from when appended lar expressions and their order will play an with a single character. This is an embodi- important role in our own password checker ment of the term | , , , which we will present later on. which is used by the n-gram model to pre- dict based on , , , and allows us to calculate the probability of a n-gram.
Radboud University Nijmegen 6 Martin M. A. Devillers Since we are calculating the product of a The actual value of has a huge impact series of probabilities, the assumption is on the behavior of the model. In the next few made that the collection of occurrences is paragraphs, we will discuss various values of mutually independent, which means that we and their implications. assume that the following property of mutual Using a value of one for (unigrams), will independence holds: make the algorithm contextless, as the algo- rithm will only look at the frequency of single letters. This is synonymous to using the cha- racter frequency analysis from the previous Although we know that this property chapter to grade or predict passwords. does not actually hold, the assumption that If one were to choose a very large , then it does is required for us to simplify the the algorithm would lose its ability to dissect problem. Through this independence as- passwords and simply grade a password by sumption, our model assumes the Markov its entire string of characters. This is syn- property, which enables reasoning and com- onymous to calculating the frequency of a putation with the model that would other- given password within the original dataset. A wise be intractable. password generator with such a model would The length of the password determines (almost) never generate any passwords out- the amount of iterations, which generally side the dataset. means that the longer the password, the On another note, a large would also more iterations there are and the smaller the impose problems of the computational kind. overall outcome of the formula will be. This As the value of increases, the amount of is taken in account, due to the fact that possible n-grams increases exponentially. we’re explicitly including the probability of Given a dataset of our size and a of 5 our the length of the password, as denoted by algorithm used up nearly a gigabyte of sys- | | , in our formula. tem memory. For a larger we had to resort In the event that , which happens in to using a smaller subset of our dataset. the first 1 iterations of the product, the Finally, a large also imposes another substring references a non-existing charac- problem with the effectiveness of our algo- ter index of zero or smaller. These references rithm. As we said earlier, one of the powerful are substituted with the ‘»’ sign to denote features of using n-grams is that they can be the fact that we are dealing with a starting used to analyze or produce new (unseen) sequence. Thus, we get the convention: data based on known data.