Analyzing Password Strength

Analyzing Password Strength Martin M.A. Devillers Juli 2010 Password authentication requires no ABSTRACT specialized hardware, such as with finger- print authentication, can be easily imple- Password authentication is still the most mented by developers and just as easily used used authentication mechanism in today’s by users. In short: It’s usable. computer systems. In most systems, the But is it safe? The topic of “Security ver- password is set by the user and must adhere sus Usability” has always been of much de- to certain password requirements. Addition- bate in the computer security world (3). Us- ally, password checkers rank the strength of ers must be protected from harm by security a password to give the user an indication of controls, but these controls may not interfere how secure their password is. In this paper, (much) with the tasks the users want to per- we take a look at a large database of user form. For instance, a firewall that simply chosen passwords to determine the current blocks all traffic can be considered secure, state of affairs. In the end, we extract a mod- but heavily impedes the overall usability of el from the database and provide our own the system. password checker which ranks passwords in various ways. We ran this checker against Security experts or system developers are our dataset which shows that over 90% of usually the ones who have to make this tra- the passwords is highly insecure. deoff, but with password authentication, this task is essentially passed on to the user (3): INTRODUCTION One can either choose a short and simple password, which is easy to remember but As we perform increasingly important also easy to crack, or a very long and com- tasks from our living room computer, the plicated password, which is hard to remem- topic of computer security also becomes in- ber but also hard to crack. creasingly important. And indeed, many ad- Unfortunately, most users do not see a vances have been made in the field of com- tradeoff: They see an obstacle and they will puter security to protect home users from choose the path of least resistance to over- digital crime. For instance, the wireless come that obstacle. Thus, short and simple transmission protocol, Wi-Fi, has had sever- passwords that are easy to remember, but al (much overdue) security overhauls to pro- also easy to guess, are used (4). tect home users from being eavesdropped or worse (1). However, when we look at what we To stop users from using weak pass- typically use to authenticate ourselves with, words, most systems enforce certain re- we are stuck with a system that dates back quirements that a password must meet be- to the Roman empire (2): passwords. fore it gets set. Examples of common requirements are minimal length of the pass- Although an ancient concept, password word, the occurrence of uppercase letters, authentication is, and most likely will be for digits and/or symbols in the password and a long time, the most used authentication inequality with the user’s username or e- mechanism for computer users. mail address (5). However, holding on to the principle of usernames. Since the source of the data was the path of least resistance, one can expect questionable at best, we ran various tests users to try and ‘circumvent’ these require- and filters to ensure the quality of the data. ments in a predictable manner (4). For in- Various noise factors were discovered: stance, given that an user starts out with an 1. Analysis showed that entries longer actual word such as ‘house’ and the re- than 30 characters were more often noise quirement that the password must contain than actual passwords. A lot of entries were at least one digit is given, one might expect pieces of HTML or JavaScript, which might the user to simply suffix the word with a sin- be indicative of an injection attack or some gle digit. In the findings which we will kind of input anomaly. Others contain a sin- present, you will see that 15% of the pass- gle character repeated up to a hundred words were a word or name suffixed with the times. The most likely explanation for this number one. anomaly, are users who want to quickly In this paper, we will first discuss the da- sign-up and fill in bogus information to get a taset which we used to perform our research. onetime account. Such entries cannot be After this, we will show you the results of our considered to be actual passwords, since the preliminary tests on this dataset. In the next user does not have the intention of memoriz- chapter, we will go one step further and ex- ing or re-using the password. Since these tract actual patterns from the passwords. entries are much longer than normal entries, After this, we will show you how we used a a filter was applied that removes any entry probabilistic approach to password analysis. longer than 30 characters. In the final chapter, we will present you our 2. At various points in the database, the password checker, which combines the re- character encoding switches. For instance, sults of all the previous chapters. the first 3 million passwords contain Un- icode characters in their HTML encoded ATASET D form. Hereafter, Unicode characters are On December 4th 2009, a hacker stored in their actual form, which might be breached a company database of RockYou!1 indicative of the backend switching to UTF-8 containing the usernames and unencrypted encoding support. Similarly, the file itself is passwords of about 32 million users (3). This part encoded in ANSI and part encoded in database was subsequently published to the UTF-8. Character analysis gave indication internet and is now in wide circulation. Ob- when these switches occurred and a pro- viously, we don’t condone hacking, but the gram was written to rewrite the database in presence of this database gives us an unique one uniform encoding scheme (UTF-8). opportunity to perform a large scale empiri- 3. Various (uncommon) passwords occur cal study on passwords. multiple times in close proximity to each The most notable fact about the afore- other. This might be indicative of a single mentioned affair would not be that a large user registering multiple accounts. A filter database was hacked, but that the pass- was applied that removes these extra occur- words inside the database were stored in rences. Common passwords did not apply to unencrypted form (so-called plaintext pass- this filter, as they may naturally occur mul- words). These days, it is common practice to tiple times in close proximity. salt and hash passwords before permanently storing them, which makes it generally hard PRELIMINARY TESTS to study passwords even when granted access to the right databases. Before we started with the actual analysis, various basic tests were ran to gain The RockYou! database was acquired in some insights into the database. These in- the form of a long text file where each pass- clude a letter frequency analysis, a character word resides on its own line. The file con- type analysis, a length distribution analysis tained no other information, such as the and a common password analysis. 1 RockYou! (originally known as RockMySpace), Letter frequency analysis helps us in var- based in Redwood City, California is a publisher ious ways. Knowing the frequency of each and developer of applications and other social letter gives us the ability to define a more network services. As of December, 2007 it is the fine-grained metric for measuring password most successful widget maker for the Facebook strength. By grading the chance of occur- platform in terms of total installations. Radboud University Nijmegen 2 Martin M. A. Devillers 16% RockYou English Spanish 14% 12% 10% 8% Percentage 6% 4% 2% 0% abcde f ghi j k lmnopqrs tuvwxyz Figure 1 – Letter frequencies of the RockYou! dataset, English and Spanish language rence of each individual character rather Unicode, which are any characters that than grading each letter a flat twenty-sixths do not belong to the above categories. chance of occurrence. Examples of these are the euro sign and Furthermore, letter frequency analysis the Japanese alphabet. allows us to measure the degree in which the Given these categories, 2 32 combina- passwords conform to actual words. This tions are possible. However, we expect that helps us to better understand if the dataset only a few will be really prominent. follows a language and what language this might be (6). Figure 1 shows the letter frequencies of the dataset and those of the English and Other Spanish language. As you can see, the dis- 9% tribution of the RockYou! dataset shows great similarity to those of the English and Digit Lower Spanish language. When we do a quick 16% Case search for English and Spanish words in the 42% dataset, we find that 10% of the entries match English words and 2.5% match Span- ish words. Lower Character type analysis looks at every Case, password and flags what kind of characters Digit make up the password. We distinguish be- 33% tween the following character types: Lowercase, which are the standard lowercase letters of the alphabet. Figure 2 – Character type analysis Uppercase, which are the standard uppercase letters of the alphabet. Figure 2 shows the results of our character type analysis. The largest category, which Digits, which are the digits zero through nearly makes up for half the dataset, are nine passwords that consist solely out of lower- Symbols, which are any characters found case characters.

Load more