<<

GPU-Assisted Hashing: The Example of scrypt

Thorsten Kranz

Master’s Thesis. April 14, 2014. Chair for Embedded Security – Prof. Dr.-Ing. Christof Paar Advisor: Dr. Markus Dürmuth EMSEC

Abstract

Many big leaks of password data in the past have emphasized the importance of protecting stored . Traditionally, cryptographic hash functions have been used to this end. However, an attacker equipped with special hardware is able to run a parallelized guessing attack on such hashes. Therefore, new ideas for password hashing functions have been presented. In 2009, scrypt was published as a password hashing function that is designed to be very expensive to attack with custom hardware. In this thesis, we present a GPU-based attack on scrypt. We examine the behavior of the different cost determining parameters for the GPU and the CPU. Furthermore, we compare the hash rates achievable for scrypt and . Our measurements show that particularly the block size parameter 푟 of scrypt is well- suited for thwarting a GPU-assisted attack. However, the hash rates show that scrypt is not as successful as bcrypt in preventing GPU-assisted attacks for cost parameters that are reasonable in an interactive login scenario. Our results imply that the factors between costs for ASIC hardware crackers of scrypt and bcrypt that have been estimated by scrypt’s author do not hold for graphics cards.

i

Declaration

I hereby declare that this submission is my own work and that, to the best of my knowledge and belief, it contains no material previously published or written by another person nor material which to a substantial extent has been accepted for the award of any other degree or diploma of the university or other institute of higher learning, except where due acknowledgment has been made in the text.

Erklärung

Hiermit versichere ich, dass ich die vorliegende Arbeit selbstständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe, dass alle Stellen der Arbeit, die wörtlich oder sinngemäß aus anderen Quellen übernommen wurden, als solche kenntlich gemacht sind und dass die Arbeit in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegt wurde.

Thorsten Kranz

Contents

1 Introduction 1 1.1 Contribution ...... 3 1.2 Related Work ...... 3 1.3 Outline ...... 4

2 Background 5 2.1 Password Security ...... 5 2.1.1 Passwords as Means of Authentication ...... 5 2.1.2 Online Attacks vs Offline Attacks ...... 6 2.1.3 Naive Password Hashing ...... 6 2.1.4 Attacks on Naive Password Hashing ...... 8 2.1.5 Advanced Password Hashing ...... 11 2.2 scrypt ...... 12 2.2.1 Overview ...... 13 2.2.2 scryptROMix ...... 13 2.2.3 scryptBlockMix ...... 14 2.2.4 /8 Core ...... 15 2.2.5 Size Restrictions ...... 17

3 Cost Estimations 19 3.1 Provided Information ...... 19 3.2 Example: Cost Estimations for MD5 ...... 21 3.3 Counting the Cryptographic Primitives ...... 22 3.4 Estimating the Memory Costs for bcrypt and scrypt ...... 26 3.5 Compression Functions vs Hashes ...... 27

4 Implementation 31 4.1 CUDA C Programming ...... 31 4.1.1 Heterogeneous Model ...... 31 4.1.2 Thread Hierarchy ...... 32 4.1.3 Memory Model ...... 33 4.2 CUDA Based Password Hashing with scrypt ...... 35 4.2.1 Host Implementation ...... 35 4.2.2 Device Implementation ...... 38

5 Results 45 5.1 CPU Running Times ...... 45 iv Contents

5.2 GPU Hash Rates ...... 47 5.2.1 The N Parameter ...... 48 5.2.2 The r Parameter ...... 48 5.2.3 The p Parameter ...... 49 5.2.4 More Results ...... 50 5.3 Comparing scrypt with bcrypt ...... 50

6 Conclusion 55 6.1 Future Work ...... 55

A Acronyms 57

B GeForce GTX 480 - Device Query 59

C Measurements 61

List of Figures 63

List of Tables 65

List of Algorithms 67

List of Listings 69

Bibliography 71 1 Introduction

In today’s world, passwords are omnipresent. The average Internet user is requested to authenticate with user name and password multiple times per day: when signing in as a user of an operating system, when checking emails, when logging in to a social network, when using online banking, and so forth. In the case of online services, that require a login, large amounts of passwords have to be stored on the server-side. Experience has shown that it is not easy at all to protect those passwords from unauthorized access. Just recently, 130 million encrypted passwords of Adobe users showed up on the Internet together with user names and password hints, directly revealing lots of passwords due to bad implementation [Sop13]. Table 1.1 lists the most famous password leaks from the last years.

Year Affected entity # of passwords Implementation details 2013 Adobe [Tec13] ≈ 130,000,000 Encrypted. Unsalted. 2012 gamigo [For12] ≈ 8,000,000 Hashed. Unsalted. 2012 LinkedIn [Tec12a] ≈ 6,000,000 Hashed. Unsalted. 2012 eHarmony [Tec12b] ≈ 1,500,000 Hashed. Unsalted. 2009 RockYou [Dev10, Tec09] ≈ 32,000,000 Plain.

Table 1.1: Overview of the most famous password leaks in the last years.

It should be pointed out that Table 1.1 only includes breaches were the according password files have been posted on the Internet and could be associated with anentity. There are three things to note here. First, it must be assumed that there also exist many password leaks that never show up on the Internet, since this might adversely affect the criminal intentions of the attackers. To gain knowledge of such a breach,the affected entity must notice it and also make it public. A famous example for thatisthe breach of Sony’s Playstation Network in 2011 [Reu11]. 77 million users were affected and, according to Sony, the stolen data also included hashed passwords [Pla11]. But it would be naive to assume that every breach is detected and also made public as in this particular case. The second thing to realize is that the number of leaked passwords showing up on the Internet is always a best-case scenario. Especially in those cases where password hashes are posted without according user names, it is conceivable that the original attacker published merely that fraction of the passwords that he did not manage to crack on his own. Finally, every day there are also lots of passwords published on the Internet that cannot be associated with a definite entity, that is, we do not know where these passwords come from and if they are actually real. Putting all those arguments together, one must assume that there is a large dark figure of leaked passwords, even 2 1 Introduction

though the numbers we can be sure of are already sufficient for motivating research for better ways of implementing password storage. This implementation can be divided into two layers as depictured in Figure 1.1. On the one hand, a data breach should not occur in the first place. This first layer in password security involves software engineering, database programming, and security policies. Breaches might for example take place due to SQL injections or thoughtless disposal of old backups. As we just motivated, such breaches prove to be hard to thwart. Therefore, we need a second security layer that takes effect in the event of password theft. Such a scenario is called an offline attack, that is, the attacker has stored the password file on a local disk and is not subject to any restrictions imposed by online authentication. This thesis will cover the latter of those security layers while preventing password leaks will not be taken into account.

Figure 1.1: Passwords are secured with two layers. The first layer prevents data theft, the second layer protects the passwords for the case that the first layer fails.

To thwart the security threats that arise from offline attacks, techniques like hashing and salting are applied. Nevertheless, since trying out different passwords as input of a is an embarrassingly parallel problem and hash functions are typically created to be computable as fast as possible, an attacker applied with parallel hardware still represents a big threat in such a scenario. In most cases of hashed and unsalted leaks, it only takes a few days for way above 50% of the hashes to be solved. More than 90% of the hashes from gamigo, LinkedIn, and eHarmony have been recovered and are now publicly accessible in the form of so called password dictionaries. While applying salting will significantly slow down the cracking process, it is still desirable to have a particularly suited hash function that is harder to attack. This is all the more true when we also want to protect against attackers with large resources, such as intelligence agencies. In contrast to “small” attackers who are typically equipped with a GPU, or maybe with an FPGA, those “large” attackers might use ASICs that can compute way more hashes per second. For that reason, algorithms [PM99, Kal00] and techniques [ALN97, Man96, KSHW98] have been proposed with the specific goal of slowing down the process of trying out many inputs. Since continuous hardware improvements must be taken into account, the computational cost is typically parameterized. One of these algorithms is called scrypt [Per09] and is 1.1 Contribution 3

based on the idea of making parallel hardware implementations expensive by forcing large utilization of RAM. The author claims that scrypt is significantly more expensive to attack than other well-known password hash functions. This thesis describes an implementation of the scrypt algorithm on a GPU. The cost of cracking passwords is measured for various parameters. Based on the results, the scrypt algorithm is evaluated and compared to the bcrypt algorithm.

1.1 Contribution

Our main contribution is the implementation and evaluation of a GPU-based exhaustive search for passwords protected with scrypt. As far as we now, we are the first to implement scrypt on a GPU for variable cost parameters. We examine the behavior of the single parameters, not only for the GPU, but also for the CPU. In this way, we are able to evaluate how well the parameters are suited for putting more stress on the GPU than on the CPU. Additionally, we compare the hash rates achieved for scrypt and bcrypt. A further contribution is the thorough analysis of Percival’s cost estimations made in the original scrypt paper [Per09]. We reverse engineer his method of computing the costs and find some inaccuracies.

1.2 Related Work

Password security has been intensively studied in the last decades. In [MT79], Morris and Thompson give an overview of the history of UNIX password security, as of 1979. The UNIX password security is fundamental for the subsequent research on this field. The technique of rainbow tables was famously developed by Oechslin [Oec03], based on work by Hellman [Hel80]. Also, according countermeasures have been presented in the form of salts and stretching [Man96, ALN97, KSHW98]. Rainbow tables, salts, and are explained in Chapter 2. The fact that human-chosen passwords are particularly susceptible to attacks is due to the fact that they are easy to guess. How to efficiently guess a password is avery interesting field in password security research. Wordlists and mangling-rules are widely used in practice. But also more sophisticated methods based on markov chains [NS05] and context-free grammars [WAMG09] have been studied and showed impressive results. When using passwords, it is advised to use so called password creation policies. They pursue the goal of preventing users from choosing weak passwords. Here, proactive password checkers were presented to exclude weak passwords [BK95, Kle90, Spa92]. However, the effectiveness of those techniques is questionable [KSK+11]. Weir et al. showed that entropy, as defined in the NIST Electronic Authentication Guideline [Nat13], is unsuited for estimating password strength [WACS10]. More advanced metrics for guessing difficulty have been presented by BonneauBon12 [ ]. A more recent technique for estimating password strength is based on counting the occurrences of a password in a database [SHM10]. Also, markov models have been used to predict password strength [CDP12]. 4 1 Introduction

Over the years, several functions for password hashing have been published. Kamps MD5 [Kam94] was published in 1994. After that, bcrypt [PM99] and PBKDF2 [Kal00] have been the most famous password-based key derivation functions. They were published in 1999 and 2000, respectively. Several years late, in 2009, Percival published scrypt as a function to thwart custom hardware attacks. Looking at those functions, the only standardized one is PBKDF2, but it has been shown that PBKDF2 is not particularly good in resisting parallelized attacks [DGK+12]. In 2013, the Password Hashing Com- petition (PHC)[Com] made a call for submissions with the goal of identifying a good password-based and maybe establishing it as a new standard. 22 submissions have been accepted. One or more winners are planned to be announced in the second quarter of 2015.

While traditionally designed for solving graphics problem, graphics cards are nowa- days widely used for general-purpose programming. Jens Steube is the developer of oclHashcat [Has14] which is a GPU-based password cracker that supports nearly all password-based key derivation functions that are used in practice. Closely related to GPU-based is GPU-based cryptocoin mining. Cryptocoins like Bit- coin and are based on a proof-of-work. A hash function has to be computed very often until the output has the desired form. In password cracking and cryptocoin mining, primarily AMD cards are used. But there are also implementations for NVIDIA cards. For oclHashcat, there is the cudaHashcat utility, and for Litecoin, there exists the cudaMiner [For]. For getting started with CUDA programming, the CUDA Zone [NVI] on the NVIDIA website has very well documented and good introductions. Also, lots of webinars can be found there. The most important documents for CUDA C programmers are NVIDIA’s CUDA C Programming Guide and the CUDA C Best Practices Guide. Besides those online resources, there are also books about CUDA. Sanders and Kandrot give a very basic introduction in CUDA By Example [SK11]. Unfortunately, many of the motivating examples are graphics problems. However, it is still quite easy to understand. A broader, more detailed, but also more advanced introduction to GPU programming with CUDA is given by Kirk and Hwu in [KH10].

1.3 Outline

Chapter 2 equips the reader with the required background knowledge. Especially the scrypt algorithm explained in Section 2.2 should be understood for the subsequent chapters. Chapter 3 thoroughly analyzes the cost estimations of scrypt originally made by Percival. Thereafter, the GPU-assisted implementation of scrypt is described in Chapter 4 and the results are presented in Chapter 5. Finally, Chapter 6 concludes the thesis and discusses future work.

All acronyms used throughout this thesis can be looked up in Appendix A. Technical details of the employed graphics card are given in Appendix B and the plain measurement results of our tests are summarized in Appendix C. 2 Background

The following sections equip the reader with basic knowledge in password hashing. The first part examines password security, how hashing is involved into protecting passwords, and how password hashes are attacked. The second part constitutes a detailed explanation of the scrypt key derivation function (KDF).

2.1 Password Security

In this section, the basics of passwords are explained. We consider why and how passwords are used in today’s computer systems. Most of this section will deal with the security problems, that arise from password usage. Here, we will focus on the so called offline attack, since this is the scenario we will be up against throughout this thesis.

2.1.1 Passwords as Means of Authentication Means of authentication are divided into three fields:

∙ What somebody has.

∙ What somebody is.

∙ What somebody knows.

While the first and second field correspond to security-tokens and biometric identifica- tion, respectively, a password is a means of authentication located in the latter of those fields. By providing a password, we convince somebody that we are who we claimto be because only we should know that password. When working with computer systems, two big applications of passwords show up. First, passwords are used to implement access control, especially for logging into accounts. Second, passwords are also used for : a key is derived from the password and used as the secret key to encrypt a file. Next time the file is needed, the same password has to be provided so thatthesame secret key can be derived for decryption. Passwords are the most common authentication method. They are (ostensibly) easy to implement and do not require additional hardware, as for example biometric means of authentication do. Another advantage over biometric authentication is that a password does not inherently leak information to the environment while, for example, a finger print cannot be assumed to stay secret. In contrast to security-tokens, passwords are easily manageable, also for a large number of users. And like biometrics, tokens require, or rather are, additional hardware. 6 2 Background

The big downside of passwords that overshadows those advantages is the huge lack of usability. The more memorable a password is, the easier it is to guess. This implies an inherent security-usability tradeoff which has to be made if one decides to remember the password (as opposed to write it down or store it somewhere). The scenario is even worse because we have to remember multiple passwords and each of them should of course be unique. As a result, most people use weak passwords.

2.1.2 Online Attacks vs Offline Attacks As stated above, passwords are used for authentication. Thinking about this, it should directly strike us that there has to be a mechanism for checking the validity of the passwords. In other words, the entity towards which we want to authenticate has to store some information that can be derived from our password. This is usually implemented in the form of a password file that contains user names and passwords. If a certain user wants to authenticate, the provided password is compared to the according entry in the password file. Two kinds of password attacks are being distinguished: in an online attack, the attacker has no control of the password validation. The guessed passwords are passed to an oracle which answers with yes or no. Any attack that guesses passwords without access to the password file is an online attack. Online attacks can efficiently be prevented by applying a maximum number of allowed tries. After, for example, ten invalid password entries, the according account could be blocked for a while. This dramatically slows down the number of passwords an attacker can guess per second. On the other hand, there are offline attacks. Here, the attacker has access to the password file and is not imposed upon any restrictions while guessing. As motivated in Chapter 1, the event of a password file becoming accessible to an attacker is by no means negligible. Obviously, such an attack is immediately and fully successful if the passwords are stored in plain. Besides that, storing the passwords in plain enables people with authorized access to read them, which is clearly not what we want. Because of that, several techniques have been developed for mitigating the impact of an offline attack.

Note, that while the scenario of password-based encryption does not seem to fit into this framework, it turns out to be attacked in the same way. The attacker has access to the encrypted file and can guess passwords instead of just guessing random keys.So it is very similar to an offline attack with just one password. The techniques described in the following can therefore also be used to enhance the security of password-based encryption.

2.1.3 Naive Password Hashing To prevent trivial offline attacks, where all the attacker has to do is reading theplain passwords, it is necessary to store the password in a non-readable way. At the same time, it must still be possible to validate incoming passwords. The natural ideas for that are encryption and hashing. Both mechanisms have been applied in real-world systems, but 2.1 Password Security 7

encryption is considered a bad practice. This is not directly obvious so we are going to dig a bit deeper here.

Password Hashing When using password hashing, the password provided by the user is digested by a cryptographic hash function and the digest is stored in the password file. Next time the user needs to authenticate, the process is repeated and the resulting digest is compared to stored one. Thus, in the event of an offline attack, the passwords are not directly revealed to the attacker. Looking at the properties of a cryptographic hash function, we directly see that preimage resistance is mandatory. Otherwise the attacker could compute a valid password from the hash. (Note that now any preimage is a valid password, not only the original one). Second preimage resistance is not needed since this would just enable somebody, who already knows the password, to compute another valid password for that account. The same is true for collision resistance because finding a collision for two passwords is useless as long as the according digest does not show up in the password file, and of course this will not happen due to preimage resistance. Nevertheless, cryptographic hash functions always aim to fulfill all three properties and it will not do any harm to have them. In addition, more sophisticated techniques then plain hashing actually do need those properties. So it is always advisable to use a cryptographic hash function that comes with all three properties. When an attacker obtains a password file protected by proper hashes, the only thing he can do is to guess inputs. Unfortunately, this procedure works out pretty well, as discussed in Section 2.1.4. For now, just note that weak passwords mean weak protection because they will be easy to guess.

Password Encryption Besides hashing, the other natural idea for preventing offline attacks is to use encryption. It works the same way, but instead of hashing the passwords, they are encrypted with a secret key. At first sight, this might seem better than password hashing. The degree of protection is not influenced by the user’s behavior any more. Decrypting a weak password is as hard as decrypting a strong password. The whole security lies in the secret key and if the key is kept secret the passwords will stay protected. But here is the inherent problem: also the key must be stored. In case of an offline attack, the adversary was able to steal the password file from the server’s database. We therefore must assume that also the secret key can be stolen. A further downside is the decryption feature that comes with encryption. We do not want anyone, for example the administrator, to be able to decrypt the stored passwords. So, starting as a promising idea, password encryption turns out to be inferior to password hashing as a means of protection against offline attacks.

Encryption Keyed by Password Another famous technique is encrypting some constant value, often called magic, with the password used as the secret key. The attacker ends up in a known-plaintext attack being unable to compute the key, that is, the password. The effects of this technique are closer to hashing then to encrypting. There is noneedof storing a secret key and no feature that enables recomputing the passwords. The attacker 8 2 Background

is again in the position of being able to guess candidates.

As we see, two basic approaches can be distinguished. On the one side, a secret key is used and all the security lies in its secrecy. On the other side, the need of such a key is eliminated, but the attacker is able to guess password candidates. As just argued, the latter approach is the desirable one. It includes traditional hashes, key derivations functions (KDFs), and keyed by passwords. For reasons of simplicity, we will often just talk about KDFs or use the term password hashing when we want to address those techniques.

2.1.4 Attacks on Naive Password Hashing

One can immediately tell which users have been using the same password by just looking at a password file protected by naive password hashing. That is because the same hash will be stored for those users. When a password is recovered, all users with this password are compromised. Besides that, in case a proper hash function was used, an attacker will not be able to revert the hashes and therefore must generate and test password candidates instead. Again, because the same password results in the same hash digest for any user, it is possible to attack all passwords at once because each candidate can always be compared against all entries in the password file.

Generating Password Candidates Of course, the password candidate generation will not be performed in the classic fashion of a brute-force attack where the candidates are randomly or iteratively generated. Instead, there exist several techniques for choosing particularly promising candidates. The most basic technique is a mask attack, also called a pattern attack. Here, regular expressions or similar means are used to search only a certain set of passwords which are particularly likely to be chosen, for example 4 lower characters followed by 2 digits. But this turns out to be still pretty close to a classical brute-force attack and it is more efficient to rely on knowledge gained from the past. Password dictionaries, containing passwords that were revealed in previous leaks are widely available in the most different sizes, ranging from only a few hundreds to many millions of entries. The basic dictionary attack tries out any password from the dictionary and compares the result to all the digests from the password file. This is already very efficient, but comes with the drawback of only being able to recover old passwords that have already been recovered before. To overcome that obstacle, there are techniques that use the knowledge gained from dictionaries to generate new passwords. The most common of such methods is the rule-based attack that modifies given passwords in various ways. It is also known under the notion of word-mangling. Typical rules are the appendage of certain characters, the modification of case sensitivity, and the substitution of characters with similar-looking characters, that is, leetspeak. Moreover, there are rules for interaction between multiple dictionaries that combine the passwords from the dictionaries in various ways. 2.1 Password Security 9

In addition to that, there are even more sophisticated methods for generating new passwords: given the markov assumption, a markov chain can be used [NS05] to compute the probability of the password “trustNo1” as

푃 (trustNo1) = 푃 (t) · 푃 (r|t) · 푃 (u|r) · ... · 푃 (1|o) (2.1)

For doing that, one has to look at groups of 2 characters and extract their relative frequency from a given training set. Those groups are called 2-grams and the idea can be generalize to n-grams. For instance, applying 3-grams to the above example yields

푃 (trustNo1) = 푃 (tr) · 푃 (u|tr) · 푃 (s|ru) · ... · 푃 (1|No) (2.2)

Abstracting this method to n-grams for a string of 푘 characters 푐1푐2 . . . 푐푘 with 푛 ≤ 푘 leads to the following equation:

푘 ∏︁ 푃 (푐1푐2 . . . 푐푘) = 푃 (푐1푐2 . . . 푐푛−1) 푃 (푐푖|푐푖−(푛−1) . . . 푐푖−1) (2.3) 푖=푛 A further technique published by Weir et al. [WAMG09] creates probabilistic context- free grammars from a training set. This allows to deduce the most probable masks and the most probable passwords that originate from those masks. The variables used in the context-free grammar correspond to subsets of the character set, that is, letters, digits, and special characters.

Rainbow Tables From the adversary’s point of view, it is a good idea to improve the efficiency of password cracking by precomputation. Instead of running theabove techniques on the fly, one would do it in advance and store the resulting hashes, so that the attack only consists in comparisons. Despite the obvious downside of being restricted to only one specific algorithm that can be attacked, there is a tremendous increase of speed. Naturally, it is not that easy. The adversary usually wants to test billions or trillions of passwords and will just not have enough memory to store all those precomputed hashes. Therefore, a time-memory tradeoff has to be made. A popular implementation of such a time-memory tradeoff are rainbow tables. They have been first introduced by Oechslin [Oec03], based on work by Hellman [Hel80]. The basic idea is to compute hash chains by alternatingly applying the hash function and a reduction function and to eventually only store the first and the last element of each chain. A reduction function is a function that maps the hash back to the set of plaintexts, but of course it is not an inverse hash function because the hash function is supposed to be preimage resistant. A hash digest is attacked by computing its chain in the same way as the was generated and comparing the resulting element to the stored end points. If there is a match, the according chain can be recomputed from the starting point and checked for the wanted hash digest. In case it is found, the previous element in the recomputed chain is a valid password. By this method, any password that was produced during generation of the rainbow table is found. On the other hand, passwords that actually cannot be found might initially cause a false alarm because of a collision 10 2 Background

between the chain created during the attack and one of the chains from the rainbow table. Sporadic false alarms due to collisions during the attack are tolerable. More problematic are collisions during creation of the rainbow table: different hash digests are reduced to the same plaintext and therefore result in merged chains or in a loop. This leads to many duplicate passwords being produced. Furthermore, those duplicates cannot be detected by looking at the end points unless a collision occurred at the same iteration, which is unlikely. To circumvent this problem, different reduction functions are applied in each iteration.

Figure 2.1: Generation of a rainbow table.

When doing this, collisions of plaintexts can still take place. But as long as they do not take place in the same iteration, different reduction functions will thereafter be applied which prevents both multiple chains from merging and single chains from looping. If a collision between two chains actually takes place in the same iteration, this can be detected by comparing the end points. Thus, while rainbow tables do not fully prevent occasional collisions, they do prevent those collisions from causing further collisions. This 2.1 Password Security 11 dramatically increases the efficiency. Using different reduction functions in each iterations also influences the process of looking up a password in the rainbow table. Given thatthe attacked hash was produced within the rainbow table, we do not know at which iteration it was produced. But we need to start the attack with the correct reduction function for reaching the end point. Thus, all possibilities are played through. First, just the last reduction function is taken into account. If the according end point does not show up in the table, the last two reduction functions are used, then the last three, and so forth. Figure 2.1 shows the generation process of a rainbow table. An important characteristic of rainbow tables is the fact, that specific passwords are not being actively chosen (except for the initial passwords). The covered passwords depend on the produced hashes and on the applied reduction functions. Of course, the reduction functions are chosen in such a way that they map to weak passwords and not to random data.

2.1.5 Advanced Password Hashing Because of the many weak spots in basic password hashing that haven been indicated above, advanced techniques have been introduced and implemented.

Salts Salts are an effective way to completely avoid many of the discussed downsides of plain hashing. A is a random value that is used as part of the hash function input. For instance, the salt can just be appended to the password before hashing. For every stored password, a fresh salt is generated and stored with it. The salt does not have to be stored in an encrypted fashion, neither does it have to be kept secret although there is also no obvious reason for publishing it. A salt has the following effects: first of all, even if multiple users are using thesame password, this cannot be deduced from the hash digests. Otherwise a collision would have been found. Furthermore, each password has to be attacked on its own. An attacker can not test a candidate and compare it against all stored passwords because the explicit salt has to be added to the password. This slows down the attacker by a factor of the number of stored passwords. Note that those advantages only take effect if multiple passwords are attacked, but we must always assume that an attacker has more password hashes from other password files. Finally, a very positive effect of salts is that they prevent attacks based on rainbow tables. A single table would have to be created for each salt. Hence, if the salts are big enough, for example 256 bits, this is rendered infeasible.

Key Stretching One of the main drawbacks of traditional password hashing results from the fact that the cryptographic hash functions typically have not been designed for this purpose. They have been designed with the goal to be as fast as possible, which raises the efficiency with which the attacker can search through the key space. Thisdoes not pose a threat for traditional schemes affected by a brute-force attack, after all the key size can be easily increased. For user-chosen passwords however, it is desirable to slow down the process of testing password candidates as much as possible. 12 2 Background

To that end, two similar approaches have been introduced. Key strengthening [ALN97, Man96] uses a secret salt which is thereafter securely deleted. As a consequence, the secret salt has to be recovered by an exhaustive search when a password is provided for authentication. Key stretching [KSHW98] adds additional complexity to the hash function, which is done by running many iterations of it. Among these two techniques, key stretching came out on top. Today, all sophisticated functions used for password hashing have an iterative implementation at heart. It must be noted that, in practice, key stretching cannot be used to make password cracking arbitrarily difficult. The server must be able to compute the password hash in a reasonable amount of time which is tolerable for the user who is waiting for being authenticated. A very important feature of key stretching is the parameterability. Considering Moore’s Law [Moo65], it must be possible to regularly adjust the difficulty of password cracking. The password itself is unsuited for this task. Therefore, a cost parameter like the number of iterations in key stretching is desirable.

Thwarting Parallelism Neither salts nor a parameterized running time of the hash function take into account that attackers are using massively parallelized algorithms on special platforms such as GPUs, FPGAs, and ASICs. While a single hash might not be computed very fast, those platforms are particularly well-suited for achieving high hash rates, that is, the number of hashes which is produced per second. Parallel platforms are already extensively used for password cracking. Especially GPUs are widely used because lots of users have powerful graphics card in their computers anyway. The password recovery tool oclHashcat [Has14] provides GPU-accelerated implementations of many password hash functions. As we see, it is reasonable to develop algorithms that do not just slow down the process of password hashing, but explicitly thwart parallelized implementations. This is not trivial, because trying out many different passwords is an inherently parallel task. The scrypt algorithm presented by Percival [Per09] suggests extensive usage of RAM to increase the area and therefore the cost of custom hardware. Within this thesis, a parallelized GPU implementation of scrypt is presented.

2.2 scrypt

The key derivation function scrypt was presented by Percival in 2009 [Per09]. It is designed to thwart custom hardware attacks. The main idea of scrypt is to force an attacker to use a large amount of memory, thus having to use more VLSI area and resulting in large attack costs. In what follows, we give a detailed top-down explanation of scrypt and the underlying algorithms scryptROMix, scryptBlockMix, and the Salsa20/8 core. However, we assume PBKDF2 [Kal00], HMAC [Nat08], and SHA-256 [Nat12] to be known and do not treat them here. 2.2 scrypt 13

Note on terminology: The original document describes multiple generic algorithms and then defines scrypt as a special case of using these algorithms. Here, we will only describe the special case of scrypt. To correctly entitle those special cases of the generic algorithms, we use the terminology of the according Internet Draft [PJ12]. However, we might sometimes refer to scryptROMix as SMix (which is the original name from [Per09]) because it is shorter and easier to handle in Figures.

Note on scrypt’s name: The name “scrypt” was presumably chosen referring to the UNIX crypt function and using the “s” as a hint at the employed Salsa algorithm. (Analogous to bcrypt being named after crypt with the “b” indicating the employed algorithm).

2.2.1 Overview scrypt is a key derivation function that takes six input values and produces a derived key as output:

퐷퐾 = 푠푐푟푦푝푡(푃, 푆, 푁, 푟, 푝, 푑푘퐿푒푛) (2.4) The input and output parameters have the following meaning:

푃 Password 푆 Salt 푁 Memory cost parameter 푟 Block size parameter 푝 Parallelization parameter 푑푘퐿푒푛 Length of derived key in bytes 퐷퐾 Derived key

Among these parameters, some size restrictions do apply. They are discussed in Section 2.2.5 since knowledge of the inner workings of scrypt is needed to understand them. Several sub-algorithms are applied in scrypt. The main part consists of 푝 parallel executions of scryptROMix which includes scryptBlockMix and the Salsa20/8 core. Before and after that, PBKDF2 with iteration count 푐 = 1 is applied together with HMAC using SHA-256. This high level description is illustrated in Figure 2.2.

2.2.2 scryptROMix Taking the parameters 푁, 푟, and some input 퐵 with 128푟 byte length, the algorithm scryptROMix implements the idea of enforcing the usage of large amount of RAM. It first produces pseudorandom values and then accesses them in pseudorandom order.For doing that, scryptBlockMix is used as a subroutine. The detailed steps can be seen in Algorithm 2.2.1. In lines 2 to 5 of the algorithm, some memory area 푉 of size 푁 · 128푟 byte is filled by repeatedly calling scryptBlockMix and storing the intermediate results into 푉 . The 14 2 Background

Figure 2.2: Overview of scrypt. The data widths are given in bytes. SMix is short for scryptROMix.

number of iterations is 푁 and each iteration produces 128푟 bytes. This process of filling the memory area is depictured in Figure 2.3. The last 128 bytes that are written to memory 푉 [푁−1] are processed by scryptBlockMix one more time. The result 푋 is used as the initial value for the second part of the algorithm in lines 6 to 9. By applying a bijective function Integerify from {0, 1}1024푟 to {0,..., 21024푟 − 1} the next memory address is computed as:

푗 = Integerify(푋) mod 푁 (2.5) As we will see in a moment, scryptBlockMix processes the data in 64 byte chunks because this is the input and output size of the Salsa20/8 core. Therefore, the 128푟 byte output of scryptBlockMix has the form (퐵[0] . . . 퐵[2푟 − 1]). Integerify(B[0] . . . B[2r-1]) is defined as the integer resulting from the little-endian representation of 퐵[2푟 − 1]. The according value that is located at position j in the memory is xored to 푋 and processed by scryptBlockMix yielding the next input for Integerify. This is repeated 푁 times, eventually leading to the final output of scryptROMix.

2.2.3 scryptBlockMix The input parameters of scryptBlockMix are the input 퐵 and the block size parameter 푟. 퐵 consists of 128푟 bytes that are chunked into 2푟 blocks of 64 bytes (퐵[0], . . . , 퐵[2푟 − 1]). First, a work variable 푋 is initialized to 퐵[2푟 − 1]. Then, it is xored to the first block 2.2 scrypt 15

Algorithm 2.2.1: scryptROMix Data: 퐵, 푟, 푁 Result: 퐵′ 1 푋 ← 퐵; 2 for 푖 ← 0 to 푁 − 1 do 3 푉 [푖] ← 푋; 4 푋 ← scryptBlockMix(푋); 5 end 6 for 푖 ← 0 to 푁 − 1 do 7 푗 ← Integerify(푋) mod 푁; 8 푋 ← scryptBlockMix(푋 ⊕ 푉 [푗]); 9 end 10 퐵′ ← 푋;

Figure 2.3: The process of filling the memory with scryptROMix. Memory sizes are given in bytes.

and processed by the Salsa20/8 core. The result is xored to the second block and processed by the Salsa20/8 core. This is repeated for all blocks and the intermediate outputs of the Salsa20/8 core are stored as (푌 [0], . . . , 푌 [2푟 −1]). Eventually, the output is rearranged to (푌 [0], 푌 [2], . . . , 푌 [2푟−2], 푌 [1], 푌 [3], . . . , 푌 [2푟−1]). The complete algorithm scryptBlockMix is shown in Figure 2.4.

2.2.4 Salsa20/8 Core Salsa20 is a family of stream ciphers designed by Bernstein [Ber08]. The Salsa20 core is a function used in Salsa20 that maps a 64 byte input to a 64 byte output. Depending on the number of rounds 푙, the core is called the Salsa20/푙 core. For scrypt, 8 functions are used, leading to the Salsa20/8 core. The 64 byte input is viewed in little-endian as 16 words that are arranged in a 4 × 4 matrix. Within one round, each column is processed by a quarter round. Within those quarter rounds, each word is replaced by summing two words of the column modulo 232, 16 2 Background

Figure 2.4: The scryptBlockMix algorithm. Each block stores 64 bytes.

rotating the result by a fixed distance and xoring it into another word of the column. After each round, the matrix is transposed which leads to 푙/2 identical double-rounds where the transposition can be replaced by working on the rows instead of the columns. Eventually, the 16 resulting words are added into the 16 original words where the addition is again computed modulo 232. Listing 2.1 shows the C code of the Salsa20/8 core. It is the code presented by Bernstein [Ber] modified for 푙 = 8 rounds and extended by comments.

1 #define R(a,b) (((a) << (b)) | ((a) >> (32 - (b)))) 2 void salsa20_word_specification(uint32 out[16],uint32 in[16]) 3 { 4 inti; 5 uint32x[16]; 6 for(i=0;i<16; ++i)x[i] = in[i]; 7 for(i=8;i>0;i-=2) {/*8 round are processed as4 double-rounds.*/ 8 /* Start first half of the double-round.*/ 9 /* First quarter round: work on first column.*/ 10 x[ 4] ^=R(x[ 0]+x[12], 7);x[ 8] ^=R(x[ 4]+x[ 0], 9); 11 x[12] ^=R(x[ 8]+x[ 4],13);x[ 0] ^=R(x[12]+x[ 8],18); 12 /* Second quarter round: work on second column.*/ 13 x[ 9] ^=R(x[ 5]+x[ 1], 7);x[13] ^=R(x[ 9]+x[ 5], 9); 14 x[ 1] ^=R(x[13]+x[ 9],13);x[ 5] ^=R(x[ 1]+x[13],18); 15 /* Third quarter round: work on third column.*/ 16 x[14] ^=R(x[10]+x[ 6], 7);x[ 2] ^=R(x[14]+x[10], 9); 17 x[ 6] ^=R(x[ 2]+x[14],13);x[10] ^=R(x[ 6]+x[ 2],18); 18 /* Fourth quarter round: work on fourth column.*/ 19 x[ 3] ^=R(x[15]+x[11], 7);x[ 7] ^=R(x[ 3]+x[15], 9); 20 x[11] ^=R(x[ 7]+x[ 3],13);x[15] ^=R(x[11]+x[ 7],18); 21 22 /* Start second half of the double-round.*/ 23 /* First quarter round: work on first row.*/ 24 x[ 1] ^=R(x[ 0]+x[ 3], 7);x[ 2] ^=R(x[ 1]+x[ 0], 9); 2.2 scrypt 17

25 x[ 3] ^=R(x[ 2]+x[ 1],13);x[ 0] ^=R(x[ 3]+x[ 2],18); 26 /* Second quarter round: work on second row.*/ 27 x[ 6] ^=R(x[ 5]+x[ 4], 7);x[ 7] ^=R(x[ 6]+x[ 5], 9); 28 x[ 4] ^=R(x[ 7]+x[ 6],13);x[ 5] ^=R(x[ 4]+x[ 7],18); 29 /* Third quarter round: work on third row.*/ 30 x[11] ^=R(x[10]+x[ 9], 7);x[ 8] ^=R(x[11]+x[10], 9); 31 x[ 9] ^=R(x[ 8]+x[11],13);x[10] ^=R(x[ 9]+x[ 8],18); 32 /* Fourth quarter round: work on fourth row.*/ 33 x[12] ^=R(x[15]+x[14], 7);x[13] ^=R(x[12]+x[15], 9); 34 x[14] ^=R(x[13]+x[12],13);x[15] ^=R(x[14]+x[13],18); 35 } 36 /* Add resulting values to the original ones.*/ 37 for(i = 0;i < 16;++i) out[i] =x[i] + in[i]; 38 } Listing 2.1: The Salsa20/8 core implemented in C code. Endianness must be handled by the caller.

2.2.5 Size Restrictions When choosing the scrypt parameters, the following size restrictions do apply:

1. 푁 must be larger than 1, a power of 2, and less than 21024푟/8.

2. p must be less than or equal to ((232 − 1) · 32)/(128푟).

3. dkLen must be less than or equal to (232 − 1) · 32.

The upper bound for 푁 is different than the one givenPJ12 in[ ]. There, the upper bound is 2128푟/8 which we assume to be erroneous due to the circumstance that some sizes that are given in bits in [Per09] have been changed to bytes. The original bound was 2푘/8 where 푘 was 1024푟 bits. The fact that this was later changed to 128푟 bytes does not justify a new bound of 푁. Anyway, neither of those bounds is likely to be a constraint in a practical implementation. 푁 must be larger than 1 because 푁 = 1 would not make any sense. Integerify would not be needed and there would not be any randomized memory access. Choosing 푁 as a power of 2 simplifies the modulo operation in the function Integerify. Finally, the upper bound is probably chosen to prevent collisions. All together, 푁 pseudorandom√ values of size 1024푟 bits are stored. From the birthday problem, we know that 푁 = 1024푟 = 21024푟/2 would yield a collision probability of ≈ 50%. This is of course too high, so an upper bound of 21024푟/8 is chosen what dramatically reduces the probability, practically rendering a collision impossible. The upper bounds for 푝 and 푑푘퐿푒푛 can be directly derived from the upper bound for the derived key of PBKDF2 [Kal00]. PBKDF2 restricts the derived key length to (232 − 1) · ℎ퐿푒푛 where ℎ퐿푒푛 is the number of bytes of the pseudorandom function. Here, since we use HMAC with SHA-256, ℎ퐿푒푛 is 32. In case of 푝, attention must be paid because the initial PBKDF2 derives a key of 푝 · 128 · 푟 bytes. The denominator of the according upper bound takes care of this.

3 Cost Estimations

This chapter analyzes the cost estimations made in the original scrypt publication [Per09]. We reverse-engineer and discuss the method of computing the estimations. In doing so, we find out that the cost estimations for PBKDF2, bcrypt, and scrypt have been based on inaccurate counts of the underlying cryptographic primitives. We present the table that would have resulted from an accurate count. Also, we learn that scrypt’s cost estimations are effectively only memory costs with less than 1% costs for logic gates.

3.1 Provided Information

In [Per09], Percival estimates the cost of hardware to crack a password in 1 year for several password hashing algorithms.

KDF 6 letters 8 letters 8 chars 10 chars 40 char 80 char text text DES crypt < $1 < $1 < $1 < $1 < $1 < $1 MD5 < $1 < $1 < $1 $1.1k $1 $1.5T MD5 crypt < $1 < $1 $130 $1.1M $1.4k $1.5 ×1015 PBKDF2 (100 ms) < $1 < $1 $18k $160M $200k $2.2 ×1017 brypt (95 ms) < $1 $4 $130k $1.2B $1.5M $48B scrypt (64 ms) < $1 $150 $4.8M $43B $52M $6 ×1019 PBKDF2 (5.0 s) < $1 $29 $920k $8.3B $10M $11 ×1018 brypt (3.0 s) < $1 $130 $4.3M $39B $47M $1.5T scrypt (3.8 s) $900 $610k $19B $175T $210B $2.3 ×1023

Table 3.1: Percival’s hardware cost estimations in dollar-years. The running times inside the parentheses refer to a single execution on a 2.5 GHz Intel Core 2 Duo processor.

The covered algorithms are DES crypt [MT79], MD5 [Riv92], MD5 crypt [Kam94], PBKDF2 [Kal00], brypt [PM99], and of course scrypt. Table 3.1 shows the cost esti- mations presented in [Per09]. The details of the underlying algorithms are shown in Table 3.2. First, we want to understand how the cost estimations have been computed. Percival does not explain this computation in detail, so we reverse-engineered the table and now walk through the computation step by step. 20 3 Cost Estimations

KDF Details DES crypt ≈ 25 iterations of DES. MD5 Plain MD5. MD5 crypt At heart, ≈ 1000 iterations of MD5. PBKDF2 (100 ms) PBKDF2_HMAC_SHA-256. Iteration count 푐 = 86,000. brypt (95 ms) bcrypt with 푐표푠푡 = 11. scrypt (64 ms) scrypt with (푁, 푟, 푝) = (214, 8, 1). PBKDF2 (5.0 s) PBKDF2_HMAC_SHA-256. Iteration count 푐 = 4,300,000. brypt (3.0 s) bcrypt with 푐표푠푡 = 16. scrypt (3.8 s) scrypt with (푁, 푟, 푝) = (220, 8, 1).

Table 3.2: Details of the analyzed KDFs. The running times inside the parentheses refer to a single execution on a 2.5 GHz Intel Core 2 Duo processor.

As it is apparent from Table 3.1, six different key spaces (or password spaces) have been taken into account. Those 6 types of passwords are presented in [Per09] as follows:

∙ A random sequence of 6 lower-case letters. ∙ A random sequence of 8 lower-case letters. ∙ A random sequence of 8 characters from the 95 printable 7-bit ASCII characters. ∙ A random sequence of 10 characters from the 95 printable 7-bit ASCII characters. ∙ A 40 character string of text. ∙ A 80 character string of text.

For estimating the size of the key spaces of the text strings, the guessing entropy estimations from NIST [Nat13, Appendix A.2.1] are consulted:

∙ The entropy of the first character is taken to be 4 bits. ∙ The entropy of characters 2 to 8 is taken to be 2 bits each. ∙ The entropy of characters 9 to 20 is taken to be 1.5 bits each. ∙ The entropy of characters 21 and above is taken to be 1 bit each.

Given this information, the sizes of the key spaces that are being searched in the brute- force attack can be computed. They are shown in Table 3.3. For the 40 characters text string, the number of bits is computed as 4+7·2+12·1.5+20·1 = 56. For the 80 characters text string, the number of bits is computed analogous as 4 + 7 · 2 + 12 · 1.5 + 60 · 1 = 96. On average, half of the key space needs to be searched until the correct key is found. So, the number of hashes to compute and the time in which they must be computed are now known. Thus, we can compute the number of hashes per second that the hardware must provide. For instance, taking the six letters case, the hardware must compute (266/2)/(365 · 24 · 60 · 60) hashes per second. The results for all password types are presented in Table 3.4. At this point, we know what the hardware must be capable of. But what we actually want to know is the cost for such hardware. To answer this question, Percival looks 3.2 Example: Cost Estimations for MD5 21

6 letters 8 letters 8 chars 10 chars 40 char 80 char text text Total Size 266 268 958 9510 256 296 Size in bits 28.2 37.6 52.6 65.7 56 96

Table 3.3: Sizes of the key spaces that are searched by a brute-force attack in Percival’s cost estimations.

6 letters 8 letters 8 chars 10 chars 40 char 80 char text text H/s 4.9 3.3k 105.2M 949.3G 1.1G 1.3×1021

Table 3.4: Number of password hashes per second that the hardware must provide to run a successful brute-force attack in one year.

at the underlying cryptographic primitives, that is, DES [Nat99], MD5 [Riv92], SHA- 256 [Nat12], BlowfishSch94 [ ], and Salsa20/8 [Ber08]. He gives the following estimations for the size and performance:

∙ A DES circuit with ≈ 4000 gates of logic can encrypt data at 2000 Mbps. ∙ An MD5 circuit with ≈ 12,000 gates of logic can hash data at 2500 Mbps. ∙ A SHA-256 circuit with ≈ 20,000 gates of logic can hash data at 2500 Mbps. ∙ A Blowfish circuit with ≈ 22,000 gates of logic and 4096 bytes of SRAM can encrypt data at 1000 Mbps. ∙ A Salsa20/8 circuit with ≈ 24,000 gates of logic can output a key stream at 2000 Mbps.

Additionally, the following estimations of manufacturing cost1 are provided:

∙ A gate of random logic requires ≈ 5 휇m2 of VLSI area. ∙ A bit of SRAM requires ≈ 2.5 휇m2 of VLSI area. ∙ A bit of DRAM requires ≈ 0.1 휇m2 of VLSI area. ∙ VLSI circuits cost ≈ 0.1 휇$/휇m2.

3.2 Example: Cost Estimations for MD5

We cannot directly apply this information for obtaining the estimated cost. Three consecutive steps are necessary. First, Table 3.4 must be modified, so that it does not count the number of password hashes per second, but the number of the underlying cryptographic primitives per second. Then, based on the information about those cryptographic primitives, the number of logic gates and the size of the memory has to be

1Percival’s estimations of manufacturing cost are based on 130nm logic in 2002. 22 3 Cost Estimations

computed. Finally, the VLSI area and its cost can be derived. This procedure is now explained in detail for MD5 in Equations 3.1 to 3.4. The first step is fairly easy. The underlying is MD5, ofcourse, and has to be executed 1 time per password, so 9510/2 times if we want to crack a 10 character password. This results in an unchanged hash rate of 949.3 · 109. Next, we have a closer look on the given MD5 circuit. It can hash data at 2500 Mbps. The block size of MD5 is 512 bit, so we know that a 10 character password always needs exactly one block. Therefore, the number of hashes that can be computed per second is

2500 · 106 = 4882812.5 (3.1) 512 Obviously, this hash rate is way to low. That is why we apply the rule of three to deduce the number of gates for the higher hash rate:

949.3 · 109 · 12000 = 2.33 · 109 (3.2) 4882812.5 Finally, the VLSI area is computed as

2.33 · 109 · 5 휇m2 = 11.66 · 109 휇m2 (3.3)

and the resulting cost is

11.66 · 109 · 0.1 휇$ = 1.16 · 103 $ (3.4)

The rounded result is $1.2k, which is very close to the $1.1k provided by Percival. This and the fact that all other values for MD5 computed that way do match precisely with Table 3.1 makes us believe that exactly this procedure, or a very similar one with just subtle differences, was used to compute the table.

3.3 Counting the Cryptographic Primitives

The more complex part begins when the relationship between the KDF and the cryp- tographic primitive is not so clear anymore. MD5 crypt is still easy. The values from MD5 have just been multiplied by 1000. DES crypt is also easy because it is just 25 DES iterations. Furthermore the effective size of the key space for DES crypt cannot be greater then 56 bits (that is, the size of a DES key), therefore larger key spaces have no effect and all costs stay below $1. For PBKDF2, bcrypt, and scrpyt, the number executions of the underlying primitives depends on the parameters. Making the assumption that the derived key is not bigger than 512 bits, PBKDF2 runs 푐 iterations of the underlying pseudorandom function (PRF). In our case, that means 86,000 iterations of HMAC. Each run of HMAC includes 2 runs of the underlying hash function, which is SHA-256. Thus, to hash 9510/2 passwords per second with PBKDF2, we require a SHA-256 hash rate of (9510/2) · 86000 · 2 hashes per 3.3 Counting the Cryptographic Primitives 23 year. In the following, the resulting cost is computed analogous to equations 3.1 to 3.4. First, the hash rate provided by the given SHA-256 circuit is computed as

2500 · 106 = 4882812.5 (3.5) 512 This is the same value as before, because the block size is again 512 bits and the circuit has the same speed. Next, the number of gates is computed.

949.3 · 109 · 86000 · 2 · 20000 = 668.79 · 1012 (3.6) 4882812.5 This leads to the following VLSI area.

668.79 · 1012 · 5 휇m2 = 3.34 · 1015 휇m2 (3.7)

Finally, the cost can be computed.

3.34 · 1015 · 0.1 휇$ = 334.39 · 106 $ (3.8)

The alert reader will have noticed that this figure does not match with the figure from Table 3.1. It is twice as much. The same holds true if we compute and round the costs for the other key spaces. Percival apparently did not take into account that each of the 86,000 iterations again imply two executions of SHA-256. While reverse-engineering the rest of the table, it turns out that also the considered number of Blowfish executions for bcrypt and the number of Salsa20/8 executions for scrypt is inaccurate. Table 3.5 lists those differences. KDF Primitive Count in Precise Count Error [Per09] factor PBKDF2 SHA-256 푐 2푐 0.5 bcrypt Blowfish 2푐표푠푡 · 512 2푐표푠푡 · 2 · 521 + 521 + 64 · 3 ≈ 0.5 scrypt Salsa20/8 푁 · 푟 · 푝 2푁 · 2푟 · 푝 0.25

Table 3.5: Comparison for the cost adaptable KDFs between the numbers of cryptographic primitives presumably counted in [Per09] and the precise numbers. 푐 is the iteration count in PBKDF2, 푐표푠푡 is the cost factor in bcrypt, (푁, 푟, 푝) are the scrypt parameters.

Taking the values from the third column in Table 3.5 nearly always yields exactly the numbers from Table 3.1, so we are confident that those numbers have actually been used for computing that table. Table 3.6 shows the according reverse-engineered cost estimations. In some cases, the values differ which is emphasized in bold. Although there seem to be many differing values, note that those are always just marginal. For example, computing the cost estimation for 8 characters and bcrypt with 푐표푠푡 = 11, our result is $135,472, but the according entry in Table 3.1 is $130k while our result 24 3 Cost Estimations

is rounded to $140k. These differences probably are the result of a different floating point precision or rounding in between the steps. Especially, the rounding in [Per09] does not seem to follow a strict convention. For example in the “80 char text” column, 2.2 × 1017 has been rounded to the first decimal place, but 5.7 × 1019 becomes 6 × 1019. By recomputing the table, we always rounded the numbers to the same representation that was used in the according cell of the original table. But still, we cannot completely eliminate the possibility that a slightly different computation was used when computing Table 3.1. Also mentionable is the factor of 512 within the Blowfish count. There is just a slight difference between the estimations computed with 512 and those one computed with 521. As we will explain in a moment, 512 does not consider the so called P-Array within Blowfish. The results obtained with 512 are closer to the numbers in Table 3.1,so we assume that 512 was used, but we cannot tell for sure. Nevertheless, this does not influence the error of factor 0.5.

KDF 6 letters 8 letters 8 chars 10 chars 40 char 80 char text text DES crypt < $1 < $1 < $1 < $1 < $1 < $1 MD5 < $1 < $1 < $1 $1.2k $1 $1.5T MD5 crypt < $1 < $1 $130 $1.2M $1.4k $1.5 ×1015 PBKDF2 (100 ms) < $1 < $1 $19k $170M $200k $2.2 ×1017 brypt (95 ms) < $1 $4 $140k $1.2B $1.5M $48B scrypt (64 ms) < $1 $150 $4.8M $43B $52M $6 ×1019 PBKDF2 (5.0 s) < $1 $29 $930k $8.4B $10M $11 ×1018 brypt (3.0 s) < $1 $140 $4.3M $39B $47M $1.5T scrypt (3.8 s) $900 $610k $19B $175T $210B $2.3 ×1023

Table 3.6: Reverse-engineered hardware cost estimations in dollar-years. Differences to the original values are shown in bold. The running times inside the parentheses refer to a single execution on a 2.5 GHz Intel Core 2 Duo processor.

As we already explained, the accurate number of PBKDF2 executions comes from 푐 , each including 2 SHA-256 hashes. Now, we are going to explain the precise number of executions of Blowfish and Salsa20/8. Eventually, the computation ofthe memory costs are discussed. For now, just be told that scrypt’s overall estimated cost are effectively only memory costs.

bcrypt The bcrypt algorithm, as presented in [PM99], is given in Algorithm 3.3.1. At the very end, there is a loop of 64 Blowfish encryptions in Electronic Codebook (ECB) mode. Since the constant input value OrpheanBeholderScryDoubt has a size of 192 bits and the block size of Blowfish is 64 bits, this results in 192/64 = 3 blocks. Thus, 64 · 3 executions of Blowfish happen in lines 3 to 5. But those 64 · 3 Blowfish executions are negligible compared to the number of Blowfish executions in the expensive , that is, line 1. To understand the exact number, we must have a closer look at the 3.3 Counting the Cryptographic Primitives 25

Algorithm 3.3.1: bcrypt Data: 푐표푠푡, 푠푎푙푡, 푝푤푑 Result: the derived key 푑푘 1 푠푡푎푡푒 ← EksBlowfishSetup(푐표푠푡, 푠푎푙푡, 푝푤푑); /* 2푐표푠푡 · 2 · 521 + 521 × Blowfish */ 2 푐푡푒푥푡 ← “OrpheanBeholderScryDoubt”; 3 for 푖 ← 1 to 64 do 4 푐푡푒푥푡 ← EncryptECB(푠푡푎푡푒, 푐푡푒푥푡); /* 3 × Blowfish */ 5 end 6 푑푘 ← 푐표푠푡 || 푠푎푙푡 || 푐푡푒푥푡;

EksBlowfishSetup function which is specified in Algorithm 3.3.2.

Algorithm 3.3.2: EksBlowfishSetup Data: 푐표푠푡, 푠푎푙푡, 푝푤푑 Result: the bcrypt 푠푡푎푡푒 1 푠푡푎푡푒 ← InitState(); 2 푠푡푎푡푒 ← ExpandKey(푠푡푎푡푒, 푠푎푙푡, 푝푤푑); 3 for 푖 ← 1 to 2푐표푠푡 do 4 푠푡푎푡푒 ← ExpandKey(푠푡푎푡푒, 0, 푠푎푙푡); 5 푠푡푎푡푒 ← ExpandKey(푠푡푎푡푒, 0, 푘푒푦); 6 end

Obviously, the ExpandKey function is executed 2푐표푠푡 · 2 + 1 times. As we will explain in a moment, each ExpandKey function implies 521 Blowfish encryptions. Together, this gives us the number of 2푐표푠푡 · 2 · 521 + 521 encryptions for the EksBlowfishSetup function. Adding the, in terms of workload actually irrelevant, 64 · 3 Blowfish executions from the final ECB encryption yields the precise number of 2푐표푠푡 · 2 · 521 + 521 + 64 · 3 executions, as stated in Table 3.5. For any reasonable cost factor, the 521 + 64 · 3 executions do not have to be considered in an estimation, so the error factor is (2푐표푠푡 · 512)/(2푐표푠푡 · 2 · 521) ≈ 0.5. As promised, we will now give a short explanation of why ExpandKey implies 521 Blowfish encryptions. Within ExpandKey, the Blowfish algorithm is used to iteratively update the state. This state consists of two parts. First, the so called P-Array which stores 18 words of 32 bits. Second, the 4 S-Boxes, each containing 256 words of 32 bits. One execution of Blowfish generates 64 bits. Thus, we need 9 executions to update the18 words in the P-Array and 512 executions to update the 1024 words in the S-Boxes. This results in 512 + 9 = 521 executions of Blowfish. As a note, it should be mentioned that setting the second argument of ExpandKey to zero does not affect this number. Actually, it converts the slightly more complex ExpandKey function to the key schedule originally designated for Blowfish inSch94 [ ]. Now that the precise count of Blowfish executions in bcrypt has been thoroughly worked out, we move on to the exact number of Salsa20/8 executions within scrypt. 26 3 Cost Estimations

scrypt The scrypt algorithm has been described in detail in Chapter 2. Recall to the mind that the cost determining parameters are (푁, 푟, 푝). In scrypt, the scryptROMix function is called 푝 times in parallel. Each scryptROMix function includes two loops of 푁 iterations, where each iteration calls BlockMixSalsa20/8,푟. Thus, there are altogether 2푁 · 푝 of those calls. Finally, BlockMix involves 2푟 executions of Salsa20/8. This results in the overall number of 2푁 · 2푟 · 푝 Salsa20/8 executions.

Table 3.5 might indicate an erroneous estimation, but it is probably more of a simplification for not having to deeply dig into the algorithms. Whether or not, it did certainly not simplify the process of understanding how the table was created. Table 3.7 shows the cost estimations that result from the more precise count of the underlying cryptographic primitives presented in table 3.5. Note that again, password length restrictions have been taken into account. As mentioned before, DES crypt can only accept a 56 bit key. Furthermore, bcrypt restricts the key size to 56 bytes. For bcrypt and scrypt, also memory costs have been taken into account.

KDF 6 letters 8 letters 8 chars 10 chars 40 char 80 char text text DES crypt < $1 < $1 < $1 < $1 < $1 < $1 MD5 < $1 < $1 < $1 $1.2k $1 $1.5T MD5 crypt < $1 < $1 $130 $1.2M $1.4k $1.5 ×1015 PBKDF2 (100 ms) < $1 $1 $37k $330M $400k $4.4 ×1017 brypt (95 ms) < $1 $9 $280 $2.5B $3M $98B scrypt (64 ms) < $1 $600 $19M $172B $210M $22 ×1019 PBKDF2 (5.0 s) < $1 $58 $1.9M $17B $20M $22 ×1018 brypt (3.0 s) < $1 $280 $8.8M $80B $96M $3.1T scrypt (3.8 s) $3.6k $2.4M $78B $700T $840B $9.3 ×1023

Table 3.7: Improved hardware cost estimations in dollar-years. The running times inside the parentheses refer to a single execution on a 2.5 GHz Intel Core 2 Duo processor.

3.4 Estimating the Memory Costs for bcrypt and scrypt

The last piece of the puzzle is the influence of memory on the cost estimations. Thisis only relevant for bcrypt and scrypt, but it is more relevant then one might expect. The first step for computing the memory costs is to compute the number ofparallel circuits according to the reference circuit. For instance, in case of bcrypt the reference circuit computes 1000 Mbps, that is,

1000 · 106 = 15625000 (3.9) 64 3.5 Compression Functions vs Hashes 27

Blowfish encryptions per second. If we want to crack an 8 letters password withinone year that was hashed with 푐표푠푡 = 11, we need a hash rate of

268 2 · 211 · 512 = 3.5 · 109 (3.10) 365 · 24 · 60 · 60 Note that we took the erroneous value from Table 3.5 since we are explaining the reverse-engineered method originally applied in [Per09]. It can now be seen that we need

3.5 · 109 ≈ 222 (3.11) 15625000 of the reference circuits to obtain the desired hash rate. Since every circuit needs 4096 · 8 bits of SRAM of size 2.5 휇m2, the resulting costs can be computed as

222 · 4096 · 8 · 2.5 · 0.1 휇$ ≈ $1.8 (3.12)

That means that 1.8 of the 4 dollars are due to memory. The proportion of 43% memory costs applies for all bcrypt cost estimations shown above. One last thing concerning bcrypt. One could argue now that, to be fully correct, we should consider 4168 bytes of memory instead of 4096. Actually, this is correct and we have been overly precise with this before. So we could take into account that the state also contains the P-Array and thus 521 instead of 512 Blowfish executions must be called and 9 · 64 additional bits must be stored. But obviously, just as before, this does not really make a difference, the error factor will still be ≈ 0.5.

For scrypt, the memory costs are basically estimated in the same way, but with DRAM instead of SRAM. But how much memory does scrypt need? The output of Salsa20/8 is 64 byte and there are 2푟 of those outputs generated in BlockMix. ROMix finally stores 푁 outputs of BlockMix and all that is done 푝 times. So, the memory size of scrypt is 64 · 2푟 · 푁 · 푝 bytes. For (푁, 푟, 푝) = (214, 8, 1), this means that 99.12% of the estimated costs are memory costs. For (푁, 푟, 푝) = (220, 8, 1), the according proportion is 99.99%, effectively meaning that scrypt’s overall cost estimation presented in [Per09] are solely costs for memory.

3.5 Compression Functions vs Hashes

For concluding this chapter, we want to note that actually, if we are very accurate, there are still some errors in those cost estimations. Note that the estimations are based on the given speeds of circuits processing the according underlying cryptographic primitives. From that speed, we had to somehow compute how many runs of the underlying primitive can be accomplished. This was always done by taking the block size of the according algorithm, that is, by assuming a hash could always be computed by a single run of the compression function. We now walk through the previous cost estimations and analyze if they are affected by this circumstance. 28 3 Cost Estimations

MD5 We argued that an MD5 circuit that computes 2500 Mbps results in (2500·106)/512 hashes per second. But this actually only holds true for passwords with less or equal to 512 bits. Using an 80 character text string results in an 80 · 8 = 640 bit password which needs two executions of the MD5 compression function. Thus, the number of hashes per second for that case must be adjusted to (2500 · 106)/1024 and the resulting costs must be multiplied by two.

bcrypt and scrypt No changes have to be done for bcrypt and scrypt because the underlying algorithms do not depend on the password length. In case of bcrypt, Blowfish views the password as a cyclic key always resulting in 18 registers of 32 bit. For scrypt, Salsa20/8 always operates on 64 bytes. The password length is not an issue due to the previous PBKDF2 execution.

PBKDF2 PBKDF2 computes a chain of 푐 HMACs as shown in Figure 3.1. If we assume the derived key not to be longer then 256 bits, which is a reasonable assumption for password hashing, only one of those chains has to be computed. (We then have 푖 = 1 and extract the first 푑푘퐿푒푛 bytes of 푇1 as the derived key).

Figure 3.1: A hash chain that is computed within PBKDF2.

Each HMAC execution results in two runs of SHA-256, as already mentioned above. As we see in Figure 3.2, each of those runs in turn result in at least two compression functions because the inputs are beginning with the password padded to the block size which already requires one compression function. For the first hash function, the rest of the input is the message which is, inallbutthe first iteration, the output of the previous HMAC. Since the HMAC output size is equal to the output size of the underlying hash function, this results in 256 further bits and therefore only in one further compression function. In the first iteration, the message consists in the salt and a 32 bit counter. Since 256 bit is a reasonable size for a salt, this would also require only one further compression function. The second hash is a concatenation of the padded key and the output of the first hash. Thus, the overall input length is 512 + 256 = 768 bits. Therefore, this hash consists of two compression functions. Putting it all together, 푐 iterations require 4푐 compression functions given that the password size does not exceed 512 bits. (If it does exceed 512 bits, it must be previously hashed which results in two additional compressions.) 3.5 Compression Functions vs Hashes 29

Figure 3.2: HMAC based on SHA-256.

As discussed in [DGK+12], the number of compression functions can be reduced to 2푐 + 2 by extracting the first compression function for the two hashes within HMAC. The key paddings do not change for the iterations, so the according computations do only need to be executed once. After all, we end up with 2푐 + 2 ≈ 2푐 compression functions which is equal to one compression function per hash. So, in a rather lucky way, the number of PBKDF2 hashes resulting from the speed in bits per seconds was already estimated correctly, when we adjusted Percival’s count from 푐 to 2푐.

MD5 crypt Finally, crypt also needs more than one compression functions per hash, but due to the complex design and the branches that depend on the input, determining the precise number of compression functions seems not to be worth the effort. Especially, since md5 crypt should be considered deprecated and there is no big interest in it anymore.

4 Implementation

In the following, we describe our CUDA C implementation of scrypt. Before going into the details, we give a short introduction to CUDA C programming. Then, the implementation is explained. Important parts are discussed in detail while unimportant parts are completely left out.

4.1 CUDA C Programming

Naturally, GPUs have been originally designed for graphics processing. But researchers started to harness the massive parallelism for computing problems not related to graphics problems. This required to make the GPU believe that some graphics problem had to be solved, even though this was not the case, and it required extended knowledge in the according APIs. Because of this cumbersome approach that surely was an obstacle to many programmers without graphics background, NVIDIA developed the Compute Unified Device ArchitectureCUDA ( )[NVI] to enable general-purpose GPU programming. CUDA is a general-purpose parallel computing platform and programming model that allows programmers without any background in graphics processing to harness the parallel computing power of the GPU. This section gives a short introduction to the CUDA programming model in C style and is mainly based on the CUDA C Programming Guide [NVI13].

This introduction should not be considered as a CUDA programming tutorial. There are whole books about writing CUDA C code as outlined in Section 1.2. We rather give a very short overview of the programming model and the thread and memory hierarchy, hoping that also readers who are not familiar with CUDA C are able to comprehend what is going on in this chapter, without necessarily understanding every line of code. Also, we do not explain any optimization techniques in this section. There is a whole bunch of techniques that might be applied in different situations for optimizing efficiency. We explain the employed optimization techniques on the fly when describing the implementation in the Section 4.2.

4.1.1 Heterogeneous Model CUDA C assumes a heterogeneous programming and compiler model. On the one hand, there is the code performed on the CPU which is compiled with a standard C compiler. On the other hand, the code that runs on the GPU is compiled with NVIDIA’s nvcc compiler. In this heterogeneous model, the CPU is referred to as the host and the GPU is referred to as the device. Following this model, the CUDA C language is just the standard 32 4 Implementation

C language plus some CUDA extensions for programming the GPU and therefore very easy to learn for C programmers. Figure 4.1 illustrates the heterogeneous model.

Figure 4.1: In the heterogeneous model, the source code is divided into device and host code and compiled by the according compiler. The device can run many threads in parallel and is called from the host1.

The heterogeneity also applies for the memory management. That means, CUDA assumes the host and the device to have their own memory spaces. Accordingly, the device cannot dereference a host pointer and the host cannot dereference a device pointer.

Interaction between the host and the device takes place through calls from the host to the CUDA runtime. This implies that the host is in charge and uses the device as a coprocessor. Basic calls to the CUDA runtime imply memory allocations, memory copies, and kernel calls, where a kernel is a function that runs on the device. A typical way of using those calls is to first copy some workload to the device, then to call the kernelthat processes the workload in parallel, and finally to copy the results back to the host.

4.1.2 Thread Hierarchy When calling a kernel, the kernelName<<>>(args); syntax is used. Grid size and block size basically are the parameters determining the number of parallel threads, where each thread executes the kernel in parallel. Multiple threads form a thread block. The size of such a thread block is selected with the second parameter

1Figure 4.1 shows an abstract illustration of the control flow. Actually, a kernel call is asynchronous which means that the second host code is executed directly after the first. Its running time can therefore be (partially) hidden in the device’s running time. 4.1 CUDA C Programming 33

in the triple angle brackets of the call. The first parameter determines the number of blocks, which is called the grid size. Blocks are assigned to Streaming Multiprocessors (SMs) and executed in groups of 32 threads, so called warps. The programmer just needs to make sure to keep block size and grid size within some hardware imposed limits. The actual multiprocessor count does not need to be known and the runtime will take care of the scaling as depictured in Figure 4.2.

Figure 4.2: The runtime handles the assignment of the blocks to the Streaming Multipro- cessors [NVI13].

Threads within a block have the ability to synchronize with each other. Also, they can share data via a special per-block shared memory, as we will see in Section 4.1.3. However, thread blocks run independently because they might be scheduled in any order by the runtime system.

Threads and blocks can be organized in up to three dimension. For our application, we will always only need one dimension. That means that the block and grid sizes can be set by conventional integers. In the device code, a thread can then ask for his thread ID and block ID by threadIdx.x and blockIdx.x, respectively. In the same way, blockDim.x and gridDim.x return the according parameters that have been used in the triple angle brackets of the kernel call.

4.1.3 Memory Model According to the thread hierarchy, there also exists a memory hierarchy. Roughly speaking, per-thread memory is very fast, per-block memory is still fast, and device memory is rather slow because it is not on-chip.

Each thread has private registers that have very low latency. Additionally, threads have their own local memory, which has a higher latency. It is mainly used if there are not 34 4 Implementation

enough registers (register spilling). Registers and local memory have the same life time as the the according thread. Threads from the same block can access the fast per-block shared memory. It can be used for inter-thread communication or as a fast user managed cache. Shared memory has the same lifetime as the according block. Finally, there are memories which are accessible by any thread in any block. The most common is the global memory. Furthermore, there are two read-only memories: constant memory and texture memory. Those read-only memories might have better latency in specific memory access patterns. For example, constant memory is particularly well-suited when all threads in a warp read from the same memory location. The lifetime of device memories is managed by the host who allocates and frees the memory. The whole memory model is illustrated in Figure 4.3.

Figure 4.3: Overview of the CUDA memory model2. The host can only access global, constant, and texture memory.

2Figure 4.3 is taken from [Spr11] where the author refers to the “CUDA Programming Guide, August 4.2 CUDA Based Password Hashing with scrypt 35

4.2 CUDA Based Password Hashing with scrypt

The implementation has been written in CUDA C with version 5.5. The employed graphics card is a GeForce GTX 480. Details of this card can be obtained from Appendix B. The description of the implementation is split into two sections. One for the host and one for the device.

4.2.1 Host Implementation We give an overview of the host implementation by examining the important functions. Trivial parts, like allocating and freeing device memory, is skipped for the purpose of not getting lost in details.

Determining the Degree of Parallelism We are hashing many passwords in parallel, but of course we cannot hash all of them in parallel. The degree of parallelism we use depends on the memory that is needed per password. We hash as many passwords in parallel as possible. For example, if we need 16 MB per password, we cannot hash more than 1536/16 passwords in parallel because 1536 MB is the size of our global memory. Since our implementation is based on warps, that is, groups of 32 threads, the number of parallel passwords is always a multiple of 32. The process of computing that number depending on the cost determining parameters 푁 and 푟 is shown in Listing 4.1.

1 /* Get amount of free and total global memory.*/ 2 size_t free_mem, total_mem; 3 cudaMemGetInfo(&free_mem,&total_mem); 4 5 /* Global memory needed per password.*/ 6 size_t mem_needed = 128*N_PARAM*R_PARAM; 7 8 /* Assure there is memory for at least 32 passwords.*/ 9 if((free_mem/mem_needed) < 32) { 10 printf("Not enough memory!\n"); 11 exit(EXIT_FAILURE); 12 } 13 14 /* Compute number of possible warps and set to grid size.*/ 15 int num_warps = 1; 16 while ( 2*num_warps <= ((free_mem/mem_needed) / 32) ) { 17 num_warps *= 2; 18 } 19 int num_parallel_pwds= num_warps*32; 20 21 /* Set grid and block size for scryptROMix kernel.*/ 22 dim3 gridSize(num_warps, 1, 1); 23 dim3 blockSize(32, 1, 1); Listing 4.1: Computation of the degree of parallelism and the according grid and block size.

2010”. We could not confirm that original source. The “CUDA C Programming Guide” from August 2010 (and any other Programming Guide we looked at) does not contain that figure. 36 4 Implementation

We always use blocks of 32 threads. The number of those blocks, that is, the grid size, is set to the greatest power of 2 that is possible for the global memory. In line 6, we can see that 푁 and 푟 influence the size of memory used per password while 푝 does not. That is because, as we will see later, we implement 푝 iteratively and use the same memory space in each iteration.

Iterative Approach

Obviously, since we cannot hash any number of passwords in parallel, we have to use an iterative approach. For example, if the number of parallel passwords determined in Section 4.2.1 is 4096 and we want to hash 263 = 17576 passwords, we have to run 5 chunks of parallel passwords. To this end, we use two arrays on the host called h_searchStatus and h_charset. The latter is just a char array containing all characters that we want to search through. The former contains the current search status, where a 0 corresponds to the first character in the character set, a 1 corresponds to the second character, andso forth. Listing 4.2 gives an idea of that.

1 /* Allocate host memory for search status.*/ 2 unsigned char*h_searchStatus=(unsigned char*)malloc(PASSW_LEN); 3 /* Start search status with all zeros.*/ 4 for(i=0;i

When calling the GPU, each thread will increment the search status by a different value and then hash a different password. After that, the CPU increments the search status by the number of parallel passwords and can then calls the GPU again. The function to increment the search status is shown in Listing 4.3.

1 bool next_searchStatus(unsigned char* searchStatus, int stride){ 2 inti= PASSW_LEN - 1; 3 int incremented_status; 4 int add= stride; 5 6 while(add>0 &&i>=0) 7 { 8 incremented_status= add+ searchStatus[i]; 9 searchStatus[i] = incremented_status% CHARSET_LEN; 10 add= incremented_status/ CHARSET_LEN; 11 i--; 12 } 13 14 return(add != 0); 15 } Listing 4.3: Computation of the next search status. 4.2 CUDA Based Password Hashing with scrypt 37

The first element of the search status is incremented by the number of parallel passwords. Every time it reaches the maximum number of characters, it starts again by zero. This is simply implemented by a modulo operation. For every time the bound is reached, the next element will also be incremented by one what is implemented by the division that updates the add variable for the next iteration. This is iteratively done until we do not have a propagation into the next round any more or we have reached the number of password characters that we search. In the latter case, we return a false to let the calling function know that the search is exhausted. For clarification, we give a small example: assume we search through passwords with length of three characters between a and z. If the search status is [0,0,0] what corresponds aaa and we increment it by 25, the result is [25,0,0] or zaa. But if we increment it by 26, the result is [0,1,0] or aba. If we would increment it by 26 · 26 · 26, we would come out with the same search status [0,0,0] or aaa, but additionally return a false. In Listing 4.4 we can see the idea of using the search status in a loop that iteratively calls kernels on the GPU.

1 __constant__ unsigned char c_searchStatus[PASSW_LEN]; 2 ... 3 int main() { 4 ... 5 bool searchExhausted= false; 6 7 while(!searchExhausted){ 8 /* Update search status for the GPU.*/ 9 cudaMemcpyToSymbol(c_searchStatus, h_searchStatus, PASSW_LEN); 10 11 /* Make calls to the GPU here and compute passwords in parallel.*/ 12 ... 13 14 /* Prepare next iteration.*/ 15 searchExhausted= next_searchStatus(h_searchStatus, num_parallel_pwds); 16 } 17 ... 18 } Listing 4.4: Overview of the main loop in the host code. The calls to the GPU are implemented in a while loop that runs as long as the search is not exhausted. At the bottom of the while loop, the function next_searchStatus() from Listing 4.3 is called to update the search status and decide if the search is already exhausted. At the beginning of the loop, the current host search status is copied to constant memory on the device. Constant memory is a read-only memory that is particularly well-suited when many threads in a warp read from the same memory location. Our tests show that this gives us a very small boost in the hash rate compared to implementing it in global memory. The constant memory is declared with the __constant__ modifier in global scope, that is, outside of any function. The copyfrom host to device is straight forward as in line 9. From the GPU, the constant memory can be accessed with the standard squared brackets syntax. We will now have a closer look at the GPU calls that have been hidden in line 12 of Listing 4.4. We employ seven kernels, as can be seen from the source code in Listing 4.5. 38 4 Implementation

1 /* Parallel generation of password candidates.*/ 2 fillPasswords<<>>(d_passwordCandidates, d_charset); 3 4 /* Parallel computation of the first PBKDF2.*/ 5 pbkdf2_A<<>>(d_B, d_passwordCandidates); 6 7 /* Parallel endianness conversion from chars to int.*/ 8 endian_chars2int<<>>(d_smixIn, d_B); 9 10 /* Runp times scryptROMix in parallel.*/ 11 for(inti=0;i>>(d_smixIn, N_PARAM,i, d_smixMem); 14 15 /* Kernel for pseudo-randomly reading the memory.*/ 16 scryptROMix_readMem<<>>(d_smixOut, N_PARAM,i, d_smixMem); 17 } 18 19 /* Parallel endianness conversion from int to chars.*/ 20 endian_int2chars<<>>(d_B, d_smixOut); 21 22 /* Parallel computation of the second PBKDF2.*/ 23 pbkdf2_B<<>>(d_derivedKeys, d_B, d_passwordCandidates); 24 25 /* Copy results back to host.*/ 26 cudaMemcpy(h_derivedKeys, d_derivedKeys, num_parallel_pwds*DK_LEN, cudaMemcpyDeviceToHost); 27 28 29 /* Compare computed hashes against the hash that is attacked.*/ 30 ... Listing 4.5: The GPU calls that are made from the host.

In line 2, the password candidates on the GPU are updated according to the new search status. The internals of this kernel will be examined in Section 4.2.2. After that, we run the first PBKDF2 computation of scrypt in line 5. PBKDF2 is computed in parallel with one thread per password. Next, we have a small kernel for the endianness conversion. The heart of our computation is the for loop in lines 11 to 17. We sequentially run the 푝 independent executions of scryptROMix which in turn is split up into two kernels. One for filling the memory and one for pseudorandomly accessing the filled memory. Note that we pass the current iteration value as a parameter to the scryptROMix kernels so that they know which part of the little-endian converted PBKDF2 output they have to process. The scryptROMix kernels are discussed in Section 4.2.2. Eventually, we revert the little-endian conversion and run the final PBKDF2. The results are copied back to the host where they are compared against the attacked hash value. When the comparison is true, we output the revealed password.

4.2.2 Device Implementation As apparent from Listing 4.5, we moved as much work as possible from the CPU to the GPU, even the PBKDF2 executions and the endianness conversions. However, these two parts of the device code are pretty straight forward and will not be described here. Instead, we examine the parallel generation of password candidates and the scryptROMix 4.2 CUDA Based Password Hashing with scrypt 39

computations. For the latter, the reader needs to be familiar with scrypt, especially with scryptROMix and scryptBlockMix. These two functions can be looked up in Section 2.2.2 and 2.2.3, respectively.

Generating Password Candidates As mentioned before, the password candidates are generated according to the current search status. Every thread generates an own password candidate. From line 9 of Listing 4.4, we know that the current search status is accessible on the GPU from constant memory. The according GPU code for generating password candidates is presented in Listing 4.6.

1 __device__ void compute_mySearchStatus(unsigned char* searchStatus, int idx) 2 { 3 inti= PASSW_LEN - 1; 4 int incremented_status; 5 int add= idx; 6 7 while(add>0 &&i>=0) 8 { 9 incremented_status= add+ searchStatus[i]; 10 searchStatus[i] = incremented_status% CHARSET_LEN; 11 add= incremented_status/ CHARSET_LEN; 12 i--; 13 } 14 } 15 16 __global__ void fillPasswords(unsigned char*d_passwordCandidates, unsigned char*d_charset){ 17 /* Compute global index of thread.*/ 18 int myIdx= blockIdx.x*blockDim.x+ threadIdx.x; 19 inti; 20 21 unsigned char mySearchStatus[PASSW_LEN]; 22 /* Read search status from constant memory.*/ 23 for(i=0;i

The kernel that is called from the host can be found in lines 16 to 33. Recall that a kernel is a device function that is called from the host with the triple angle bracket notation in which the degree of parallelism is specified. It is started with the __global__ qualifier and always returns void. In line 18, we compute the index of the executing thread. Then, in lines 23 to 25, the current search status is read from constant to local memory. Note that this is done by all threads, so at this time every thread still has the same search status. We therefore call the function compute_mySearchStatus() that updates the 40 4 Implementation

search status of each thread according to its index. Lines 1 to 14 show the according function. Since we cannot, and we do not want to, call a host function from the device, we use the __device__ qualifier here to indicate that this is a device function. Itcanbe called from other device functions, including kernels, but not from the host. The function is basically the same as the one presented in Listing 4.3. Now that every thread has its own search status, we can put together the according passwords and store them in global memory, as it is done in lines 30 to 32.

The scryptROMix State The 푝 independent implementations of scryptROMix are split into two kernels each. The kernels are based on Christian Buchner’s CudaMiner Fermi kernels [For]. CudaMiner is a CUDA based Litecoin miner. Litecoin is similar to that uses scrypt hashes as a proof-of-work. However, the cost determining parameters are fixed to (푁, 푟, 푝) = (1024, 1, 1). So, we adapted the kernels to generic computations for arbitrary parameters. Since we implement one thread per password. Every thread has its own local memory with the current state of scryptROMix, that is iteratively updated by scryptBlockMix and written to global memory. To work with that state, we use the arrays shown in Listing 4.7.

1 /* Local memory for storing state.*/ 2 uint4 state[4*2*R_PARAM]; 3 /* Arrays of pointers to state will be used for block mixing. */ 4 uint4* state_ptr1[2*R_PARAM]; 5 uint4* state_ptr2[2*R_PARAM]; 6 /* Initialize first array of pointers.*/ 7 for(inti=0;i<2*R_PARAM;i++) { 8 state_ptr1[i] = state+i*4; 9 } Listing 4.7: Setup of the scryptROMix state.

First we declare an array for storing the state. The uint4 structure consists of four unsigned integers. Thus, it stores 16 bytes. Since we need 128 · 푟 bytes, we allocate 8 · 푟 of those structures. Furthermore, we declare two arrays of pointers that will point to the 2 · 푟 blocks of the state. The first array of pointers is already initialized. Those pointer arrays will be particularly helpful in the block mixing phase of scryptBlockMix, where we do just copy pointers instead of actually copying blocks of memory.

scryptBlockMix When it comes to running scryptBlockMix, the 2 · 푟 Salsa20/8 executions are processed in groups of two. Listing 4.8 shows this in lines 2 to 8.

1 /* First Salsas need special treatment because the last block is involved here.*/ 2 xor_salsa20_8(state_ptr1[0], state_ptr1[2*R_PARAM-1]); 3 xor_salsa20_8(state_ptr1[1], state_ptr1[0]); 4 /* Now, process all other Salsas.*/ 5 for(r=1;r

7 xor_salsa20_8(state_ptr1[2*r+1], state_ptr1[2*r]); 8 } 9 10 /* Block mixing: rearrange memory by only copying pointers.*/ 11 for(j=0;j

The xor of two blocks before each Salsa has been incorporated in the device function xor_salsa20_8(). As we know from Section 2.2.3, the first xor takes the last block of the current state as one of its input. So, this special case is exclusively handled before all further xor and Salsa computations are performed in a for loop. The block mixing is done in lines 11 to 18 by just copying the pointers instead of the memory blocks themselves.

Memory Transactions In several situations, the scryptROMix kernels must copy a large amount of memory from global to local or from local to global memory. The naive way of implementing this would be to have every thread writing or reading just its own memory, accessing one word after the other. As it turns out, and people who are used to CUDA will not be surprised, this results in very bad performance. For understanding how to maximize the global memory throughput, we have to give some background about memory alignment and memory coalescing which is taken from [NVI13, Chapter 5.3.2]. Global memory can be accessed in chunks of 32, 64, or 128 bytes of memory. Further- more, those chunks of memory must start with an address which is a multiple of their size, this is also called natural alignment. Now, when a warp accesses global memory (all threads in warp always execute the same instruction), it coalesces the memory accesses of the threads to one or more transactions of those chunks, depending on the size of memory accessed by each thread and the memory addresses accessed by the threads. Reducing this number of transactions increases the memory throughput. The global memory instructions by the threads can access words of size 1, 2, 4, 8, or 16 bytes. If a thread accesses global memory with a data type that is not naturally aligned, this access will compile to more than one global memory instruction. Besides some built-in types that automatically support this alignment requirement, 8 and 16 byte structures can be used with the specifiers __align__(8) and __align__(16), respectively, to make sure they fulfill this requirement. In our case, we use structures of 16 bytes with the __align__(16) specifier. Always four contiguous threads access four contiguous structures so that four 16 byte accesses are fully coalesced. This results in eight naturally aligned data chunks of 64 bytes that are accessed by a warp. The accessed data is exactly the data that we need, no additional chunks are accessed, so global memory throughput is maximized. 42 4 Implementation

But the question arises how we can read the data for a password with multiple threads from global memory if we eventually need it in the local memory of a single thread. The solution is to not directly read from global to local memory. Instead, we first read from global to shared memory in the above-explained way. Then, each thread reads its own data from shared memory to local memory. Analogous, if we want to write from local to global memory, we first have a thread sequentially write its data into shared memory and then write to global memory with maximized throughput. In this way, the global memory access can be done coalesced and aligned while the sequential memory reads for each thread are done to the fast shared memory. This time, instead of showing the very complicated source code, we illustrate the approach of reading memory in Figure 4.4. It shows a quarter of the initial read from the generated password candidates for a single block of threads for the case of 푟 = 2. Other big memory transfers are implemented in the same way, but of course with other offsets, especially later when every thread computes a different pseudorandom index that is accessed. Writing memory works analogous the other way around. There are two things to note about the memory transfer from global

Figure 4.4: Example of partial memory transfer from generated passwords to local memory for a single block of threads. The block size parameter 푟 is 2.

to shared memory. First, since we use four threads per password to copy a block of 64 bytes, every thread needs to do four memory accesses with different offsets. The first offset has been indicated in between the dots in Figure 4.4. Second, the whole procedure shown in Figure 4.4 has to be run for each memory block, that is 2 · 2 = 4 times in our example with 푟 = 2, and 2 · 푟 times in general. We also see that there is an offset of 4 bytes after each block in the shared memory. The reason for that are memory bank conflicts. The shared memory is divided in so called memory banks, for example on our GeForce GTX 480 with compute capability 2.0, the shared memory is divided into 32 memory banks with a size of 32 bits each. If a so called bank conflict occurs, the memory access has to be serialized which has anegative effect on the performance. A memory bank conflict might occur when threads access the same memory bank. When exactly a memory bank conflict occurs depends on the compute capability of the graphics card and on the size and address that is accessed. For 4.2 CUDA Based Password Hashing with scrypt 43

further information about memory bank conflicts, please have a look in the CUDA C Programming Guide [NVI13]. Here, just note that without the offset of 4 bytes, every other thread would access the same bank at a time: thread 0 would access bank 0, thread 1 would access bank 16, thread 2 would access bank 0, and so forth. Now, by using the offset, we make sure that no thread within a quarter warp accesses the same bank.As can be looked up in [NVI13, Appendix G.4.3], this is what we need for 128-bit accesses on our card to avoid memory bank conflicts.

5 Results

This chapter presents the measured results of our scrypt implementation. First, we examine the CPU running times of scrypt. After that, the GPU hash rates that can be achieved with the implementation from Section 4.2 are reported. Eventually, those hash rates are compared to bcrypt hash rates from the same graphics card. The reason why we analyze running times on the CPU, but hash rates on the GPU is that this conforms to our attack model: a server equipped with a CPU that has to run the authentication only once on the one side, an attacker equipped with a GPU who runs a brute-force attack on the other side. The used graphics card is a GeForce GTX 480. Technical details of this card are given in Appendix B. All measurement results are also summarized in Appendix C.

We analyzed parameters that use up to 32 Mbytes per passwords. The reason for this is that our graphics card’s global memory size is 1536 Mbytes and we process always warps of 32 passwords in parallel. There are two things two note about this. First, with some effort, the implementation can be adapted to also process less then 32 password in parallel and therefore being able to run with even higher parameters. Second, the parameters we are looking at here are already excessively high when thinking about real-world implementations. For example, bcrypt parameters tend to be chosen between 4 and 7. But the parameters we are looking at correspond to bcrypt parameters up to 12. Servers seem to be unhappy with spending much resources for password hashing. Now, note that scrypt needs much more memory resources than bcrypt. We are confident that the analyzed parameters cover those values that would be reasonably chosen for interactive logins in practical systems.

5.1 CPU Running Times

The CPU results have been computed on a 3.1 GHz Intel Core i5-2400. Initially, we took Percival’s scrypt encryption utility [Tar] and just extracted the scrypt function. This seemed to be the easiest and fastest way, but the resulting code performed very bad (about 10 times slower than the benchmarks given in [Per09]). So, we implemented the scrypt function from scratch, also as C Code. This time, the program behaved as expected: Our CPU running times are a bit slower than the running times given in [Per09]. But the same holds true for our bcrypt running times, were we used the bcrypt implementation from Openwall [Ope]. Thus, we assume that this difference results from the different CPU that we used.

The scrypt parameters have been introduced in Section 2.2.1. Three of them, namely 푁, 46 5 Results

(푁, 푟, 푝) Running time in ms (CPU) (210, 1, 1) 1.1 (211, 1, 1) 2.2 (212, 1, 1) 4.5 (213, 1, 1) 9.0 (214, 1, 1) 18.1 (215, 1, 1) 36.7 (216, 1, 1) 74.4 (217, 1, 1) 153 (218, 1, 1) 305

Figure 5.1: The measured CPU run- Figure 5.2: The exact measured ning times of scrypt for CPU running times of (푁, 푟, 푝) = (푁, 1, 1). scrypt for varying 푁.

(푁, 푟, 푝) Running time in ms (CPU) (1024, 1, 1) 1.1 (1024, 2, 1) 2.0 (1024, 4, 1) 4.0 (1024, 8, 1) 7.8 (1024, 16, 1) 15.6 (1024, 32, 1) 33.0

Figure 5.3: The measured CPU run- Figure 5.4: The exact measured ning times of scrypt for CPU running times of (푁, 푟, 푝) = (1024, 푟, 1). scrypt for varying 푟.

푟, and 푝, can be used for adapting the cost. Thus, the tuple (푁, 푟, 푝) is the counterpart to the 푐표푠푡 parameter used within bcrypt. On the CPU, the effect of each of those parameters is unspectacular. Basically, if we double any parameter, also the running time doubles. The effect of the memory cost parameter 푁 is shown in Figure 5.1. The according running times are listed in Figure 5.2. As we see, the graph indicates proportional behavior for N. The second parameter is the block size parameter 푟. In terms of the amount of used memory and also in terms of the number of applied Salsa20/8 function, it has the same effect as the 푁 parameter. But instead of more iterations with the same amount of memory, increasing 푟 yields more memory usage without changing the number of iterations. Furthermore, the “mixing” that eventually happens in scryptBlockMix only does anything what one could call mixing for 푟 ≥ 2. The influence on the CPU running time, however, is nearly the same. Figure 5.3 shows the proportional behavior of the running time depending on 푟. The measured values can be found in Figure 5.4. Finally, 5.2 GPU Hash Rates 47

(푁, 푟, 푝) Running time in ms (CPU) (1024, 1, 1) 1.1 (1024, 1, 2) 2.2 (1024, 1, 3) 3.3 (1024, 1, 4) 4.4 (1024, 1, 5) 5.5 (1024, 1, 6) 6.6 (1024, 1, 7) 7.7 (1024, 1, 8) 8.8

Figure 5.5: The measured CPU run- Figure 5.6: The exact measured ning times of scrypt for CPU running times of (푁, 푟, 푝) = (1024, 1, 푝). scrypt for varying 푝.

the parallelization parameter 푝 is the easiest of the three cost determining parameters. It just tells how many independent runs of scryptROMix have to be performed. Accordingly, the behavior is perfectly proportional to 푝, as shown in Figure 5.5. The underlying values can be looked up in Figure 5.6. Note that the behavior does not change for other fixed parameters. That means, instead of fixing 푟 = 1 and 푝 = 1 and varying the 푁 parameter, we could also fix 푟 and 푝 to other values. The resulting times will of course be different, but the running time will still be proportional to the according parameter. Solely 푟 seems to sometimes be a bit faster than expected. That means, we sometimes observe that 푟 is doubled, but the running time just increases by a factor of 1.8 or so.

5.2 GPU Hash Rates

Now, we present the results from the CUDA based GPU implementation described in Chapter 4.2. The measurements have been performed on a GeForce GTX 480. The according device query can be found in Appendix B. Furthermore, we used CUDA 5.5 and the GeForce 335.23 Driver. There are two different characteristics that should be analyzed when measuring a password hashing function. First, for some specific (and reasonable) CPU running times, one can measure the according hash rates on an attacker’s platform and thereby determine how long a password will withstand an exhaustive search. On the other hand, it is very interesting to observe how the platforms of the authentication server and the attacker behave for different cost parameters. Particularly interesting are parameters that behave different on both platforms. Of course, we hope for cost parameters that slow downthe GPU by a bigger factor then the CPU. We first look at the latter aspect and present the behavior for each parameter in an own subsection where we also compare the results to the CPU running times from Section 5.1. In this way, we can analyze how well each parameter is suited for thwarting a GPU-based implementation. Thereafter, we analyze 48 5 Results

(푁, 푟, 푝) Hash rate in H/s (GPU) (210, 1, 1) 130,000 (211, 1, 1) 54,000 (212, 1, 1) 27,000 (213, 1, 1) 7400 (214, 1, 1) 1900 (215, 1, 1) 482 (216, 1, 1) 121 (217, 1, 1) 31 (218, 1, 1) 8

Figure 5.7: The measured GPU Figure 5.8: The exact measured hash rates of scrypt for GPU hash rates of (푁, 푟, 푝) = (푁, 1, 1). scrypt for varying 푁.

the specific hash rates for given CPU running times by comparing them to bcrypt.

5.2.1 The N Parameter Figure 5.7 shows the hash rates of the GPU for varying 푁. The exact values are given in Figure 5.8. We observe linear behavior with a knee at 푁 = 4096. Before that knee, the hash rate approximately halves itself when the 푁 parameter is doubled. This means, that the GPU is not effectively thwarted by those low values of 푁. After all, this is an effect that can be reached by a simple key stretching. After the knee, however, the GPU looses approximately 75% of its hash rate when 푁 is multiplied by 2. We recall from Figure 5.1 that the CPU does not change its behavior, also for large values of 푁. Thus, with high parameters of 푁, a password authentication server can effectively slow down our GPU-based attack by doubling the time needed for authentication and thwarting the attacker’s hash rate by 75% in return. The memory cost parameter 푁 mainly slows us down because the overall amount of memory is increased. That means that have to decrease the degree of parallelism since the global memory is full. We end up with less and less passwords that are hashed at the same time. Unfortunately, we have no explanation for the knee at 푁 = 4096. A varying 푁 param- eter just changes the number of iterations in the according loops of our implementations. The global memory access is always implemented in the same way and also the local memory per thread is not influence by 푁.

5.2.2 The r Parameter The block size parameter 푟 effectively slows down the GPU from the very beginning. Figure 5.7 and Figure 5.8 present the hash rates for varying 푟 in a graphic and tabular way, respectively. When 푟 is multiplied by 2, the hash rate drops about 15% to 25%. On 5.2 GPU Hash Rates 49

(푁, 푟, 푝) Hash rate in H/s (GPU) (1024, 1, 1) 130,000 (1024, 2, 1) 30,000 (1024, 4, 1) 9800 (1024, 8, 1) 2560 (1024, 16, 1) 390 (1024, 32, 1) 55

Figure 5.9: The measured GPU Figure 5.10: The exact measured hash rates of scrypt for GPU hash rates of (푁, 푟, 푝) = (1024, 푟, 1). scrypt for varying 푟.

(푁, 푟, 푝) Hash rate in H/s (GPU) (1024, 1, 1) 130,000 (1024, 1, 2) 66,000 (1024, 1, 3) 44,000 (1024, 1, 4) 33,000 (1024, 1, 5) 26,000 (1024, 1, 6) 22,000 (1024, 1, 7) 19,000 (1024, 1, 8) 16,600

Figure 5.11: The measured GPU Figure 5.12: The exact measured hash rates of scrypt for GPU hash rates of (푁, 푟, 푝) = (1024, 1, 푝). scrypt for varying 푝.

the one hand, we again have to use more memory and therefore decrease the degree of parallelism. But a further reason why the 푟 parameter is so good in thwarting the GPU seems to be register spilling. When increasing the 푁 parameter, we have to do more memory accesses, but always use the same amount of local memory and registers. When increasing 푟 on the other hand, we use more local memory and when no more registers are free, the local memory spills over to the device memory what significantly slows down the attack.

5.2.3 The p Parameter The only parameter that did not effectively slow down the GPU in our measurements is the parallelization parameter 푝. The according hash rates that were achieved on the GPU can be seen in Figure 5.11 and Figure 5.12. We observe that the hash rates behave nearly exactly inversely proportional to 푝. For example, starting from (푁, 푟, 푝) = (1024, 1, 1) with 130k H/s, any hash rate for a different 푝 can be computed as 130000/푝. The reason for this behavior is obvious when recalling to mind the implementation from Section 4.2. 50 5 Results

We implemented the 푝 independent computations by simply writing a for loop, just as it is done on the CPU. So, we can process the same amount of parallel password and and just have to do it 푝 times. Although there are more memory accesses, namely 푝 times as much, the overall amount of memory does not change because we use the same memory space again for each iteration of 푝. Hence, the degree of parallelism that we can use is not influenced. It is imaginable that one could even speed up the GPU with this parameter, by implementing it in a parallel way. On the other hand, 푝 was presumably introduced to efficiently leverage multi-core CPUs. Thus, when writing a multi-threaded CPU implementation where 푝 is equal to the number of cores, we could also gain an advantage over the GPU.

5.2.4 More Results So far, we have just analyzed each parameter for itself. Now, we present some more results in Table 5.1. However, those results do not behave in any way, that we would not have expected from the previous measurements.

(푁, 푟, 푝) Hash rate in H/s (GPU) (212, 2, 1) 5200 (213, 2, 1) 1563 (214, 2, 1) 418 (212, 4, 1) 872 (213, 4, 1) 253 (214, 4, 1) 64 (212, 8, 1) 215 (213, 8, 1) 54 (214, 8, 1) 14

Table 5.1: Further GPU hash rates.

5.3 Comparing scrypt with bcrypt

To reasonably classify the measured hash rates, we now compare them with hash rates for bcrypt. To that end, we form classes of scrypt and bcrypt parameters that have similar running times on the CPU. Then, we compare the GPU hash rates for those parameters.

Our CPU implementation of bcrypt comes from Openwall [Ope] and is written in C. Recall to mind that bcrypt has only one cost parameter. Table 5.2 lists the running times obtained for multiple cost parameters and their comparable scrypt parameters that have been analyzed in Sections 5.2.1 to 5.2.3. We can see that the measured 5.3 Comparing scrypt with bcrypt 51

bcrypt values double when 푐표푠푡 is incremented by one. That is because bcrypt’s main work consists of a loop that is iterated 2푐표푠푡 many times, as it was also explained in Algorithm 3.3.2. Table 5.3 additionally incorporates the parameters that have been

bcrypt costs time in ms scrypt costs time in ms 4 1.3 (210, 1, 1) 1.1 (211, 1, 1) 2.2 5 2.4 (210, 2, 1) 2.0 (210, 1, 2) 2.2 (212, 1, 1) 4.5 6 4.6 (210, 4, 1) 4.0 (210, 1, 4) 4.4 (213, 1, 1) 9.0 7 9.3 (210, 8, 1) 7.8 (210, 1, 8) 8.8 (214, 1, 1) 18.1 8 18.2 (210, 16, 1) 15.6 (215, 1, 1) 36.7 9 36.0 (210, 32, 1) 33.0 10 72.1 (216, 1, 1) 74.4 11 144 (217, 1, 1) 153 12 287 (218, 1, 1) 305

Table 5.2: Comparable parameters for bcrypt and scrypt (1).

examined in Section 5.2.4. It should be noted that most scrypt times are a bit below the bcrypt times, especially when the 푟 parameter is increased. Still, they are very close to each other and there are no other values that would come closer.

Finally, the GPU hash rates of bcrypt are still missing. For obtaining this data, we ran oclHashcat [Has14]. Hashcat claims to be the world’s fastest password cracker. While scrypt is not supported by hashcat, bcrypt is. The results obtained with hashcat for cracking bcrypt hashes with different costs on our GeForce GTX 480 are presented in Table 5.4 together with the hash rates for scrypt with comparable parameters from Section 5.2.1. First, we have a short look on the bcrypt hash rates and see that they show inversely proportional behavior, that is, when the CPU running time is doubled (increasing 푐표푠푡 by one), the hash rate drops 50%. Thus, increasing bcrypt’s 푐표푠푡 parameter does not thwart the GPU more than applying conventional key stretching. However, the obtained hash rates are very low from the very beginning. Therefore, we can hash scrypt much faster for low parameters, but when the parameters grow, the advantage of bcrypt decreases. Finally, for the biggest 푁 parameter we could test, scrypt manages to be as slow as bcrypt. The fact that the factor between the two hash rates moves in favor of scrypt for growing 52 5 Results

bcrypt costs time in ms scrypt costs time in ms 7 9.3 (212, 2, 1) 8.3 (213, 2, 1) 16.3 8 18.2 (212, 4, 1) 15.9 (214, 2, 1) 32.9 9 36.0 (213, 4, 1) 31.7 (212, 8, 1) 30.4 (214, 4, 1) 65.5 10 72.1 (213, 8, 1) 62.4 11 144 (214, 8, 1) 125

Table 5.3: Comparable parameters for bcrypt and scrypt (2).

bcrypt costs Hash rate scrypt costs Hash rate Factor 4 2494 (210, 1, 1) 130,000 52 5 1279 (211, 1, 1) 54,000 42 6 652 (212, 1, 1) 27,000 41 7 325 (213, 1, 1) 7300 22 8 164 (214, 1, 1) 1900 12 9 82 (215, 1, 1) 482 5.9 10 40 (216, 1, 1) 121 3.0 11 20 (217, 1, 1) 31 1.6 12 8 (218, 1, 1) 8 1.0

Table 5.4: Comparison of hash rates for bcrypt and scrypt for varying 푁 parameter.

푁 parameter is exactly what we expected. Recall that doubling bcrypt’s CPU costs did not have any effect but halving the hash rate, while doubling scrypt’s memory costs gave the CPU an advantage over the GPU. The reason why scrypt is not able to thwart the GPU better than bcrypt is that the hash rate for moderate parameters are just too high. But we can expect, that ridiculously high values of 푁 would result in scrypt having slower GPU hash rates than bcrypt.

Next, we compare with the scrypt parameters from Section 5.2.2, that is, 푁 and 푝 are fixed and we look at the effect of the block size parameter 푟. The according comparison is presented in Table 5.5. We see a similar behavior as for the memory cost parameter 푁. But, as expected, 푟 is more successful in slowing down the GPU. As a result, scrypt levels up with bcrypt quicker than in Table 5.4. Again, we assume that even bigger values for 푟 would show the same behavior and give scrypt a leg-up on bcrypt.

As already seen in Section 5.2.3, the parallelization parameter 푝 does not manage to slow down our GPU implementation in such a way as 푁 or 푟. We can confirm that 5.3 Comparing scrypt with bcrypt 53

bcrypt costs Hash rate scrypt costs Hash rate Factor 4 2494 (210, 1, 1) 130,000 52 5 1279 (210, 2, 1) 30,000 23 6 652 (210, 4, 1) 9800 15 7 325 (210, 8, 1) 2560 7.9 8 164 (210, 16, 1) 390 2.4 9 82 (210, 32, 1) 55 0.7

Table 5.5: Comparison of hash rates for bcrypt and scrypt for varying 푟 parameter.

observation in Table 5.6 where we see that scrypt can be hashed 51 to 52 times faster than bcrypt, even for varying 푝 parameters.

bcrypt costs Hash rate scrypt costs Hash rate Factor 4 2494 (210, 1, 1) 130,000 52 5 1279 (210, 1, 2) 66,000 52 6 652 (210, 1, 4) 33,000 51 7 325 (210, 1, 8) 16,600 51

Table 5.6: Comparison of hash rates for bcrypt and scrypt for varying 푝 parameter.

Finally, we look at some further interesting cost parameters of scrypt that have already been presented in Section 5.2.4. The according hash rates are compared with bcrypt in Table 5.7. We can see that, based on the chosen parameters (푁, 푟, 푝), the factor

bcrypt costs Hash rate scrypt costs Hash rate Factor 7 325 (212, 2, 1) 5200 16 (213, 2, 1) 1563 9.5 8 164 (212, 4, 1) 872 5.3 (214, 2, 1) 418 5.1 82 9 (213, 4, 1) 253 3.1 (212, 8, 1) 215 2.6 (214, 4, 1) 64 1.6 10 40 (213, 8, 1) 54 1.4 11 20 (214, 8, 1) 14 0.7

Table 5.7: Further comparisons of hash rates for bcrypt and scrypt. between the bcrypt and the scrypt hash rates varies although the CPU running times are comparable. That was already apparent when looking at each parameter for itself and it holds true if we modify multiple parameters at once. A very interesting parameter 54 5 Results set from Table 5.7 is (푁, 푟, 푝) = (214, 8, 1) because this is what Percival suggests for interactive logins [Per09]. He claims that those settings result in bcrypt being 35 times more expensive than scrypt on ASICs. While we have no practical results for ASICs, we can state that, for our GPU implementations, it is not even twice as expensive. On the other side, it is obviously dramatically more expensive for the authentication server, who needs to spend very much memory. To be precise, in this case the memory size needed for one password of scrypt is 4096 times as high as for bcrypt. 6 Conclusion

In this thesis, we presented a GPU-assisted password cracker for scrypt and examined its results. Based on these results, we must make the assumption that scrypt is inferior to bcrypt in protecting password hashes for interactive logins against GPU-assisted attacks. For our tested graphics card, as long no significantly better bcrypt implementation then the one of hashcat is developed, scrypt is easier to attack for nearly all parameters imaginable for interactive logins. However, we do assume that scrypt is harder to attack for very high parameters that are unsuited for interactive logins. Therefore, scrypt might be a reasonable choice when key derivation for file encryption is needed. This holds true especially because we observed the very promising behavior that the GPU hash rate suffers much more under increasing the parameters than the CPU running time. Such a feature cannot be confirmed for conventional password hashing functions, including bcrypt. Additionally, we analyzed the cost estimations made in the original scrypt paper [Per09]. We found some inaccuracies. However, those do not lead to a lower, but a higher cost estimation. Since the cost estimations have been based on ASICs, we cannot assess those estimations with our results. But we can state that they do not hold for GPUs. For example, for a the parameter set (푁, 푟, 푝) = (214, 8, 1), the cost factor between scrypt and bcrypt was given as 35 (on ASICs) and turned out to be 1.4 (on our GPU). This shows that a password hashing function must always be developed with respect to all possible attacker platforms. While scrypt has been developed to thwart custom hardware attacks, GPUs have not been taking into account and now turn out to be a problem. So far, the only well-known system that employs scrypt is Litecoin. With respect to our results, we must state that Litecoin’s scrypt parameters have been a bad choice. Although the aim was to bring mining back to the CPU and thwart miners equipped with special hardware, very low parameters have been chosen. It seems to have worked out for holding off ASICs and even FPGAs so far, but GPUs are already widely used for Litecoin mining. An example for a better choice of parameters is Steve Gibsons Secure Quick Reliable Login (SQRL) that uses scrypt in a, admittedly unnecessary complicated, PBKDF2 like fashion called EnScrypt [Cor]. Gibson makes intensive use of the block size parameter 푟 and does not use at all the parallelization parameter 푝. According to our results, this is an approach that would have been more successful in fulfilling Litecoin’s original goals.

6.1 Future Work

We assume that the implementation can be improved by several means. The first idea is to do a memory-computation tradeoff. For instance, one could store just every other 56 6 Conclusion

scryptBlockMix output and therefore use 50% less memory. Then, when data is needed that has not been stored, the data must be computed from the previous stored data by applying scryptBlockMix. We expect that this would improve the hash rates since decreasing the needed memory directly enables increasing the degree of parallelism. With respect to this technique, it would be interesting to analyze the behavior for different tradeoffs: storing every other output, storing every third output, storing every fourth output, and so on. A further way of increasing the possible degree of parallelism is to use multiple threads per password. Since Salsa20/8 always works on four independent rows or columns, four threads per password would be the obvious choice. This would result in a degree of parallelism that is four times as large. Some difficulties might arise when parallelizing Salsa20/8, since when changing from columns to blocks the threads have to communicate via shared memory. This might lead to some performance drawbacks, but we think that this will be compensated by the increased parallelism. Another idea is to implement the 푝 parameter in a parallel way. In this case, one should additionally implement a multi-threaded CPU version of scrypt with the 푝 parameter being equal to the number of cores. Then, the performances could be analyzed and compared to our results from Chapter 5.

As stated above, our results show that it is not sufficient to develop a password hashing function that thwarts a special platform without taking into account other platforms. Percival was partially successful for ASICs and FPGAs. Very interesting further work is to investigate the best way of slowing down GPUs. A promising idea is to leverage the “feature” of thread divergence, that is, when multiple threads in a warp want to run distinct instructions, they are serialized. Additionally, a password hashing function must be attractive for authentication servers. Experience has shown that extensive resource usage discourages servers from using the according function or parameters. With respect to developing more advanced password hashing functions, we want to refer to the Password Hashing Competition (PHC)[Com] which pursues exactly this goal. The submission deadline was on March 31 of this year. As of today, 23 submissions are in the running and wait for being scrutinized. A Acronyms

API Application Programming Interface ASIC Application-Specific Integrated Circuit CPU Central Processing Unit CUDA Compute Unified Device Architecture DES Data Encryption Standard DRAM Dynamic Random-Access Memory ECB Electronic Codebook FPGA Field-Programmable Gate Array GPU Graphics Processing Unit HMAC Keyed-Hash Code KDF Key Derivation Function NIST National Institute of Standards and Technology PBKDF2 Password-Based Key Derivation Function 2 PHC Password Hashing Competition PRF Pseudorandom Function RAM Random-Access Memory SHA Secure Hash Algorithm SM Streaming Multiprocessor SQRL Secure Quick Reliable Login SRAM Static Random-Access Memory VLSI Very-Large-Scale Integration

B GeForce GTX 480 - Device Query

Device 0: "GeForce GTX 480" CUDA Driver Version / Runtime Version 6.0 / 5.0 CUDA Capability Major/Minor version number: 2.0 Total amount of global memory: 1536 MBytes (1610612736 bytes) (15) Multiprocessors x ( 32) CUDA Cores/MP: 480 CUDA Cores GPU Clock rate: 1512 MHz (1.51 GHz) Memory Clock rate: 1900 Mhz Memory Bus Width: 384-bit L2 Cache Size: 786432 bytes Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048) Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Maximum sizes of each dimension of a block: 1024 x 1024 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535 Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): No Device PCI Bus ID / PCI location ID: 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

C Measurements

(푁, 푟, 푝) H/s (210, 1, 1) 130,000 (211, 1, 1) 54,000 (212, 1, 1) 27,000 (213, 1, 1) 7400 (214, 1, 1) 1900 (215, 1, 1) 482 (216, 1, 1) 121 (217, 1, 1) 31 (218, 1, 1) 8 (1024, 2, 1) 30,000 (1024, 4, 1) 9800 (1024, 8, 1) 2560 (1024, 16, 1) 390 (1024, 32, 1) 55 (1024, 1, 2) 66,000 (1024, 1, 3) 44,000 (1024, 1, 4) 33,000 (1024, 1, 5) 26,000 (1024, 1, 6) 22,000 (1024, 1, 7) 19,000 (1024, 1, 8) 16,600 (212, 2, 1) 5200 (213, 2, 1) 1563 (214, 2, 1) 418 (212, 4, 1) 872 (213, 4, 1) 253 (214, 4, 1) 64 (212, 8, 1) 215 (213, 8, 1) 54 (214, 8, 1) 14

Table C.1: Summarized hash rates measured on the GPU.

List of Figures

1.1 The two layers of password security...... 2

2.1 Generation of a rainbow table...... 10 2.2 Overview of scrypt...... 14 2.3 The process of filling the memory with scryptROMix...... 15 2.4 The scryptBlockMix algorithm...... 16

3.1 A hash chain that is computed within PBKDF2...... 28 3.2 HMAC based on SHA-256...... 29

4.1 CUDA’s heterogeneous programming model...... 32 4.2 Assigning thread blocks to Streaming Multiprocessors...... 33 4.3 The CUDA memory model...... 34 4.4 Memory transfer implemented on the GPU...... 42

5.1 Graphic presentation of the CPU running times of scrypt for parameter N. 46 5.2 Tabular presentation of the CPU running times of scrypt for parameter N. 46 5.3 Graphic presentation of the CPU running times of scrypt for parameter r. 46 5.4 Tabular presentation of the CPU running times of scrypt for parameter r. 46 5.5 Graphic presentation of the CPU running times of scrypt for parameter p. 47 5.6 Tabular presentation of the CPU running times of scrypt for parameter p. 47 5.7 Graphic presentation of the GPU hash rates of scrypt for parameter N. . 48 5.8 Tabular presentation of the GPU hash rates of scrypt for parameter N. . . 48 5.9 Graphic presentation of the GPU hash rates of scrypt for parameter r. . . 49 5.10 Tabular presentation of the GPU hash rates of scrypt for parameter r. . . 49 5.11 Graphic presentation of the GPU hash rates of scrypt for parameter p. . . 49 5.12 Tabular presentation of the GPU hash rates of scrypt for parameter p. . . 49

List of Tables

1.1 Overview of the most famous password leaks in the last years...... 1

3.1 Percival’s hardware cost estimations in dollar-years...... 19 3.2 Details of the analyzed KDFs...... 20 3.3 Sizes of the key spaces in Percival’s cost estimations...... 21 3.4 Hash rate required by the hardware to run a successful brute-force attack in one year...... 21 3.5 Comparison for the cost adaptable KDFs between the numbers of crypto- graphic primitives presumably counted by Percival and the precise numbers. 23 3.6 Reverse-engineered hardware cost estimations in dollar-years...... 24 3.7 Improved hardware cost estimations in dollar-years...... 26

5.1 Further GPU hash rates...... 50 5.2 Comparable parameters for bcrypt and scrypt (1)...... 51 5.3 Comparable parameters for bcrypt and scrypt (2)...... 52 5.4 Comparison of hash rates for bcrypt and scrypt for varying 푁 parameter. 52 5.5 Comparison of hash rates for bcrypt and scrypt for varying 푟 parameter. . 53 5.6 Comparison of hash rates for bcrypt and scrypt for varying 푝 parameter. . 53 5.7 Further comparisons of hash rates for bcrypt and scrypt...... 53

C.1 Summarized hash rates measured on the GPU...... 61

List of Algorithms

2.2.1 scryptROMix ...... 15

3.3.1 bcrypt ...... 25 3.3.2 EksBlowfishSetup ...... 25

List of Listings

2.1 The Salsa20/8 core implemented in C code. Endianness must be handled by the caller...... 16

4.1 Computation of the degree of parallelism and the according grid and block size...... 35 4.2 Allocation and declaration of the search status and the character set. . . . 36 4.3 Computation of the next search status...... 36 4.4 Overview of the main loop in the host code...... 37 4.5 The GPU calls that are made from the host...... 38 4.6 Device functions for generating password candidates...... 39 4.7 Setup of the scryptROMix state...... 40 4.8 Computation of scryptBlockMix...... 40

Bibliography

[ALN97] Martín Abadi, T. Mark A. Lomas, and Roger Needham. Strengthening Passwords. Technical report, SRC Technical Note, 1997.

[Ber] Daniel J. Bernstein. The Salsa20 core. http://cr.yp.to/salsa20.html.

[Ber08] Daniel J. Bernstein. The Salsa20 family of stream ciphers. In New Designs, volume 4986 of LNCS, pages 84–97. Springer, 2008.

[BK95] Matt Bishop and Daniel V. Klein. Improving system security via proactive password checking. Computers & Security, 14(3):233–249, 1995.

[Bon12] Joseph Bonneau. The science of guessing: analyzing an anonymized corpus of 70 million passwords. In IEEE Symposium on Security and Privacy, 2012.

[CDP12] Claude Castelluccia, Markus Dürmuth, and Daniele Perito. Adaptive Password-Strength Meters from Markov Models. In Proc. NDSS. The Internet Society, 2012.

[Com] Password Hashing Competition. http://password-hashing.net/.

[Cor] Gibson Research Corporation. SQRL’s use of the ‘SCrypt’ Password Based Key Definition Function. http://www.grc.com/sqrl/scrypt.htm.

[Dev10] Martin Devillers. Analyzing Password Strength. Technical report, Radboud University Nijmegen, 2010.

[DGK+12] Markus Dürmuth, Tim Güneysu, Markus Kasper, Christof Paar, Tolga Yalçin, and Ralf Zimmermann. Evaluation of Standardized Password-Based Key Derivation against Parallel Processing Platforms. In Proc. ESORICS, pages 716–733, 2012.

[For] Bitcoin Forum. cudaMiner – a new litecoin mining application. http: //bitcointalk.org/index.php?topic=167229.0.

[For12] Forbes. Eight Million Email Addresses And Passwords Spilled From Gaming Site Gamigo Months After Hacker Breach. http: //www.forbes.com/sites/andygreenberg/2012/07/23/eight-million- passwords-spilled-from-gaming-site-gamigo-months-after-breach, July 23, 2012.

[Has14] Hashcat. oclHashcat v1.01. http://hashcat.net/oclhashcat/, 2014. 72 Bibliography

[Hel80] Martin E. Hellman. A Cryptanalytic Time-Memory Trade-Off. IEEE Transactions on Information Theory, 26(4):401–406, 1980. [Kal00] Burt Kaliski. PKCS #5: Password-Based Cryptography Specification Version 2.0. RFC 2898, 2000. [Kam94] Poul-Henning Kamp. MD5 crypt. FreeBSD 2.0, http: //svnweb.freebsd.org/base/head/lib/libcrypt/crypt.c?revision= 4246&view=markup, November 1994. [KH10] David B. Kirk and Wen-mei W. Hwu. Programming Massively Parallel Processors – A Hands-on Approach. Morgan Kaufmann Publishers, 2010. [Kle90] Daniel V. Klein. Foiling the Cracker: A Survey of, and Improvements to, Password Security. In Proc. USENIX UNIX Security Workshop, 1990. [KSHW98] John Kelsey, Bruce Schneier, Chris Hall, and David Wagner. Secure Ap- plications of Low-Entropy Keys. In Proc. First International Workshop on Information Security, ISW ’97, pages 121–134. Springer, 1998. [KSK+11] Saranga Komanduri, Richard Shay, Patrick Gage Kelley, Michelle L. Mazurek, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Serge Egelman. Of Passwords and People: Measuring the Effect of Password-composition Policies. In Proc. SIGCHI Conference on Human Factors in Computing Systems, CHI ’11, 2011. [Man96] Udi Manber. A simple scheme to make passwords based on one-way functions much harder to crack. Computers & Security, 15(2):171–176, 1996. [Moo65] Gordon E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8), 1965. [MT79] Robert Morris and Ken Thompson. Password Security: A Case History. Communications of the ACM, 22(11):594–597, 1979. [Nat99] National Institute of Standards and Technology. Data Encryption Standard (DES). FIPS PUB 46-3, 1999. [Nat08] National Institute of Standards and Technology. The Keyed-Hash Message Authentication Code (HMAC). FIPS PUB 198-1, 2008. [Nat12] National Institute of Standards and Technology. Secure Hash Standard (SHS). FIPS PUB 180-4, 2012. [Nat13] National Institute of Standards and Technology. Electronic Authentication Guideline. NIST Special Publication 800-63-2, 2013. [NS05] Arvind Narayanan and Vitaly Shmatikov. Fast Dictionary Attacks on Passwords Using Time-Space Tradeoff. In ACM Conference on Computer and Communications Security, pages 364–372. ACM, 2005. Bibliography 73

[NVI] NVIDIA. CUDA Zone. http://developer.nvidia.com/cuda-zone. [NVI13] NVIDIA. CUDA C Programming Guide v5.5, 2013. [Oec03] Philippe Oechslin. Making a Faster Cryptanalytic Time-Memory Trade-Off. In CRYPTO, volume 2729 of LNCS, pages 617–630. Springer, 2003.

[Ope] Openwall. C implementation of bcrypt. http://openwall.com/crypt/. [Per09] . Stronger Key Derivation via Sequential Memory-Hard Func- tions. In BSDCan, 2009. [PJ12] Colin Percival and Simon Josefsson. The scrypt Password-Based Key Deriva- tion Function. Version 01. Internet-Draft, 2012.

[Pla11] Playstation.Blog. PlayStation Network Security Update. http: //blog.us.playstation.com/2011/05/02/playstation-network- security-update, May 2, 2011. [PM99] Niels Provos and David Mazières. A Future-Adaptable Password Scheme. In Proc. FREENIX Track: 1999 USENIX Annual Technical Conference, 1999.

[Reu11] Reuters. Sony PlayStation suffers massive data breach. http: //www.reuters.com/article/2011/04/26/us-sony-stoldendata- idUSTRE73P6WB20110426, April 26, 2011. [Riv92] Ron Rivest. The MD5 Message-Digest Algorithm. RFC 1321, 1992. [Sch94] Bruce Schneier. Description of a New Variable-Length Key, 64-bit (Blowfish). In Fast Software Encryption, volume 809 of LNCS, pages 191–204. Springer, 1994. [SHM10] Stuart Schechter, Cormac Herley, and Michael Mitzenmacher. Popularity is Everything: A New Approach to Protecting Passwords from Statistical- guessing Attacks. In Proc. 5th USENIX Conference on Hot Topics in Security, HotSec’10, pages 1–8, 2010. [SK11] Jason Sanders and Edward Kandrot. CUDA By Example – An Introduction to General-Purpose GPU Programming. Addison-Wesley, 2011. [Sop13] Sophos. Anatomy of a password disaster – Adobe’s giant-sized cryptographic blunder. http://nakedsecurity.sophos.com/2013/11/04/anatomy-of- a-password-disaster-adobes-giant-sized-cryptographic-blunder, November 4, 2013. [Spa92] Eugene H. Spafford. Observing Reusable Password Choices. In Proc. 3rd Security Symposium. Usenix, pages 299–312, 1992. [Spr11] Martijn Sprengers. GPU-based Password Cracking. Master’s thesis, Radboud University Nijmegen, 2011. 74 Bibliography

[Tar] Tarsnap. The scrypt encryption utility. http://www.tarsnap.com/ scrypt.html.

[Tec09] TechCrunch. RockYou Hack: From Bad To Worse. http://techcrunch.com/ 2009/12/14/rockyou-hack-security-myspace-facebook-passwords, December 14, 2009.

[Tec12a] Ars Technica. 8 million leaked passwords connected to LinkedIn, dat- ing website. http://arstechnica.com/security/2012/06/8-million- leaked-passwords-connected-to-linkedin, June 6, 2012.

[Tec12b] Ars Technica. eHarmony confirms its members’ passwords were posted online, too. http://arstechnica.com/security/2012/06/eharmony-confirms- member-passwords-compromise, June 7, 2012.

[Tec13] Ars Technica. How an epic blunder by Adobe could strengthen hand of password crackers. http://arstechnica.com/security/2013/11/how- an-epic-blunder-by-adobe-could-strengthen-hand-of-password- crackers, November 1, 2013.

[WACS10] Matt Weir, Sudhir Aggarwal, Michael Collins, and Henry Stern. Testing Metrics for Password Creation Policies by Attacking Large Sets of Revealed Passwords. In Proc. 17th ACM Conference on Computer and Communica- tions Security, CCS ’10, pages 162–175, 2010.

[WAMG09] Matt Weir, Sudhir Aggarwal, Breno de Medeiros, and Bill Glodek. Password Cracking Using Probabilistic Context-Free Grammars. In Proc. 30th IEEE Symposium on Security and Privacy, pages 391–405. IEEE Computer Society, 2009.