Hash Functions: Theory, Attacks, and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Hash functions: Theory, attacks, and applications Ilya Mironov Microsoft Research, Silicon Valley Campus [email protected] October 24, 2005 1 Introduction Hash functions, most notably MD5 and SHA-1, initially crafted for use in a handful of cryptographic schemes with specific security requirements, have become standard fare for many developers and protocol designers who treat them as black boxes with magic properties. This practice had been defensible until 2004—both functions appeared to have withstood the test of time and intense scrutiny of cryptanalysts. Starting last year, we have seen an explosive growth in the number and power of attacks on the standard hash functions. In this note we discuss the extent to which the hash functions can be thought of as black boxes, review some recent attacks, and, most importantly, revisit common applications of hash functions in programming practice. The note assumes no previous background in cryptography. Parts of the text intended for the more mathematically-inclined readers are marked with H — the sign of the contour integral—and typeset in a smaller font. 2 Theory of hash functions In this section we introduce notation, define security properties of hash functions, describe basic design principles of modern hash functions and generic attacks. 2.1 Notation The following notation used in this note is standard in the cryptographic literature: 0, 1 n—the set of all binary strings of length n. {0, 1}∗—the set of all finite binary strings. A{ }B—the set of all pairs (v,w), where v A and w B. × ∈ ∈ 1 H : A B—function H from set A to set B. 7→ w —the length of string w. w| |v—concatenation of strings w and v. w|| v—bitwise XOR of binary strings w and v, which are presumed to be of equal length.⊕ For example, H : 0, 1 ∗ 0, 1 ℓ 0, 1 n means a function H of two arguments, the first of which is a binary{ string} × { of arbitrary} 7→ { length,} the second is a binary string of length ℓ, returning a binary string of length n. 2.2 Definitions The disconnect between theory and practice of cryptographic hash functions starts right in the beginning—in the very definition of hash functions. In practice, the hash function (sometimes called the message digest) is a fixed function that maps arbitrary strings into binary strings of fixed length. In theory, we usually consider keyed hash functions, as in the following definition: Definition Let ℓ, n be positive integers. We call f a hash function with n-bit output and ℓ-bit key if f is a deterministic function that takes two inputs, the first of arbitrary length, the second of length ℓ bits, and outputs a binary string of length n. Formally, H : 0, 1 ∗ 0, 1 ℓ 0, 1 n. { } × { } 7→ { } In this note we write the key as a subscript: Hk(x) is an expressive shorthand for H(x, k). The key k is assumed to be known unless indicated otherwise (to avoid confusion with cryptographic keys, which typically represent closely guarded secrets, the hash function’s key is sometimes called the “index”). We distinguish between three levels of security that hash functions may satisfy to be useful in cryptographic applications. The definitions are framed as games that are infeasible to win by a computationally-bounded adversary. The words “infeasible” and “computationally- bounded” can be formalized in several ways, neither of which we find entirely satisfying for the purpose of this note. Instead, we offer a semi-formal semantic where we say that the problem is computationally infeasible if no program performing less than T elementary operations can solve the problem with probability higher than T/280. It corresponds to the so-called 80-bit security level, which is currently acceptable for most applications. For further discussion on the appropriate level of security we refer the reader to [LV01]. H We stress that both the game and the adversary are randomized. The probability of the adversary’s success in winning the game is taken over all random choices, including the key of the keyed hash function and the adversary’s own coin tosses. Usually we assume the uniform distribution on the input, which is impossible to define when x is an arbitrary finite binary string. In those case it is convenient to break the set of inputs into finite subsets, such as strings of the same length. 2 Definition Hash function H is one-way if, for random key k and an n-bit string w, it is hard for the attacker presented with k,w to find x so that Hk(x)= w. Definition Hash function H is second-preimage resistant if it is hard for the attacker pre- sented with a random key k and random string x to find y = x so that H (x)= H (y). 6 k k Definition Hash function H is collision resistant if it is hard for the attacker presented with a random key k to find x and y = x so that H (x)= H (y). 6 k k The last definition is notably harder to formalize for keyless hash functions, which explains why theorists prefer keyed hash functions. It is easy to see that collision resistance implies second-preimage resistance. Strictly speaking, second-preimage resistance and one-wayness are incomparable (neither property follows from the other), although construction which are one-way but not second-preimage resistant are quite contrived. In practice, collision resistance is the strongest property of all three, hardest to satisfy and easiest to breach, and breaking it is the goal of most attacks on hash functions. Certificational weakness. Intuitively, a good hash function must satisfy other properties not implied by one-wayness or even collision-resistance. For example, one would expect that flipping a bit of the input would change approximately half the bits of the output (avalanche property) or that no inputs bits can be reliably guessed based on the hash function’s output (local one-wayness). Failure to satisfy those or similar properties, such as infeasibility of finding a pseudo-collision, a free-start collision, and a near-collision whose definitions are given in Section 5, is called a certificational weakness. Presence of certificational weaknesses does not amount to a break of a hash function but is enough to cast doubt on its design principles. 2.3 Generic attacks A generic attack is an attack that applies to all hash functions, no matter how good they are, as opposed to specific attacks that exploit flaws of a particular design. The running time of generic attacks is measured in the number of calls to the hash function, which is treated as a black box. It is not difficult to see that in the black box model the best strategy for inverting the hash function and for finding a second preimage is the exhaustive search. Suppose the problem is to invert Hk, i.e., given w, k find x, so that Hk(x) = w, where k is ℓ-bit key and w is an n-bit string. The only strategy which is guaranteed to work for any hash function is to probe arbitrary chosen strings until a preimage of w is hit. For a random function H it would take on average 2n−1 evaluations of H. In the black-box model the problem of finding a second preimage is just as hard as inverting the hash function. Finding collisions is a different story, the one that goes under the name of the “birthday 3 paradox.” The chances that among 23 randomly chosen people there are two who share the same birthday are almost 51%. The better than even odds appear to be much higher than the intuition would suggest, hence the name. Of course, there is nothing paradoxical (as in being absurd or self-contradictory) about the fact—between 23 people there are 253 pairs, each of which has a one-to-365 odds of being a hit. The reason why the result appears counter-intuitive is because your chances of finding among 22 people somebody with the same birthday as yours (not just someone else’s) to split the cost of the birthday party are strongly against you. This explains why inverting a hash function is much more difficult than finding a collision. More careful analysis of the birthday paradox shows that in order to attain probability better than 1/2 of finding a collision in a hash function with n-bit output, it suffices to evaluate the function on approximately 1.2 2n/2 randomly chosen inputs (notice that 1.2 √365 = 23). · ⌈ · ⌉ The running times of generic attacks on different properties of hash functions provide upper bounds on security of any hash function. We say that a hash function has ideal security if the best attacks known against it are generic. Cryptanalysts consider a primitive broken if its security is shown to be less than ideal, although it may still be sufficient for some applications. The generic attacks are summarized in Table 1. Property Ideal security One-wayness 2n−1 Second preimage-resistance 2n−1 Collision-resistance 1.2 2n/2 · Table 1: Complexity of generic attacks on different properties of hash functions. H A na¨ıve implementation of the birthday attack would store 2n/2 previously computed elements in a data structure supporting quick stores and look-ups. However, there is profound imbalance between the cost of storage and computation. While 240 CPU cycles are dirty cheap, capacity of high-end hard-drives is still below 240 bytes=1TB, and having commercially available computers with 1TB RAM is still several years away.