<<

Printer - please drop in Elsevier (tree) logo Computers 8 Security Vol. 17, No.2, pp. 171- 174, 1998 0 1998 Elsevier Science Limited All rights reserved. Printed in Great Britain 0167-4048/98 $19.00

On Probabilities of Hash Value Matches

Mohammad Peyraviana, Allen Roginskya, Ajay Kshemkalyanib

alBM Corporation, Research Triangle Park, NC 27709, USA bECECS Detlartment. University of Cincinnati, Cincinnati, OH 45221, USAL ’ ‘ _

Hash functions are used in authentication and , as hash function is said to be collision resistant if it is hard well as for the efficient storage and retrieval of data using hashed to find two input strings that map to the same hash keys. Hash functions are susceptible to undesirable collisions. To design or choose an appropriate hash function for an application, value.The problem of constructing fast hash functions it is essential to estimate the probabilities with which these colli- that also have low collision rates is studied in [S]. - sions can occur. In this paper we consider two problems: one of to-address transformation techniques for file access evaluating the probability of no collision at all and one of finding and their performance have been studied in [8]. Given a bound for the probability of a collision with a particular hash k, the number of hashed values that have been used in value. The quality of these estimates under various values of the parameters is also discussed. a hashing scheme in which input strings are mapped to random values, the expected number of locations Keywords: hash functions, security, cryptography, indexing, databases that must be looked at until an empty hashed value is found is formulated in [lo]. This result was improved 1. Introduction for certain non-random hashing functions and certain values of k in [14]. The efficiency of multiple-key A hash hnction takes a variable-length input string hashing in limiting the search of a given key value in and maps it to a fixed-length (generally smaller) out- a file and in minimizing the search in answering par- put string. Hash fimctions are extensively used for tial-match or multi-attribute queries is studied in [2]. index management in database systems and file sys- The only known work that deals with the probability tems for efficient storage and retrieval of data based on of collisions of hash functions is [3,13,16]. These hashed keys [ 1,9,15], digital signatures for authentica- papers dealt with the construction of universal hash tion [6,12,16], and cryptography [4,5,11,12]. A good functions which are classes of hash fimctions such that survey of classical hashing methods is given in [9]. the functions in any class have the same bound on the probability of hash collisions. A hash function is prone to collisions wherein two input strings map to the same output string. A good For several applications such as cryptography, it is hash function minimizes the possibility of collisions; a important to design or choose appropriate hash func-

0167-4048/98$19.00 0 1998 Elsevier Science Ltd 171 On Probabilities of Hash Value Ma tcheslPeyravian, Roginsky, Kshemkalyani

tions with an upper bound on the probability of col- Definition 2: Following [12], we define a one-way lisions. To this end, our objective is to determine the hash function H(M) = h as a function that maps any probabilities of hash collisions for hash functions that input bit string M from set A into a bit string of a fixed are uniformly mapped into the codomain, i.e., each length in set B with the properties that input string is equally likely to be mapped to any of the hash values of the codomain under different sce- l Every possible value of B is equally likely to be an narios. Specifically, we address two problems: I) deter- image of an element in set A. mining the probability that there are no collisions at l Given M, it is easy to compute h. all, in Theorem I, and 2) determining a bound on the l Given h, it is hard to compute M such that H(M) = h probability of a collision with a particular hash value, l Given M, it is hard to find another message M’ in Theorem 2. For each ofTheorems I and 2, we state such that H(M) = H(M’). some corollaries that give the minimum size of the l It is hard to find two random messages, M and M’, hash function that is necessary to satisfy a desired upper bound on the probability of hash collisions. such that H(M) = H(M’). Although our results apply to all hash functions that satisfy Definition I, we focus on the cryptography 2. Hash Functions application (Definition 2) for the rest of this paper. Definition 1: A hash function H(M) = h maps an input bit string M from set A into a bit string of a fixed length in set B with the properties that 3. Problem Statement Let us consider the following scenario. Let A. be a set Every possible value of B is equally likely to be an consisting of m different messages M,, M2,. .., M,,,. Let image of an element in set A. M denote a particular message in A with a correspond- Given M, it is easy to compute h. ing hash element H(M) in set B.Two problems then arise.

Hash functions considered in the literature, e.g. Problem 1: Given sets A and B, determine the prob- [3,5,8,10,13,14,16] satisfy the above definition. ability that no two elements in A mesh into the same element of B, that is, determine the probability that We were prompted to address the problems on the there are no collisions. probability of collisions while employing hash func- tions for cryptography and had to deal with potential Problem 2: Given our particular message M, deter- collisions. In cryptography, the security of protocols mine the probability that no other element in A mesh- that use hash functions would be undermined if the es into the value, H(M), of the hash function at M. hash functions are not highly collision resistant. Additionally, it is should be extremely hard to recon- Answers to Problems I and 2 give two measures of struct the input string from its hash value in a reason- goodness of a hash function. An application can able amount of time. Such hash functions are called choose or design a hash function such that the good- one-way hash functions. ness of the hash function satisfies the application’s requirements. In cryptography, Problem I is also One-way hash functions are useful in cryptography important since a high probability of collisions because they enforce the property that the knowledge between any messages opens the door to the so-called of a particular value of H(M) does not help the attack- ‘birthday attacks’ (see [12].) Problem 2 is also relevant er to guess the value M that was used and at the same because a user is concerned with only his or her mes- time it provides a ‘fingerprint’ of M that is unique. sage M having a unique image in set B. There are many one-way hash functions such as SHA-I, MD2, MD4, MD5, and presently in In practical applications, Problems I and 2 can be stat- use (see [12]). ed as follows: given the size m of set A and the maxi-

172 Computers & Security, Vol. 7 7, No. 2

mum permissible value Q,, of the collision probabil- ity, what should be the minimum length, k, in bits of the values of the hash fimction? We answer these ques- tions associated with Problems 1 and 2 by deriving where corollaries to two theorems that answer Problem 1 and Problem 2, respectively. IB2j I IRj(z) I< 4. Estimates of Collision Probabilities 2j(2j - l)/zi2j-l co~~j-~(O.S. Im(z)) We will prove two theorems (Theorems 1 and 2) asso- ciated with Problems 1 and 2 stated above and also and B,, is the corresponding Bernoulli number. The consider some special cases. reason for the cosine term is that the formula is also applicable to the non-real values of Z. We will use this Theorem 1: For kz3 and ~n<2~-‘, the probability Q formula only with j=l, so that the summation term that there exists collisions between the hash values sat- will not be present, the only applicable Bernoulli isfies the following inequality: number is B, = l/6, and with z equal first to c and then to c-m. Therefore, z is real and positive, the cosine function of one half of its imaginary part is equal to 1 and thus 1RI(z)1 c In(c/(c-m))-05 In(c/(c-m))-m In(c/(c-m))~m+(R(c)-R(c-m)) p=b-l b-2 b-m+1 b! =(c-m-05) In(l+m/(r-m))-m+(R(c)-R(c-m)). -~x---_x..)( ---= (6) b b b bm(b-m)! (2) Let us denote n = c - m. From the assumptions of the where b=2k is the number of possible values of the theorem it follows immediately that m < n. hash function. In terms of the Gamma function r(x), (2) can be rewritten Using the bound for R(x) and the fact that n < c, we have 1R(c) - R(n)\ < l/6n so (6) can be written as UC) P= bmr(c-m) (3) ln(P)>(n-05) ln(l+m/n)-m-1/6n where c = b+l=2”+1. Hence >(n-0.5)(mln-m2 /2n2)-m-1/6n m 2 2 =_- _II1-+?!nl/fjn 2n 2n 4n2 ln(P)=ln(G(c))-m In(b)-ln(P(c-m)). 2 (m + 1)2 (4) =--+ m+m+1/3n 2n ( 2n 4n2 1 A logarithm of a Gamma function for the large values of the argument can be very well estimated by using >-(m+1)2 /2n. formula 8.344 from [7]. It states that for any j> 1,

173 On Probabilities of Hash Value MatcheslPeyravian, Roginsky, Kshemkalyani

Here we used the well-known and easily-verified The following two corollaries may prove to be very fact that ln(l+x)>x-G/2 for x>O. Hence p>e-(m+1)2’2n useful in all practical applications of the above results. and the statement of Theorem 1 follows from here. The upper bound for the probability in Theorem 1 This completes the proof ofTheorem 1. can be very closely approximated by m2/2-ck+‘). QED While there is no absolute guarantee that the latter expression provides an upper bound for Q, its close- Theorem la: Under the assumptions ofTheorem 1, ness to the estimate in (1) makes it fully acceptable for the probability Q that there exist collisions between all reasonable values of r, k and Q,. With this under- the hash values satisfies that following inequality: standing we can obtain the following results.

Corollary la: To achieve a target probability of Q,, or less for the probability of having hash collisions, it Q< cm+1>* is sufficient to have a hash function with the value of 2(2k +1-m) (7) k that satisfies Proof: k22.10gzm-log*Q,,,-1 It follows immediately from Theorem 1 and from the fact that 4y> 1 +x, for all x. QED QED Corollary lb: Let Ybe the length of a message in the Theorem la offers a very close approximation to the A set and suppose that the A set may include all of the true probability of having collisions if m is significant- 2’ different messages of length r bits. Suppose also that ly smaller than ~1. Q is expressed as 2’ for some t. Then it is sufficient to%ve k that satisfies Corollary 1: Given Q_ and m, it is sufficient to have k12r+t-1

Now we turn to Problem 2 and prove the following (m+ l)* k 2 log2 p+m-1 result. ( 2Qrnax 1 (8) Theorem 2: The probability Q that there exist col- Proof: lisions with a given hash value satisfies the following If k satisfies (8) then inequality:

(m+I)2 +m_I Q* Mi#M hashes into a value different from H(M) is Q< 2 Qmax (b-1)/b where, as before, b=2k. 2(2k +1-m) QED

174 Computers & Security, Vol. 77, No. 2

The properties of our hash function allow us to 5. Conclusion assume the independence of these collisions, so P = (l-l/6)m-‘. We have Analyzing the probability is important for choosing or designing appropriate hash functions for critical applications such as cryptography. A large number of hash functions studied in the literature sat- isfy the property that each hash value is equally likely to be the image of any given input. This paper ana- lyzed the hash collision probability by answering the following two questions for all hash functions that sat- >(m-1)(-l/b-lib2) isfy the above property.

=-mlb+(lib+llb2-mlb2)>-mlb, when m I b. What is the probability that there are no collisions at all?

Therefore PY”“~, so Q< l-~-~‘~. Theorem 2 is The answer to this question is particularly important proved. in knowing the susceptibility of the hash function- QED based to ‘birthday attacks’.

Theorem 2a: The probability Q that there exist col- Given any particular input string, what is the proba- lisions with a given hash value satisfies the inequality bility that no other input string in the domain gets hashed to the same value that the given input string Q$ hashed into?

The answer to this question gives the user of the cryp- Proof: to-system based on the hash function the probability Theorem 2a follows immediately from Theorem 2 and that his particular input string will not be involved in the fact that emX>I-x for all 9~. a collision with any other input. QED The theorems that answer the above questions are Corollary 2: Given Q_ and m, it is sufficient to have used to derive corollaries that answer the following questions for the two types of probabilities addressed: k 1 Iogz m - Iogz Qmax Given the size of input strings and the maximum per- Proof: missible value Q,, of the collision probability, what The proof logic follows that of the proof of Corollary 1. should be the minimum length k in bits of the value QED of the hash function?

Corollary 2a: Under the assumptions made in An application would choose or design a hash func- Corollary lb, it is sufficient for k to satisfy k_>r+t. tion of a length indicated by answering this question QED in order to guarantee an upper bound on the proba- bility of a hash collision. Example: If one considers the hashes of all messages 100 bits long and wants the probability of collisions References with a particular hash value not to exceed 2-60, then it [1] A. Aho, J. Hopcroft, J. Ullman, The Design and Analysis of is safe to use the hash function SHA-1 (k = 160) for Computer Akorithms, Addison-Wesley, Reading, Mass., 1974. this purpose. [2] A. Bolour, “Optimal+ Properties of Multiple Key Hash Functions,” Journal ofthe ACM, 26(2), 196-210, 1979.

175 On Probabilities of Hash Value MatcheslPeyravian, Roginskv, Kshemkalyani

[3] J. L. Carter, M. N. Wegman, “Universal Classes of Hash [lo] R. Morris,“Scatter Storage Techniques,” Communications ofthe Functions,“jocrrna/ of Computer and System Sciences, 18, 143-154, ACM, 11(l), 38-44, Jan. 1968.

1979. [ 1 l] B. Preneel, “Cryptographic Hash Functions,” European %ns. [4] I. B. Damgard,“Collision Free Hash Functions and Public Key z/ecommunications, 5, 431-448, 1994. Signature Schemes,” Advances in Cryptography, Eurocrypt ‘87, [ 121 B. Schneier, Applied Cryptography, 2nd edition, John Wiley and LNCS 304,203-216, 1987, Springer-Verlag. Sons, Inc, 1996. [5] I. B. Damgard, “A Design Principle for Hash Functions,” [13] D. Stinson, “Combinatorial Techniques for Universal Advances in Cryptology, Crypt0 ‘89, LNCS 435, 416-427, 1989, Hashing” journal of Computer and System Sciences, 48, 337-346, Springer-Verlag. 1994. [6] S. Goldwasser, S. Micali, R. Rivest, “A Paradoxical Solution to [14] J. D. Ullman, “A Note on the Efficiency of Hashing the Signature Problem,” Proc. 25th IEEE Conf. on Foundations of Functions”,~otrma~ of the ACM, 19(3), 569-575, July 1972. Computer Science, 441-448,1984. [ 151 J. D. ULLman, Principles of Database Systems, Computer Science [7] I. S. Gradshtein, I. M. Ryzhik, Tables of Integrals, Series, and Press, 1980. Products, Academic Press, 1980. [16] M. N. Wegman, J. L. Carter, “New Hash Functions and Their [8] V Lum, “General Performance Analysis of Key-to-Address Use in Authentication and Set quality,“Journal of Computer and Transformation Methods Using an Abstract File Concept,” System Sciences, 22,265-279, 1981. Commlrnicarions of the ACM, 16(10), 603-612, October 1973. [9] W. D. Maurer, T. G. Lewis, “Hash Table Methods,” ACM Computing Surveys, 7(l), 5-20, 1975.

176