Horner's Method: a Fast Method of Evaluating a Polynomial String Hash
Total Page:16
File Type:pdf, Size:1020Kb
Olympiads in Informatics, 2013, Vol. 7, 90–100 90 2013 Vilnius University Where to Use and Ho not to Use Polynomial String Hashing Jaku' !(CHOCKI, $ak&' RADOSZEWSKI Faculty of Mathematics, Informatics and Mechanics, University of Warsaw Banacha 2, 02-097 Warsaw, oland e-mail: {pachoc$i,jrad}@mimuw.edu.pl Abstract. We discuss the usefulness of polynomial string hashing in programming competition tasks. We sho why se/eral common choices of parameters of a hash function can easily lead to a large number of collisions. We particularly concentrate on the case of hashing modulo the size of the integer type &sed for computation of fingerprints, that is, modulo a power of t o. We also gi/e examples of tasks in hich strin# hashing yields a solution much simpler than the solutions obtained using other known algorithms and data structures for string processing. Key words: programming contests, hashing on strings, task e/aluation. 1. Introduction Hash functions are &sed to map large data sets of elements of an arbitrary length 3the $eys4 to smaller data sets of elements of a 12ed length 3the fingerprints). The basic appli6 cation of hashing is efficient testin# of equality of %eys by comparin# their 1ngerprints. ( collision happens when two different %eys ha/e the same 1ngerprint. The ay in which collisions are handled is crucial in most applications of hashing. Hashing is particularly useful in construction of efficient practical algorithms. Here e focus on the case of the %eys 'ein# strings o/er an integer alphabetΣ= 0,1,...,A 1 . 5he elements ofΣ are called symbols. { − } An ideal hash f&nction for strings should o'viously depend 'oth on the multiset of the symbols present in the %ey and on the order of the symbols. The most common family of such hash functions treats the symbols of a string as coefficients of a polynomial with an integer /ariablep and computes its /alue modulo an inte#er constantM8 − H(s s s . s ) = (s +s p+s p2 + +s pn 1) modM. 1 2 3 n 1 2 3 ··· n ( careful choice of the parametersM,p is important to obtain 9#ood” properties of the hash function, i.e., lo collision rate. Fingerprints of strings can be computed inO(n) time &sin# the ell-kno n Horner’s method: h = 0, h = (s + ph ) modM fori= n, n 1,...,1. (1) n+1 i i i+1 − Where to Use and How not to Use olynomial String ,ashin* 91 Polynomial hashing has a rolling property: the 1ngerprints can be updated ef1ciently when symbols are added or remo/ed at the ends of the strin# (pro/ided that an array of powers ofp moduloM of suf1cient length is stored). The popular Rabin–Karp pattern matching algorithm is 'ased on this property 3*arp and +abin, 1987). Moreo/er, if the fingerprints of all the suf12es of a string are stored as in 31), one can efficiently find the fingerprint of any substring of the string8 − H(s . s ) = (h p j i+1 h ) modM. (2) i j i − j+1 This enables ef1cient comparison of any pair of substrings of a #i/en string. When the 1ngerprints of two strings are equal, we basically ha/e the follo ing op- tions8 either e consider the strings equal henceforth, and this ay possibly sacrifice the correctness in the case a collision occurs, or simply check symbol-by-symbol if the strings are indeed equal, possibly sacrificin# the efficiency. The decision should be made dependin# on a particular application. In this article e discuss the usefulness of polynomial string hashing in solutions of programming competition tasks. We 1rst sho why se/eral common ays of choosing the constantsM andp in the hash function can easily lead to a large number of collisions with a detailed discussion of the case in hichM=2 k. 5hen e consider e2amples of tasks in hich strin# hashing applies especially ell. 2. How to Choose Parameters of Hash Function? 2.1. Basic Constraints ( #ood requirement for a hash function on strings is that it should be dif1cult to find a pair of different strings, preferably of the same lengthn, that ha/e equal fingerprints. This excludes the choice ofM<n. Indeed, in this case at some point the po ers ofp corre- sponding to respecti/e symbols of the string start to repeat. (ssume thatp i p j modM ≡ fori<j<n. Then the follo in# two strings of lengthn ha/e the same fingerprint: 0...0 a1a2 . an−j 0...0 and0...0 a1a2 . an−j. � ��i � � j��−i � � ��j � Similarly, if gcd(M, p)>1 then powers ofp moduloM may repeat for exponents smaller thann. The safest choice is to setp as one of the generators of the groupU(Z M )– the #roup of all integers relati/ely prime toM under multiplication moduloM. "&ch a generator exists ifM e7&als 2, 4,q a or2q a whereq is an odd prime anda�1 is integer (Weisstein, on-line). ( generator ofU(Z M ) can 'e found by testing a number of random candidates. We will not get into further details here@ it is simply most important not to chooseM andp for hichM p i for any integeri. | ( slightly less ob/ious fact is that it is bad to choosep that is too small. Ifp<A (the size of the alphabet) then it is /ery easy to sho two strings of length 2 that cause a 92 /( achocki, /( 0adoszewski collision: H(01) =H(p0). 2.2. U##er Bound onM We also need to consider the magnitude of the parameterM. Let us recall that most programming languages, and especially the languages C, )++, !ascal that are used for IOI-style competitions, &se '&ilt-in integer types for integer manipulation. 5he most pop6 ular such types operate on 32-bit or B?6bit numbers hich corresponds to computations modulo2 32 and2 64 respecti/ely. Thus, to be able to use Horner’s method for fingerprint computation 31), the /alue(M 1) p+(M 1)=(M 1) (p+1) must fit ithin − · − − · the selected inte#er type. Howe/er, if we wish to compute fingerprints for s&bstrings of a string using the (2), we need(M 1) 2 + (M 1) to 1t ithin the integer type, hich − − bounds the range of possible /alues ofM significantly. (lternati/ely, one could choose a #reater constantM and use a fancier integer multiplication algorithm 3 hich is far less con/enient). 2.3. 2ower Bound onM On the other sideM is 'ounded due to the ell-known birthday paradox8 if e consider a collection ofm %eys withm�1.2 √M then the chance of a collision to occur within this collection is at least 50% (assuming that the distribution of fingerprints is close to uniform on the set of all strings). Thus if the birthday paradox applies then one needs to chooseM=ω(m 2) to ha/e a fair chance to a/oid a collision. Howe/er, one should note that not al ays the 'irthday paradox applies. As a benchmark consider the follo ing two problems. Problem 1: Longest Repeating Substring. Gi/en a strings, compute the longest string that occurs at least twice as a substring ofs. Problem 2: Lexicographical Comparison. Gi/en a strings and a number of 7&eries (a,b,c,d) specifying pairs of substrings ofs8s asa+1 . sb ands csc+1 . sd, check, for each 7uery, hich of the t o s&bstrings is le2icographically smaller. ( solution to !roblem 1 &ses the fact that ifs has a repeatin# substrin# of lengthk then it has a repeatin# s&bstring of any length smaller thank. Therefore we can apply bi6 nary search to find the maximum lengthk of a repeating s&bstring. ;or a candidate /alue ofk we need to find out if there is any pair of substrings ofs of lengthk ith e7&al 1nger6 prints. In this situation the 'irthday paradox applies. Here we assume that the distribution of 1ngerprints is close to &niform, we also ignore the fact that fingerprints of consecuti/e substrings heavily depend on each other – both of these simplifying assumptions turn out not to influence the chance of a collision significantly. The situation in Problem 2 is different. ;or a gi/en pair of s&'strings, we apply binary search to 1nd the length of their longest common prefi2 and after ards we compare the Where to Use and How not to Use olynomial String ,ashin* 93 symbols that immediately follo the common prefix, provided that they e2ist. Here we ha/e a completely different setting, since we only chec% if specific pairs of substrings are equal and we do not care about collisions across the pairs. In a &niform model, the chance 1 of a collision within a single comparison is M , and the chance of a collision occurring m withinm substrin# comparisons does not e2ceed M . The 'irthday paradox does not apply here. 2.4. What ifM=2 k3 ( /ery temptin# idea is not to select any /alue ofM at all8 simply perform all the compu- tations modulo the size of the inte#er type, that is, modulo2 k for some positi/e inte#erk. Apart from simplicity e also #ain efficiency since the modulo operation is relati/ely slo , especially for larger integer types. That is why many contestants often choose this method hen implementin# their solutions. Howe/er, this might not be the safest choice. Belo e sho a %no n family of strings which causes many collisions for s&ch a hash function.