Horner's Method: a Fast Method of Evaluating a Polynomial String Hash

Horner's Method: a Fast Method of Evaluating a Polynomial String Hash

Olympiads in Informatics, 2013, Vol. 7, 90–100 90 2013 Vilnius University Where to Use and Ho not to Use Polynomial String Hashing Jaku' !(CHOCKI, $ak&' RADOSZEWSKI Faculty of Mathematics, Informatics and Mechanics, University of Warsaw Banacha 2, 02-097 Warsaw, oland e-mail: {pachoc$i,jrad}@mimuw.edu.pl Abstract. We discuss the usefulness of polynomial string hashing in programming competition tasks. We sho why se/eral common choices of parameters of a hash function can easily lead to a large number of collisions. We particularly concentrate on the case of hashing modulo the size of the integer type &sed for computation of fingerprints, that is, modulo a power of t o. We also gi/e examples of tasks in hich strin# hashing yields a solution much simpler than the solutions obtained using other known algorithms and data structures for string processing. Key words: programming contests, hashing on strings, task e/aluation. 1. Introduction Hash functions are &sed to map large data sets of elements of an arbitrary length 3the $eys4 to smaller data sets of elements of a 12ed length 3the fingerprints). The basic appli6 cation of hashing is efficient testin# of equality of %eys by comparin# their 1ngerprints. ( collision happens when two different %eys ha/e the same 1ngerprint. The ay in which collisions are handled is crucial in most applications of hashing. Hashing is particularly useful in construction of efficient practical algorithms. Here e focus on the case of the %eys 'ein# strings o/er an integer alphabetΣ= 0,1,...,A 1 . 5he elements ofΣ are called symbols. { − } An ideal hash f&nction for strings should o'viously depend 'oth on the multiset of the symbols present in the %ey and on the order of the symbols. The most common family of such hash functions treats the symbols of a string as coefficients of a polynomial with an integer /ariablep and computes its /alue modulo an inte#er constantM8 − H(s s s . s ) = (s +s p+s p2 + +s pn 1) modM. 1 2 3 n 1 2 3 ··· n ( careful choice of the parametersM,p is important to obtain 9#ood” properties of the hash function, i.e., lo collision rate. Fingerprints of strings can be computed inO(n) time &sin# the ell-kno n Horner’s method: h = 0, h = (s + ph ) modM fori= n, n 1,...,1. (1) n+1 i i i+1 − Where to Use and How not to Use olynomial String ,ashin* 91 Polynomial hashing has a rolling property: the 1ngerprints can be updated ef1ciently when symbols are added or remo/ed at the ends of the strin# (pro/ided that an array of powers ofp moduloM of suf1cient length is stored). The popular Rabin–Karp pattern matching algorithm is 'ased on this property 3*arp and +abin, 1987). Moreo/er, if the fingerprints of all the suf12es of a string are stored as in 31), one can efficiently find the fingerprint of any substring of the string8 − H(s . s ) = (h p j i+1 h ) modM. (2) i j i − j+1 This enables ef1cient comparison of any pair of substrings of a #i/en string. When the 1ngerprints of two strings are equal, we basically ha/e the follo ing op- tions8 either e consider the strings equal henceforth, and this ay possibly sacrifice the correctness in the case a collision occurs, or simply check symbol-by-symbol if the strings are indeed equal, possibly sacrificin# the efficiency. The decision should be made dependin# on a particular application. In this article e discuss the usefulness of polynomial string hashing in solutions of programming competition tasks. We 1rst sho why se/eral common ays of choosing the constantsM andp in the hash function can easily lead to a large number of collisions with a detailed discussion of the case in hichM=2 k. 5hen e consider e2amples of tasks in hich strin# hashing applies especially ell. 2. How to Choose Parameters of Hash Function? 2.1. Basic Constraints ( #ood requirement for a hash function on strings is that it should be dif1cult to find a pair of different strings, preferably of the same lengthn, that ha/e equal fingerprints. This excludes the choice ofM<n. Indeed, in this case at some point the po ers ofp corre- sponding to respecti/e symbols of the string start to repeat. (ssume thatp i p j modM ≡ fori<j<n. Then the follo in# two strings of lengthn ha/e the same fingerprint: 0...0 a1a2 . an−j 0...0 and0...0 a1a2 . an−j. � ��i � � j��−i � � ��j � Similarly, if gcd(M, p)>1 then powers ofp moduloM may repeat for exponents smaller thann. The safest choice is to setp as one of the generators of the groupU(Z M )– the #roup of all integers relati/ely prime toM under multiplication moduloM. "&ch a generator exists ifM e7&als 2, 4,q a or2q a whereq is an odd prime anda�1 is integer (Weisstein, on-line). ( generator ofU(Z M ) can 'e found by testing a number of random candidates. We will not get into further details here@ it is simply most important not to chooseM andp for hichM p i for any integeri. | ( slightly less ob/ious fact is that it is bad to choosep that is too small. Ifp<A (the size of the alphabet) then it is /ery easy to sho two strings of length 2 that cause a 92 /( achocki, /( 0adoszewski collision: H(01) =H(p0). 2.2. U##er Bound onM We also need to consider the magnitude of the parameterM. Let us recall that most programming languages, and especially the languages C, )++, !ascal that are used for IOI-style competitions, &se '&ilt-in integer types for integer manipulation. 5he most pop6 ular such types operate on 32-bit or B?6bit numbers hich corresponds to computations modulo2 32 and2 64 respecti/ely. Thus, to be able to use Horner’s method for fingerprint computation 31), the /alue(M 1) p+(M 1)=(M 1) (p+1) must fit ithin − · − − · the selected inte#er type. Howe/er, if we wish to compute fingerprints for s&bstrings of a string using the (2), we need(M 1) 2 + (M 1) to 1t ithin the integer type, hich − − bounds the range of possible /alues ofM significantly. (lternati/ely, one could choose a #reater constantM and use a fancier integer multiplication algorithm 3 hich is far less con/enient). 2.3. 2ower Bound onM On the other sideM is 'ounded due to the ell-known birthday paradox8 if e consider a collection ofm %eys withm�1.2 √M then the chance of a collision to occur within this collection is at least 50% (assuming that the distribution of fingerprints is close to uniform on the set of all strings). Thus if the birthday paradox applies then one needs to chooseM=ω(m 2) to ha/e a fair chance to a/oid a collision. Howe/er, one should note that not al ays the 'irthday paradox applies. As a benchmark consider the follo ing two problems. Problem 1: Longest Repeating Substring. Gi/en a strings, compute the longest string that occurs at least twice as a substring ofs. Problem 2: Lexicographical Comparison. Gi/en a strings and a number of 7&eries (a,b,c,d) specifying pairs of substrings ofs8s asa+1 . sb ands csc+1 . sd, check, for each 7uery, hich of the t o s&bstrings is le2icographically smaller. ( solution to !roblem 1 &ses the fact that ifs has a repeatin# substrin# of lengthk then it has a repeatin# s&bstring of any length smaller thank. Therefore we can apply bi6 nary search to find the maximum lengthk of a repeating s&bstring. ;or a candidate /alue ofk we need to find out if there is any pair of substrings ofs of lengthk ith e7&al 1nger6 prints. In this situation the 'irthday paradox applies. Here we assume that the distribution of 1ngerprints is close to &niform, we also ignore the fact that fingerprints of consecuti/e substrings heavily depend on each other – both of these simplifying assumptions turn out not to influence the chance of a collision significantly. The situation in Problem 2 is different. ;or a gi/en pair of s&'strings, we apply binary search to 1nd the length of their longest common prefi2 and after ards we compare the Where to Use and How not to Use olynomial String ,ashin* 93 symbols that immediately follo the common prefix, provided that they e2ist. Here we ha/e a completely different setting, since we only chec% if specific pairs of substrings are equal and we do not care about collisions across the pairs. In a &niform model, the chance 1 of a collision within a single comparison is M , and the chance of a collision occurring m withinm substrin# comparisons does not e2ceed M . The 'irthday paradox does not apply here. 2.4. What ifM=2 k3 ( /ery temptin# idea is not to select any /alue ofM at all8 simply perform all the compu- tations modulo the size of the inte#er type, that is, modulo2 k for some positi/e inte#erk. Apart from simplicity e also #ain efficiency since the modulo operation is relati/ely slo , especially for larger integer types. That is why many contestants often choose this method hen implementin# their solutions. Howe/er, this might not be the safest choice. Belo e sho a %no n family of strings which causes many collisions for s&ch a hash function.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us