String Matching String Matching Problem

String Matching String Matching Problem Pattern： compress Text： We introduce a general framework which is suitable to capture an essence of compresscompressed pattern matching according to various dictionary based compresscompressions. The goal is to find all occurrences of a pattern in a text without decompression,compress which is one of the most active topics in string matching. Our framework includes such compresscompression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed compresstext presented by Amir, Benson and Farach. Notaon & Terminology • String S: – S[1…n] • Sub-string of S : – S[i…j] • Prefix of S: – S[1…i] • Suffix of S: – S[i…n] • |S| = n (string length) • Ex: AGATCGATGGA Prefix Sub-string Suffix Notaon • Σ* is the set of all finite strings over Σ; length is |x|; concatenaon of x and y is xy with |xy| = |x| + |y| • w is a prefix of x (w x) provided x = wy for some y • w is a suffix of x (w x) provided x = yw for some y Naïve String Matcher • The equality test takes 3me O(m) ♦ Complexity – the overall complexity is O((n-m+1)m) Another method? • Can we do beNer than this? – Idea: shiQing at a mismatch far enough for efficiency not too far for correctness. – Ex: T = x a b x y a b x y a b x z P = a b x y a b x z a b x y a b x z X a b x y a b x z v v v v v v v X Jump a b x y a b x z v v v v v v v v Ø Preprocessing on P or T Two types of algorithms for string matching • Preprocessing on P (P fixed , T varied) – Ex: Query P in database T – Algorithms: • Knuth – Morris - Pratt • Boyer – Moore • Preprocessing on T (T fixed , P varied) – Ex: Look for P in dictionary T – Algorothms: – Rabin-Karp – Suffix Trees Rabin-Karp algorithm • Algorithm has worst case running time of O((n-m +1)m), but an average case running time of O(m+n) • Algorithm uses some elementary ideas from number theory - the equivalence of two numbers modulo a third • For simplicity assume that Σ= {0,1,2,..., 9} so that each character is a decimal digit – more generally, if the size of Σ is d, then we will treat each character as a digit in radix d notation. • We can think of a string of length k as a k-digit decimal number. Rabin Karp algorithm • Example - string of characters 354123 corresponds to the number 354,123 • Given a paern P[1 .. m], let p be the decimal number corresponding to the m character paern • Given a text T[1 ..n], let ts = T[s+1 .. s+m] for s = 0, 1, ... n-m denotes the decimal value of the m character substring star3ng at posi3on s+1. • Clearly, ts = p if and only if P[1 ..m] = T[s+1 .. s+m] • So, if we could compute p in O(m) 3me and compute all of the ts in O(n) 3me, then in O(n) 3me we can compare all of the shiQs with p and find all matching locaons. • Of course, the numbers may get very large, but we'll solve that problem in 2-3 slides P --> p • p = P[m] + 10 x P[m-1] + 10 2 x P[m-2] + ... 10 m-1 x P[1] • Compu3ng this in a brute force way would require O(m2) operaons to compute all of the powers of 10. • But no3ce, that we can eliminate one mul3plicaon by 10 in all terms (but the first), by factoring the 10 out: • p = P[m] + 10 (P[m-1] + 10 x P[m-2] + ... + 10 m-2 x P[1]) • Applying this factoring rule repeatedly will lead to only a linear number, in m, of mul3plicaons. Rabin-Karp • Compu3ng p in O(m) 3me. We use what is called Horner's rule • p = P[m] + 10 (P[m-1] +10 (P[m-2]) + ... + 10(P[2] + 10P[1])...))) • Example: P = 3276; P[4] = 6, P[3] = 7, P[2] = 2, P[1] =3 • p = 6 + 10(7 +10(2 + 10(3)))) = 6 + 10(7 + 10(32)) = 6 + 10(327) = 6 + 3270 = 3276 • t0 can be computed from T[1 .. m] using the same method Rabin Karp • Big question is then computing the remaining ti in O(n-m) time - that is, with just a constant number of operations for each remaining ti. • But, there is a simple relationship between t s+1 and ts: (m-1) t s+1 = 10(ts - 10 T[s]) + T[s+m+1] • Example: Suppose T = 314152 and the length of the pattern is 5. Then t0 = 31, 415. What is t1? 4 t1 = 10(31,415 - 10 (3)) + 2 = 10( 1,415) + 2 = 14,152 Rabin Karp • We precompute the constant 10(m-1) (which can be done in logm time) so each update of a t takes a constant number of operations. • So, p and all of the ti can be computed in O(m+n) time. Rabin Karp • So - what is the problem? Well, if P is a long paern, or if the size of the alphabet is large, then the p and ti may be very large numbers, and each arithme3c operaon will not take constant 3me - the numbers might require many words of memory to represent Rabin Karp • Solu3on - compute the p and t's modulo a suitable modulus q. • Horner's rule and the simple updang rule will work using mod q arithme3c • q is large and prime • q is chosen such that 10q just fits within one computer word. This allows all of the arithme3c operaons to be performed using single precision arithme3c, while keeping q as large as possible • For an alphabet with d symbols, q is chosen so that dq fits within one word. Rabin Karp ♦ The catch is that p = ts is no longer a sufficient condition for deciding that P matches a substring of T. ♦ What if two values collide? – Similar to hash table func3ons, it is possible that two or more different strings produce the same “hit” value – any hit will have to be tested to verify that it is not spurious and that p[1..m] = T[s+1..s+m] The Calculaons What is a worst case? • Let P be a string of length m of all a's • Let T be a string of length n of all a's • Then all substrings of length m of T will have the same "key" as P, and all will have to be checked. • But we know that if a substring of T matches P, to determine if the "next" substring matches we only have to perform one additional comparison. • So, the Rabin Karp algorithm can do a lot of unnecessary work. • In practice, though, it doesn't. The Algorithm Knuth-Morris-Pra Algorithm • The key observaon – when there is a mismatch aer several characters match, then the paern and search string share a common substring, which is also a prefix of the paern – If there is a suffix of the part of the paern we have matched which is equal to a prefix of that part, then we need to consider it as a possible match. – By finding the largest such suffix, we make the smallest shiQ – so we never miss a match. • By using the prefix func3on the algorithm has running 3me of O(n + m) Components of KMP algorithm • The prefix function, Π The prefix function,Π for a pattern summarizes how the pattern matches against shifts of itself. This information is used to avoid useless shifts of the pattern P. • The KMP Matcher Given text T, pattern P and prefix function Π as inputs, finds all of the occurrence of P in T. The prefix function, Π Compute-Prefix-Function (p) 1 m ß length[p] //’p’ pattern to be matched 2 Π[1] ß 0 3 k ß 0 4 for q ß 2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k ß Π[k] 7 If p[k+1] = p[q] 8 then k ß k +1 9 Π[q] ß k 10 return Π Example: compute Π for the pattern ‘p’ below: p do while k > 0 and p[k+1] ! a b a b a c a = p[q] do k ß Π[k] If p[k+1] = p[q] Initially: m = length[p] = 7 then k ß k +1 Π[1] = 0 Π[q] ß k k = 0 q 1 2 3 4 5 6 7 Step 1: q = 2, k=0 p a b a b a c a Π[2] = 0 Π 0 0 q 1 2 3 4 5 6 7 Step 2: q = 3, k = 0, p a b a b a c a Π[3] = 1 Π 0 0 1 q 1 2 3 4 5 6 7 Step 3: q = 4, k = 1 p a b a b a c A Π[4] = 2 Π 0 0 1 2 do while k > 0 and p[k+1] != p[q] do k ß Π[k] If p[k+1] = p[q] then k ß k +1 Π[q] ß k Step 4: q = 5, k =2 q 1 2 3 4 5 6 7 Π[5] = 3 p a b a b a c a Π 0 0 1 2 3 q 1 2 3 4 5 6 7 Step 5: q = 6, k = 3 Π[6] = 1 p a b a b a c a Π 0 0 1 2 3 0 q 1 2 3 4 5 6 7 Step 6: q = 7, k = 1 p a b a b a c a Π[7] = 1 Π 0 0 1 2 3 0 1 The KMP Matcher The KMP Matcher, with pattern ‘p’, string T and prefix function ‘Π’ as input, finds a match of p in T.

String Matching String Matching Problem

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support