String Matching String Matching Problem

String Matching String Matching Problem Pattern： compress Text： We introduce a general framework which is suitable to capture an essence of compresscompressed pattern matching according to various dictionary based compresscompressions. The goal is to find all occurrences of a pattern in a text without decompression,compress which is one of the most active topics in string matching. Our framework includes such compresscompression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed compresstext presented by Amir, Benson and Farach. Notaon & Terminology • String S: – S[1…n] • Sub-string of S : – S[i…j] • Prefix of S: – S[1…i] • Suffix of S: – S[i…n] • |S| = n (string length) • Ex: AGATCGATGGA Prefix Sub-string Suffix Notaon • Σ* is the set of all finite strings over Σ; length is |x|; concatenaon of x and y is xy with |xy| = |x| + |y| • w is a prefix of x (w x) provided x = wy for some y • w is a suffix of x (w x) provided x = yw for some y Naïve String Matcher • The equality test takes 3me O(m) ♦ Complexity – the overall complexity is O((n-m+1)m) Another method? • Can we do beNer than this? – Idea: shiQing at a mismatch far enough for efficiency not too far for correctness. – Ex: T = x a b x y a b x y a b x z P = a b x y a b x z a b x y a b x z X a b x y a b x z v v v v v v v X Jump a b x y a b x z v v v v v v v v Ø Preprocessing on P or T Two types of algorithms for string matching • Preprocessing on P (P fixed , T varied) – Ex: Query P in database T – Algorithms: • Knuth – Morris - Pratt • Boyer – Moore • Preprocessing on T (T fixed , P varied) – Ex: Look for P in dictionary T – Algorothms: – Rabin-Karp – Suffix Trees Rabin-Karp algorithm • Algorithm has worst case running time of O((n-m +1)m), but an average case running time of O(m+n) • Algorithm uses some elementary ideas from number theory - the equivalence of two numbers modulo a third • For simplicity assume that Σ= {0,1,2,..., 9} so that each character is a decimal digit – more generally, if the size of Σ is d, then we will treat each character as a digit in radix d notation. • We can think of a string of length k as a k-digit decimal number. Rabin Karp algorithm • Example - string of characters 354123 corresponds to the number 354,123 • Given a paern P[1 .. m], let p be the decimal number corresponding to the m character paern • Given a text T[1 ..n], let ts = T[s+1 .. s+m] for s = 0, 1, ... n-m denotes the decimal value of the m character substring star3ng at posi3on s+1. • Clearly, ts = p if and only if P[1 ..m] = T[s+1 .. s+m] • So, if we could compute p in O(m) 3me and compute all of the ts in O(n) 3me, then in O(n) 3me we can compare all of the shiQs with p and find all matching locaons. • Of course, the numbers may get very large, but we'll solve that problem in 2-3 slides P --> p • p = P[m] + 10 x P[m-1] + 10 2 x P[m-2] + ... 10 m-1 x P[1] • Compu3ng this in a brute force way would require O(m2) operaons to compute all of the powers of 10. • But no3ce, that we can eliminate one mul3plicaon by 10 in all terms (but the first), by factoring the 10 out: • p = P[m] + 10 (P[m-1] + 10 x P[m-2] + ... + 10 m-2 x P[1]) • Applying this factoring rule repeatedly will lead to only a linear number, in m, of mul3plicaons. Rabin-Karp • Compu3ng p in O(m) 3me. We use what is called Horner's rule • p = P[m] + 10 (P[m-1] +10 (P[m-2]) + ... + 10(P[2] + 10P[1])...))) • Example: P = 3276; P[4] = 6, P[3] = 7, P[2] = 2, P[1] =3 • p = 6 + 10(7 +10(2 + 10(3)))) = 6 + 10(7 + 10(32)) = 6 + 10(327) = 6 + 3270 = 3276 • t0 can be computed from T[1 .. m] using the same method Rabin Karp • Big question is then computing the remaining ti in O(n-m) time - that is, with just a constant number of operations for each remaining ti. • But, there is a simple relationship between t s+1 and ts: (m-1) t s+1 = 10(ts - 10 T[s]) + T[s+m+1] • Example: Suppose T = 314152 and the length of the pattern is 5. Then t0 = 31, 415. What is t1? 4 t1 = 10(31,415 - 10 (3)) + 2 = 10( 1,415) + 2 = 14,152 Rabin Karp • We precompute the constant 10(m-1) (which can be done in logm time) so each update of a t takes a constant number of operations. • So, p and all of the ti can be computed in O(m+n) time. Rabin Karp • So - what is the problem? Well, if P is a long paern, or if the size of the alphabet is large, then the p and ti may be very large numbers, and each arithme3c operaon will not take constant 3me - the numbers might require many words of memory to represent Rabin Karp • Solu3on - compute the p and t's modulo a suitable modulus q. • Horner's rule and the simple updang rule will work using mod q arithme3c • q is large and prime • q is chosen such that 10q just fits within one computer word. This allows all of the arithme3c operaons to be performed using single precision arithme3c, while keeping q as large as possible • For an alphabet with d symbols, q is chosen so that dq fits within one word. Rabin Karp ♦ The catch is that p = ts is no longer a sufficient condition for deciding that P matches a substring of T. ♦ What if two values collide? – Similar to hash table func3ons, it is possible that two or more different strings produce the same “hit” value – any hit will have to be tested to verify that it is not spurious and that p[1..m] = T[s+1..s+m] The Calculaons What is a worst case? • Let P be a string of length m of all a's • Let T be a string of length n of all a's • Then all substrings of length m of T will have the same "key" as P, and all will have to be checked. • But we know that if a substring of T matches P, to determine if the "next" substring matches we only have to perform one additional comparison. • So, the Rabin Karp algorithm can do a lot of unnecessary work. • In practice, though, it doesn't. The Algorithm Knuth-Morris-Pra Algorithm • The key observaon – when there is a mismatch aer several characters match, then the paern and search string share a common substring, which is also a prefix of the paern – If there is a suffix of the part of the paern we have matched which is equal to a prefix of that part, then we need to consider it as a possible match. – By finding the largest such suffix, we make the smallest shiQ – so we never miss a match. • By using the prefix func3on the algorithm has running 3me of O(n + m) Components of KMP algorithm • The prefix function, Π The prefix function,Π for a pattern summarizes how the pattern matches against shifts of itself. This information is used to avoid useless shifts of the pattern P. • The KMP Matcher Given text T, pattern P and prefix function Π as inputs, finds all of the occurrence of P in T. The prefix function, Π Compute-Prefix-Function (p) 1 m ß length[p] //’p’ pattern to be matched 2 Π[1] ß 0 3 k ß 0 4 for q ß 2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k ß Π[k] 7 If p[k+1] = p[q] 8 then k ß k +1 9 Π[q] ß k 10 return Π Example: compute Π for the pattern ‘p’ below: p do while k > 0 and p[k+1] ! a b a b a c a = p[q] do k ß Π[k] If p[k+1] = p[q] Initially: m = length[p] = 7 then k ß k +1 Π[1] = 0 Π[q] ß k k = 0 q 1 2 3 4 5 6 7 Step 1: q = 2, k=0 p a b a b a c a Π[2] = 0 Π 0 0 q 1 2 3 4 5 6 7 Step 2: q = 3, k = 0, p a b a b a c a Π[3] = 1 Π 0 0 1 q 1 2 3 4 5 6 7 Step 3: q = 4, k = 1 p a b a b a c A Π[4] = 2 Π 0 0 1 2 do while k > 0 and p[k+1] != p[q] do k ß Π[k] If p[k+1] = p[q] then k ß k +1 Π[q] ß k Step 4: q = 5, k =2 q 1 2 3 4 5 6 7 Π[5] = 3 p a b a b a c a Π 0 0 1 2 3 q 1 2 3 4 5 6 7 Step 5: q = 6, k = 3 Π[6] = 1 p a b a b a c a Π 0 0 1 2 3 0 q 1 2 3 4 5 6 7 Step 6: q = 7, k = 1 p a b a b a c a Π[7] = 1 Π 0 0 1 2 3 0 1 The KMP Matcher The KMP Matcher, with pattern ‘p’, string T and prefix function ‘Π’ as input, finds a match of p in T.

String Matching String Matching Problem

FM-Index Reveals the Reverse Suffix Array

Optimal Time and Space Construction of Suffix Arrays and LCP

A Quick Tour on Suffix Arrays and Compressed Suffix Arrays✩ Roberto Grossi Dipartimento Di Informatica, Università Di Pisa, Italy Article Info a B S T R a C T

Suffix Trees, Suffix Arrays, BWT

1 Suffix Trees

Suffix Trees and Suffix Arrays in Primary and Secondary Storage Pang Ko Iowa State University

Approximate String Matching Using Compressed Suffix Arrays

Suffix Array

Exhaustive Exact String Matching: the Analysis of the Full Human Genome

Distributed Text Search Using Suffix Arrays$

Optimal In-Place Suffix Sorting

4. Suffix Trees and Arrays