String Matching String Matching Problem
Pattern: compress
Text: We introduce a general framework which is suitable to capture an essence of compress compressed pattern matching according to various dictionary based compresscompressions. The goal is to find all occurrences of a pattern in a text without decompression,compress which is one of the most active topics in string matching. Our framework includes such compresscompression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed compresstext presented by Amir, Benson and Farach. Nota on & Terminology • String S: – S[1…n] • Sub-string of S : – S[i…j] • Prefix of S: – S[1…i] • Suffix of S: – S[i…n] • |S| = n (string length) • Ex: AGATCGATGGA Prefix Sub-string Suffix Nota on • Σ* is the set of all finite strings over Σ; length is |x|; concatena on of x and y is xy with |xy| = |x| + |y| • w is a prefix of x (w x) provided x = wy for some y • w is a suffix of x (w x) provided x = yw for some y Naïve String Matcher
• The equality test takes me O(m)
♦ Complexity – the overall complexity is O((n-m+1)m) Another method? • Can we do be er than this? – Idea: shi ing at a mismatch far enough for efficiency not too far for correctness. – Ex: T = x a b x y a b x y a b x z P = a b x y a b x z a b x y a b x z X a b x y a b x z v v v v v v v X Jump a b x y a b x z v v v v v v v v Ø Preprocessing on P or T Two types of algorithms for string matching • Preprocessing on P (P fixed , T varied) – Ex: Query P in database T – Algorithms: • Knuth – Morris - Pratt • Boyer – Moore • Preprocessing on T (T fixed , P varied) – Ex: Look for P in dictionary T – Algorothms: – Rabin-Karp – Suffix Trees Rabin-Karp algorithm
• Algorithm has worst case running time of O((n-m +1)m), but an average case running time of O(m+n) • Algorithm uses some elementary ideas from number theory - the equivalence of two numbers modulo a third • For simplicity assume that Σ= {0,1,2,..., 9} so that each character is a decimal digit – more generally, if the size of Σ is d, then we will treat each character as a digit in radix d notation. • We can think of a string of length k as a k-digit decimal number. Rabin Karp algorithm • Example - string of characters 354123 corresponds to the number 354,123 • Given a pa ern P[1 .. m], let p be the decimal number corresponding to the m character pa ern
• Given a text T[1 ..n], let ts = T[s+1 .. s+m] for s = 0, 1, ... n-m denotes the decimal value of the m character substring star ng at posi on s+1.
• Clearly, ts = p if and only if P[1 ..m] = T[s+1 .. s+m] • So, if we could compute p in O(m) me and compute all of the ts in O(n) me, then in O(n) me we can compare all of the shi s with p and find all matching loca ons. • Of course, the numbers may get very large, but we'll solve that problem in 2-3 slides P --> p • p = P[m] + 10 x P[m-1] + 10 2 x P[m-2] + ... 10 m-1 x P[1] • Compu ng this in a brute force way would require O(m2) opera ons to compute all of the powers of 10. • But no ce, that we can eliminate one mul plica on by 10 in all terms (but the first), by factoring the 10 out: • p = P[m] + 10 (P[m-1] + 10 x P[m-2] + ... + 10 m-2 x P[1]) • Applying this factoring rule repeatedly will lead to only a linear number, in m, of mul plica ons. Rabin-Karp • Compu ng p in O(m) me. We use what is called Horner's rule • p = P[m] + 10 (P[m-1] +10 (P[m-2]) + ... + 10(P[2] + 10P[1])...))) • Example: P = 3276; P[4] = 6, P[3] = 7, P[2] = 2, P[1] =3 • p = 6 + 10(7 +10(2 + 10(3)))) = 6 + 10(7 + 10(32)) = 6 + 10(327) = 6 + 3270 = 3276
• t0 can be computed from T[1 .. m] using the same method Rabin Karp
• Big question is then computing the remaining ti in O(n-m) time - that is, with just a constant number of
operations for each remaining ti.
• But, there is a simple relationship between t s+1 and ts: (m-1) t s+1 = 10(ts - 10 T[s]) + T[s+m+1] • Example: Suppose T = 314152 and the length of
the pattern is 5. Then t0 = 31, 415. What is t1? 4 t1 = 10(31,415 - 10 (3)) + 2 = 10( 1,415) + 2 = 14,152 Rabin Karp
• We precompute the constant 10(m-1) (which can be done in logm time) so each update of a t takes a constant number of operations.
• So, p and all of the ti can be computed in O(m+n) time. Rabin Karp • So - what is the problem? Well, if P is a long pa ern, or if the size of the alphabet is large,
then the p and ti may be very large numbers, and each arithme c opera on will not take constant me - the numbers might require many words of memory to represent Rabin Karp • Solu on - compute the p and t's modulo a suitable modulus q. • Horner's rule and the simple upda ng rule will work using mod q arithme c • q is large and prime • q is chosen such that 10q just fits within one computer word. This allows all of the arithme c opera ons to be performed using single precision arithme c, while keeping q as large as possible • For an alphabet with d symbols, q is chosen so that dq fits within one word. Rabin Karp
♦ The catch is that p = ts is no longer a sufficient condition for deciding that P matches a substring of T. ♦ What if two values collide? – Similar to hash table func ons, it is possible that two or more different strings produce the same “hit” value – any hit will have to be tested to verify that it is not spurious and that p[1..m] = T[s+1..s+m] The Calcula ons What is a worst case?
• Let P be a string of length m of all a's • Let T be a string of length n of all a's • Then all substrings of length m of T will have the same "key" as P, and all will have to be checked. • But we know that if a substring of T matches P, to determine if the "next" substring matches we only have to perform one additional comparison. • So, the Rabin Karp algorithm can do a lot of unnecessary work. • In practice, though, it doesn't. The Algorithm Knuth-Morris-Pra Algorithm • The key observa on – when there is a mismatch a er several characters match, then the pa ern and search string share a common substring, which is also a prefix of the pa ern – If there is a suffix of the part of the pa ern we have matched which is equal to a prefix of that part, then we need to consider it as a possible match. – By finding the largest such suffix, we make the smallest shi – so we never miss a match. • By using the prefix func on the algorithm has running me of O(n + m) Components of KMP algorithm
• The prefix function, Π The prefix function,Π for a pattern summarizes how the pattern matches against shifts of itself. This information is used to avoid useless shifts of the pattern P. • The KMP Matcher Given text T, pattern P and prefix function Π as inputs, finds all of the occurrence of P in T. The prefix function, Π
Compute-Prefix-Function (p) 1 m ß length[p] //’p’ pattern to be matched 2 Π[1] ß 0 3 k ß 0 4 for q ß 2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k ß Π[k] 7 If p[k+1] = p[q] 8 then k ß k +1 9 Π[q] ß k 10 return Π
Example: compute Π for the pattern ‘p’ below: p do while k > 0 and p[k+1] ! a b a b a c a = p[q] do k ß Π[k] If p[k+1] = p[q] Initially: m = length[p] = 7 then k ß k +1 Π[1] = 0 Π[q] ß k k = 0 q 1 2 3 4 5 6 7 Step 1: q = 2, k=0 p a b a b a c a Π[2] = 0 Π 0 0
q 1 2 3 4 5 6 7 Step 2: q = 3, k = 0, p a b a b a c a Π[3] = 1 Π 0 0 1
q 1 2 3 4 5 6 7 Step 3: q = 4, k = 1 p a b a b a c A Π[4] = 2 Π 0 0 1 2
do while k > 0 and p[k+1] != p[q] do k ß Π[k] If p[k+1] = p[q] then k ß k +1 Π[q] ß k
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7 Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3
q 1 2 3 4 5 6 7 Step 5: q = 6, k = 3 Π[6] = 1 p a b a b a c a Π 0 0 1 2 3 0
q 1 2 3 4 5 6 7 Step 6: q = 7, k = 1 p a b a b a c a Π[7] = 1 Π 0 0 1 2 3 0 1
The KMP Matcher
The KMP Matcher, with pattern ‘p’, string T and prefix function ‘Π’ as input, finds a match of p in T. Following pseudo-code computes the matching component of KMP algorithm: KMP-Matcher(S,p) 1 n ß length[T] 2 m ß length[p] 3 Π ß Compute-Prefix-Function(p) 4 q ß 0 //number of characters matched 5 for i ß 1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != T[i] 7 do q ß Π[q] //next character does not match 8 if p[q+1] = T[i] 9 then q ß q + 1 //next character matches 10 if q = m //is all of p matched? 11 then print “Pattern occurs with shift” i – m 12 q ß Π[ q] // look for the next match
Note: KMP finds every occurrence of a ‘p’ in ‘T’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘S’ for any more occurrences of ‘p’.
5 for i ß 1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != T[i] 7 do q ß Π[q] //next character does not match 8 if p[q+1] = T[i] 9 then q ß q + 1 //next character matches 10 if q = m //is all of p matched? 11 then print “Pattern occurs with shift” i – m 12 q ß Π[ q] // look for the next match
Note: KMP finds every occurrence of a ‘p’ in ‘T’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘S’ for any more occurrences of ‘p’. Illustration: given a String ‘T’ and pattern ‘p’ as follows:
T b a c b a b a b a b a c a c a
p a b a b a c a Simulate the KMP algorithm to find whether ‘p’ occurs in ‘T’.
For ‘p’ the prefix function, Π was computed previously and is as follows: q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3 0 1 Initially: n = size of T = 15; m = size of p = 7
Step 1: i = 1, q = 0 comparing p[1] with T[1]
T b a c b a b a b a b a c a a b
p a b a b a c a p[1] does not match with T[1]. ‘p’ will be shifted one position to the right. Step 2: i = 2, q = 0 comparing p[1] with T[2]
T b a c b a b a b a b a c a a b
p a b a b a c a
P[1] matches T[2]. Since there is a match, p is not shifted. Step 3: i = 3, q = 1 Comparing p[2] with T[3] p[2] does not match T[3] T b a c b a b a b a b a c a a b
p a b a b a c a Backtracking on p, comparing p[1] and T[3] Step 4: i = 4, q = 0 comparing p[1] with T[4] p[1] does not match T[4]
T b a c b a b a b a b a c a a b
p a b a b a c a Step 5: i = 5, q = 0
comparing p[1] with T[5] p[1] matches with T[5]
T b a c b a b a b a b a c a a b p a b a b a c a Step 6: i = 6, q = 1 Comparing p[2] with T[6] p[2] matches T[6] T b a c b a b a b a b a c a a b
p a b a b a c a
Step 7: i = 7, q = 2 Comparing p[3] with T[7] p[3] matches T[7] T b a c b a b a b a b a c a a b
p a b a b a c a
Step 8: i = 8, q = 3 Comparing p[4] with T[8] p[4] matches T[8] T b a c b a b a b a b a c a a b
p a b a b a c a Step 9: i = 9, q = 4 Comparing p[5] with T[9] p[5] matches T[9] T b a c b a b a b a b a c a a b
p a b a b a c a Step 10: i = 10, q = 5 Comparing p[6] with T[10] p[6] doesn’t match T[10] T b a c b a b a b a b a c a a b
p a b a b a c a
Backtracking on p, compare p[4] with T[10] because after mismatch q = Π[5] = 3
Step 11: i = 11, q = 4
Comparing p[5] with T[11] p[5] matches T[11]
T b a c b a b a b a b a c a a b p a b a b a c a Step 12: i = 12, q = 5 Comparing p[6] with T[12] p[6] matches T[12] T b a c b a b a b a b a c a a b
p a b a b a c a
Step 13: i = 13, q = 6 Comparing p[7] with T[13] p[7] matches T[13] T b a c b a b a b a b a c a a b
p a b a b a c a
Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts that took place for the match to be found are: i – m = 13 – 7 = 6 shifts. Run - time analysis
• Compute-Prefix-Function (Π) • KMP Matcher 1 m ß length[p] //’p’ pattern to be 1 n ß length[S] matched 2 m ß length[p] 2 Π[1] ß 0 3 Π ß Compute-Prefix-Function(p) 3 k ß 0 4 q ß 0 4 for q ß 2 to m 5 for i ß 1 to n 5 do while k > 0 and p[k+1] != p[q] 6 do while q > 0 and p[q+1] != S[i] 6 do k ß Π[k] 7 do q ß Π[q] 7 If p[k+1] = p[q] 8 if p[q+1] = S[i] 8 then k ß k +1 9 then q ß q + 1 9 Π[q] ß k 10 if q = m 10 return Π 11 then print “Pattern occurs with shift” i – m 12 q ß Π[ q]
In the above pseudocode for computing the The for loop beginning in step 5 runs ‘n’ times, prefix function, the for loop from step 4 to i.e., as long as the length of the string ‘S’. step 10 runs ‘m’ times. Step 1 to step 3 Since step 1 to step 4 take constant time, take constant time. Hence the running the running time is dominated by this for time of compute prefix function is Θ(m). loop. Thus running time of matching function is Θ(n).
Boyer-Moore Algorithm • For longer pa erns and large Σ it is the most efficient algorithm • What it does – Compares characters from right to le – Uses a “bad character” rule – Adds a “good suffix” rule – These two rules generate two different shi values; the larger of the two values is chosen • In some cases Boyer-Moore can run in sub-linear me, which means it may not be necessary to check all of the characters in the search text! Boyer and Moore String pa ern matching - Boyer-Moore Again, this algorithm uses fail-functions to shift the pattern efficiently. Boyer-Moore starts however at the end of the pattern, which can result in larger shifts. Two heuristics are used: 1: if you encounter a mismatch at character c in T, you can shift to the first occurrence of c in P from the right:
T a b c a b c d g a b c e a b c d a c e d
P a b c e b c d a b c e b c d (restart here) String pa ern matching - Boyer-Moore 2: if the examined suffix of P occurs also as substring in P, a shift can be performed to the closest occurrence of this substring:
c e a b c d a b c f g a b c e a b
d a b c a a b c
we can shift to: d a b c a a b c (restart here)
pattern: d a b c a a b c fail2 7 String pa ern matching - Boyer-Moore
The second shift is also precomputed and put into a length(P)-sized array fail2[] (complexity O(length(P))
pattern: d a b c a a b c fail2 9 9 9 9 7 6 5 1 (shift = 7)
c e a b c d a b c f g a b c e a b
d a b c a a b c d a b c a a b c
• If lines 12 and 13 were changed to then we would have a naïve-string matcher Suffix trees Pre-process the text n KMP and BM preprocess the pattern for string matching n Instead, consider preprocessing the text – in particular by constructing a Patricia trie of ALL the suffixes of the text. n If P matches T, then it must be a prefix of some suffix! So if a search for P in the trie succeeds, then P must occur in T. Compressed Trie
n Compress unary nodes, label edges by strings
c è a a c b e d b d bbf e eef f f c c e g e g Suffix tree
Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s Suffix tree (Example)
Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
{ $ a b $ b b$ $ ab$ a a $ b bab$ b $ abab $ $ } Trivial algorithm to build a Suffix tree
a b Insert the largest suffix a b $
a b b a Insert the suffix bab$ a b b $ $ a b b a a b b $ $
Insert suffix ab$ a b b a b $ a $ b $ a b b a b $ a $ b $
a b Put the suffix b$ in b $ a a $ b b $ $ a b b a b $ a $ b $ $ a b b Put the suffix $ in $ a a $ b b $ $ $ a b b $ a a $ b b $ $
• Label each leaf with the starting point of the $ corres. suffix. a b b 5 2 $ • Takes O(n ) time to a build. a $ b 4 b $ $ 3 • Can be done in O(n). 2 1 What can we do with it ?
Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.
We may also want to find all occurrences of P in T Exact string matching
In preprocessing build a suffix tree in O(n) time
$ a b b 5 $ a a $ b 4 b $ $ 3 1 2 Given a pattern P = ab , traverse the tree according to the pattern. $ a b b 5 $ a a $ b 4 b $ $ 3 1 2
If we do not fail while traversing the pattern then the pattern occurs in the text.
Each leaf in the subtree below the node reached corresponds to an occurrence. By traversing this subtree we find all k occurrences in O(n+k) time Drawbacks of suffix trees n Suffix trees consume a lot of space n It is O(n) but the constant is quite big n Recall that if we want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node Suffix array n We lose some of the functionality but save space.
Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order Suffix array construction – naïve algorithm n Naive in place construction n Similar to insertion sort n Insert all the suffixes into the array one by one Running time complexity: 2 n O(n ) where n is the length of the string How do we build it fast? n Build a suffix tree n Traverse the tree using DFS, lexicographically following outgoing edges from each node and filling the suffix array. n O(n) time How do we search for a pattern ? n If P occurs in T then all its occurrences are consecutive in the suffix array. n Do a binary search on the suffix array n Takes O(mlogn) time Search example n Search is in mississippi$ 0 11 i$ • Examine the pattern letter by 1 8 ippi$ letter, reducing the range of 2 5 issippi$ occurrence each time. 3 2 ississippi$ • First letter i: occurs in 4 1 mississippi$ indices from 0 to 3 5 10 pi$ • So, pattern should be 6 9 ppi$ between these indices. 7 7 sippi$ • Second letter s: occurs 8 4 sissippi$ in indices from 2 to 3 9 6 ssippi$ • Done. Output: issippi$ and ississippi$ 10 3 ssissippi$ 11 12 $