String Matching String Matching Problem

Pattern: compress

Text: We introduce a general framework which is suitable to capture an essence of compress compressed according to various dictionary based compresscompressions. The goal is to find all occurrences of a pattern in a text without decompression,compress which is one of the most active topics in string matching. Our framework includes such compresscompression methods as Lempel-Ziv family, (LZ77, LZSS, LZ78, LZW), byte-pair encoding, and the static dictionary based method. Technically, our pattern matching algorithm extremely extends that for LZW compressed compresstext presented by Amir, Benson and Farach. Notaon & Terminology • String S: – S[1…n] • Sub-string of S : – S[i…j] • Prefix of S: – S[1…i] • Suffix of S: – S[i…n] • |S| = n (string length) • Ex: AGATCGATGGA Prefix Sub-string Suffix Notaon • Σ* is the set of all finite strings over Σ; length is |x|; concatenaon of x and y is xy with |xy| = |x| + |y| • w is a prefix of x (w x) provided x = wy for some y • w is a suffix of x (w x) provided x = yw for some y Naïve String Matcher

• The equality test takes me O(m)

♦ Complexity – the overall complexity is O((n-m+1)m) Another method? • Can we do beer than this? – Idea: shiing at a mismatch far enough for efficiency not too far for correctness. – Ex: T = x a b x y a b x y a b x z P = a b x y a b x z a b x y a b x z X a b x y a b x z v v v v v v v X Jump a b x y a b x z v v v v v v v v Ø Preprocessing on P or T Two types of algorithms for string matching • Preprocessing on P (P fixed , T varied) – Ex: Query P in database T – Algorithms: • Knuth – Morris - Pratt • Boyer – Moore • Preprocessing on T (T fixed , P varied) – Ex: Look for P in dictionary T – Algorothms: – Rabin-Karp – Suffix Trees Rabin-Karp algorithm

• Algorithm has worst case running time of O((n-m +1)m), but an average case running time of O(m+n) • Algorithm uses some elementary ideas from number theory - the equivalence of two numbers modulo a third • For simplicity assume that Σ= {0,1,2,..., 9} so that each character is a decimal digit – more generally, if the size of Σ is d, then we will treat each character as a digit in radix d notation. • We can think of a string of length k as a k-digit decimal number. Rabin Karp algorithm • Example - string of characters 354123 corresponds to the number 354,123 • Given a paern P[1 .. m], let p be the decimal number corresponding to the m character paern

• Given a text T[1 ..n], let ts = T[s+1 .. s+m] for s = 0, 1, ... n-m denotes the decimal value of the m character starng at posion s+1.

• Clearly, ts = p if and only if P[1 ..m] = T[s+1 .. s+m] • So, if we could compute p in O(m) me and compute all of the ts in O(n) me, then in O(n) me we can compare all of the shis with p and find all matching locaons. • Of course, the numbers may get very large, but we'll solve that problem in 2-3 slides P --> p • p = P[m] + 10 x P[m-1] + 10 2 x P[m-2] + ... 10 m-1 x P[1] • Compung this in a brute force way would require O(m2) operaons to compute all of the powers of 10. • But noce, that we can eliminate one mulplicaon by 10 in all terms (but the first), by factoring the 10 out: • p = P[m] + 10 (P[m-1] + 10 x P[m-2] + ... + 10 m-2 x P[1]) • Applying this factoring rule repeatedly will lead to only a linear number, in m, of mulplicaons. Rabin-Karp • Compung p in O(m) me. We use what is called Horner's rule • p = P[m] + 10 (P[m-1] +10 (P[m-2]) + ... + 10(P[2] + 10P[1])...))) • Example: P = 3276; P[4] = 6, P[3] = 7, P[2] = 2, P[1] =3 • p = 6 + 10(7 +10(2 + 10(3)))) = 6 + 10(7 + 10(32)) = 6 + 10(327) = 6 + 3270 = 3276

• t0 can be computed from T[1 .. m] using the same method Rabin Karp

• Big question is then computing the remaining ti in O(n-m) time - that is, with just a constant number of

operations for each remaining ti.

• But, there is a simple relationship between t s+1 and ts: (m-1) t s+1 = 10(ts - 10 T[s]) + T[s+m+1] • Example: Suppose T = 314152 and the length of

the pattern is 5. Then t0 = 31, 415. What is t1? 4 t1 = 10(31,415 - 10 (3)) + 2 = 10( 1,415) + 2 = 14,152 Rabin Karp

• We precompute the constant 10(m-1) (which can be done in logm time) so each update of a t takes a constant number of operations.

• So, p and all of the ti can be computed in O(m+n) time. Rabin Karp • So - what is the problem? Well, if P is a long paern, or if the size of the alphabet is large,

then the p and ti may be very large numbers, and each arithmec operaon will not take constant me - the numbers might require many words of memory to represent Rabin Karp • Soluon - compute the p and t's modulo a suitable modulus q. • Horner's rule and the simple updang rule will work using mod q arithmec • q is large and prime • q is chosen such that 10q just fits within one computer word. This allows all of the arithmec operaons to be performed using single precision arithmec, while keeping q as large as possible • For an alphabet with d symbols, q is chosen so that dq fits within one word. Rabin Karp

♦ The catch is that p = ts is no longer a sufficient condition for deciding that P matches a substring of T. ♦ What if two values collide? – Similar to hash table funcons, it is possible that two or more different strings produce the same “hit” value – any hit will have to be tested to verify that it is not spurious and that p[1..m] = T[s+1..s+m] The Calculaons What is a worst case?

• Let P be a string of length m of all a's • Let T be a string of length n of all a's • Then all of length m of T will have the same "key" as P, and all will have to be checked. • But we know that if a substring of T matches P, to determine if the "next" substring matches we only have to perform one additional comparison. • So, the Rabin Karp algorithm can do a lot of unnecessary work. • In practice, though, it doesn't. The Algorithm Knuth-Morris-Pra Algorithm • The key observaon – when there is a mismatch aer several characters match, then the paern and search string share a common substring, which is also a prefix of the paern – If there is a suffix of the part of the paern we have matched which is equal to a prefix of that part, then we need to consider it as a possible match. – By finding the largest such suffix, we make the smallest shi – so we never miss a match. • By using the prefix funcon the algorithm has running me of O(n + m) Components of KMP algorithm

• The prefix function, Π The prefix function,Π for a pattern summarizes how the pattern matches against shifts of itself. This information is used to avoid useless shifts of the pattern P. • The KMP Matcher Given text T, pattern P and prefix function Π as inputs, finds all of the occurrence of P in T. The prefix function, Π

Compute-Prefix-Function (p) 1 m ß length[p] //’p’ pattern to be matched 2 Π[1] ß 0 3 k ß 0 4 for q ß 2 to m 5 do while k > 0 and p[k+1] != p[q] 6 do k ß Π[k] 7 If p[k+1] = p[q] 8 then k ß k +1 9 Π[q] ß k 10 return Π

Example: compute Π for the pattern ‘p’ below: p do while k > 0 and p[k+1] ! a b a b a c a = p[q] do k ß Π[k] If p[k+1] = p[q] Initially: m = length[p] = 7 then k ß k +1 Π[1] = 0 Π[q] ß k k = 0 q 1 2 3 4 5 6 7 Step 1: q = 2, k=0 p a b a b a c a Π[2] = 0 Π 0 0

q 1 2 3 4 5 6 7 Step 2: q = 3, k = 0, p a b a b a c a Π[3] = 1 Π 0 0 1

q 1 2 3 4 5 6 7 Step 3: q = 4, k = 1 p a b a b a c A Π[4] = 2 Π 0 0 1 2

do while k > 0 and p[k+1] != p[q] do k ß Π[k] If p[k+1] = p[q] then k ß k +1 Π[q] ß k

Step 4: q = 5, k =2 q 1 2 3 4 5 6 7 Π[5] = 3 p a b a b a c a

Π 0 0 1 2 3

q 1 2 3 4 5 6 7 Step 5: q = 6, k = 3 Π[6] = 1 p a b a b a c a Π 0 0 1 2 3 0

q 1 2 3 4 5 6 7 Step 6: q = 7, k = 1 p a b a b a c a Π[7] = 1 Π 0 0 1 2 3 0 1

The KMP Matcher

The KMP Matcher, with pattern ‘p’, string T and prefix function ‘Π’ as input, finds a match of p in T. Following pseudo-code computes the matching component of KMP algorithm: KMP-Matcher(S,p) 1 n ß length[T] 2 m ß length[p] 3 Π ß Compute-Prefix-Function(p) 4 q ß 0 //number of characters matched 5 for i ß 1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != T[i] 7 do q ß Π[q] //next character does not match 8 if p[q+1] = T[i] 9 then q ß q + 1 //next character matches 10 if q = m //is all of p matched? 11 then print “Pattern occurs with shift” i – m 12 q ß Π[ q] // look for the next match

Note: KMP finds every occurrence of a ‘p’ in ‘T’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘S’ for any more occurrences of ‘p’.

5 for i ß 1 to n //scan S from left to right 6 do while q > 0 and p[q+1] != T[i] 7 do q ß Π[q] //next character does not match 8 if p[q+1] = T[i] 9 then q ß q + 1 //next character matches 10 if q = m //is all of p matched? 11 then print “Pattern occurs with shift” i – m 12 q ß Π[ q] // look for the next match

Note: KMP finds every occurrence of a ‘p’ in ‘T’. That is why KMP does not terminate in step 12, rather it searches remainder of ‘S’ for any more occurrences of ‘p’. Illustration: given a String ‘T’ and pattern ‘p’ as follows:

T b a c b a b a b a b a c a c a

p a b a b a c a Simulate the KMP algorithm to find whether ‘p’ occurs in ‘T’.

For ‘p’ the prefix function, Π was computed previously and is as follows: q 1 2 3 4 5 6 7

p a b a b a c a

Π 0 0 1 2 3 0 1 Initially: n = size of T = 15; m = size of p = 7

Step 1: i = 1, q = 0 comparing p[1] with T[1]

T b a c b a b a b a b a c a a b

p a b a b a c a p[1] does not match with T[1]. ‘p’ will be shifted one position to the right. Step 2: i = 2, q = 0 comparing p[1] with T[2]

T b a c b a b a b a b a c a a b

p a b a b a c a

P[1] matches T[2]. Since there is a match, p is not shifted. Step 3: i = 3, q = 1 Comparing p[2] with T[3] p[2] does not match T[3] T b a c b a b a b a b a c a a b

p a b a b a c a Backtracking on p, comparing p[1] and T[3] Step 4: i = 4, q = 0 comparing p[1] with T[4] p[1] does not match T[4]

T b a c b a b a b a b a c a a b

p a b a b a c a Step 5: i = 5, q = 0

comparing p[1] with T[5] p[1] matches with T[5]

T b a c b a b a b a b a c a a b p a b a b a c a Step 6: i = 6, q = 1 Comparing p[2] with T[6] p[2] matches T[6] T b a c b a b a b a b a c a a b

p a b a b a c a

Step 7: i = 7, q = 2 Comparing p[3] with T[7] p[3] matches T[7] T b a c b a b a b a b a c a a b

p a b a b a c a

Step 8: i = 8, q = 3 Comparing p[4] with T[8] p[4] matches T[8] T b a c b a b a b a b a c a a b

p a b a b a c a Step 9: i = 9, q = 4 Comparing p[5] with T[9] p[5] matches T[9] T b a c b a b a b a b a c a a b

p a b a b a c a Step 10: i = 10, q = 5 Comparing p[6] with T[10] p[6] doesn’t match T[10] T b a c b a b a b a b a c a a b

p a b a b a c a

Backtracking on p, compare p[4] with T[10] because after mismatch q = Π[5] = 3

Step 11: i = 11, q = 4

Comparing p[5] with T[11] p[5] matches T[11]

T b a c b a b a b a b a c a a b p a b a b a c a Step 12: i = 12, q = 5 Comparing p[6] with T[12] p[6] matches T[12] T b a c b a b a b a b a c a a b

p a b a b a c a

Step 13: i = 13, q = 6 Comparing p[7] with T[13] p[7] matches T[13] T b a c b a b a b a b a c a a b

p a b a b a c a

Pattern ‘p’ has been found to completely occur in string ‘S’. The total number of shifts that took place for the match to be found are: i – m = 13 – 7 = 6 shifts. Run - time analysis

• Compute-Prefix-Function (Π) • KMP Matcher 1 m ß length[p] //’p’ pattern to be 1 n ß length[S] matched 2 m ß length[p] 2 Π[1] ß 0 3 Π ß Compute-Prefix-Function(p) 3 k ß 0 4 q ß 0 4 for q ß 2 to m 5 for i ß 1 to n 5 do while k > 0 and p[k+1] != p[q] 6 do while q > 0 and p[q+1] != S[i] 6 do k ß Π[k] 7 do q ß Π[q] 7 If p[k+1] = p[q] 8 if p[q+1] = S[i] 8 then k ß k +1 9 then q ß q + 1 9 Π[q] ß k 10 if q = m 10 return Π 11 then print “Pattern occurs with shift” i – m 12 q ß Π[ q]

In the above pseudocode for computing the The for loop beginning in step 5 runs ‘n’ times, prefix function, the for loop from step 4 to i.e., as long as the length of the string ‘S’. step 10 runs ‘m’ times. Step 1 to step 3 Since step 1 to step 4 take constant time, take constant time. Hence the running the running time is dominated by this for time of compute prefix function is Θ(m). loop. Thus running time of matching function is Θ(n).

Boyer-Moore Algorithm • For longer paerns and large Σ it is the most efficient algorithm • What it does – Compares characters from right to le – Uses a “bad character” rule – Adds a “good suffix” rule – These two rules generate two different shi values; the larger of the two values is chosen • In some cases Boyer-Moore can run in sub-linear me, which means it may not be necessary to check all of the characters in the search text! Boyer and Moore String paern matching - Boyer-Moore Again, this algorithm uses fail-functions to shift the pattern efficiently. Boyer-Moore starts however at the end of the pattern, which can result in larger shifts. Two heuristics are used: 1: if you encounter a mismatch at character c in T, you can shift to the first occurrence of c in P from the right:

T a b c a b c d g a b c e a b c d a c e d

P a b c e b c d a b c e b c d (restart here) String paern matching - Boyer-Moore 2: if the examined suffix of P occurs also as substring in P, a shift can be performed to the closest occurrence of this substring:

c e a b c d a b c f g a b c e a b

d a b c a a b c

we can shift to: d a b c a a b c (restart here)

pattern: d a b c a a b c fail2 7 String paern matching - Boyer-Moore

The second shift is also precomputed and put into a length(P)-sized array fail2[] (complexity O(length(P))

pattern: d a b c a a b c fail2 9 9 9 9 7 6 5 1 (shift = 7)

c e a b c d a b c f g a b c e a b

d a b c a a b c d a b c a a b c

• If lines 12 and 13 were changed to then we would have a naïve-string matcher Suffix trees Pre-process the text n KMP and BM preprocess the pattern for string matching n Instead, consider preprocessing the text – in particular by constructing a Patricia of ALL the suffixes of the text. n If P matches T, then it must be a prefix of some suffix! So if a search for P in the trie succeeds, then P must occur in T. Compressed Trie

n Compress unary nodes, label edges by strings

c è a a c b e d b d bbf e eef f f c c e g e g

Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s Suffix tree (Example)

Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$

{ $ a b $ b b$ $ ab$ a a $ b bab$ b $ abab $ $ } Trivial algorithm to build a Suffix tree

a b Insert the largest suffix a b $

a b b a Insert the suffix bab$ a b b $ $ a b b a a b b $ $

Insert suffix ab$ a b b a b $ a $ b $ a b b a b $ a $ b $

a b Put the suffix b$ in b $ a a $ b b $ $ a b b a b $ a $ b $ $ a b b Put the suffix $ in $ a a $ b b $ $ $ a b b $ a a $ b b $ $

• Label each leaf with the starting point of the $ corres. suffix. a b b 5 2 $ • Takes O(n ) time to a build. a $ b 4 b $ $ 3 • Can be done in O(n). 2 1 What can we do with it ?

Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide when it occurs in T.

We may also want to find all occurrences of P in T Exact string matching

In preprocessing build a suffix tree in O(n) time

$ a b b 5 $ a a $ b 4 b $ $ 3 1 2 Given a pattern P = ab , traverse the tree according to the pattern. $ a b b 5 $ a a $ b 4 b $ $ 3 1 2

If we do not fail while traversing the pattern then the pattern occurs in the text.

Each leaf in the subtree below the node reached corresponds to an occurrence. By traversing this subtree we find all k occurrences in O(n+k) time Drawbacks of suffix trees n Suffix trees consume a lot of space n It is O(n) but the constant is quite big n Recall that if we want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node Suffix array n We lose some of the functionality but save space.

Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order Suffix array construction – naïve algorithm n Naive in place construction n Similar to insertion sort n Insert all the suffixes into the array one by one Running : 2 n O(n ) where n is the length of the string How do we build it fast? n Build a suffix tree n Traverse the tree using DFS, lexicographically following outgoing edges from each node and filling the suffix array. n O(n) time How do we search for a pattern ? n If P occurs in T then all its occurrences are consecutive in the suffix array. n Do a binary search on the suffix array n Takes O(mlogn) time Search example n Search is in mississippi$ 0 11 i$ • Examine the pattern letter by 1 8 ippi$ letter, reducing the range of 2 5 issippi$ occurrence each time. 3 2 ississippi$ • First letter i: occurs in 4 1 mississippi$ indices from 0 to 3 5 10 pi$ • So, pattern should be 6 9 ppi$ between these indices. 7 7 sippi$ • Second letter s: occurs 8 4 sissippi$ in indices from 2 to 3 9 6 ssippi$ • Done. Output: issippi$ and ississippi$ 10 3 ssissippi$ 11 12 $