<<

String Matching

Chris Muncey

1 Questions

● Who are the inventors of the KMP ? ● What type of hashing algorithm do we use in Rabin-Karp (the type, not the name of the specific one we discussed)? ● What is the time complexity of both algorithms?

2 About me

● Chris Muncey ● Course-only Master’s Candidate, no research interests ● Advisor: None at the moment

● I enjoy basically anything low level ● I like cooking ● I was born and raised in Knoxville, about 20 minutes from campus ● F1 is the closest thing to a sport I like

3 Pictures

4 Outline

● Naive algorithm ● Knuth-Morris-Pratt algorithm ● Rabin-Karp algorithm ● Uses outside of text matching

5 History

● A fast pattern matching algorithm was developed in 1970 by James H. Morris and ● Donald Knuth independently discovered the algorithm around the same time because he’s Donald Knuth ● The three published it jointly in 1977

● Michael O. Rabin wrote a paper on fingerprinting (hashing) using polynomials in 1981 ● He and Richard M. Karp used it to create a fast string searching algorithm in 1987

6 Naive algorithm

7 Naive (brute force) algorithm

● Needle string n of size N ● Haystack string h of size H ● Ensure that H is greater than or equal to N ○ n = “Pneumonoultramicroscopicsilicovolcanoconiosis” isn’t going to be in h = “abc” ● Start at offset 0 of n and h, compare up to N characters, stop on first mismatch ● Increment offset of n and h by 1, repeat

● Overall time complexity of O(NH), not very fast

8 Naive algorithm visual n = “TGGA” h = “CTTGGAGAGC”

9 Where does this go wrong?

● For every letter in h, we are potentially doing up to N comparisons ○ On long matches with a mismatch at the end, we do N-1 comparisons ○ When we hit that mismatch, we move by one, and potentially have the same situation ○ Consider n = “aaaaa”, h = “aaaa...aa” ■ We have to do 5 comparisons for every character in h ● We can apply some optimisations to eliminate this extra work

10 Knuth-Morris-Pratt algorithm

11 Knuth-Morris-Pratt algorithm

● Needle string n of size N ● Haystack string h of size H ● Ensure that H is greater than or equal to N ● Similar to naive algorithm, but we can jump by more than 1 character ● If n has a repeated prefix, we can jump to that offset in n on a mismatch ● If it doesn’t, we can jump all the way to the mismatch location in h

● Overall time complexity of O(N+H). ● O(N) preprocessing, then O(H) matching ● Effectively linear time string searching

12 KMP algorithm prefix array

● An array with the number of matches to the prefix of the string ● Usually the first one is reserved as a sentinel, but not always ● On mismatch, we can move move our pointer in n to the index given by prefix ○ Special case needed for the -1s obviously ● If our mismatch was at position 7, we would move n to index 3 and compare the same spot in h to that ● We know the first 3 letters will match, so we don’t need to check them

13 KMP example

● On the first 3 mismatches, we can only move by 1 ● On 4th, we can move by several spaces because we know a match can’t happen between the two points ● We align the prefix of the string with the matching suffix

14 What makes it linear time?

● This is very similar to the brute force method, but there’s one key difference ● The algorithm never backtracks in h ● In the brute force algorithm, you can look at up to N for each letter in h ● With this one, if we have a match or partial match, we can skip anything that we know can’t also be a match ○ If we don’t end on a repeated prefix, there can’t be a second match ○ If we do, we can jump to that instead ● Thus this algorithm is linear time, O(N+H) ● Back to n = “aaaaa”, h = “aaaa...aa” ○ We can start at the end of n every time because we know the first 4 will match

15 Rabin-Karp algorithm

16 Rabin-Karp algorithm

● Needle string n of size N ● Haystack string h of size H ● Ensure that H is greater than or equal to N ● Hash n, hash N characters of h, compare the hash ● If match ○ Double check it with naive approach to make sure ● If mismatch ○ Remove first character from hash of h, add next character

● Overall time complexity of O(N+H) ● Assumes good hash function, few collisions ● Worst case still O(NH), but only with a terrible hash function

17 Hey wait a minute...

● We can’t just edit a hash value like that, that’s not how hashing works ● We need a special type of hash algorithm, called a rolling hash ● One such hash is the Rabin fingerprint:

● View message n of length N as an polynomial of degree N-1 ○ f(d) = ○ d should be larger than the number of characters in the alphabet of n ● Pick a random irreducible polynomial of degree k over GF(2) ○ For our purposes, this is effectively just a k-bit ● The hash is the remainder when dividing f(d) by p ○ hash = f(d) % p

18 Rabin fingerprint implementation

● Since 256^x gets big, we need ● hash(str, p): something to make it smaller ● n = str.length, h = 0 ● Key insight ● for i in {0 ... n} ○ (A*B) % = ((A % C) * (B % C)) % C ○ j = 1 ● The original paper was worried about ○ for k in {0 … n - 1 - i} space, so it picked the smallest prime ■ j = (j * d) % p p such that p > log(NH / e) with e < 0 ○ h += ((str[i] % p) * (j % p)) % p being the acceptable error rate ● return(h % p) ● We’ll pick 101 because it’s easy for the examples ● The number of bits in p determine the number of bits in the hash, so we’ll have a 7 bit hash

19 Rabin fingerprint example

● n = “abc”, p = 101, d = 256 ● (((‘a’*((256*256)%p))%p) + ((‘b’*(256%p))%p) + ((‘c’*(1%p))%p)) % p ● (((97 * 88) % p) + ((98 * 54) % p) + (99 % p)) % p ● (52 + 40 + 99) % 101 = 90

● n = “bcq”, p = 101, d = 256 ● (((‘b’*((256*256)%p))%p) + ((‘c’*(256%p))%p) + (‘q’%p)) % p = 44

● To move hash, subtract top value’s hash, “shift by 1”, add new value’s hash

● ((((hash(“abc”) - ((‘a’*((256*256)%p))%p)) * 256) % p) + hash(“q”)) % p ● ((90 - ((‘a’*((256*256)%p))%p)) * 256) % p ● ((90 - 52) * 256) % p = 32 ● (32 + (‘q’ % p)) % p = 44, as expected

20 Uses outside of text matching

21 Uses of KMP

● DNA matching ○ DNA only has alphabet size of 4, so repeats will happen with high frequency ○ Lots of repeats means we can skip many checks ○ Still bound by linear time of course, but KMP is good here

22 Uses of Rabin-Karp

● Data compression relies on finding matching in a file ● Rabin-Karp excels at this, as the “hash” can just be the pattern bytes ○ Impossible to have a collision, because a collision is a match ○ Guaranteed O(H) pattern matching assuming you can cram n into a register

● Also very useful for trying to find many needles in a single haystack at once ○ Simply hash all of the needles, and search for them all as you move through the haystack ○ Wikipedia gives the example of plagiarism detection

23 References

1. https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm 2. https://en.wikipedia.org/wiki/Rabin_fingerprint 3. http://www.xmailserver.org/rabin.pdf a. Fingerprinting by Random Polynomials - Michael O. Rabin - 1981 4. https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_alg orithm 5. https://www.youtube.com/watch?v=BfUejqd07yo 6. https://www.youtube.com/watch?v=EL4ZbRF587g 7. https://www.youtube.com/watch?v=V5-7GzOfADQ

24 Discussion

25 Questions

● Who are the inventors of the KMP algorithm? ● What type of hashing algorithm do we use in Rabin-Karp (the type, not the name of the specific one we discussed)? ● What is the time complexity of both algorithms?

26 String Matching Algorithms

Chris Muncey

27