String Matching Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
String Matching Algorithms Chris Muncey 1 Questions ● Who are the inventors of the KMP algorithm? ● What type of hashing algorithm do we use in Rabin-Karp (the type, not the name of the specific one we discussed)? ● What is the time complexity of both algorithms? 2 About me ● Chris Muncey ● Course-only Master’s Candidate, no research interests ● A visor: None at the moment ● . enjoy basically anything low level ● . li0e coo0ing ● . was born and raised in Knoxville& about 20 minutes from campus ● 21 is the closest thing to a sport I like * Pictures 3 4utline ● -aive algorithm ● Knuth-Morris-Pratt algorithm ● "abin-Karp algorithm ● Uses outside of text matching 6 History ● A fast pattern matching algorithm was eveloped in 1970 by James H. Morris and <aughan Pratt ● Donald Knuth independently discovered the algorithm around the same time because he+s Donald Knuth ● >he three published it /ointly in 1999 ● Michael 4; Rabin wrote a paper on fingerprinting %hashing) using polynomials in 1981 ● He and Richar M. Karp used it to create a fast string searching algorithm in 1989 @ -aive algorithm 9 -aive %brute force) algorithm ● Needle string n of siAe - ● 7aystack string h of siAe H ● Ensure that H is greater than or equal to - D n E “PneumonoultramicroscopicsilicovolcanoconiosisG isn+t going to be in h = FabcG ● Start at offset 0 of n and h& compare up to N characters, stop on first mismatch ● .ncrement offset of n and h #y 1, repeat ● 4verall time complexity of 4%-7), not very fast ? -aive algorithm visual n = “>IIAG h = “C>>IIAIAGCG 8 Where does this go !rong? ● 2or every letter in h, we are potentially doing up to - comparisons D On long matches !ith a mismatch at the end& we do N$1 comparisons D When !e hit that mismatch, !e move #y one& and potentially have the same situation D Consider n = “aaaaa”, h = Faaaa;;;aa” J We have to do 5 comparisons for every character in h ● We can apply some optimisations to eliminate this extra wor0 11 Knuth-Morris-Pratt algorithm 11 Knuth-Morris$Pratt algorithm ● Needle string n of siAe - ● 7aystack string h of siAe H ● Ensure that H is greater than or equal to - ● Similar to naive algorithm, but we can jump by more than 1 character ● .f n has a repeated prefix, we can jump to that offset in n on a mismatch ● .f it doesn+t& we can jump all the way to the mismatch location in h ● 4verall time complexity of 4%-K7(; ● 4%-) preprocessing, then O(H) matching ● Effectively linear time string searching 12 KMP algorithm prefix array ● An array with the number of matches to the prefix of the string ● Usually the first one is reserved as a sentinel, #ut not al!ays ● 4n mismatch& we can move move our pointer in n to the index given by pre') D Special case needed for the -1s obviously ● .f our mismatch was at position 9& we would move n to index 3 and compare the same spot in h to that ● We 0now the first * letters will match, so we on+t need to check them 1* KMP e)ample ● 4n the first * mismatches, we can only move by 1 ● 4n 3th, we can move #y several spaces because !e 0now a match can+t happen bet!een the t!o points ● We align the prefix of the string !ith the matching suffix 13 What makes it linear time? ● >his is very similar to the #rute force metho , #ut there+s one key difference ● >he algorithm never #ac0tracks in h ● .n the #rute force algorithm, you can look at up to - for each letter in h ● With this one& if !e have a match or partial match, we can skip anything that !e 0no! can+t also be a match D .f !e don+t end on a repeated pre'), there can+t be a second match D .f !e do, !e can jump to that instead ● >hus this algorithm is linear time& O(N+7( ● Back to n = FaaaaaG& h = “aaaa;;.aa” D We can start at the end of n every time because we kno! the first 4 !ill match 16 "abin-Karp algorithm 1@ "abin-Karp algorithm ● Needle string n of siAe - ● 7aystack string h of siAe H ● Ensure that H is greater than or equal to - ● 7ash n, hash N characters of h, compare the hash ● .f match D =ouble check it with naive approach to make sure ● .f mismatch D "emove 'rst character from hash of h, add ne)t character ● 4verall time complexity of 4%-K7( ● Assumes good hash function& fe! collisions ● Worst case still 4%-7(& but only with a terrible hash function 19 Hey !ait a minute;;; ● We can+t just edit a hash value li0e that, that+s not ho! hashing works ● We need a special type of hash algorithm, called a rolling hash ● 4ne such hash is the Rabin fingerprint: ● <iew message n of length - as an polynomial of egree --1 D f% ( = D d should #e larger than the number of characters in the alphabet of n ● Pick a random irreducible polynomial of egree k over I2(2( D 2or our purposes& this is effectively /ust a 0$bit prime number ● >he hash is the remainder !hen ividing f( ) by p D hash E f( ( N p 1? "abin fingerprint implementation ● Since 256O) gets big, we need ● hash(str, p(, something to ma0e it smaller ● n = str;length, h E 1 ● Key insight ● for i in {0 ... nU D %APM( N C E %%A N C) P (B N C(( N C D / E 1 ● >he original paper was worried about D for 0 in T0 … n - 1 $ iU space, so it pic0ed the smallest prime J / E (/ * d( N p p such that p > log%NH / e( with e < 0 D h KE %%strWi] N p) P %/ N p)( % p being the acceptable error rate ● return(h N p( ● We+ll pic0 101 because it’s easy for the e)amples ● >he num#er of bits in p determine the number of bits in the hash, so !e+ll have a 7 bit hash 18 "abin fingerprint e)ample ● n E Fa#cG& p E 111& E 26@ ● %%%Ya+P%%26@P26@(Np((Np) K %%Y#+P%26@Np((Np( K %%Yc+P(1Np((Np() N p ● %%%89 P ??) N p) K %%8? P 63) N p) K %88 N p() N p ● %62 K 31 K 88( N 111 E 81 ● n E F#cqG& p E 111& E 26@ ● %%%Y#+P((26@P26@(Np((Np) K %%Yc+P%26@Np((Np( K %YC+Np(( N p E 33 ● >o move hash& su#tract top value+s hash& Fshift #y 1G& a ne! value+s hash ● %%%%hash%Fa#cG( - %%Ya+P%(26@P26@(Np((Np(( P 26@) N p( K hash%FCG() N p ● %%81 - %%Ya+P%(26@P26@(Np((Np(( P 26@) N p ● %%81 - 62) P 26@) N p E *2 ● %*2 K %YC+ N p() N p E 33& as expecte 21 Uses outside of text matching 21 Uses of KMP ● =-A matching D =-A only has alphabet siAe of 4, so repeats !ill happen with high frequency D Zots of repeats means !e can skip many chec0s D Still bound #y linear time of course, but KMP is good here 22 Uses of "abin-Karp ● =ata compression relies on finding matching bytes in a file ● "abin-Karp e)cels at this, as the “hash” can just be the pattern bytes D .mpossible to have a collision, #ecause a collision is a match D Guaranteed 4%7( pattern matching assuming you can cram n into a register ● Also very useful for trying to find many needles in a single haystack at once D Simply hash all of the needles& and search for them all as you move through the haystac0 D Wikipedia gives the e)ample of plagiarism detection 2* "eferences 1. https,R/en.wi0ipedia.orgR!iki/"abin%E2%?1N8*Karp_algorithm 2. https,R/en.wi0ipedia.orgR!iki/"abin_fingerprint *; http:/R!!!.xmailserver;org/rabin.pdf a; 2ingerprinting #y Random Polynomials $ Michael 4; Rabin $ 1981 3; https,R/en.wi0ipedia.orgR!iki/Knuth%E2N?1N83Morris%E2N?1N83Pratt[alg orithm 6; https,RR!!!.youtube.com/!atch?vEBf5e/C 19yo @; https,RR!!!.youtube.com/!atch?vEBZ3\#"26?7g 9; https,RR!!!.youtube.com/!atch?vE<6$9IzOfA=Q 23 Discussion 26 Questions ● Who are the inventors of the KMP algorithm? ● What type of hashing algorithm do we use in Rabin-Karp (the type, not the name of the specific one we discussed)? ● What is the time complexity of both algorithms? 2@ String Matching Algorithms Chris Muncey 29.