String Matching Algorithms

String Matching Algorithms Chris Muncey 1 Questions ● Who are the inventors of the KMP algorithm? ● What type of hashing algorithm do we use in Rabin-Karp (the type, not the name of the specific one we discussed)? ● What is the time complexity of both algorithms? 2 About me ● Chris Muncey ● Course-only Master’s Candidate, no research interests ● A visor: None at the moment ● . enjoy basically anything low level ● . li0e coo0ing ● . was born and raised in Knoxville& about 20 minutes from campus ● 21 is the closest thing to a sport I like * Pictures 3 4utline ● -aive algorithm ● Knuth-Morris-Pratt algorithm ● "abin-Karp algorithm ● Uses outside of text matching 6 History ● A fast pattern matching algorithm was eveloped in 1970 by James H. Morris and <aughan Pratt ● Donald Knuth independently discovered the algorithm around the same time because he+s Donald Knuth ● >he three published it /ointly in 1999 ● Michael 4; Rabin wrote a paper on fingerprinting %hashing) using polynomials in 1981 ● He and Richar M. Karp used it to create a fast string searching algorithm in 1989 @ -aive algorithm 9 -aive %brute force) algorithm ● Needle string n of siAe - ● 7aystack string h of siAe H ● Ensure that H is greater than or equal to - D n E “PneumonoultramicroscopicsilicovolcanoconiosisG isn+t going to be in h = FabcG ● Start at offset 0 of n and h& compare up to N characters, stop on first mismatch ● .ncrement offset of n and h #y 1, repeat ● 4verall time complexity of 4%-7), not very fast ? -aive algorithm visual n = “>IIAG h = “C>>IIAIAGCG 8 Where does this go !rong? ● 2or every letter in h, we are potentially doing up to - comparisons D On long matches !ith a mismatch at the end& we do N$1 comparisons D When !e hit that mismatch, !e move #y one& and potentially have the same situation D Consider n = “aaaaa”, h = Faaaa;;;aa” J We have to do 5 comparisons for every character in h ● We can apply some optimisations to eliminate this extra wor0 11 Knuth-Morris-Pratt algorithm 11 Knuth-Morris$Pratt algorithm ● Needle string n of siAe - ● 7aystack string h of siAe H ● Ensure that H is greater than or equal to - ● Similar to naive algorithm, but we can jump by more than 1 character ● .f n has a repeated prefix, we can jump to that offset in n on a mismatch ● .f it doesn+t& we can jump all the way to the mismatch location in h ● 4verall time complexity of 4%-K7(; ● 4%-) preprocessing, then O(H) matching ● Effectively linear time string searching 12 KMP algorithm prefix array ● An array with the number of matches to the prefix of the string ● Usually the first one is reserved as a sentinel, #ut not al!ays ● 4n mismatch& we can move move our pointer in n to the index given by pre') D Special case needed for the -1s obviously ● .f our mismatch was at position 9& we would move n to index 3 and compare the same spot in h to that ● We 0now the first * letters will match, so we on+t need to check them 1* KMP e)ample ● 4n the first * mismatches, we can only move by 1 ● 4n 3th, we can move #y several spaces because !e 0now a match can+t happen bet!een the t!o points ● We align the prefix of the string !ith the matching suffix 13 What makes it linear time? ● >his is very similar to the #rute force metho , #ut there+s one key difference ● >he algorithm never #ac0tracks in h ● .n the #rute force algorithm, you can look at up to - for each letter in h ● With this one& if !e have a match or partial match, we can skip anything that !e 0no! can+t also be a match D .f !e don+t end on a repeated pre'), there can+t be a second match D .f !e do, !e can jump to that instead ● >hus this algorithm is linear time& O(N+7( ● Back to n = FaaaaaG& h = “aaaa;;.aa” D We can start at the end of n every time because we kno! the first 4 !ill match 16 "abin-Karp algorithm 1@ "abin-Karp algorithm ● Needle string n of siAe - ● 7aystack string h of siAe H ● Ensure that H is greater than or equal to - ● 7ash n, hash N characters of h, compare the hash ● .f match D =ouble check it with naive approach to make sure ● .f mismatch D "emove 'rst character from hash of h, add ne)t character ● 4verall time complexity of 4%-K7( ● Assumes good hash function& fe! collisions ● Worst case still 4%-7(& but only with a terrible hash function 19 Hey !ait a minute;;; ● We can+t just edit a hash value li0e that, that+s not ho! hashing works ● We need a special type of hash algorithm, called a rolling hash ● 4ne such hash is the Rabin fingerprint: ● <iew message n of length - as an polynomial of egree --1 D f% ( = D d should #e larger than the number of characters in the alphabet of n ● Pick a random irreducible polynomial of egree k over I2(2( D 2or our purposes& this is effectively /ust a 0$bit prime number ● >he hash is the remainder !hen ividing f( ) by p D hash E f( ( N p 1? "abin fingerprint implementation ● Since 256O) gets big, we need ● hash(str, p(, something to ma0e it smaller ● n = str;length, h E 1 ● Key insight ● for i in {0 ... nU D %APM( N C E %%A N C) P (B N C(( N C D / E 1 ● >he original paper was worried about D for 0 in T0 … n - 1 $ iU space, so it pic0ed the smallest prime J / E (/ * d( N p p such that p > log%NH / e( with e < 0 D h KE %%strWi] N p) P %/ N p)( % p being the acceptable error rate ● return(h N p( ● We+ll pic0 101 because it’s easy for the e)amples ● >he num#er of bits in p determine the number of bits in the hash, so !e+ll have a 7 bit hash 18 "abin fingerprint e)ample ● n E Fa#cG& p E 111& E 26@ ● %%%Ya+P%%26@P26@(Np((Np) K %%Y#+P%26@Np((Np( K %%Yc+P(1Np((Np() N p ● %%%89 P ??) N p) K %%8? P 63) N p) K %88 N p() N p ● %62 K 31 K 88( N 111 E 81 ● n E F#cqG& p E 111& E 26@ ● %%%Y#+P((26@P26@(Np((Np) K %%Yc+P%26@Np((Np( K %YC+Np(( N p E 33 ● >o move hash& su#tract top value+s hash& Fshift #y 1G& a ne! value+s hash ● %%%%hash%Fa#cG( - %%Ya+P%(26@P26@(Np((Np(( P 26@) N p( K hash%FCG() N p ● %%81 - %%Ya+P%(26@P26@(Np((Np(( P 26@) N p ● %%81 - 62) P 26@) N p E *2 ● %*2 K %YC+ N p() N p E 33& as expecte 21 Uses outside of text matching 21 Uses of KMP ● =-A matching D =-A only has alphabet siAe of 4, so repeats !ill happen with high frequency D Zots of repeats means !e can skip many chec0s D Still bound #y linear time of course, but KMP is good here 22 Uses of "abin-Karp ● =ata compression relies on finding matching bytes in a file ● "abin-Karp e)cels at this, as the “hash” can just be the pattern bytes D .mpossible to have a collision, #ecause a collision is a match D Guaranteed 4%7( pattern matching assuming you can cram n into a register ● Also very useful for trying to find many needles in a single haystack at once D Simply hash all of the needles& and search for them all as you move through the haystac0 D Wikipedia gives the e)ample of plagiarism detection 2* "eferences 1. https,R/en.wi0ipedia.orgR!iki/"abin%E2%?1N8*Karp_algorithm 2. https,R/en.wi0ipedia.orgR!iki/"abin_fingerprint *; http:/R!!!.xmailserver;org/rabin.pdf a; 2ingerprinting #y Random Polynomials $ Michael 4; Rabin $ 1981 3; https,R/en.wi0ipedia.orgR!iki/Knuth%E2N?1N83Morris%E2N?1N83Pratt[alg orithm 6; https,RR!!!.youtube.com/!atch?vEBf5e/C 19yo @; https,RR!!!.youtube.com/!atch?vEBZ3\#"26?7g 9; https,RR!!!.youtube.com/!atch?vE<6$9IzOfA=Q 23 Discussion 26 Questions ● Who are the inventors of the KMP algorithm? ● What type of hashing algorithm do we use in Rabin-Karp (the type, not the name of the specific one we discussed)? ● What is the time complexity of both algorithms? 2@ String Matching Algorithms Chris Muncey 29.

String Matching Algorithms

Donald Knuth Fletcher Jones Professor of Computer Science, Emeritus Curriculum Vitae Available Online

Modified Moments for Indefinite Weight Functions [2Mm] (A Tribute

Mathematical Circus & 'Martin Gardner

Fundamental Theorems in Mathematics

An Interview with Donald Knuth 33

COMPUTERSCIENCE Science

Herbert S. Wilf (1931–2012)

The TEX Tuneup of 2021 Donald Knuth This Is the Promised Sequel to Previous Reports from 2008 [3] and 2014 [4]. Once Again

Donald E. Knuth Papers SC0097

Martin Gardner Papers SC0647

An Interview with Donald Knuth 1974 ACM Turing Award Recipient

Arxiv:2105.07884V3 [Math.HO] 20 Jun 2021