Pattern Matching

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Strings q A string is a sequence of q Let P be a string of size m characters n A substring P[i .. j] of P is the q Examples of strings: subsequence of P consisting of the characters with ranks n Python program between i and j n HTML document n A prefix of P is a substring of n DNA sequence the type P[0 .. i] n Digitized image n A suffix of P is a substring of the type P[i ..m - 1] q An alphabet Σ is the set of possible characters for a q Given strings T (text) and P family of strings (pattern), the pattern q Example of alphabets: matching problem consists of finding a substring of T equal n ASCII to P n Unicode q Applications: n {0, 1} n Text editors n {A, C, G, T} n Search engines n Biological research © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 2 Brute-Force Pattern Matching q The brute-force pattern matching algorithm compares Algorithm BruteForceMatch(T, P) the pattern P with the text T Input text T of size n and pattern for each possible shift of P P of size m relative to , until either T Output starting index of a n a match is found, or substring of T equal to P or -1 n all placements of the pattern if no such substring exists have been tried for i ← 0 to n - m q Brute-force pattern matching runs in time O(nm) { test shift i of the pattern } q Example of worst case: j ← 0 n T = aaa … ah while j < m ∧ T[i + j] = P[j] n P = aaah j ← j + 1 n may occur in images and if j m DNA sequences = n unlikely in English text return i {match at i} else break while loop {mismatch} return -1 {no match anywhere} © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 3 Right to Left Matching q Matching the pattern from right to left q For a pattern abc: T: bbacdcbaabcddcdaddaaabcbcb P: abc q Worst case is still O(n m) Pattern Matching 4 Bad Character Rule (BCR) (1) q On a mismatch between the pattern and the text, we can shift the pattern by more than one place. ddbbacdcbaabcddcdaddaaabcbcb acabc Pattern Matching 5 Bad Character Rule (BCR) (2) q Preprocessing – n A table, for each position in the pattern and a character, last occurrence of the mismatched character in P preceding the mismatch . O(n |Σ|) space. O(1) access time. 1 2 3 4 5 a c a b c a 1 1 1 3 3 1 2 3 4 5 b 4 4 c 2 2 2 5 Pattern Matching 6 Bad Character Rule (BCR) (3) q On a mismatch, shift the pattern to the right until the first occurrence of the mismatched character in P. q Still O(n m) worst case running time: T: aaaaaaaaaaaaaaaaaaaaaaaaa P: abaaaa Pattern Matching 7 Good Suffix Rule (GSR) (1) q We want to use the knowledge of the matched characters in the pattern’s suffix. q If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ? Pattern Matching 8 Good Suffix Rule (GSR) (2) q Example 1 – how much to move: T: bbacdcbaabcddcdaddaaabcbcb P: cabbabdbab cabbabdbab Pattern Matching 9 Good Suffix Rule (GSR) (3) q Example 2 – what if there is no alignment: ↓ T: bbacdcbaabcbbabdbabcaabcbcb P: bcbbabdbabc bcbbabdbabc Pattern Matching 10 Good Suffix Rule (GSR) (4) q We mark the matched sub-string in T with t and the mismatched char with x 1. In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x 2. Otherwise, shift right to the largest prefix of P that aligns with a suffix of t. Pattern Matching 11 Boyer Moore Algorithm q Preprocess(P) q k := n while (k ≤ m) do n Match P and T from right to left starting at k n If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule). n else, print the occurrence and shift P right (advance k) by the good suffix rule. Pattern Matching 12 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG Bad Character Rule – shift by 7 Good Suffix Rule – shift by 0 Pattern Matching 13 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Bad Character Rule – shift by 1 Good Suffix Rule – shift by 3 (9-6) Pattern P – GTAGCGGCG t - GCG Prefixes of P G GT Find the longest prefix of GTA P which has t as the suffix. GTAG GTAGC GTAGCG Pattern Matching 14 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Pattern P – GTAGCGGCG t - GCGGCG Prefixes of P G GTAGCGG GT GTAGCGGC GTA GTAGCGGCG GTAG GTAGC Find the longest prefix of GTAGCG P which has t as the suffix. Pattern Matching 15 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Bad Character Rule – shift by 3 Good Suffix Rule – shift by 8 (9-1) Pattern P – GTAGCGGCG t - GCGGCG suffixes of t Prefixes of P G G CG GT GCG GTA GCGG GTAG Find the longest prefix of GCGGC GTAGC P which is a suffix of t. GCGGCG GTAGCG Pattern Matching 16 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG Pattern Matching 17 Good Suffix Rule (GSR) (5) q L(i) – The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n] i 1 2 3 4 5 6 7 8 9 P G T A G C G G C G L(i) 0 0 0 0 0 0 6 0 7 Pattern Matching 18 Good Suffix Rule (GSR) (6) q l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P i 1 2 3 4 5 6 7 8 9 P G T A G C G G C G l(i) 1 1 1 1 1 1 1 1 1 Pattern Matching 19 Good Suffix Rule (GSR) (7) q Putting it together n If mismatch occurs at position n, shift P by 1 n If a mismatch occurs at position i-1 in P: w If L(i) > 0, shift P by n – L(i) w else shift P by n – l(i) n If P was found, shift P by n – l(2) Pattern Matching 20 Boyer-Moore Algorithm Analysis q Worst case n O(n+m) if the pattern does not occur in the text n O(nm) if the pattern does occur in the text n For patterns of small size, the algorithm might not be efficient. Pattern Matching 21 Knuth-Morris-Pratt - Algorithm q KMP Algorithm n compares the pattern to the text in left-to- right, but shifts the pattern more intelligently than the brute-force algorithm. text t1 . tj-r+1 . tj . pattern p1 . pr . pattern p1 . pr . pattern p1 . pr . 22 The KMP Algorithm q Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to- right, but shifts the pattern more intelligently than the . a b a a b x . brute-force algorithm. q When a mismatch occurs, what is the most we can a b a a b a shift the pattern so as to j avoid redundant comparisons? a b a a b a q Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] No need to Resume repeat these comparing comparisons here © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 23 KMP Failure Function q Knuth-Morris-Pratt’s algorithm preprocesses the j 0 1 2 3 4 5 pattern to find matches of prefixes of the pattern with P[j] a b a a b a the pattern itself F(j) 0 0 1 1 2 3 q The failure function F(j) is defined as the size of the . a b a a b x . largest prefix of P[0..j] that is also a suffix of P[1..j] q Knuth-Morris-Pratt’s a b a a b a algorithm modifies the brute- j force algorithm so that if a mismatch occurs at P[j] ≠ T[i] a b a a b a we set j ← F(j - 1) F(j − 1) © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 24 Computing the Failure Function q The failure function can be represented by an array and can be computed in O(m) Algorithm failureFunction(P) F[0] 0 time ← i ← 1 q At each iteration of the while- j ← 0 loop, either while i < m if P[i] = P[j] n i increases by one, or {we have matched j + 1 chars} n the shift amount i - j F[i] ← j + 1 increases by at least one i ← i + 1 (observe that F(j - 1) < j) j ← j + 1 else if j > 0 then q Hence, there are no more {use failure function to shift P} than 2m iterations of the j ← F[j - 1] while-loop else F[i] ← 0 { no match } i ← i + 1 © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 25 The KMP Algorithm q The failure function can be represented by an array and Algorithm KMPMatch(T, P) can be computed in O(m) F ← failureFunction(P) time i ← 0 j 0 q At each iteration of the while- ← loop, either while i < n if T[i] = P[j] n i increases by one, or if j = m - 1 n the shift amount i - j return i - j { match } increases by at least one else (observe that F(j - 1) < j) i ← i + 1 q Hence, there are no more j ← j + 1 than 2n iterations of the else while-loop if j > 0 j ← F[j - 1] q Thus, KMP’s algorithm runs else in optimal time O(m + n) i ← i + 1 return -1 { no match } © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 26 Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b j 0 1 2 3 4 5 14 15 16 17 18 19 P[j] a b a c a b a b a c a b F(j) 0 0 1 0 1 2 Pattern Matching 27 Tries e i mi nimize ze mize nimize ze nimize ze Preprocessing Strings q Preprocessing the pattern speeds up pattern matching queries n After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size q If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern q A trie is a compact data structure for representing a set of strings, such as all the words in a text n A trie supports pattern matching queries in time proportional to the pattern size © 2014 Goodrich, Tamassia, Goldwasser Tries 29 Standard Tries q The standard trie for a set of strings S is an ordered tree such that: n Each node but the root is labeled with a character n The children of a node are alphabetically ordered n The paths from the external nodes to the root yield the strings of S q Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } b s e i u e t a l d l y l o r l l l c p k © 2014 Goodrich, Tamassia, Goldwasser Tries 30 Analysis of Standard Tries q A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet b s e i u e t a l d l y l o r l l l c p k © 2014 Goodrich, Tamassia, Goldwasser Tries 31 Application of Tries q A standard trie supports the following operations on a processed text in O(m) time, where m is the length of the string.

Pattern Matching

Pattern Matching Using Similarity Measures

Pattern Matching Using Fuzzy Methods David Bell and Lynn Palmer, State of California: Genetic Disease Branch

CSCI 2041: Pattern Matching Basics

Compiling Pattern Matching to Good Decision Trees

Combinatorial Pattern Matching

Integrating Pattern Matching Within String Scanning a Thesis

Practical Authenticated Pattern Matching with Optimal Proof Size