Pattern Matching

Pattern Matching

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Strings q A string is a sequence of q Let P be a string of size m characters n A substring P[i .. j] of P is the q Examples of strings: subsequence of P consisting of the characters with ranks n Python program between i and j n HTML document n A prefix of P is a substring of n DNA sequence the type P[0 .. i] n Digitized image n A suffix of P is a substring of the type P[i ..m - 1] q An alphabet Σ is the set of possible characters for a q Given strings T (text) and P family of strings (pattern), the pattern q Example of alphabets: matching problem consists of finding a substring of T equal n ASCII to P n Unicode q Applications: n {0, 1} n Text editors n {A, C, G, T} n Search engines n Biological research © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 2 Brute-Force Pattern Matching q The brute-force pattern matching algorithm compares Algorithm BruteForceMatch(T, P) the pattern P with the text T Input text T of size n and pattern for each possible shift of P P of size m relative to , until either T Output starting index of a n a match is found, or substring of T equal to P or -1 n all placements of the pattern if no such substring exists have been tried for i ← 0 to n - m q Brute-force pattern matching runs in time O(nm) { test shift i of the pattern } q Example of worst case: j ← 0 n T = aaa … ah while j < m ∧ T[i + j] = P[j] n P = aaah j ← j + 1 n may occur in images and if j m DNA sequences = n unlikely in English text return i {match at i} else break while loop {mismatch} return -1 {no match anywhere} © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 3 Right to Left Matching q Matching the pattern from right to left q For a pattern abc: T: bbacdcbaabcddcdaddaaabcbcb P: abc q Worst case is still O(n m) Pattern Matching 4 Bad Character Rule (BCR) (1) q On a mismatch between the pattern and the text, we can shift the pattern by more than one place. ddbbacdcbaabcddcdaddaaabcbcb acabc Pattern Matching 5 Bad Character Rule (BCR) (2) q Preprocessing – n A table, for each position in the pattern and a character, last occurrence of the mismatched character in P preceding the mismatch . O(n |Σ|) space. O(1) access time. 1 2 3 4 5 a c a b c a 1 1 1 3 3 1 2 3 4 5 b 4 4 c 2 2 2 5 Pattern Matching 6 Bad Character Rule (BCR) (3) q On a mismatch, shift the pattern to the right until the first occurrence of the mismatched character in P. q Still O(n m) worst case running time: T: aaaaaaaaaaaaaaaaaaaaaaaaa P: abaaaa Pattern Matching 7 Good Suffix Rule (GSR) (1) q We want to use the knowledge of the matched characters in the pattern’s suffix. q If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ? Pattern Matching 8 Good Suffix Rule (GSR) (2) q Example 1 – how much to move: T: bbacdcbaabcddcdaddaaabcbcb P: cabbabdbab cabbabdbab Pattern Matching 9 Good Suffix Rule (GSR) (3) q Example 2 – what if there is no alignment: ↓ T: bbacdcbaabcbbabdbabcaabcbcb P: bcbbabdbabc bcbbabdbabc Pattern Matching 10 Good Suffix Rule (GSR) (4) q We mark the matched sub-string in T with t and the mismatched char with x 1. In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x 2. Otherwise, shift right to the largest prefix of P that aligns with a suffix of t. Pattern Matching 11 Boyer Moore Algorithm q Preprocess(P) q k := n while (k ≤ m) do n Match P and T from right to left starting at k n If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule). n else, print the occurrence and shift P right (advance k) by the good suffix rule. Pattern Matching 12 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG Bad Character Rule – shift by 7 Good Suffix Rule – shift by 0 Pattern Matching 13 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Bad Character Rule – shift by 1 Good Suffix Rule – shift by 3 (9-6) Pattern P – GTAGCGGCG t - GCG Prefixes of P G GT Find the longest prefix of GTA P which has t as the suffix. GTAG GTAGC GTAGCG Pattern Matching 14 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Pattern P – GTAGCGGCG t - GCGGCG Prefixes of P G GTAGCGG GT GTAGCGGC GTA GTAGCGGCG GTAG GTAGC Find the longest prefix of GTAGCG P which has t as the suffix. Pattern Matching 15 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Bad Character Rule – shift by 3 Good Suffix Rule – shift by 8 (9-1) Pattern P – GTAGCGGCG t - GCGGCG suffixes of t Prefixes of P G G CG GT GCG GTA GCGG GTAG Find the longest prefix of GCGGC GTAGC P which is a suffix of t. GCGGCG GTAGCG Pattern Matching 16 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG Pattern Matching 17 Good Suffix Rule (GSR) (5) q L(i) – The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n] i 1 2 3 4 5 6 7 8 9 P G T A G C G G C G L(i) 0 0 0 0 0 0 6 0 7 Pattern Matching 18 Good Suffix Rule (GSR) (6) q l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P i 1 2 3 4 5 6 7 8 9 P G T A G C G G C G l(i) 1 1 1 1 1 1 1 1 1 Pattern Matching 19 Good Suffix Rule (GSR) (7) q Putting it together n If mismatch occurs at position n, shift P by 1 n If a mismatch occurs at position i-1 in P: w If L(i) > 0, shift P by n – L(i) w else shift P by n – l(i) n If P was found, shift P by n – l(2) Pattern Matching 20 Boyer-Moore Algorithm Analysis q Worst case n O(n+m) if the pattern does not occur in the text n O(nm) if the pattern does occur in the text n For patterns of small size, the algorithm might not be efficient. Pattern Matching 21 Knuth-Morris-Pratt - Algorithm q KMP Algorithm n compares the pattern to the text in left-to- right, but shifts the pattern more intelligently than the brute-force algorithm. text t1 . tj-r+1 . tj . pattern p1 . pr . pattern p1 . pr . pattern p1 . pr . 22 The KMP Algorithm q Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to- right, but shifts the pattern more intelligently than the . a b a a b x . brute-force algorithm. q When a mismatch occurs, what is the most we can a b a a b a shift the pattern so as to j avoid redundant comparisons? a b a a b a q Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] No need to Resume repeat these comparing comparisons here © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 23 KMP Failure Function q Knuth-Morris-Pratt’s algorithm preprocesses the j 0 1 2 3 4 5 pattern to find matches of prefixes of the pattern with P[j] a b a a b a the pattern itself F(j) 0 0 1 1 2 3 q The failure function F(j) is defined as the size of the . a b a a b x . largest prefix of P[0..j] that is also a suffix of P[1..j] q Knuth-Morris-Pratt’s a b a a b a algorithm modifies the brute- j force algorithm so that if a mismatch occurs at P[j] ≠ T[i] a b a a b a we set j ← F(j - 1) F(j − 1) © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 24 Computing the Failure Function q The failure function can be represented by an array and can be computed in O(m) Algorithm failureFunction(P) F[0] 0 time ← i ← 1 q At each iteration of the while- j ← 0 loop, either while i < m if P[i] = P[j] n i increases by one, or {we have matched j + 1 chars} n the shift amount i - j F[i] ← j + 1 increases by at least one i ← i + 1 (observe that F(j - 1) < j) j ← j + 1 else if j > 0 then q Hence, there are no more {use failure function to shift P} than 2m iterations of the j ← F[j - 1] while-loop else F[i] ← 0 { no match } i ← i + 1 © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 25 The KMP Algorithm q The failure function can be represented by an array and Algorithm KMPMatch(T, P) can be computed in O(m) F ← failureFunction(P) time i ← 0 j 0 q At each iteration of the while- ← loop, either while i < n if T[i] = P[j] n i increases by one, or if j = m - 1 n the shift amount i - j return i - j { match } increases by at least one else (observe that F(j - 1) < j) i ← i + 1 q Hence, there are no more j ← j + 1 than 2n iterations of the else while-loop if j > 0 j ← F[j - 1] q Thus, KMP’s algorithm runs else in optimal time O(m + n) i ← i + 1 return -1 { no match } © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 26 Example a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b j 0 1 2 3 4 5 14 15 16 17 18 19 P[j] a b a c a b a b a c a b F(j) 0 0 1 0 1 2 Pattern Matching 27 Tries e i mi nimize ze mize nimize ze nimize ze Preprocessing Strings q Preprocessing the pattern speeds up pattern matching queries n After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size q If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern q A trie is a compact data structure for representing a set of strings, such as all the words in a text n A trie supports pattern matching queries in time proportional to the pattern size © 2014 Goodrich, Tamassia, Goldwasser Tries 29 Standard Tries q The standard trie for a set of strings S is an ordered tree such that: n Each node but the root is labeled with a character n The children of a node are alphabetically ordered n The paths from the external nodes to the root yield the strings of S q Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop } b s e i u e t a l d l y l o r l l l c p k © 2014 Goodrich, Tamassia, Goldwasser Tries 30 Analysis of Standard Tries q A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet b s e i u e t a l d l y l o r l l l c p k © 2014 Goodrich, Tamassia, Goldwasser Tries 31 Application of Tries q A standard trie supports the following operations on a processed text in O(m) time, where m is the length of the string.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    47 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us