Pattern Matching

a b a c a a b

1 a b a c a b 4 3 2 a b a c a b Strings

q A string is a sequence of q Let P be a string of size m characters n A substring P[i .. j] of P is the q Examples of strings: subsequence of P consisting of the characters with ranks n Python program between i and j n HTML document n A prefix of P is a substring of n DNA sequence the type P[0 .. i] n Digitized image n A suffix of P is a substring of the type P[i ..m - 1] q An alphabet Σ is the set of possible characters for a q Given strings T (text) and P family of strings (pattern), the pattern

q Example of alphabets: matching problem consists of finding a substring of T equal n ASCII to P n Unicode q Applications: n {0, 1} n Text editors n {A, C, G, T} n Search engines n Biological research © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 2 Brute-Force Pattern Matching

q The brute-force pattern matching algorithm compares Algorithm BruteForceMatch(T, P) the pattern P with the text T Input text T of size n and pattern for each possible shift of P P of size m relative to T, until either Output starting index of a n a match is found, or substring of T equal to P or -1 n all placements of the pattern if no such substring exists have been tried for i ← 0 to n - m q Brute-force pattern matching runs in time O(nm) { test shift i of the pattern } q Example of worst case: j ← 0 n T = aaa … ah while j < m ∧ T[i + j] = P[j] n P = aaah j ← j + 1 n may occur in images and if j m DNA sequences = n unlikely in English text return i {match at i} else break while loop {mismatch} return -1 {no match anywhere} © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 3 Right to Left Matching q Matching the pattern from right to left q For a pattern abc:

T: bbacdcbaabcddcdaddaaabcbcb P: abc

q Worst case is still O(n m)

Pattern Matching 4 Bad Character Rule (BCR) (1) q On a mismatch between the pattern and the text, we can shift the pattern by more than one place.

ddbbacdcbaabcddcdaddaaabcbcb acabc

Pattern Matching 5 Bad Character Rule (BCR) (2) q Preprocessing – n A table, for each position in the pattern and a character, last occurrence of the mismatched character in P preceding the mismatch . O(n |Σ|) space. O(1) access time.

1 2 3 4 5

a c a b c a 1 1 1 3 3 1 2 3 4 5 b 4 4 c 2 2 2 5

Pattern Matching 6 Bad Character Rule (BCR) (3) q On a mismatch, shift the pattern to the right until the first occurrence of the mismatched character in P. q Still O(n m) worst case running time:

T: aaaaaaaaaaaaaaaaaaaaaaaaa P: abaaaa

Pattern Matching 7 Good Suffix Rule (GSR) (1) q We want to use the knowledge of the matched characters in the pattern’s suffix. q If we matched S characters in T, what is (if exists) the smallest shift in P that will align a sub-string of P of the same S characters ?

Pattern Matching 8 Good Suffix Rule (GSR) (2) q Example 1 – how much to move:

T: bbacdcbaabcddcdaddaaabcbcb P: cabbabdbab cabbabdbab

Pattern Matching 9 Good Suffix Rule (GSR) (3) q Example 2 – what if there is no alignment: ↓ T: bbacdcbaabcbbabdbabcaabcbcb P: bcbbabdbabc bcbbabdbabc

Pattern Matching 10 Good Suffix Rule (GSR) (4) q We mark the matched sub-string in T with t and the mismatched char with x

1. In case of a mismatch: shift right until the first occurrence of t in P such that the next char y in P holds y≠x

2. Otherwise, shift right to the largest prefix of P that aligns with a suffix of t.

Pattern Matching 11 Boyer Moore Algorithm q Preprocess(P) q k := n while (k ≤ m) do

n Match P and T from right to left starting at k

n If a mismatch occurs: shift P right (advance k) by max(good suffix rule, bad char rule).

n else, print the occurrence and shift P right (advance k) by the good suffix rule.

Pattern Matching 12 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG Bad Character Rule – shift by 7 Good Suffix Rule – shift by 0

Pattern Matching 13 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Bad Character Rule – shift by 1 Good Suffix Rule – shift by 3 (9-6) Pattern P – GTAGCGGCG t - GCG Prefixes of P G GT the longest prefix of GTA P which has t as the suffix. GTAG GTAGC GTAGCG

Pattern Matching 14 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t

Pattern P – GTAGCGGCG t - GCGGCG Prefixes of P G GTAGCGG GT GTAGCGGC GTA GTAGCGGCG GTAG GTAGC Find the longest prefix of GTAGCG P which has t as the suffix.

Pattern Matching 15 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG t Bad Character Rule – shift by 3 Good Suffix Rule – shift by 8 (9-1) Pattern P – GTAGCGGCG t - GCGGCG suffixes of t Prefixes of P G G CG GT GCG GTA GCGG GTAG Find the longest prefix of GCGGC GTAGC P which is a suffix of t. GCGGCG GTAGCG Pattern Matching 16 Boyer-Moore Algorithm Demo GTTATAGCTGATCGCGGCGTAGCGGCGAA GTAGCGGCG

Pattern Matching 17 Good Suffix Rule (GSR) (5) q L(i) – The biggest index j, such that j < n and prefix P[1..j] contains suffix P[i..n] as a suffix but not suffix P[i-1..n]

i 1 2 3 4 5 6 7 8 9 P G T A G C G G C G L(i) 0 0 0 0 0 0 6 0 7

Pattern Matching 18 Good Suffix Rule (GSR) (6) q l(i) – The length of the longest suffix of P[i..n] that is also a prefix of P

i 1 2 3 4 5 6 7 8 9 P G T A G C G G C G l(i) 1 1 1 1 1 1 1 1 1

Pattern Matching 19 Good Suffix Rule (GSR) (7) q Putting it together

n If mismatch occurs at position n, shift P by 1

n If a mismatch occurs at position i-1 in P: w If L(i) > 0, shift P by n – L(i) w else shift P by n – l(i)

n If P was found, shift P by n – l(2)

Pattern Matching 20 Boyer-Moore Algorithm Analysis q Worst case

n O(n+m) if the pattern does not occur in the text

n O(nm) if the pattern does occur in the text

n For patterns of small size, the algorithm might not be efficient.

Pattern Matching 21 Knuth-Morris-Pratt - Algorithm q KMP Algorithm

n compares the pattern to the text in left-to- right, but shifts the pattern more intelligently than the brute-force algorithm.

text t1 . . . tj-r+1 . . . tj . . .

pattern p1 . . . pr . . .

pattern p1 . . . pr . . .

pattern p1 . . . pr . . .

22 The KMP Algorithm

q Knuth-Morris-Pratt’s algorithm compares the pattern to the text in left-to- right, but shifts the pattern more intelligently than the . . a b a a b x . . . . . brute-force algorithm.

q When a mismatch occurs, what is the most we can a b a a b a shift the pattern so as to j avoid redundant comparisons? a b a a b a q Answer: the largest prefix of P[0..j] that is a suffix of P[1..j] No need to Resume repeat these comparing comparisons here © 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 23 KMP Failure Function

q Knuth-Morris-Pratt’s algorithm preprocesses the j 0 1 2 3 4 5 pattern to find matches of prefixes of the pattern with P[j] a b a a b a the pattern itself F(j) 0 0 1 1 2 3 q The failure function F(j) is defined as the size of the . . a b a a b x . . . . . largest prefix of P[0..j] that is also a suffix of P[1..j] q Knuth-Morris-Pratt’s a b a a b a algorithm modifies the brute- j force algorithm so that if a mismatch occurs at P[j] ≠ T[i] a b a a b a we set j ← F(j - 1) F(j − 1)

© 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 24 Computing the Failure Function

q The failure function can be represented by an array and can be computed in O(m) Algorithm failureFunction(P) F[0] 0 time ← i ← 1 q At each iteration of the while- j ← 0 loop, either while i < m if P[i] = P[j] n i increases by one, or {we have matched j + 1 chars} n the shift amount i - j F[i] ← j + 1 increases by at least one i ← i + 1 (observe that F(j - 1) < j) j ← j + 1 else if j > 0 then q Hence, there are no more {use failure function to shift P} than 2m iterations of the j ← F[j - 1] while-loop else F[i] ← 0 { no match } i ← i + 1

© 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 25 The KMP Algorithm

q The failure function can be represented by an array and Algorithm KMPMatch(T, P) can be computed in O(m) F ← failureFunction(P) time i ← 0 j 0 q At each iteration of the while- ← loop, either while i < n if T[i] = P[j] n i increases by one, or if j = m - 1 n the shift amount i - j return i - j { match } increases by at least one else (observe that F(j - 1) < j) i ← i + 1 q Hence, there are no more j ← j + 1 than 2n iterations of the else while-loop if j > 0 j ← F[j - 1] q Thus, KMP’s algorithm runs else in optimal time O(m + n) i ← i + 1 return -1 { no match }

© 2014 Goodrich, Tamassia, Goldwasser Pattern Matching 26 Example

a b a c a a b a c c a b a c a b a a b b 1 2 3 4 5 6 a b a c a b 7 a b a c a b 8 9 10 11 12 a b a c a b 13 a b a c a b j 0 1 2 3 4 5 14 15 16 17 18 19 P[j] a b a c a b a b a c a b F(j) 0 0 1 0 1 2

Pattern Matching 27

e i mi nimize ze

mize nimize ze nimize ze Preprocessing Strings

q Preprocessing the pattern speeds up pattern matching queries

n After preprocessing the pattern, KMP’s algorithm performs pattern matching in time proportional to the text size

q If the text is large, immutable and searched for often (e.g., works by Shakespeare), we may want to preprocess the text instead of the pattern

q A is a compact for representing a set of strings, such as all the words in a text

n A trie supports pattern matching queries in time proportional to the pattern size

© 2014 Goodrich, Tamassia, Goldwasser Tries 29 Standard Tries

q The standard trie for a set of strings S is an ordered such that:

n Each node but the root is labeled with a character n The children of a node are alphabetically ordered n The paths from the external nodes to the root yield the strings of S q Example: standard trie for the set of strings S = { bear, bell, bid, bull, buy, sell, stock, stop }

b s

e i u e t

a l d l y l o

r l l l c p

k © 2014 Goodrich, Tamassia, Goldwasser Tries 30 Analysis of Standard Tries

q A standard trie uses O(n) space and supports searches, insertions and deletions in time O(dm), where: n total size of the strings in S m size of the string parameter of the operation d size of the alphabet

b s

e i u e t

a l d l y l o

r l l l c p

k © 2014 Goodrich, Tamassia, Goldwasser Tries 31 Application of Tries q A standard trie supports the following operations on a processed text in O(m) time, where m is the length of the string.

n word matching: find the first occurrence of word X in the text.

n prefix matching: find the first occurrence of the longest prefix of word X in the text.

Tries 32 Word Matching with a Trie insert the words of s e e a b e a r ? s e l l s t o c k ! the text into trie 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Each leaf is associated w/ one s e e a b u l l ? b u y s t o c k ! particular word 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 leaf stores indices b i d s t o c k ! b i d s t o c k ! where associated 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 word begins h e a r t h e b e l l ? s t o p ! (“see” starts at index 0 & 24, leaf 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 for “see” stores those indices)

b h s

e i u e e t

a l d l y a e l o 47, 58 36 0, 24 r l l r l c p 6 78 30 69 12 84 k

© 2014 Goodrich, Tamassia, Goldwasser Tries 17, 40, 33 51, 62 Compressed Tries

q A compressed trie has internal nodes of degree at least two b s

q It is obtained from standard trie by compressing chains of e id u ell to “redundant” nodes q ex. the “i” and “d” in “bid” ar ll ll y ck p are “redundant” because they signify the same word

b s

e i u e t

a l d l y l o

r l l l c p

k © 2014 Goodrich, Tamassia, Goldwasser Tries 34 Why Compressed Tries? q A compressed trie storing a collection S of s strings from an alphabet of size d has the following properties

n Every internal node of T has at least two children and at most d children

n T has s leaf nodes

n The number of internal nodes of T is O(s)

Tries 35 Compact Representation

q Compact representation of a compressed trie for an array of strings:

n Stores at the nodes ranges of indices instead of substrings

n Uses O(s) space, where s is the number of strings in the array

n Serves as an auxiliary index structure

0 1 2 3 4 0 1 2 3 0 1 2 3 S[0] = s e e S[4] = b u l l S[7] = h e a r

b hear s S[1] = b e a r S[5] = b u y S[8] = b e l l e e id u ell to S[2] = s e l l S[6] = b i d S[9] = s t o p e ll ar ll ll y ck p S[3] = s t o c k

1,0,0 7,0,3 0,0,0

1,1,1 6,1,2 4,1,1 0,1,1 3,1,2

1,2,3 8,2,3 4,2,3 5,2,2 0,2,2 2,2,3 3,3,4 9,3,3

Tries 36 Insertion/Deletion in Compressed Trie

f g

or u one reat

n r f g search ends here or u one reat insert green f g n r or u one re

insert green n r at en

Tries 37 Application - Routing through Tries q Internet Routers maintain a Trie (table) q It is not a lookup

n forwards packets to its neighbors using IP prefix matching rules

10 12

~ 2 9 ~

1 ~

Tries 38 Suffix Trie

q The suffix trie of a string X is the compressed trie of all the suffixes of X

q Each leaf corresponds to a suffix of X

m i n i m i z e 0 1 2 3 4 5 6 7

e i mi nimize ze 7 2 6 mize nimize ze nimize ze 3 1 5 0 4

© 2014 Goodrich, Tamassia, Goldwasser Tries 39 Analysis of Suffix Tries

q Compact representation of the suffix trie for a string X of size n from an alphabet of size d n Uses O(n) space n Supports arbitrary pattern matching queries in X in O(dm) time, where m is the size of the pattern n Can be constructed in O(n) time

m i n i m i z e 0 1 2 3 4 5 6 7

7, 7 1, 1 0, 1 2, 7 6, 7 7 2 6 4, 7 2, 7 6, 7 2, 7 6, 7 3 1 5 0 4 © 2014 Goodrich, Tamassia, Goldwasser Tries 40 Prefix Matching using Suffix Trie q If two suffixes have a same prefix, then their corresponding paths are the same at their beginning, and the concatenation of the edge labels of the mutual part is the prefix. m i n i m i z e ize and 0 1 2 3 4 5 6 7 imize

e i mi nimize ze 7 2 6 mize nimize ze nimize ze 3 1 5 0 4

Tries 41 Suffix Trie q Not all strings are guaranteed to have corresponding suffix trie. q For example: xabxa

a xa bxa bxa bxa

a xa bxa bxa bxa $ $

Tries 42 Constructing a Suffix Trie

q S[1..n] is the string q start with a single edge for S q enter the edges for the suffix S[i..n] where i goes from 2 to n n Starting at the root node find the longest part from the root whose label matches a prefix of S[i..n]. At some point, no further matches are possible w If the point is at a node, then denote this node by w w If it is in the middle of an edge, then insert a new node called w, at this point w create a new edge running from w to a new leaf labeled S[i..n]

Tries 43 Example minimize minimize minimize

minimize inimize minimize

inimize minimize nimize

minimize

i minimize nimize mize nimize Tries 44 Example (2) minimize

i minimize nimize mize nimize minimize

i m nimize mize nimize inimize ize

Tries 45 Example (3)

m i n i m i z e 0 1 2 3 4 5 6 7

e i mi nimize ze 7 2 6 mize nimize ze nimize ze 3 1 5 0 4

Tries 46 Constructing a Suffix Trie

q S[1..n] is the string Complexity- O(n2) q start with a single edge for S q enter the edges for the suffix S[i..n] where i goes from 2 to n n Starting at the root node find the longest part from the root whose label matches a prefix of S[i..n]. At some point, no further matches are possible w If the point is at a node, then denote this node by w w If it is in the middle of an edge, then insert a new node called w, at this point w create a new edge running from w to a new leaf labeled S[i..n]

Tries 47