Pattern Matching
Pattern Matching Goal. Generalize string searching to incompletely specified patterns. Applications.
■ Test if a string or its substring matches some pattern. – validate data-entry fields (dates, email, URL, credit card) – text filters (spam, NetNanny, Carnivore) – computational biology Some of these lecture slides have been adapted from: • Algorithms in C, Robert Sedgewick. ■ Parse text files. – given web page, extract names of all links (web crawling, indexing, and searching) – Javadoc: automatically create documentation from comments
■ Replace or substitute some pattern in a text string. – text-editor – remove all tags in web page, leaving only content
4
Pattern Matching Review of Regular Expressions
Goal. Generalize string searching to incompletely specified patterns. Theoretician. Language accepted by FSA. Programmer. Compact description of multiple strings. Text. N characters. You. Practical application of core CS principles. Pattern. M character REGULAR EXPRESSION.
■ Compact and expressive notation for describing text patterns. Concatenate.
■ ■ Algorithmically interesting. abcda abcda
■ Easy to implement. Logical OR.
Matching. Does the text match the pattern? ■ a + b a, b
Search. Find a substring of the text that matches the pattern. ■ (a + cc)(b + d) ab, ad, ccb, ccd Search all. Find all substrings of the text that match the pattern. Closure.
■ a* ε, a, aa, aaa, aaaa, aaaaa, …
■ ca*b cb, cab, caab, caaab, caaaab, …
■ c(a + bb)* d cd, cad, cbbd, caad, cabbd, caaad, …
5 6 Pattern Matching and You FSA and RE
Broadly applicable programmer's tool. Kleene's theorem (1956). FSA and RE describe same languages.
■ Many languages support extended regular expressions.
■ Built into Perl, PHP, Python, JavaScript, emacs, egrep, awk. Possible grep implementation.
■ Build FSA from RE.
Find any 11+ letter words in dictionary that can be typed by using only top ■ Write C program to simulate FSA. row letters, followed by bottom row letters. ■ Performance barrier: FSA can be exponentially large.
■ egrep '^[qwertyuiop]*[zxcvbnm]*$' /usr/dict/words | Actual grep implementation. egrep '...... ' ■ Build nondeterministic FSA from RE.
■ ■ perl -ne 'print if /^[qwertyuiop]*[zxcvbnm]*$/' /usr/dict/words | Write C program to simulate NFSA. perl -ne 'print if /...... /' Essential paradigm in computer science.
■ Build intermediate abstractions. ! typewritten ■ Pick the right ones!
Stephen C. Kleene (1909 - 1994)
7 8
Review of NFSA Simulating an NFSA
A nondeterministic FSA. Brute force. Try all possible paths ⇒ exponential time.
■ 0, 1, or 2 arcs leaving a state, each with same label.
■ ε - transitions allowed, but no ε - cycles. Better idea. Keep track of all possible states NFSA could be in after reading in first i characters.
■ Use a deque (double-ended queue). Note: this restricted form is no loss of generality. – can push/pop like stack, enqueue like queue
Deque a ε c ε 6 7 8 9 ε a a 1 9 3 -1 5 4 2 7 a a ε d 10 ε 11 12 13 possible states possible states NFSA a ε ε b ε after i-1 chars could be in after i chars 1 2 3 4 5
■ Pop state v. ε – if label of arc v→wis ε, push state w – if current character matches label, enqueue state w (a*b + ac)d – if mismatch, ignore
9 10 NFSA Simulator Performance Gotcha
nfsa() Major performance bug if not careful. Deque #define SCAN -1 ■ Simulate input aaaaaaaaaab on NFSA. #define EPS ' ' a 4 -1 -1 7 2 a a a a a ε ε b 5 2 -1 -1 7 2 a a a a a #define MATCHSTATE 0 0 4 5 6 a 2 -1 -1 7 2 a a a a a int match(char a[]) { ε a 1 3 -1 -1 7 2 a a a a a int j = 0, state = next1[0]; a 3 -1 2 7 2 a a a a a a ε DQinit(); 1 2 3 a a -1 2 4 2 a a a a a ε DQput(SCAN); a a 1 2 4 -1 a a a a a while(state != MATCHSTATE) { (a*a)*b 1 3 4 -1 a a a a if (state == SCAN) { DQput(scan); j++; } else if (ch[state] == a[j]) { DQput(next1[state]); } a a 1 3 4 -1 2 a a a a else if (ch[state] == EPS) { DQpush(next1[state]); a a 1 -1 4 -1 2 4 a a a ■ Duplicate states allowed on deque ⇒ DQpush(next2[state]); } a a 1 -1 7 ... a a a a a exponential growth! if (DQisempty() || a[j] == '\0') return 0; a a 1 -1 7 -1 2 4 2 4 state = DQpop(); a a 1 -1 7 2 4 2 4 -1 } Easy fix. return j; ■ Disallow duplicate states on same side of deque. } ■ Keep "existence array" of states currently on each side of deque.
11 12
Build NFSA from RE Parse Tree
expr Goal: build NFSA from RE. Parse tree: grammatical structure of string. Parser: construct tree. First challenge: Is expression a legal RE? Example: (a*b + ac)d. term
■ Use context free language to describe RE. fctr term
Start :
← ←
Top-down recursive descent parser: Recursive program directly Definition of expression in CFL. expr() void expr() { derived from CFL. ■
printf("%s is not a RE.\n", p); ■
15 16
Recursive Descent Parser for RE Left Recursive Parsers
Definition of factor in CFL. factor() Not as trivial as it first seems. ← ■
■ Avoiding infinite recursive loops is fundamental difficulty in recursive-descent parsers.
■ Problem can be more subtle than example above.
17 18 Left Recursive Parsers Building NFSA from RE
Example. (a*b + ac)d. Unix Each RE construct corresponds to a piece of NFSA.
■ Corresponds to parse tree. expr() a term() ■ Single character. fctr() – a start accept ( expr() A ■ Concatenation. term() ε – AB A B fctr() a * B term() fctr() b + A ■ OR. ε ε expr() A – A + B term() B fctr() a ε B ε term() fctr() c ) ε term() ■ Closure. A A fctr() d – A* ε 19 20
Building NFSA from RE: Example Building NFSA from RE: Example
Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.
■ (a*b + ac)d ■ (a*b + ac)d
a a ε 1 2 1 2 3
ε
a a*
21 22 Building NFSA from RE: Example Building NFSA from RE: Example
Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.
■ (a*b + ac)d ■ (a*b + ac)d
a ε b a ε ε b 1 2 3 4 5 1 2 3 4 5
ε ε
a* b a*b
23 24
Building NFSA from RE: Example Building NFSA from RE: Example
Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.
■ (a*b + ac)d ■ (a*b + ac)d
a a c
a a c 6 7 6 7 8 9
a ε ε b a ε ε b 1 2 3 4 5 1 2 3 4 5
ε ε
a*b a*b
25 26 Building NFSA from RE: Example Building NFSA from RE: Example
Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.
■ (a*b + ac)d ■ (a*b + ac)d
ac
a ε c a ε c 6 7 8 9 ε 6 7 8 9 ε
10 ε 11
a ε ε b a ε ε b ε 1 2 3 4 5 1 2 3 4 5
ε ε
a*b a*b + ac
27 28
Building NFSA from RE: Example Building NFSA from RE: Example
Each RE construct corresponds to a piece of NFSA. Note. This construction doesn't yield simplest NFSA.
■ (a*b + ac)d ■ (a*b + ac)d
a ε c a ε 6 7 8 9 ε ε d 10 ε 11 12 13 ε c
a ε ε b ε b d 1 2 3 4 5 a ε
(a*b + ac)d
29 30 Building NFSA from RE: Theory Building NFSA from RE: Practice
For any RE of length M, our construction produces an NFSA with the To build NFSA, augment parser to generate state table. following properties. ■ For details: Sedgewick, Chapter 21 (Algorithms in C, 2nd edition). ■ No more than two arcs leave any state. – recursive routines return index of start state ε – if two arcs, they both have label – state = next state to be filled in ε ■ No - cycles. – setstate() fills in NFSA table
■ Exactly 1 start state, has 1 incoming arc. expr() ■ Exactly 1 accept state, has at most 1 leaving arc. int expr() { ■ Number of states ≤ 2M. int s1, s2, start; start = s1 = term(); Proof: Apply 3 composition rules and use induction on length of RE. if (p[j] == '+') {
■ For number of states. j++; start = s2 = ++state; – single character: 2 state++; – concatenation AB: |A| + |B| setstate(s2, EPS, expr(), s1); – closure A*: |A| + 1 setstate(s2-1, EPS, state, state); – OR A + B: |A| + |B| + 2 } return start; }
31 32
Complexity Analysis Perspective
Text. N characters. Compiler. A program that translates from one language to another.
Pattern. M character regular expression. ■ Grep: RE ⇒ NFSA.
■ C compiler: C language ⇒ machine language. Matching: Does the text match the pattern?
■ Build NFSA. – at most 2M states ⇒ O(M) time, O(M) space
■ Simulate NFSA. Abstract Machine NFSA Computer – O(M) time per text character because of ε-transitions Pattern Word in CFL Word in CFL
■ O(MN) time, O(M) space. Parser Check if legal RE Check if legal C program Compiler Output NFSA Output machine executable Search: Find a substring of the text that matches the pattern. Simulator Find match Run program in hardware ■ For each offset of text, solve matching problem. 2 ■ O(MN ) time, O(M+N) space.
33 34