Pattern Matching

Pattern Matching Goal. Generalize string searching to incompletely specified patterns. Applications.

■ Test if a string or its substring matches some pattern. – validate data-entry fields (dates, email, URL, credit card) – text filters (spam, NetNanny, Carnivore) – computational biology Some of these lecture slides have been adapted from: • Algorithms in C, Robert Sedgewick. ■ Parse text files. – given web page, extract names of all links (web crawling, indexing, and searching) – Javadoc: automatically create documentation from comments

■ Replace or substitute some pattern in a text string. – text-editor – remove all tags in web page, leaving only content

4

Pattern Matching Review of Regular Expressions

Goal. Generalize string searching to incompletely specified patterns. Theoretician. Language accepted by FSA. Programmer. Compact description of multiple strings. Text. N characters. You. Practical application of core CS principles. Pattern. M character .

■ Compact and expressive notation for describing text patterns. Concatenate.

■ ■ Algorithmically interesting. abcda abcda

■ Easy to implement. Logical OR.

Matching. Does the text match the pattern? ■ a + b a, b

Search. a substring of the text that matches the pattern. ■ (a + cc)(b + d) ab, ad, ccb, ccd Search all. Find all substrings of the text that match the pattern. Closure.

■ a* ε, a, aa, aaa, aaaa, aaaaa, …

■ ca*b cb, cab, caab, caaab, caaaab, …

■ c(a + bb)* d cd, cad, cbbd, caad, cabbd, caaad, …

5 6 Pattern Matching and You FSA and RE

Broadly applicable programmer's tool. Kleene's theorem (1956). FSA and RE describe same languages.

■ Many languages support extended regular expressions.

■ Built into , PHP, Python, JavaScript, emacs, egrep, awk. Possible grep implementation.

■ Build FSA from RE.

Find any 11+ letter words in dictionary that can be typed by using only top ■ Write C program to simulate FSA. row letters, followed by bottom row letters. ■ Performance barrier: FSA can be exponentially large.

■ egrep '^[qwertyuiop]*[zxcvbnm]*$' /usr/dict/words | Actual grep implementation. egrep '...... ' ■ Build nondeterministic FSA from RE.

■ ■ perl -ne 'print if /^[qwertyuiop]*[zxcvbnm]*$/' /usr/dict/words | Write C program to simulate NFSA. perl -ne 'print if /...... /' Essential paradigm in computer science.

■ Build intermediate abstractions. ! typewritten ■ Pick the right ones!

Stephen C. Kleene (1909 - 1994)

7 8

Review of NFSA Simulating an NFSA

A nondeterministic FSA. Brute force. Try all possible paths ⇒ exponential time.

■ 0, 1, or 2 arcs leaving a state, each with same label.

■ ε - transitions allowed, but no ε - cycles. Better idea. Keep track of all possible states NFSA could be in after reading in first i characters.

■ Use a deque (double-ended queue). Note: this restricted form is no loss of generality. – can push/pop like stack, enqueue like queue

Deque a ε c ε 6 7 8 9 ε a a 1 9 3 -1 5 4 2 7 a a ε d 10 ε 11 12 13 possible states possible states NFSA a ε ε b ε after i-1 chars could be in after i chars 1 2 3 4 5

■ Pop state v. ε – if label of arc v→wis ε, push state w – if current character matches label, enqueue state w (a*b + ac)d – if mismatch, ignore

9 10 NFSA Simulator Performance Gotcha

nfsa() Major performance bug if not careful. Deque #define SCAN -1 ■ Simulate input aaaaaaaaaab on NFSA. #define EPS ' ' a 4 -1 -1 7 2 a a a a a ε ε b 5 2 -1 -1 7 2 a a a a a #define MATCHSTATE 0 0 4 5 6 a 2 -1 -1 7 2 a a a a a int match(char a[]) { ε a 1 3 -1 -1 7 2 a a a a a int j = 0, state = next1[0]; a 3 -1 2 7 2 a a a a a a ε DQinit(); 1 2 3 a a -1 2 4 2 a a a a a ε DQput(SCAN); a a 1 2 4 -1 a a a a a while(state != MATCHSTATE) { (a*a)*b 1 3 4 -1 a a a a if (state == SCAN) { DQput(scan); j++; } else if (ch[state] == a[j]) { DQput(next1[state]); } a a 1 3 4 -1 2 a a a a else if (ch[state] == EPS) { DQpush(next1[state]); a a 1 -1 4 -1 2 4 a a a ■ Duplicate states allowed on deque ⇒ DQpush(next2[state]); } a a 1 -1 7 ... a a a a a exponential growth! if (DQisempty() || a[j] == '\0') return 0; a a 1 -1 7 -1 2 4 2 4 state = DQpop(); a a 1 -1 7 2 4 2 4 -1 } Easy fix. return j; ■ Disallow duplicate states on same side of deque. } ■ Keep "existence array" of states currently on each side of deque.

11 12

Build NFSA from RE Parse

expr Goal: build NFSA from RE. Parse tree: grammatical structure of string. Parser: construct tree. First challenge: Is expression a legal RE? Example: (a*b + ac)d. term

■ Use context free language to describe RE. fctr term

Start : Start : ( expr ) fctr

b + + term + expr

← ← fctr term term fctr fctr term ← c ← c a * ← c* ← c* ← () ← () b a fctr ← ()* ← ()* c 13 14 Recursive Descent Parser for RE Recursive Descent Parser for RE

Top-down recursive descent parser: Recursive program directly Definition of expression in CFL. expr() void expr() { derived from CFL. ■ term(); ■ + if (p[j] == '+') { main() j++; expr(); int j = 0; // current index } char p[MAXN + 1]; // RE pattern } Definition of term in CFL. ← void parserror(void) { ■

printf("%s is not a RE.\n", p); ■ exit(EXIT_FAILURE); } term() void term() { int main(void) { fctr(); scanf("%s", p); if ((p[j] == '(') || islower(p[j])) expr(); term(); if (j != strlen(p)) parserror(p); } return 0; }

15 16

Recursive Descent Parser for RE Left Recursive Parsers

Definition of factor in CFL. factor() Not as trivial as it first seems. ← ■ c void fctr() { badexpr() ■ ← c* if (islower(p[j])) { j++; void badexpr() { ■ ← () Alternate definition of expr in CFL. } if (islower(p[j]) j++; ■ ← c ■ ← ()* else if (p[j] == '(') { else { ■ + j++; badexpr(); expr(); if (p[j] == '+') { if (p[j] == ')') j++; j++; else parserror(); term(); } } else parserror(); else parserror(); } if (p[j] == '*') j++; } } Fix: use left recursive CFL.

■ Avoiding infinite recursive loops is fundamental difficulty in recursive-descent parsers.

■ Problem can be more subtle than example above.

17 18 Left Recursive Parsers Building NFSA from RE

Example. (a*b + ac)d. Unix Each RE construct corresponds to a piece of NFSA.

■ Corresponds to parse tree. expr() a term() ■ Single character. fctr() – a start accept ( expr() A ■ Concatenation. term() ε – AB A B fctr() a * B term() fctr() b + A ■ OR. ε ε expr() A – A + B term() B fctr() a ε B ε term() fctr() c ) ε term() ■ Closure. A A fctr() d – A* ε 19 20

Building NFSA from RE: Example Building NFSA from RE: Example

Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.

■ (a*b + ac)d ■ (a*b + ac)d

a a ε 1 2 1 2 3

ε

a a*

21 22 Building NFSA from RE: Example Building NFSA from RE: Example

Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.

■ (a*b + ac)d ■ (a*b + ac)d

a ε b a ε ε b 1 2 3 4 5 1 2 3 4 5

ε ε

a* b a*b

23 24

Building NFSA from RE: Example Building NFSA from RE: Example

Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.

■ (a*b + ac)d ■ (a*b + ac)d

a a c

a a c 6 7 6 7 8 9

a ε ε b a ε ε b 1 2 3 4 5 1 2 3 4 5

ε ε

a*b a*b

25 26 Building NFSA from RE: Example Building NFSA from RE: Example

Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA.

■ (a*b + ac)d ■ (a*b + ac)d

ac

a ε c a ε c 6 7 8 9 ε 6 7 8 9 ε

10 ε 11

a ε ε b a ε ε b ε 1 2 3 4 5 1 2 3 4 5

ε ε

a*b a*b + ac

27 28

Building NFSA from RE: Example Building NFSA from RE: Example

Each RE construct corresponds to a piece of NFSA. Note. This construction doesn't yield simplest NFSA.

■ (a*b + ac)d ■ (a*b + ac)d

a ε c a ε 6 7 8 9 ε ε d 10 ε 11 12 13 ε c

a ε ε b ε b d 1 2 3 4 5 a ε

(a*b + ac)d

29 30 Building NFSA from RE: Theory Building NFSA from RE: Practice

For any RE of length M, our construction produces an NFSA with the To build NFSA, augment parser to generate state table. following properties. ■ For details: Sedgewick, Chapter 21 (Algorithms in C, 2nd edition). ■ No more than two arcs leave any state. – recursive routines return index of start state ε – if two arcs, they both have label – state = next state to be filled in ε ■ No - cycles. – setstate() fills in NFSA table

■ Exactly 1 start state, has 1 incoming arc. expr() ■ Exactly 1 accept state, has at most 1 leaving arc. int expr() { ■ Number of states ≤ 2M. int s1, s2, start; start = s1 = term(); Proof: Apply 3 composition rules and use induction on length of RE. if (p[j] == '+') {

■ For number of states. j++; start = s2 = ++state; – single character: 2 state++; – concatenation AB: |A| + |B| setstate(s2, EPS, expr(), s1); – closure A*: |A| + 1 setstate(s2-1, EPS, state, state); – OR A + B: |A| + |B| + 2 } return start; }

31 32

Complexity Analysis Perspective

Text. N characters. Compiler. A program that translates from one language to another.

Pattern. M character regular expression. ■ Grep: RE ⇒ NFSA.

■ C compiler: C language ⇒ machine language. Matching: Does the text match the pattern?

■ Build NFSA. – at most 2M states ⇒ O(M) time, O(M) space

■ Simulate NFSA. Abstract Machine NFSA Computer – O(M) time per text character because of ε-transitions Pattern Word in CFL Word in CFL

■ O(MN) time, O(M) space. Parser Check if legal RE Check if legal C program Compiler Output NFSA Output machine executable Search: Find a substring of the text that matches the pattern. Simulator Find match Run program in hardware ■ For each offset of text, solve matching problem. 2 ■ O(MN ) time, O(M+N) space.

33 34