Pattern Matching

Pattern Matching Pattern Matching Goal. Generalize string searching to incompletely specified patterns. Applications. ■ Test if a string or its substring matches some pattern. – validate data-entry fields (dates, email, URL, credit card) – text filters (spam, NetNanny, Carnivore) – computational biology Some of these lecture slides have been adapted from: • Algorithms in C, Robert Sedgewick. ■ Parse text files. – given web page, extract names of all links (web crawling, indexing, and searching) – Javadoc: automatically create documentation from comments ■ Replace or substitute some pattern in a text string. – text-editor – remove all tags in web page, leaving only content 4 Pattern Matching Review of Regular Expressions Goal. Generalize string searching to incompletely specified patterns. Theoretician. Language accepted by FSA. Programmer. Compact description of multiple strings. Text. N characters. You. Practical application of core CS principles. Pattern. M character REGULAR EXPRESSION. ■ Compact and expressive notation for describing text patterns. Concatenate. ■ ■ Algorithmically interesting. abcda abcda ■ Easy to implement. Logical OR. Matching. Does the text match the pattern? ■ a + b a, b Search. Find a substring of the text that matches the pattern. ■ (a + cc)(b + d) ab, ad, ccb, ccd Search all. Find all substrings of the text that match the pattern. Closure. ■ a* ε, a, aa, aaa, aaaa, aaaaa, … ■ ca*b cb, cab, caab, caaab, caaaab, … ■ c(a + bb)* d cd, cad, cbbd, caad, cabbd, caaad, … 5 6 Pattern Matching and You FSA and RE Broadly applicable programmer's tool. Kleene's theorem (1956). FSA and RE describe same languages. ■ Many languages support extended regular expressions. ■ Built into Perl, PHP, Python, JavaScript, emacs, egrep, awk. Possible grep implementation. ■ Build FSA from RE. Find any 11+ letter words in dictionary that can be typed by using only top ■ Write C program to simulate FSA. row letters, followed by bottom row letters. ■ Performance barrier: FSA can be exponentially large. ■ egrep '^[qwertyuiop]*[zxcvbnm]*$' /usr/dict/words | Actual grep implementation. egrep '...........' ■ Build nondeterministic FSA from RE. ■ ■ perl -ne 'print if /^[qwertyuiop]*[zxcvbnm]*$/' /usr/dict/words | Write C program to simulate NFSA. perl -ne 'print if /.........../' Essential paradigm in computer science. ■ Build intermediate abstractions. ! typewritten ■ Pick the right ones! Stephen C. Kleene (1909 - 1994) 7 8 Review of NFSA Simulating an NFSA A nondeterministic FSA. Brute force. Try all possible paths ⇒ exponential time. ■ 0, 1, or 2 arcs leaving a state, each with same label. ■ ε - transitions allowed, but no ε - cycles. Better idea. Keep track of all possible states NFSA could be in after reading in first i characters. ■ Use a deque (double-ended queue). Note: this restricted form is no loss of generality. – can push/pop like stack, enqueue like queue Deque a ε c ε 6 7 8 9 ε a a 1 9 3 -1 5 4 2 7 a a ε d 10 ε 11 12 13 possible states possible states NFSA a ε ε b ε after i-1 chars could be in after i chars 1 2 3 4 5 ■ Pop state v. ε – if label of arc v→wis ε, push state w – if current character matches label, enqueue state w (a*b + ac)d – if mismatch, ignore 9 10 NFSA Simulator Performance Gotcha nfsa() Major performance bug if not careful. Deque #define SCAN -1 ■ Simulate input aaaaaaaaaab on NFSA. #define EPS ' ' a 4 -1 -1 7 2 a a a a a ε ε b 5 2 -1 -1 7 2 a a a a a #define MATCHSTATE 0 0 4 5 6 a 2 -1 -1 7 2 a a a a a int match(char a[]) { ε a 1 3 -1 -1 7 2 a a a a a int j = 0, state = next1[0]; a 3 -1 2 7 2 a a a a a a ε DQinit(); 1 2 3 a a -1 2 4 2 a a a a a ε DQput(SCAN); a a 1 2 4 -1 a a a a a while(state != MATCHSTATE) { (a*a)*b 1 3 4 -1 a a a a if (state == SCAN) { DQput(scan); j++; } else if (ch[state] == a[j]) { DQput(next1[state]); } a a 1 3 4 -1 2 a a a a else if (ch[state] == EPS) { DQpush(next1[state]); a a 1 -1 4 -1 2 4 a a a ■ Duplicate states allowed on deque ⇒ DQpush(next2[state]); } a a 1 -1 7 ... a a a a a exponential growth! if (DQisempty() || a[j] == '\0') return 0; a a 1 -1 7 -1 2 4 2 4 state = DQpop(); a a 1 -1 7 2 4 2 4 -1 } Easy fix. return j; ■ Disallow duplicate states on same side of deque. } ■ Keep "existence array" of states currently on each side of deque. 11 12 Build NFSA from RE Parse Tree expr Goal: build NFSA from RE. Parse tree: grammatical structure of string. Parser: construct tree. First challenge: Is expression a legal RE? Example: (a*b + ac)d. term ■ Use context free language to describe RE. fctr term Start : <expr> Start : <expr> ( expr ) fctr <expr> ← <term> <expr> ← <term> b <expr> ← <term> + <expr> <expr> ← <term> + <expr> term + expr ← ← <term> <fctr> <term> <fctr> fctr term term <term> ← <fctr><term> <term> ← <fctr><term> fctr fctr term <fctr> ← c <fctr> ← c a * <fctr> ← c* <fctr> ← c* <fctr> ← (<expr>) <fctr> ← (<expr>) b a fctr <fctr> ← (<expr>)* <fctr> ← (<expr>)* c 13 14 Recursive Descent Parser for RE Recursive Descent Parser for RE Top-down recursive descent parser: Recursive program directly Definition of expression in CFL. expr() void expr() { derived from CFL. ■ <expr> ← <term> term(); ■ <expr> ← <term> + <expr> if (p[j] == '+') { main() j++; expr(); int j = 0; // current index } char p[MAXN + 1]; // RE pattern } Definition of term in CFL. ← void parserror(void) { ■ <term> <fctr> printf("%s is not a RE.\n", p); ■ <term> ← <fctr> <term> exit(EXIT_FAILURE); } term() void term() { int main(void) { fctr(); scanf("%s", p); if ((p[j] == '(') || islower(p[j])) expr(); term(); if (j != strlen(p)) parserror(p); } return 0; } 15 16 Recursive Descent Parser for RE Left Recursive Parsers Definition of factor in CFL. factor() Not as trivial as it first seems. ← ■ <fctr> c void fctr() { badexpr() ■ <fctr> ← c* if (islower(p[j])) { j++; void badexpr() { ■ <fctr> ← (<expr>) Alternate definition of expr in CFL. } if (islower(p[j]) j++; ■ <expr> ← c ■ <fctr> ← (<expr>)* else if (p[j] == '(') { else { ■ <expr> ← <expr> + <term> j++; badexpr(); expr(); if (p[j] == '+') { if (p[j] == ')') j++; j++; else parserror(); term(); } } else parserror(); else parserror(); } if (p[j] == '*') j++; } } Fix: use left recursive CFL. ■ Avoiding infinite recursive loops is fundamental difficulty in recursive-descent parsers. ■ Problem can be more subtle than example above. 17 18 Left Recursive Parsers Building NFSA from RE Example. (a*b + ac)d. Unix Each RE construct corresponds to a piece of NFSA. ■ Corresponds to parse tree. expr() a term() ■ Single character. fctr() – a start accept ( expr() A ■ Concatenation. term() ε – AB A B fctr() a * B term() fctr() b + A ■ OR. ε ε expr() A – A + B term() B fctr() a ε B ε term() fctr() c ) ε term() ■ Closure. A A fctr() d – A* ε 19 20 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a a ε 1 2 1 2 3 ε a a* 21 22 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a ε b a ε ε b 1 2 3 4 5 1 2 3 4 5 ε ε a* b a*b 23 24 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a a c a a c 6 7 6 7 8 9 a ε ε b a ε ε b 1 2 3 4 5 1 2 3 4 5 ε ε a*b a*b 25 26 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d ac a ε c a ε c 6 7 8 9 ε 6 7 8 9 ε 10 ε 11 a ε ε b a ε ε b ε 1 2 3 4 5 1 2 3 4 5 ε ε a*b a*b + ac 27 28 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Note. This construction doesn't yield simplest NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a ε c a ε 6 7 8 9 ε ε d 10 ε 11 12 13 ε c a ε ε b ε b d 1 2 3 4 5 a ε (a*b + ac)d 29 30 Building NFSA from RE: Theory Building NFSA from RE: Practice For any RE of length M, our construction produces an NFSA with the To build NFSA, augment parser to generate state table. following properties. ■ For details: Sedgewick, Chapter 21 (Algorithms in C, 2nd edition). ■ No more than two arcs leave any state. – recursive routines return index of start state ε – if two arcs, they both have label – state = next state to be filled in ε ■ No - cycles. – setstate() fills in NFSA table ■ Exactly 1 start state, has 1 incoming arc.

Load more