Pattern Matching

Pattern Matching

Pattern Matching Pattern Matching Goal. Generalize string searching to incompletely specified patterns. Applications. ■ Test if a string or its substring matches some pattern. – validate data-entry fields (dates, email, URL, credit card) – text filters (spam, NetNanny, Carnivore) – computational biology Some of these lecture slides have been adapted from: • Algorithms in C, Robert Sedgewick. ■ Parse text files. – given web page, extract names of all links (web crawling, indexing, and searching) – Javadoc: automatically create documentation from comments ■ Replace or substitute some pattern in a text string. – text-editor – remove all tags in web page, leaving only content 4 Pattern Matching Review of Regular Expressions Goal. Generalize string searching to incompletely specified patterns. Theoretician. Language accepted by FSA. Programmer. Compact description of multiple strings. Text. N characters. You. Practical application of core CS principles. Pattern. M character REGULAR EXPRESSION. ■ Compact and expressive notation for describing text patterns. Concatenate. ■ ■ Algorithmically interesting. abcda abcda ■ Easy to implement. Logical OR. Matching. Does the text match the pattern? ■ a + b a, b Search. Find a substring of the text that matches the pattern. ■ (a + cc)(b + d) ab, ad, ccb, ccd Search all. Find all substrings of the text that match the pattern. Closure. ■ a* ε, a, aa, aaa, aaaa, aaaaa, … ■ ca*b cb, cab, caab, caaab, caaaab, … ■ c(a + bb)* d cd, cad, cbbd, caad, cabbd, caaad, … 5 6 Pattern Matching and You FSA and RE Broadly applicable programmer's tool. Kleene's theorem (1956). FSA and RE describe same languages. ■ Many languages support extended regular expressions. ■ Built into Perl, PHP, Python, JavaScript, emacs, egrep, awk. Possible grep implementation. ■ Build FSA from RE. Find any 11+ letter words in dictionary that can be typed by using only top ■ Write C program to simulate FSA. row letters, followed by bottom row letters. ■ Performance barrier: FSA can be exponentially large. ■ egrep '^[qwertyuiop]*[zxcvbnm]*$' /usr/dict/words | Actual grep implementation. egrep '...........' ■ Build nondeterministic FSA from RE. ■ ■ perl -ne 'print if /^[qwertyuiop]*[zxcvbnm]*$/' /usr/dict/words | Write C program to simulate NFSA. perl -ne 'print if /.........../' Essential paradigm in computer science. ■ Build intermediate abstractions. ! typewritten ■ Pick the right ones! Stephen C. Kleene (1909 - 1994) 7 8 Review of NFSA Simulating an NFSA A nondeterministic FSA. Brute force. Try all possible paths ⇒ exponential time. ■ 0, 1, or 2 arcs leaving a state, each with same label. ■ ε - transitions allowed, but no ε - cycles. Better idea. Keep track of all possible states NFSA could be in after reading in first i characters. ■ Use a deque (double-ended queue). Note: this restricted form is no loss of generality. – can push/pop like stack, enqueue like queue Deque a ε c ε 6 7 8 9 ε a a 1 9 3 -1 5 4 2 7 a a ε d 10 ε 11 12 13 possible states possible states NFSA a ε ε b ε after i-1 chars could be in after i chars 1 2 3 4 5 ■ Pop state v. ε – if label of arc v→wis ε, push state w – if current character matches label, enqueue state w (a*b + ac)d – if mismatch, ignore 9 10 NFSA Simulator Performance Gotcha nfsa() Major performance bug if not careful. Deque #define SCAN -1 ■ Simulate input aaaaaaaaaab on NFSA. #define EPS ' ' a 4 -1 -1 7 2 a a a a a ε ε b 5 2 -1 -1 7 2 a a a a a #define MATCHSTATE 0 0 4 5 6 a 2 -1 -1 7 2 a a a a a int match(char a[]) { ε a 1 3 -1 -1 7 2 a a a a a int j = 0, state = next1[0]; a 3 -1 2 7 2 a a a a a a ε DQinit(); 1 2 3 a a -1 2 4 2 a a a a a ε DQput(SCAN); a a 1 2 4 -1 a a a a a while(state != MATCHSTATE) { (a*a)*b 1 3 4 -1 a a a a if (state == SCAN) { DQput(scan); j++; } else if (ch[state] == a[j]) { DQput(next1[state]); } a a 1 3 4 -1 2 a a a a else if (ch[state] == EPS) { DQpush(next1[state]); a a 1 -1 4 -1 2 4 a a a ■ Duplicate states allowed on deque ⇒ DQpush(next2[state]); } a a 1 -1 7 ... a a a a a exponential growth! if (DQisempty() || a[j] == '\0') return 0; a a 1 -1 7 -1 2 4 2 4 state = DQpop(); a a 1 -1 7 2 4 2 4 -1 } Easy fix. return j; ■ Disallow duplicate states on same side of deque. } ■ Keep "existence array" of states currently on each side of deque. 11 12 Build NFSA from RE Parse Tree expr Goal: build NFSA from RE. Parse tree: grammatical structure of string. Parser: construct tree. First challenge: Is expression a legal RE? Example: (a*b + ac)d. term ■ Use context free language to describe RE. fctr term Start : <expr> Start : <expr> ( expr ) fctr <expr> ← <term> <expr> ← <term> b <expr> ← <term> + <expr> <expr> ← <term> + <expr> term + expr ← ← <term> <fctr> <term> <fctr> fctr term term <term> ← <fctr><term> <term> ← <fctr><term> fctr fctr term <fctr> ← c <fctr> ← c a * <fctr> ← c* <fctr> ← c* <fctr> ← (<expr>) <fctr> ← (<expr>) b a fctr <fctr> ← (<expr>)* <fctr> ← (<expr>)* c 13 14 Recursive Descent Parser for RE Recursive Descent Parser for RE Top-down recursive descent parser: Recursive program directly Definition of expression in CFL. expr() void expr() { derived from CFL. ■ <expr> ← <term> term(); ■ <expr> ← <term> + <expr> if (p[j] == '+') { main() j++; expr(); int j = 0; // current index } char p[MAXN + 1]; // RE pattern } Definition of term in CFL. ← void parserror(void) { ■ <term> <fctr> printf("%s is not a RE.\n", p); ■ <term> ← <fctr> <term> exit(EXIT_FAILURE); } term() void term() { int main(void) { fctr(); scanf("%s", p); if ((p[j] == '(') || islower(p[j])) expr(); term(); if (j != strlen(p)) parserror(p); } return 0; } 15 16 Recursive Descent Parser for RE Left Recursive Parsers Definition of factor in CFL. factor() Not as trivial as it first seems. ← ■ <fctr> c void fctr() { badexpr() ■ <fctr> ← c* if (islower(p[j])) { j++; void badexpr() { ■ <fctr> ← (<expr>) Alternate definition of expr in CFL. } if (islower(p[j]) j++; ■ <expr> ← c ■ <fctr> ← (<expr>)* else if (p[j] == '(') { else { ■ <expr> ← <expr> + <term> j++; badexpr(); expr(); if (p[j] == '+') { if (p[j] == ')') j++; j++; else parserror(); term(); } } else parserror(); else parserror(); } if (p[j] == '*') j++; } } Fix: use left recursive CFL. ■ Avoiding infinite recursive loops is fundamental difficulty in recursive-descent parsers. ■ Problem can be more subtle than example above. 17 18 Left Recursive Parsers Building NFSA from RE Example. (a*b + ac)d. Unix Each RE construct corresponds to a piece of NFSA. ■ Corresponds to parse tree. expr() a term() ■ Single character. fctr() – a start accept ( expr() A ■ Concatenation. term() ε – AB A B fctr() a * B term() fctr() b + A ■ OR. ε ε expr() A – A + B term() B fctr() a ε B ε term() fctr() c ) ε term() ■ Closure. A A fctr() d – A* ε 19 20 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a a ε 1 2 1 2 3 ε a a* 21 22 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a ε b a ε ε b 1 2 3 4 5 1 2 3 4 5 ε ε a* b a*b 23 24 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a a c a a c 6 7 6 7 8 9 a ε ε b a ε ε b 1 2 3 4 5 1 2 3 4 5 ε ε a*b a*b 25 26 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Each RE construct corresponds to a piece of NFSA. ■ (a*b + ac)d ■ (a*b + ac)d ac a ε c a ε c 6 7 8 9 ε 6 7 8 9 ε 10 ε 11 a ε ε b a ε ε b ε 1 2 3 4 5 1 2 3 4 5 ε ε a*b a*b + ac 27 28 Building NFSA from RE: Example Building NFSA from RE: Example Each RE construct corresponds to a piece of NFSA. Note. This construction doesn't yield simplest NFSA. ■ (a*b + ac)d ■ (a*b + ac)d a ε c a ε 6 7 8 9 ε ε d 10 ε 11 12 13 ε c a ε ε b ε b d 1 2 3 4 5 a ε (a*b + ac)d 29 30 Building NFSA from RE: Theory Building NFSA from RE: Practice For any RE of length M, our construction produces an NFSA with the To build NFSA, augment parser to generate state table. following properties. ■ For details: Sedgewick, Chapter 21 (Algorithms in C, 2nd edition). ■ No more than two arcs leave any state. – recursive routines return index of start state ε – if two arcs, they both have label – state = next state to be filled in ε ■ No - cycles. – setstate() fills in NFSA table ■ Exactly 1 start state, has 1 incoming arc.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us