Lecture 18: Theory of Computation Regular Expressions and Dfas

Introduction to Theoretical CS Lecture 18: Theory of Computation Two fundamental questions. ! What can a computer do? ! What can a computer do with limited resources? General approach. Pentium IV running Linux kernel 2.4.22 ! Don't talk about specific machines or problems. ! Consider minimal abstract machines. ! Consider general classes of problems. COS126: General Computer Science • http://www.cs.Princeton.EDU/~cos126 2 Why Learn Theory In theory . Regular Expressions and DFAs ! Deeper understanding of what is a computer and computing. ! Foundation of all modern computers. ! Pure science. ! Philosophical implications. a* | (a*ba*ba*ba*)* In practice . ! Web search: theory of pattern matching. ! Sequential circuits: theory of finite state automata. a a a ! Compilers: theory of context free grammars. b b ! Cryptography: theory of computational complexity. 0 1 2 ! Data compression: theory of information. b "In theory there is no difference between theory and practice. In practice there is." -Yogi Berra 3 4 Pattern Matching Applications Regular Expressions: Basic Operations Test if a string matches some pattern. Regular expression. Notation to specify a set of strings. ! Process natural language. ! Scan for virus signatures. ! Search for information using Google. Operation Regular Expression Yes No ! Access information in digital libraries. ! Retrieve information from Lexis/Nexis. Concatenation aabaab aabaab every other string ! Search-and-replace in a word processors. cumulus succubus Wildcard .u.u.u. ! Filter text (spam, NetNanny, Carnivore, malware). jugulum tumultuous ! Validate data-entry fields (dates, email, URL, credit card). aa Union aa | baab baab every other string ! Search for markers in human genome using PROSITE patterns. aa ab Closure ab*a abbba ababa Parse text files. aaaab ! Compile a Java program. a(a|b)aab every other string abaab ! Crawl and index the Web. Parentheses a ! ! Read in data stored in TOY input file format. (ab)*a ababababa abbbaa ! Automatically create Java documentation from Javadoc comments. 5 6 Regular Expressions: Examples Generalized Regular Expressions Regular expression. Notation is surprisingly expressive. Regular expressions are a standard programmer's tool. ! Built in to Java, Perl, Unix, Python, . ! Additional operations typically added for convenience. ! Ex: [a-e]+ is shorthand for (a|b|c|d|e)(a|b|c|d|e)*. Regular Expression Yes No .* spb .* raspberry subspace contains the trigraph spb crispbread subspecies Operation Regular Expression Yes No a* | (a*ba*ba*ba*)* bbb b aaa bb multiple of three b’s abcde ade bbbaababbaa baabbbaa One or more a(bc)+de abcbcde bcde .*0.... 10000 111111111 capitalized camelCase fifth to last digit is 0 98701234 403982772 Character classes [A-Za-z][a-z]* Word 4illegal gcg (cgg|agg)* ctg gcgctg gcgcgg 08540-1321 111111111 Exactly k [0-9]{5}-[0-9]{4} gcgcggctg cggcggcggctg 19072-5541 166-54-111 fragile X syndrome indicator gcgcggaggctg gcgcaggctg Negations [^aeiou]{6} rhythm decade 7 8 Regular Expressions in Java Solving the Pattern Match Problem Validity checking. Is input in the set described by the re? Regular expressions are a concise way to describe patterns. ! How would you implement String.matches ? ! Hardware: build a deterministic finite state automaton (DFA). public class Validate { public static void main(String[] args) { ! Software: simulate a DFA. String re = args[0]; String input = args[1]; DFA: simple machine that solves the pattern match problem. System.out.println(input.matches(re)); } ! Different machine for each pattern. } powerful string library method ! Accepts or rejects string specified on input tape. ! Focus on true or false questions for simplicity. need help solving crosswords? % java Validate "..oo..oo." bloodroot true legal Java identifier % java Validate "[$_A-Za-z][$_A-Za-z0-9]*" ident123 true valid email address (simplified) % java Validate "[a-z]+@([a-z]+\\.)+(edu|com)" [email protected] true need quotes to "escape" the shell 9 10 Deterministic Finite State Automaton (DFA) Theory of DFAs and REs Simple machine with N states. RE. Concise way to describe a set of strings. ! Begin in start state. DFA. Machine to recognize whether a given string is in a given set. ! Read first input symbol. ! Move to new state, depending on current state and input symbol. Duality: for any DFA, there exists a regular expression to describe ! Repeat until last input symbol read. the same set of strings; for any regular expression, there exists a ! Accept or reject string depending on last state. DFA that recognizes the same set. a a a a a a b b Y N N a* | (a*ba*ba*ba*)* b b b DFA Y N N multiple of 3 b's multiple of 3 b's b Practical consequence of duality proof: to match regular expression Input b b a a b b a b b patterns, (i) build DFA and (ii) simulate DFA on input string. 11 12 Implementing a Pattern Matcher Application: Harvester Problem: given a regular expression, create program that tests Harvest information from input stream. whether given input is in set of strings described. ! Harvest patterns from DNA. Step 1: build the DFA. % java Harvester "gcg(cgg|agg)*ctg" chromosomeX.txt ! A compiler! gcgcggcggcggcggcggctg ! See COS 226 or COS 320. gcgctg gcgctg Step 2: simulate it with given input. Easy. gcgcggcggcggaggcggaggcggctg ! Harvest email addresses from web for spam campaign. State state = start; while (!CharStdIn.isEmpty()) { % java Harvester "[a-z]+@([a-z]+\\.)+(edu|com|net|tv)" char c = CharStdIn.readChar(); http://www.princeton.edu/~cos126 state = state.next(c); [email protected] } email validator (simplified) [email protected] System.out.println(state.accept()); [email protected] 13 14 Application: Harvester Application: Parsing a Data File Harvest information from input stream. Ex: parsing an NCBI genome data file. ! Use Pattern data type to compile regular expression to NFA. LOCUS AC146846 128142 bp DNA linear HTG 13-NOV-2003 ! Use Matcher data type to simulate NFA. DEFINITION Ornithorhynchus anatinus clone CLM1-393H9, ACCESSION AC146846 ! (NFA is fancy but equivalent variety of DFA) VERSION AC146846.2 GI:38304214 KEYWORDS HTG; HTGS_PHASE2; HTGS_DRAFT. SOURCE Ornithorhynchus anatinus (platypus) import java.util.regex.Pattern; ORIGIN 1 tgtatttcat ttgaccgtgc tgttttttcc cggtttttca gtacggtgtt agggagccac import java.util.regex.Matcher; 61 gtgattctgt ttgttttatg ctgccgaata gctgctcgat gaatctctgc atagacagct // a comment public class Harvester { 121 gccgcaggga gaaatgacca gtttgtgatg acaaaatgta ggaaagctgt ttcttcataa ... public static void main(String[] args) { 128101 ggaaatgcga cccccacgct aatgtacagc ttctttagat tg String re = args[0]; // In in = new In(args[1]); String input = in.readAll(); String re = "[ ]*[0-9]+([actg ]*).*"; Pattern pattern = Pattern.compile(re); Pattern pattern = Pattern.compile(re); Matcher matcher = pattern.matcher(input); In in = new In(filename); while (matcher.find()) { String line; System.out.println(matcher.group()); while ((line = in.readLine()) != null) { } Matcher matcher = pattern.matcher(line); } if (matcher.find()) { extract the RE part in parentheses } String s = matcher.group(1).replaceAll(" ", ""); // do something with s } replace this RE with this string } 15 16 Limitations of DFA Fundamental Questions No DFA can recognize the language of all bit strings with an equal Which languages CANNOT be described by any RE? number of 0's and 1's. ! Bit strings with equal number of 0s and 1s. ! Decimal strings that represent prime numbers. ! Suppose an N-state DFA can recognize this language. ! Genomic strings that are Watson-Crick complemented palindromes. ! Consider following input: 0000000011111111 ! Many more. N+1 0's N+1 1's How can we extend REs to describe richer sets of strings? ! DFA must accept this string. ! Context free grammar (e.g., Java). ! Some state x is revisited during first N+1 0's since only N states. 0000000011111111 Reference: http://java.sun.com/docs/books/jls/second_edition/html/syntax.doc.html x x Q. How can we make simple machines more powerful? ! Machine would accept same string without intervening 0's. Q. Are there any limits on what kinds of problems machines can solve? 000011111111 ! This string doesn't have an equal number of 0's and 1's. 17 18 Summary Programmer. Turing Machines ! Regular expressions are a powerful pattern matching tool. ! Implement regular expressions with finite state machines. Theoretician. ! Regular expression is a compact description of a set of strings. ! DFA is an abstract machine that solves pattern match problem for Challenge: Design simplest machine that is regular expressions. "as powerful" as conventional computers. ! DFAs and regular expressions have limitations. Variations ! Yes (accept) and No (reject) states sometimes drawn differently ! Terminology: Deterministic Finite State Automaton (DFA), Finite State Machine (FSM), Finite State Automaton (FSA) are the same Alan Turing (1912-1954) ! DFA’s can have output, specified on the arcs or in the states – These may not have explicit Yes and No states 19 20 Turing Machine: Components Turing Machine: Fetch, Execute Alan Turing sought the most primitive model of a computing device. States. ! Finite number of possible machine configurations. Tape. ! Determines what machine does and which way tape head moves. ! Stores input, output, and intermediate results. ! One arbitrarily long strip, divided into cells. State transition diagram. tape head ! Finite alphabet of symbols. ! Ex. if in state 2 and input symbol is 1 then: overwrite the 1 with x, tape move to state 0, move tape head to left. 2 Tape head. R 1:x ! Points to one cell of tape. #:# 0:x ! 0 1 Reads a symbol from active cell. 3 5 #:# #:# ! Writes a symbol to active cell. L R Y N ! Moves left or right one cell at a time. 1:x 0:x 4 #:# R Before … # # x x x 1 1 0 # # … 22 23 Turing Machine: Fetch, Execute Turing Machine: Initialization and Termination States. Initialization. ! Finite number of possible machine configurations. ! Set input on some portion of tape. ! Determines what machine does and which way tape head moves. ! Set tape head. … # # 0 0 1 1 1 0 # # … ! Set initial state. State transition diagram.

Lecture 18: Theory of Computation Regular Expressions and Dfas

Use Perl Regular Expressions in SAS® Shuguang Zhang, WRDS, Philadelphia, PA

Perl Regular Expressions Tip Sheet Functions and Call Routines

Unicode Regular Expressions Technical Reports

Regular Expressions with a Brief Intro to FSM

Regular Expressions

Context-Free Grammar for the Syntax of Regular Expression Over the ASCII

PHP Regular Expressions

A Quick Guide to PERL Regular Expressions Second Edition © 2006 Transliteration: Translate Operator Tr/// EXPR =~ Tr/SEARCHLIST/REPLACELIST/Cds

CSCI 3434: Theory of Computation Lecture 4: Regular Expressions

10 Patterns, Automata, and Regular Expressions

Regular Expressions for Perl, C, PHP, Python, Java, and .NET

Regular Languages and Finite Automata for Part IA of the Computer Science Tripos