Introduction to Theoretical CS Regular Expressions

Introduction to Theoretical CS Regular Expressions

Why Learn Theory? Introduction to Theoretical CS In theory … • Deeper understanding of what is a computer and computing. • Foundation of all modern computers. • Pure science. • Philosophical implications. Fundamental questions: Q. What can a computer do? In practice … Q. What can a computer do with limited resources? • Web search: theory of pattern matching. • Sequential circuits: theory of finite state automata. General approach. • Compilers: theory of context free grammars. • Don't talk about specific machines or problems. • Cryptography: theory of computational complexity. • Consider minimal abstract machines. • Data compression: theory of information. • Consider general classes of problems. Why Learn Theory? In theory … Regular Expressions • Deeper understanding of what is a computer and computing. • Foundation of all modern computers. • Pure science. • Philosophical implications. In practice … • Web search: theory of pattern matching. • Sequential circuits: theory of finite state automata. • Compilers: theory of context free grammars. • Cryptography: theory of computational complexity. • Data compression: theory of information. “ In theory there is no difference between theory and practice. In practice there is.” – Yogi Berra Pattern Matching Pattern Matching Application Pattern matching problem. Is a given string in a specified set of strings? PROSITE. Huge database of protein families and domains. Ex. [genomics] Q. How to describe a protein motif? • Fragile X syndrome is a common cause of mental retardation. • Human genome contains triplet repeats of CGG or AGG, Ex. [signature of the C2H2-type zinc finger domain] bracketed by GCG at the beginning and CTG at the end. 1. C 2. Between 2 and 4 amino acids. Number of repeats is variable, and correlated with syndrome. • 3. C 4. 3 more amino acids. Specified set of strings: “all strings of G, C, T, A having some occurrence 5. One of the following amino acids: LIVMFYWCX. of GCG followed by any number of CGG or AGG triplets, followed by CTG” 6. 8 more amino acids. Q: “Is this string in the set?” 7. H GCGGCGTGTGTGCGAGAGAGTGGGTTTAAAGCTGGCGCGGAGGCGGCTGGCGCGGAGGCTG 8. Between 3 and 5 more amino acids. 9. H A: Yes GCGCGGAGGCGGCTG First step: A. Use a regular expression. Regular expression. A formal notation for specifying a set of strings. CAASCGGPYACGGWAGYHAGWH Pattern Matching Applications Regular Expressions: Basic Operations Test if a string matches some pattern. Regular expression. Notation to specify a set of strings. • Process natural language. • Scan for virus signatures. “in specified set” “not in specified set” • Access information in digital libraries. • Search-and-replace in a word processors. operation regular expression matches does not match Filter text (spam, NetNanny, ads, Carnivore, malware). • concatenation aabaab aabaab every other string • Validate data-entry fields (dates, email, URL, credit card). cumulus succubus Search for markers in human genome using PROSITE patterns. .u.u.u. • wildcard jugulum tumultuous aa aa | baab Parse text files. union baab every other string • Compile a Java program. aa ab ab*a • Crawl and index the Web. closure abbba ababa Read in data stored in TOY input file format. • aaaab a(a|b)aab • Automatically create Java documentation from Javadoc comments. abaab every other string parentheses a aa (ab)*a ababababa abbba Regular Expressions: Examples Generalized Regular Expressions Regular expression. Notation is surprisingly expressive. Regular expressions are a standard programmer's tool. • Built in to Java, Perl, Unix, Python, …. • Additional operations typically added for convenience. – Ex 1: [a-e]+ is shorthand for (a|b|c|d|e)(a|b|c|d|e)*. regular expression matches does not match – Ex 2: \s is shorthand for “any whitespace character” (space, tab, ...). .*spb.* raspberry subspace contains the trigraph spb crispbread subspecies operation regular expression matches does not match a* | (a*ba*ba*ba*)* bbb b aaa bb abcde ade multiple of three b’s a(bc)+de bbbaababbaa baabbbaa one or more abcbcde bcde .*0.... lowercase camelCase 1000234 111111111 character class [A-Za-z][a-z]* fifth to last digit is 0 98701234 403982772 Capitalized 4illegal 08540-1321 111111111 [0-9]{5}-[0-9]{4} gcg(cgg|agg)*ctg gcgctg gcgcgg exactly k 19072-5541 166-54-1111 gcgcggctg cggcggcggctg fragile X syndrome indicator gcgcggaggctg gcgcaggctg negation [^aeiou]{6} rhythm decade Regular Expression Challenge 1 Regular Expression Challenge 2 Q. Consider the RE Q. Give an RE that describes the following set of strings: a*bb(ab|ba)* Which of the following strings match (is in the set it describes)? • characters are A, C, T or G • starts with ATG a. abb • length is a multiple of 3 b. abba • ends with TAG, TAA, or TTG c. aaba d. bbbaab e. cbb f. bbababbab Describing a Pattern REs in Java PROSITE. Huge database of protein families and domains. public class String (Java's String library) Q.Q. How to describe a protein motif? does this string match the given boolean matches(String re) regular expression? Ex. [signature of the C H -type zinc finger domain] replace all occurrences of regular 2 2 String replaceAll(String re, String str) 1. C expression with the replacement string return the index of the first occurrence 2. Between 2 and 4 amino acids. int indexOf(String r, int from) 3. C of the string r after the index from 4. 3 more amino acids. split the string around matches of the String[] split(String re) 5. One of the following amino acids: LIVMFYWCX. given regular expression 6. 8 more amino acids. 7. H 8. Between 3 and 5 more amino acids. 9. H String re = "C.{2,4}C...[LIVMFYWC].{8}H.{3,5}H "; String input = "CAASCGGPYACGGAAGYHAGAH"; boolean test = input.matches(re); A. C.{2,4}C...[LIVMFYWC].{8}H.{3,5}H is the input string in the set described by the RE? CAASCGGPYACGGWAGYHAGWH REs in Java REs in Java Validity checking. Is input in the set described by the re? public class String (Java's String library) does this string match the given public class Validate boolean matches(String re) { regular expression? public static void main(String[] args) { replace all occurrences of regular String replaceAll(String re, String str) String re = args[0]; expression with the replacement string String input = args[1]; return the index of the first occurrence StdOut.println(input.matches(re)); int indexOf(String r, int from) } of the string r after the index from } powerful string library method split the string around matches of the String[] split(String re) given regular expression C2H2 type zinc finger domain % java Validate "C.{2,4}C...[LIVMFYWC].{8}H.{3,5}H" CAASCGGPYACGGAAGYHAGAH RE that matches any sequence of true whitespace characters (at least 1). legal Java identifier % java Validate "[$_A-Za-z][$_A-Za-z0-9]*" ident123 Extra \ distinguishes from the string \s+ true String s = StdIn.readAll(); valid email address (simplified) s = s.replaceAll("\\s+", " "); % java Validate "[a-z]+@([a-z]+\.)+(edu|com)" [email protected] true need quotes to "escape" the shell replace each sequence of at least one whitespace character with a single space REs in Java public class String (Java's String library) DFAs does this string match the given boolean matches(String re) regular expression? replace all occurrences of regular String replaceAll(String re, String str) expression with the replacement string return the index of the first occurrence int indexOf(String r, int from) of the string r after the index from split the string around matches of the String[] split(String re) given regular expression String s = StdIn.readAll(); String[] words = s.split("\\s+"); create an array of the words in StdIn Solving the Pattern Match Problem Deterministic Finite State Automaton (DFA) Regular expressions are a concise way to describe patterns. Simple machine with N states. • How would you implement the method matches() ? • Begin in start state. • Hardware: build a deterministic finite state automaton (DFA). • Read first input symbol. • Software: simulate a DFA. • Move to new state, depending on current state and input symbol. • Repeat until last input symbol read. DFA: simple machine that solves a pattern match problem. • Accept input string if last state is labeled Y. • Different machine for each pattern. • Accepts or rejects string specified on input tape. • Focus on true or false questions for simplicity. a a a b b DFA Y N N b Input b b a a b b a b b DFA and RE Duality DFA Challenge 1 RE. Concise way to describe a set of strings. Q. Consider this DFA: DFA. Machine to recognize whether a given string is in a given set. Duality. • For any DFA, there exists a RE that describes the same set of strings. Which of the following sets of strings does it recognize? • For any RE, there exists a DFA that recognizes the same set. a. Bitstrings with at least one 1 b. Bitstrings with an equal number of occurrences of 01 and 10 a* | (a*ba*ba*ba*)* c. Bitstrings with more 1s than 0s d. Bitstrings with an equal number of occurrences of 0 and 1 e. Bitstrings that end in 1 multiple of 3 b's multiple of 3 b's Practical consequence of duality proof: to match RE • build DFA • simulate DFA on input string. DFA Challenge 2 Implementing a Pattern Matcher Q. Consider this DFA: Problem. Given a RE, create program that tests whether given input is in set of strings described. Step 1. Build the DFA. It is actually better to use an NFA, an equivalent (but Which of the following sets of strings does it recognize? • A compiler! more efficient) representation of a DFA. We ignore • See COS 226 or COS 320. that distinction in this lecture. a. Bitstrings with at least one 1 b. Bitstrings with an equal number of occurrences of 01 and 10 Step 2. Simulate it with given input. c.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    11 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us