Computer Science 14. Introduction to Theoretical CS 14
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTER SCIENCE COMPUTER SCIENCE SEDGEWICK/WAYNE SEDGEWICK/WAYNE PART II: ALGORITHMS, MACHINES, and THEORY PART II: ALGORITHMS, MACHINES, and THEORY Computer Science 14. Introduction to Theoretical CS 14. Introduction to •Overview Computer •Regular expressions Science Theoretical CS •DFAs An Interdisciplinary Approach •Applications R O B E R T S E D G E W I C K Section 7.2 KEVIN WAYNE •Limitations http://introcs.cs.princeton.edu CS.17.A.Theory.Overview Introduction to theoretical computer science Why study theory? Fundamental questions In theory... • What can a computer do? • Deeper understanding of computation. • What can a computer do with limited resources? • Foundation of all modern computers. • Pure science. • Philosophical implications. General approach • Don't talk about specific machines or problems. NO • Consider minimal abstract machines. YES In practice... • Consider general classes of problems. • Web search: theory of pattern matching. • Sequential circuits: theory of finite state automata. • Compilers: theory of context free grammars. • Cryptography: theory of computational complexity. " In theory there is no • Data compression: theory of information. difference • ... between theory and Surprising outcome. Sweeping and relevant statements about all computers. — Yogi Berra 3 4 Abstract machines COMPUTER SCIENCE SEDGEWICK/WAYNE PART I: PROGRAMMING IN JAVA Abstract machine • Mathematical model of computation. • Each machine defined by specific rules for transforming input to output. • This lecture: Deterministic finite automata (DFAs). Formal language madam im adam a man a plan a canal panama • A set of strings. able i was ere i saw elba evil olive • Each defined by specific rules that characterize it. go hang a salami im a lasagna hog pull up if i pull up • This lecture: Regular expressions (REs). ... Questions for this lecture • Is a given string in the language defined by a given RE, or not? • Can a DFA help answer this question? 5 CS.17.A.Theory.Overview COMPUTER SCIENCE Pattern matching SEDGEWICK/WAYNE PART II: ALGORITHMS, MACHINES, and THEORY Pattern matching problem. Is a given string an element of a given set of strings? Example 1 (from computational biochemistry) An amino acid is represented by one of the characters CAVLIMCRKHDENQSTYFWP. 14. Introduction to TheoreticaalTheoretical CS CS A protein is a string of amino acids. A C2H2-type zinc finger domain signature is •Overview • C followed by 2, 3, or 4 amino acids, followed by •Regular expressions • C followed by 3 amino acids, followed by •DFAs • L, I, V, M, F, Y, W, C, or X followed by 8 amino acids, followed by • H followed by 3, 4, or 5 amino acids, followed by H. •Applications •Limitations Q. Is this protein in the C2H2-type zinc finger domain? CAASCGGPYACGGWAGYHAGWH A. Yes. C H CS.14.B.Theory.REs 3 C 3 Y 8 H 3 8 Pattern matching Pattern matching Example 2 (from commercial computing) Example 3 (from genomics) An e-mail address is • A sequence of letters, followed by A nucleic acid is represented by one of the letters a, c, t, or g. • the character "@", followed by • followed by a nonempty sequence of lowercase letters, followed by the character "." A genome is a string of nucleic acids. • [any number of occurrences of the previous pattern] A Fragile X Syndrome pattern is a genome having an occurrence of gcg, followed • "edu" or "com" (others omitted for brevity). by any number of cgg or agg triplets, followed by ctg. Q. Which of the following are e-mail addresses? A. Note. The number of triplets correlates with Fragile X Syndrome, a common cause of mental retardation. [email protected] ✓ not an e-mail address ✗ Q. Does this genome contain a such a pattern? [email protected] ✓ gcggcgtgtgtgcgagagagtgggtttaaagctggcgcggaggcggctggcgcggaggctg eve@airport ✗ A. Yes. gcgcggaggcggctg Oops, need to fix description [email protected] ✗ gcg start mark ctg end mark Challenge. Develop a precise description of the set of strings that are legal e-mail addresses. sequence of cgg and agg 9 triplets 10 Regular expressions More examples of regular expressions A regular expression (RE) is a notation for specifying a set of strings ( a formal language). The notation is surprisingly expressive. regular expression matches does not match matches does not match operation example RE (IN the set) (NOT in the set) An RE is either .*spb.* raspberry subspace • The empty set concatenation aabaab aabaab every other string contains the trigraph spb crispbread subspecies The empty string • cumulus succubus wildcard .u.u.u. bbb b • A single character or wildcard symbol jugulum tumultuous a* | (a*ba*ba*ba*)* aa aaa bb • An RE enclosed in parentheses union aa | baab every other string multiple of three b’s baab bbbaababbaa baabbbaa The concatenation of two or more REs • aa ab closure ab*a • The union of two or more REs abbba ababa .*0.... 1000234 111111111 aaaab 98701234 403982772 • The closure of an RE a(a|b)aab every other string fifth to last digit is 0 abaab (any number of occurrences) parentheses a aa (ab)*a ...gcgctg... gcgcgg ababababa abbba .*gcg(cgg|agg)*ctg.* ...gcgcggctg... cggcggcggctg fragile X syndrome pattern ...gcgcggaggctg... gcgcaggctg 11 12 Generalized regular expressions Example of describing a pattern with a generalized RE Additional operations further extend the utility of REs. A C2H2-type zinc finger domain signature is operation example RE matches does not match • C followed by 2, 3, or 4 amino acids, followed by abcde ade C followed by 3 amino acids, followed by one or more a(bc)+de • bcde abcbcde • L, I, V, M, F, Y, W, C, or X followed by 8 amino acids, followed lowercase camelCase character class [A-Za-z][a-z]* by Capitalized 4illegal • H followed by 3, 4, or 5 amino acids, followed by 08540-1321 111111111 exactly j [0-9]{5}-[0-9]{4} 19072-5541 166-54-1111 abcb ab between j and k a.{2,4}b abcbcb aaaaaab Q. Give a generalized RE for all such signatures. negation [^aeiou]{6} rhythm decade A. C.{2,4}C...[LIVMFYWCX].{8}H.{3,5}H CAASCGGPYACGGWAGYHAGWH any whitespace char whitespace \s every other character (space, tab, newline...) "Wildcard" matches any of the letters C 3 C 3 Y 8 H 3 H CAVLIMCRKHDENQSTYFWP Note. These operations are all shorthand. RE: (a|b|c|d|e)(a|b|c|d|e)* They are very useful but not essential. shorthand: (a-e)+ 13 14 Example of a real-world RE application: PROSITE Another example of describing a pattern with a generalized RE An e-mail address is • A sequence of letters, followed by • the character "@", followed by • the character "." , followed by a nonempty sequence of lowercase letters, followed by • [any number of occurrences of the previous pattern] • "edu" or "com" (others omitted for brevity). Q. Give a generalized RE for e-mail addresses. A. [a-z]+@([a-z]+\.)+(edu|com) Type an RE here Exercise. Extend to handle [email protected], more suffixes such as .org, and any other extensions you can think of. Next. Determining whether a given string matches a given RE. 15 16 Pop quiz 1 on REs Pop quiz 2 on REs Q. Which of the following strings match the RE a*bb(ab|ba)* ? Q. Give an RE for genes • Characters are a, c, t or g. is in the set it describes • Starts with atg (a start codon). • Length is a multiple of 3. • Ends with tag, taa, or ttg (a stop codon). 1. abb 2. aaba 3. abba 4. bbbaab 5. cbb 6. bbababbab 17 18 COMPUTER SCIENCE COMPUTER SCIENCE SEDGEWICK/WAYNE SEDGEWICK/WAYNE PART I: PROGRAMMING IN JAVA PART II: ALGORITHMS, MACHINES, and THEORY Image sources http://en.wikipedia.org/wiki/Homology_modeling#/media/File:DHRS7B_homology_model.png 14. Introduction to TheoreticaalTheoretical CS CS •Overview •Regular expressions •DFAs •Applications •Limitations CS.14.B.Theory.REs CS.14.C.Theory.DFAs Deterministic finite automata (DFA) Deterministic finite automata details and example A DFA is an abstract machine that solves a pattern matching problem. AA DFADFA isis anan abstractabstract machinemachine withwith aa finitefinite numbernumber states,states, eacheach labeledlabeled YY oror N,N, andand • A string is specified on an input tape (no limit on its length). transitionstransitions betweenbetween states,states, eacheach labeledlabeled withwith aa symbol.symbol. OneOne statestate isis thethe startstart state.state. • The DFA reads each character on input tape once, moving left to right. • •BeginBegin in in the the start start state. state (denoted by an arrow from nowhere). • The DFA lights "YES" if it recognizes the string, "NO" otherwise. • •ReadRead an an input input symbol symbol and and move move to to the the indicated indicated state. state. Each DFA defines a language (the set of strings that it recognizes). • •RepeatRepeat until until the the last last input input symbol symbol has has been been read. read. • •TurnTurn on on the the "YES" "YES" or or "NO" "NO" light light according according to to the the label label on on the the current current state. state. NO NO a a a Does this DFA recognize Y b N b N this string? b YES YES b b a a b b a b b b b a a b b a b b 21 22 Deterministic finite automata details and example Simulating the operation of a DFA public class DFA a 0 a 1 a 2 A DFA is an abstract machine with a finite number states, each labeled Y or N, and symbol table to map { chars a, b, ... to next Y b N b N private int start; transitions between states, each labeled with a symbol. One state is the start state. state 0, 1, ... b private boolean[] action; action[] next[] • Begin in the start state. private ST<Character, Integer>[] next; a b • Read an input symbol and move to the indicated state.