Lexical Analysis Example: Source Code

CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Lexical Analysis Example: Source Code A Sample Toy Program: • Read source program and produce a list of tokens (“linear” analysis) (* define valid mutually recursive procedures *) let token source lexical parser function do_nothing1(a: int, b: string)= program analyzer do_nothing2(a+1) get next token function do_nothing2(d: int) = do_nothing1(d, “str”) • The lexical structure is specified using regular expressions in do_nothing1(0, “str2”) • Other secondary tasks: end (1) get rid of white spaces (e.g., \t,\n,\sp) and comments (2) line numbering What do we really care here ? Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 1 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 2 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS The Lexical Structure Tokens Output after the Lexical Analysis ----- token + associated value • Tokens are the atomic unit of a language, and are usually specific strings or LET 51 FUNCTION 56 ID(do_nothing1) 65 instances of classes of strings. LPAREN 76 ID(a) 77 COLON 78 Tokens Sample Values Informal Description ID(int) 80 COMMA 83 ID(b) 85 COLON 86 ID(string) 88 RPAREN 94 LET let keyword LET EQ 95 ID(do_nothing2) 99 END end keyword END LPAREN 110 ID(a) 111 PLUS 112 PLUS + INT(1) 113 RPAREN 114 FUNCTION 117 LPAREN ( ID(do_nothing2) 126 LPAREN 137 ID(d) 138 COLON 139 ID(int) 141 COLON : RPAREN 144 EQ 146 STRING “str” ID(do_nothing1) 150 LPAREN 161 RPAREN ) ID(d) 162 COMMA 163 STRING(str) 165 INT 49, 48 integer constants RPAREN 170 IN 173 ID do_nothing1, a, letter followed by letters, digits, and under- ID(do_nothing1) 177 LPAREN 188 int, string scores INT(0) 189 COMMA 190 STRING(str2) 192 EQ = RPAREN 198 END 200 EOF 203 EOF end of file Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 3 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 4 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Lexical Analysis, How? Regular Expressions • regular expressions are concise, linguistic characterization of regular • First, write down the lexical specification (how each token is defined?) languages (regular sets) * using regular expression to specify the lexical structure: identifier = letter (letter | digit | underscore) identifier = letter (letter | digit | underscore)* “or” “ 0 or more” letter = a | ... | z | A | ... | Z digit = 0 | 1 | ... | 9 • each regular expression define a regular language --- a set of strings over some alphabet, such as ASCII characters; each member of this set is called a • Second, based on the above lexical specification, build the lexical analyzer (to sentence, or a word recognize tokens) by hand, Regular Expression Spec ==> NFA ==> DFA ==>Transition Table ==> Lexical Analyzer • we use regular expressions to define each category of tokens • Or just by using lex --- the lexical analyzer generator For example, the above identifier specifies a set of strings that are a sequence of letters, digits, and underscores, starting with a letter. Regular Expression Spec (in lex format) ==> feed to lex ==> Lexical Analyzer Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 5 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 6 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Regular Expressions and Regular Example Languages Regular Expression Explanation • Given an alphabet the regular expressions over and their corresponding a* 0 or more a’s regular languages are a+ 1 or more a’s * a) denotes ,the empty string, denotes the language { }. (a|b) all strings of a’s and b’s (including ) (aa|ab|ba|bb)* all strings of a’s and b’s of even length b) for each a in , a denotes { a } --- a language with one string. [a-zA-Z] shorthand for “a|b|...|z|A|...|Z” c) if R denotes LR and S denotes LS then R | S denotes the language [0-9] shorthand for “0|1|2|...|9” L L , i.e, { x | x L or x L }. R S R S 0([0-9])*0 numbers that start and end with 0 * ? d) if R denotes LR and S denotes LS then RS denotes the language (ab|aab|b) (a|aa|e) LRLS , that is, { xy | x LRand y LS }. ? all strings that contain foo as substring * * * e) if R denotes LR then R denotes the language LR where L is the i i n n union of all L (i=0,...,and L is just {x1x2...xi | x1L, ..., xiL}. • the following is not a regular expression: a b (n > 0) f) if R denotes LR then (R) denotes the same language LR. Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 7 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 8 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Lexical Specification Transition Diagrams • Flowchart with states and edges; each edge is labelled with characters; certain •Using regular expressions to specify tokens subset of states are marked as “final states” keyword = begin | end | if | then | else • Transition from state to state proceeds along edges according to the next input * identifier = letter (letter | digit | underscore) character integer = digit+ relop = < | <= | = | <> | > | >= letter digit underscore letter = a | b | ... | z | A | B | ... | Z start letter delimiter final state digit = 0 | 1 | 2 | ... | 9 0 1 2 • Ambiguity : is “begin” a keyword or an identifier ? • Next step: to construct a token recognizer for languages given by regular • Every string that ends up at a final state is accepted expressions --- by using finite automata ! • If get “stuck”, there is no transition for a given character, it is an error given a string x, the token recognizer says “yes” if x is a sentence of the specified language and says “no” otherwise • Transition diagrams can be easily translated to programs using case statements (in C). Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 9 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 10 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Transition Diagrams (cont’d) Finite Automata The token recognizer (for identifiers) based on transition diagrams: • Finite Automata are similar to transition diagrams; they have states and labelled edges; there are one unique start state and one or more than one final state0: c = getchar(); states if (isalpha(c)) goto state1; error(); • Nondeterministic Finite Automata (NFA) : ... state1: c = getchar(); a) can label edges (these edges are called -transitions) if (isalpha(c) || isdigit(c) || b) some character can label 2 or more edges out of the same state isunderscore(c)) goto state1; if (c == ‘,’ || ... || c == ‘)’) goto state2; • Deterministic Finite Automata (DFA) : error(); a) no edges are labelled with ... b) each charcter can label at most one edge out of the same state state2: ungetc(c,stdin); /* retract current char */ return(ID, ... the current identifier ...); • NFA and DFA accepts string x if there exists a path from the start state to a final state labeled with characters in x Next: 1. finite automata are generalized transition diagrams ! 2. how to build finite automata from regular expressions? NFA: multiple paths DFA: one unique path Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 11 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 12 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Example: NFA Example: DFA start a final state start b b final state state state a b 0 1 2 3 0 1 2 3 start a b b start b a a a b An NFA accepts (a|b)*abb A DFA accepts (a|b)*abb There are many possible moves --- to accept a string, we only need one There is only one possible sequence of moves --- either lead to a final state and sequence of moves that lead to a final state. accept or the input string is rejected input string: aabb a a b b input string: aabb One sucessful sequence: 0 0 1 2 3 The sucessful sequence: a a b b Another unsuccessful sequence: a a b b 0 1 1 2 3 0 0 0 0 0 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 13 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 14 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Transition Table NFA with -transitions • Finite Automata can also be represented using transition tables 1. NFA can have -transitions --- edges labelled with For NFA, each entry is a set of states: For DFA, each entry is a unique state: a 1 2 a STATE ab STATE ab 0 0 {0,1} {0} 010 b 1-{2} 112 3 4 b 2-{3} 213 3-- 310 accepts the regular language denoted by (aa*|bb*) Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 15 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 16 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS Regular Expressions -> NFA RE -> NFA (cont’d) • How to construct NFA (with -transitions) from a regular expression ? 2. “Inductive” Construction N1 : NFA for R1 • Algorithm : apply the following construction rules , use unique names for all N2 : NFA for R2 • R1 | R2 the states. (inportant invariant: always one final state !) N1 i f 1. Basic Construction the new and N2 unique final state initial state final state • for N1 and N2 i f for N1 and N2 merge : final state of N1 •a a and initial state of N i f • R1 R2 2 initial state N1 N2 final state for N1 for N2 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 17 of 40 Copyright 1994 - 2017 Zhong Shao, Yale University Lexical Analysis : Page 18 of 40 CS421 COMPILERS AND INTERPRETERS CS421 COMPILERS AND INTERPRETERS RE -> NFA (cont’d) Example : RE -> NFA Converting the regular expression : (a|b)*abb 2.

Load more