Syntax and Parsing Last Time This Time the Compilation Process
Total Page:16
File Type:pdf, Size:1020Kb
Last Time This Time Administrivia Interpreters and Compilers Syntax and Parsing Class Project Structure of a Compiler Types of Programming Languages: Lexical Analysis COMS W4115 Imperative, Object-Oriented, Functional, Logic, Dataflow Syntax Prof. Stephen A. Edwards Parsing Spring 2002 Columbia University Department of Computer Science Interpreters Compilers Source Program Source Program # # Input ! Interpreter ! Output Compiler The Compilation Process # Input ! Executable Program ! Output Structure of a Compiler Compiling a Simple Program What the Compiler Sees int gcd(int a, int b) # Program Text int gcd(int a, int b) { while (a != b) { Lexer { if (a > b) a -= b; # Token Stream else b -= a; while (a != b) { } Parser return a; if (a > b) a -= b; # Abstract Syntax Tree } else b -= a; Static semantics (type checking) i n t sp g c d ( i n t sp a , sp i # Annotated AST } n t sp b ) nl { nl sp sp w h i l e sp Translation to intermediate form return a; ( a sp ! = sp b ) sp { nl sp sp sp sp i # Three-address code } f sp ( a sp > sp b ) sp a sp - = sp b Code generation ; nl sp sp sp sp e l s e sp b sp - = sp # Assembly Code a ; nl sp sp } nl sp sp r e t u r n sp a ; nl } nl Text file is a sequence of characters After Lexical Analysis After Parsing After Semantic Analysis int gcd(int a, int b) func { func while (a != b) { int gcd args seq if (a > b) a -= b; else b -= a; int gcd args seq } arg arg while return return a; arg arg while return } int a int b != if a int a int b != if a int gcd ( int a , int b ) – while ( a Symbol a b > -= -= int gcd(int a, int b) a b > -= -= { Table: a b a b b a != b ) – if ( a > b ) a -= b ; while (a != b) { if (a > b) a -= b; a b a b b a int a else b -= a; else b -= a ; ˝ return a ; ˝ } int b return a; } A stream of tokens. Whitespace, comments removed. Types checked; references to symbols resolved Abstract syntax tree built from parsing rules. After Translation into 3-Address Code After Translation to 80386 Assembly L0: sne $1, a, b gcd: pushl %ebp % Save frame pointer seq $0, $1, 0 movl %esp,%ebp btrue $0, L1 % while (a != b) movl 8(%ebp),%eax % Load a from stack sl $3, b, a movl 12(%ebp),%edx % Load b from stack seq $2, $3, 0 .L8: cmpl %edx,%eax btrue $2, L4 % if (a < b) je .L3 % while (a != b) Lexical Analysis (Scanning) jle .L5 % if (a < b) sub a, a, b % a -= b int gcd(int a, int b) jmp L5 { subl %edx,%eax % a -= b while (a != b) { L4: sub b, b, a % b -= a if (a > b) a -= b; jmp .L8 else b -= a; L5: jmp L0 } .L5: subl %eax,%edx % b -= a return a; L1: ret a } jmp .L8 .L3: leave % Restore SP, BP Idealized assembly language w/ infinite registers ret Lexical Analysis (Scanning) Lexical Analysis The ANTLR Compiler Generator Goal is to translate a stream of characters Goal: simplify the job of the parser. Language and compiler for writing compilers i n t sp g c d ( i n t sp Scanners are usually much faster than parsers. Running ANTLR on an ANTLR file produces Java source files that can be compiled and run. a , sp i n t sp b Discard as many irrelevant details as possible (e.g., into a stream of tokens whitespace, comments). ANTLR can generate Parser does not care that the the identifer is • Scanners (lexical analyzers) ID ID LPAREN ID ID COMMA ID ID “supercalifragilisticexpialidocious.” int gcd ( int a , int b • Parsers Parser rules are only concerned with token types. • Each token consists of a token type and its text. Tree walkers Whitespace and comments are discarded. We will use all of these facilities in this class An ANTLR File for a Simple Scanner ANTLR Specifications for Scanners ANTLR Specifications class CalcLexer extends Lexer; Rules are names starting with a capital letter. Question mark makes a clause optional. A character in single quotes matches that character. PERSON : ("wo")? ’m’ (’a’|’e’) ’n’ ; LPAREN : ’(’ ; // Rules for puctuation RPAREN : ’)’ ; LPAREN : ’(’ ; (Matches man, men, woman, and women.) STAR : ’*’ ; A string in double quotes matches the string Double dots indicate a range of characters: PLUS : ’+’ ; SEMI : ’;’ ; IF : "if" ; DIGIT : ’0’..’9’; protected // Can only be used as a sub-rule A vertical bar indicates a choice: Asterisk and plus match “zero or more,” “one or more.” DIGIT : ’0’..’9’ ; // Any character between 0 and 9 INT : (DIGIT)+ ; // One or more digits OP : ’+’ | ’-’ | ’*’ | ’/’ ; ID : LETTER (LETTER | DIGIT)* ; NUMBER : (DIGIT)+ ; WS : (’ ’ | ’\t’ | ’\n’| ’\r’) // Whitespace { $setType(Token.SKIP); } ; // Action: ignore Kleene Closure Scanner Behavior Implementing Scanners Automatically The asterisk operator (*) is called the Kleene Closure All rules (tokens) are considered simultaneously. The Regular Expressions (Rules) operator after the inventor of regular expressions, Stephen longest one that matches wins: # Cole Kleene, who pronounced his last name “CLAY-nee.” 1. Look at the next character in the file. Nondeterministic Finite Automata His son Ken writes “As far as I am aware this 2. Can the next character be added to any of the tokens # Subset Construction pronunciation is incorrect in all known languages. I believe under construction? that this novel pronunciation was invented by my father.” Deterministic Finite Automata 3. If so, add the character to the token being constructed # and go to step 1. Tables 4. Otherwise, return the token. How to keep track of multiple rules matching simultaneously? Build an automata. Regular Expressions and NFAs Deterministic Finite Automata Deterministic Finite Automata We are describing tokens with regular expressions: A state machine with an initial state ELSE: "else" ; ELSEIF: "elseif" ; • The symbol always matches Arcs indicate “consumed” input symbols. e l s e • A symbol from an alphabet, e.g., a, matches itself States with double lines are accepting. • If the next token has an arc, follow the arc. A sequence of two regular expressions e.g., e1e2 i Matches e1 followed by e2 If the next token has no arc and the state is accepting, return the token. • An “OR” of two regular expressions e.g., e1je2 If the next token has no arc and the state is not accepting, f Matches e1 or e2 syntax error. • The Kleene closure of a regular expression, e.g., (e)∗ Matches zero or more instances of e1 in sequence. Deterministic Finite Automata Nondeterminstic Finite Automata Translating REs into NFAs a IF: "if" ; DFAs with arcs. a ID: ’a’..’z’ (’a’..’z’ | ’0’..’9’)* ; Conceptually, arcs denote state equivalence. NUM: (’0’..’9’)+ ; e1e2 e1 e2 arcs add the ability to make nondeterministic f (schizophrenic) choices. e1 ID a-eg-z0-9 IF i When an NFA reaches a state with an arc, it moves into a-z0-9 e1je2 e2 every destination. a-hj-z a-z0-9 ID ID a-z90-9 NFAs can be in multiple states at once. 0-9 ∗ (e) e 0-9 NUM NUM 0-9 RE to NFAs Subset Construction Subset Construction Building an NFA for the regular expression How to compute a DFA from an NFA. An DFA can be exponentially larger than the corresponding NFA. (woj)m(aje)n Basic idea: each state of the DFA is a marking of the NFA a n produces n states versus 2 m Tools often try to strike a balance between the two w o representations. a m n e ANTLR uses a different technique. w m n e after simplification. Most arcs disappear. o Free-Format Languages Free-Format Languages FORTRAN 77 Typical style arising from scanner/parser division Java C C++ Algol Pascal FORTRAN 77 is not free-format. 72-character lines: Program text is a series of tokens possibly separated by Some deviate a little (e.g., C and C++ have a separate 100 IF(IN .EQ. ’Y’ .OR. IN .EQ. ’y’ .OR. whitespace and comments, which are both ignored. preprocessor) $ IN .EQ. ’T’ .OR. IN .EQ. ’t’) THEN • keywords (if while) But not all languages are free-format. 1 · · · 5 6 7 · · · 72 | {z } |{z} | {z } • punctuation (, ( +) Statement label Continuation Normal • identifiers (foo bar) When column 6 is not a space, line is considered part of • numbers (10 -3.14159e+32) the previous. • strings ("A String") Fixed-length line makes it easy to allocate a one-line buffer. Makes sense on punch cards. Python Syntax and Langauge Design Syntax and Language Design The Python scripting language groups with indentation Does syntax matter? Yes and no Some syntax is error-prone. Classic FORTRAN example: i = 0 More important is a language’s semantics—its meaning. DO 5 I = 1,25 ! Loop header (for i = 1 to 25) while i < 10: The syntax is aesthetic, but can be a religious issue. DO 5 I = 1.25 ! Assignment to variable DO5I i = i + 1 print i # Prints 1, 2, ..., 10 But aesthetics matter to people, and can be critical. Trying too hard to reuse existing syntax in C++: Verbosity does matter: smaller is usually better. vector< vector<int> > foo; i = 0 vector<vector<int>> foo; // Syntax error while i < 10: Too small can be a problem: APL is a compact, cryptic i = i + 1 language with its own character set (!) C distinguishes > and >> as different operators. print i # Just prints 10 E A TEST B;L This is succinct, but can be error-prone. L 0.5 E ((A A)+B B)*L How do you wrap a conditional around instructions? × × Keywords Parsing Keywords look like identifiers in most languages.