<<

Last Time This Time Administrivia Interpreters and Syntax and Parsing Class Project Structure of a Types of Programming Languages: Lexical Analysis COMS W4115 Imperative, Object-Oriented, Functional, Logic, Dataflow Syntax Prof. Stephen A. Edwards Parsing Spring 2002 Columbia University Department of Computer Science

Interpreters Compilers

Source Program Source Program ↓ ↓ Input → Interpreter → Output Compiler The Compilation Process ↓ Input → Executable Program → Output

Structure of a Compiler Compiling a Simple Program What the Compiler Sees int gcd(int a, int b) ↓ Program Text int gcd(int a, int b) { while (a != b) { Lexer { if (a > b) a -= b; ↓ Token Stream else b -= a; while (a != b) { } Parser return a; if (a > b) a -= b; ↓ Abstract Syntax Tree } else b -= a; Static semantics (type checking) i n t sp g d ( i n t sp a , sp i ↓ Annotated AST } n t sp b ) nl { nl sp sp w h i l e sp Translation to intermediate form return a; ( a sp ! = sp b ) sp { nl sp sp sp sp i ↓ Three-address code } f sp ( a sp > sp b ) sp a sp - = sp b Code generation ; nl sp sp sp sp e l s e sp b sp - = sp ↓ Assembly Code a ; nl sp sp } nl sp sp r e t u r n sp a ; nl } nl Text file is a sequence of characters After Lexical Analysis After Parsing After Semantic Analysis int gcd(int a, int b) func { func while (a != b) { int gcd args seq if (a > b) a -= b; else b -= a; int gcd args seq } arg arg while return return a; arg arg while return } int a int b != if a int a int b != if a int gcd ( int a , int b ) – while ( a Symbol a b > -= -= int gcd(int a, int b) a b > -= -= { Table: a b a b b a != b ) – if ( a > b ) a -= b ; while (a != b) { if (a > b) a -= b; a b a b b a int a else b -= a; else b -= a ; ˝ return a ; ˝ } int b return a; } A stream of tokens. Whitespace, comments removed. Types checked; references to symbols resolved Abstract syntax tree built from parsing rules.

After Translation into 3-Address Code After Translation to 80386 Assembly L0: sne $1, a, b gcd: pushl %ebp % Save frame pointer seq $0, $1, 0 movl %esp,%ebp btrue $0, L1 % while (a != b) movl 8(%ebp),%eax % Load a from stack sl $3, b, a movl 12(%ebp),%edx % Load b from stack seq $2, $3, 0 .L8: cmpl %edx,%eax btrue $2, L4 % if (a < b) je .L3 % while (a != b) Lexical Analysis (Scanning) jle .L5 % if (a < b) sub a, a, b % a -= b int gcd(int a, int b) jmp L5 { subl %edx,%eax % a -= b while (a != b) { L4: sub b, b, a % b -= a if (a > b) a -= b; jmp .L8 else b -= a; L5: jmp L0 } .L5: subl %eax,%edx % b -= a return a; L1: ret a } jmp .L8 .L3: leave % Restore SP, BP Idealized assembly language w/ infinite registers ret

Lexical Analysis (Scanning) Lexical Analysis The ANTLR Compiler Generator Goal is to translate a stream of characters Goal: simplify the job of the parser. Language and compiler for writing compilers i n t sp g c d ( i n t sp Scanners are usually much faster than parsers. Running ANTLR on an ANTLR file produces Java source files that can be compiled and run. a , sp i n t sp b Discard as many irrelevant details as possible (e.g., into a stream of tokens whitespace, comments). ANTLR can generate Parser does not care that the the identifer is • Scanners (lexical analyzers) ID ID LPAREN ID ID COMMA ID ID “supercalifragilisticexpialidocious.” int gcd ( int a , int b • Parsers Parser rules are only concerned with token types. • Each token consists of a token type and its text. Tree walkers Whitespace and comments are discarded. We will use all of these facilities in this class An ANTLR File for a Simple Scanner ANTLR Specifications for Scanners ANTLR Specifications class CalcLexer extends Lexer; Rules are names starting with a capital letter. Question mark makes a clause optional.

A character in single quotes matches that character. PERSON : ("wo")? ’m’ (’a’|’e’) ’n’ ; LPAREN : ’(’ ; // Rules for puctuation RPAREN : ’)’ ; LPAREN : ’(’ ; (Matches man, men, woman, and women.) STAR : ’*’ ; A string in double quotes matches the string Double dots indicate a range of characters: PLUS : ’+’ ; SEMI : ’;’ ; IF : "if" ; DIGIT : ’0’..’9’; protected // Can only be used as a sub-rule A vertical bar indicates a choice: Asterisk and plus match “zero or more,” “one or more.” DIGIT : ’0’..’9’ ; // Any character between 0 and 9 INT : (DIGIT)+ ; // One or more digits OP : ’+’ | ’-’ | ’*’ | ’/’ ; ID : LETTER (LETTER | DIGIT)* ; NUMBER : (DIGIT)+ ; WS : (’ ’ | ’\t’ | ’\n’| ’\r’) // Whitespace { $setType(Token.SKIP); } ; // Action: ignore

Kleene Closure Scanner Behavior Implementing Scanners Automatically

The asterisk operator (*) is called the Kleene Closure All rules (tokens) are considered simultaneously. The Regular Expressions (Rules) operator after the inventor of regular expressions, Stephen longest one that matches wins: ↓ Cole Kleene, who pronounced his last name “CLAY-nee.” 1. Look at the next character in the file. Nondeterministic Finite Automata His son Ken writes “As far as I am aware this 2. Can the next character be added to any of the tokens ↓ Subset Construction pronunciation is incorrect in all known languages. I believe under construction? that this novel pronunciation was invented by my father.” Deterministic Finite Automata 3. If so, add the character to the token being constructed ↓ and go to step 1. Tables 4. Otherwise, return the token.

How to keep track of multiple rules matching simultaneously? Build an automata.

Regular Expressions and NFAs Deterministic Finite Automata Deterministic Finite Automata We are describing tokens with regular expressions: A state machine with an initial state ELSE: "else" ; ELSEIF: "elseif" ; • The symbol  always matches Arcs indicate “consumed” input symbols. e l s e • A symbol from an alphabet, e.g., a, matches itself States with double lines are accepting.

• If the next token has an arc, follow the arc. A sequence of two regular expressions e.g., e1e2 i Matches e1 followed by e2 If the next token has no arc and the state is accepting, return the token. • An “OR” of two regular expressions e.g., e1|e2 If the next token has no arc and the state is not accepting, f Matches e1 or e2 syntax error. • The Kleene closure of a regular expression, e.g., (e)∗ Matches zero or more instances of e1 in sequence. Deterministic Finite Automata Nondeterminstic Finite Automata Translating REs into NFAs a IF: "if" ; DFAs with  arcs. a ID: ’a’..’z’ (’a’..’z’ | ’0’..’9’)* ; Conceptually,  arcs denote state equivalence.  NUM: (’0’..’9’)+ ; e1e2 e1 e2  arcs add the ability to make nondeterministic f (schizophrenic) choices.  e1  ID a-eg-z0-9 IF i When an NFA reaches a state with an  arc, it moves into a-z0-9 e1|e2 e2 every destination.   a-hj-z a-z0-9 ID ID a-z90-9 NFAs can be in multiple states at once.  0-9 ∗   (e) e 0-9 NUM NUM 0-9 

RE to NFAs Subset Construction Subset Construction Building an NFA for the regular expression How to compute a DFA from an NFA. An DFA can be exponentially larger than the corresponding NFA. (wo|)m(a|e)n Basic idea: each state of the DFA is a marking of the NFA a n produces n states versus 2 m Tools often try to strike a balance between the two w o representations. a m n e ANTLR uses a different technique.  w m n e after simplification. Most  arcs disappear. o

Free-Format Languages Free-Format Languages FORTRAN 77 Typical style arising from scanner/parser division Java C C++ Algol Pascal FORTRAN 77 is not free-format. 72-character lines: Program text is a series of tokens possibly separated by Some deviate a little (e.g., C and C++ have a separate 100 IF(IN .EQ. ’Y’ .OR. IN .EQ. ’y’ .OR. whitespace and comments, which are both ignored. preprocessor) $ IN .EQ. ’T’ .OR. IN .EQ. ’t’) THEN

• keywords (if while) But not all languages are free-format. 1 · · · 5 6 7 · · · 72 | {z } |{z} | {z } • punctuation (, ( +) Statement label Continuation Normal • identifiers (foo bar) When column 6 is not a space, line is considered part of • numbers (10 -3.14159e+32) the previous.

• strings ("A String") Fixed-length line makes it easy to allocate a one-line buffer. Makes sense on punch cards. Python Syntax and Langauge Design Syntax and Language Design

The Python scripting language groups with indentation Does syntax matter? Yes and no Some syntax is error-prone. Classic FORTRAN example: i = 0 More important is a language’s semantics—its meaning. DO 5 I = 1,25 ! Loop header (for i = 1 to 25) while i < 10: The syntax is aesthetic, but can be a religious issue. DO 5 I = 1.25 ! Assignment to variable DO5I i = i + 1 print i # Prints 1, 2, ..., 10 But aesthetics matter to people, and can be critical. Trying too hard to reuse existing syntax in C++: Verbosity does matter: smaller is usually better. vector< vector > foo; i = 0 vector> foo; // Syntax error while i < 10: Too small can be a problem: APL is a compact, cryptic i = i + 1 language with its own character set (!) C distinguishes > and >> as different operators. print i # Just prints 10 E←A TEST B;L This is succinct, but can be error-prone. L←0.5 E ((A A)+B B)*L How do you wrap a conditional around instructions? ← × ×

Keywords Parsing Keywords look like identifiers in most languages. Objective: build an abstract syntax tree (AST) for the token sequence from the scanner. Scanners do not know context, so keywords must take precedence over identifiers. + 2 * 3 + 4 Too many keywords leaves fewer options for identifiers. ⇒ * 4 Parsing Langauges such as C++ or Java strive for fewer keywords 2 3 to avoid “polluting” available identifiers. Goal: discard irrelevant information to make it easier for the next stage. Parentheses and most other forms of punctuation removed.

Grammars Languages Issues Most programming languages described using a Regular languages (t is a terminal): Ambiguous grammars context-free grammar. A → t1 . . . tnB Precedence of operators Compared to regular languages, context-free languages A → t1 . . . tn Left- versus right-recursive add one important thing: recursion. Context-free languages (P is terminal or a variable): Top-down vs. bottom-up parsers Recursion allows you to count, e.g., to match pairs of nested parentheses. A → P1 . . . Pn vs. Abstract Syntax Tree Which languages do humans speak? I’d say it’s regular: I Context-sensitive languages: do not not not not not not not not not not understand this α1Aα2 → α1Bα2 sentence. “B → A only in the ‘context’ of α1 · · · α2” Ambiguous Grammars Operator Precedence and Operator Precedence A grammar can easily be ambiguous. Consider parsing Associativity Defines how “sticky” an operator is.

3 - 4 * 2 + 5 Usually resolve ambiguity in arithmetic expressions 1 * 2 + 3 * 4 Like you were taught in elementary school: with the grammar + “My Dear Aunt Sally” * at higher precedence than +: e → e + e | e − e | e ∗ e | e / e (1 * 2) + (3 * 4) * * Mnemonic for multiplication and division before addition + - * - + 1 2 3 4 and subtraction. - 5 3 + - + 3 * * 5 * 3 * * 5 3 4 2 5 4 + - 2 + at higher precedence than *: * 4 1 * (2 + 3) * 4 4 2 4 2 2 5 3 4 1 + 2 3

C’s 15 Precedence Levels Associativity Fixing Ambiguous Grammars f(r,r,...) a[i] p->m s.m !b ˜i -i Whether to evaluate left-to-right or right-to-left Original ANTLR grammar specification ++l --l l++ l-- *p &l (type) r sizeof(t) Most operators are left-associative expr n * o n / o i % j n + o n - o 1 - 2 - 3 - 4 : expr ’+’ expr i << j i >> j | expr ’-’ expr n < o n > o n <= o n >= o r == r r != r - - | expr ’*’ expr i & j - 4 1 - | expr ’/’ expr i ˆ j | NUMBER i | j - 3 2 - b && c 1 2 3 4 ; b || c b ? r : r ((1 - 2) - 3) - 4 1 - (2 - (3 - 4)) Ambiguous: no precedence or associativity. l = r l += n l -= n l *= n left associative right associative l /= n l %= i l &= i l ˆ= i l |= i l <<= i l >>= i r1 , r2

Assigning Precedence Levels Assigning Associativity Parsing Context-Free Grammars Split into multiple rules, one per level Make one side or the other the next level of precedence There are O(n3) algorithms for parsing arbitrary CFGs, but most compilers demand O(n) algorithms. expr : expr ’+’ expr expr : expr ’+’ term | expr ’-’ expr | expr ’-’ term Fortunately, the LL and LR subclasses of CFGs have | term ; | term ; O(n) parsing algorithms. People use these in practice. term : term ’*’ term term : term ’*’ atom | term ’/’ term | term ’/’ atom | atom ; | atom ; atom : NUMBER ; atom : NUMBER ;

Still ambiguous: associativity not defined Parsing LL(k) Grammars A Top-Down Parser Writing LL(k) Grammars LL: Left-to-right, Left-most derivation stmt : ’if’ expr ’then’ expr Cannot have left-recursion k: number of tokens to look ahead | ’while’ expr ’do’ expr expr : expr ’+’ term | term ; | expr ’:=’ expr ; Parsed by top-down, predictive, recursive parsers becomes

Basic idea: look at the next token to predict which expr : NUMBER | ’(’ expr ’)’ ; AST expr() – production to use AST stmt() – switch (next-token) – ANTLR builds recursive LL(k) parsers switch (next-token) – case NUMBER : expr(); /* Infinite Recursion */ Almost a direct translation from the grammar. case ”if” : match(”if”); expr(); match(”then”); expr(); case ”while” : match(”while”); expr(); match(”do”); expr(); case NUMBER or ”(” : expr(); match(”:=”); expr(); ˝ ˝

Writing LL(1) Grammars Eliminating Common Prefixes Eliminating Left Recursion Cannot have common prefixes Consolidate common prefixes: Understand the recursion and add tail rules expr : ID ’(’ expr ’)’ expr expr | ID ’=’ expr : expr ’+’ term : expr (’+’ term | ’-’ term ) | expr ’-’ term | term becomes | term ; AST expr() – ; becomes switch (next-token) – becomes case ID : match(ID); match(’(’); expr(); match(’)’); expr : term exprt ; case ID : match(ID); match(’=’); expr(); expr exprt : ’+’ term exprt : expr (’+’ term | ’-’ term ) | ’-’ term exprt | term | /* nothing */ ; ;

Using ANTLR’s EBNF The Dangling Else Problem The Dangling Else Problem ANTLR makes this easier since it supports * and -: Who owns the else? stmt : "if" expr "then" stmt iftail | other-statements ; expr : expr ’+’ term if (a) if (b) c(); else d(); | expr ’-’ term or ? iftail | term ; if if : "else" stmt a if a if d() becomes | /* nothing */ b c() d() b c() expr : term (’+’ term | ’-’ term)* ; ; Grammars are usually ambiguous; manuals give Problem comes when matching “iftail.” disambiguating rules such as C’s: Normally, an empty choice is taken if the next token is in As usual the “else” is resolved by connecting an the “follow set” of the rule. But since “else” can follow an else with the last encountered elseless if. iftail, the decision is ambiguous. The Dangling Else Problem The Dangling Else Problem Bottom-up Parsers ANTLR can resolve this problem by making certain rules Some languages resolve this problem by insisting on Regular languages can be matched using finite automata. “greedy.” If a conditional is marked as greedy, it will take nesting everything. Context-free languages can be matched with pushdown that option even if the “nothing” option would also match: E.g., Algol 68: automata (have a stack). stmt if a < b then a else b fi; Operation of a bottom-up parser: : "if" expr "then" stmt • Maintain a stack of tokens and rules ( options {greedy = true;} “fi” is “if” spelled backwards. The language also uses : "else" stmt do–od and case–esac. • Push each new token onto this stack (“shift”) )? • When the top few things on the stack match a rule, | other-statements replace them (“reduce”) ; Used by yacc, bison, and other parser generators. Parses more languages, but error recovery harder.

Bottom-up Parsing Parsing Techniques Statement separators or terminators? E : T ’+’ E | T ; Much theory has been developed about languages and C uses ; as a statement terminator. T : int ’*’ T | int ; parsing algorithms. if (a

Summary Compiler: scanner, parser, AST, IR, assembly Scanner divides input into tokens Scanning defined using a regular language Parser uses rules to recognize phrases and build AST Context-free grammars used for parsers Operator precedence and associativity Top-down and bottom-up parsers