The MICRO Compiler

Last time Extra: Front-end vs. Back-end Source code Source code • Scanner + Parser + Semantic actions + (high Scanner level) optimizations called the front-end of a Scanner compiler Tokens Tokens • IR-level optimizations and code generation front Parser (instruction selection, scheduling, register Parser allocation) called the back-end of a compiler end Syntax tree Syntax tree What are compilers? Can build multiple front-ends for a particular • Semantic • Semantic Routines back-end Routines Phases of a compiler • IR IR • e.g., gcc & g++, or many front-ends which generate CIL Optimizer Optimizer Can build multiple back-ends for a particular back IR • IR front-end end Code Generation Code Generation • e.g., gcc allows targeting different Executable architectures Executable Sunday, August 29, 2010 Sunday, August 29, 2010 High level structure • Single pass compiler: no intermediate representation The MICRO compiler: a • Scanner tokenizes input stream, but is called by parser on demand simple example • As parser recognizes constructs, it invokes semantic routines • Semantic routines build symbol table on-the-fly and directly generate code for a 3-address virtual machine Sunday, August 29, 2010 Sunday, August 29, 2010 The Micro language The Micro language • Tokens are defined by regular expressions • One data type—all IDs are integers • Tokens: BEGIN, END, READ, WRITE, ID, LITERAL, • Statement: LPAREN, RPAREN, SEMICOLON, COMMA, ASSIGN_OP, PLUS_OP, MINUS_OP, SCANEOF ID := EXPR • Implicit identifier declaration (no need to predeclare • Expressions are simple arithmetic expressions which can variables): ID = [A-Z][A-Z0-9]* contain identifiers • Literals (numbers): LITERAL = [0-9]+ • Note: no unary minus • Comments (not passed on as tokens): --(Not(\n))*\n • Input/output • Program: READ(ID, ID, ...) • BEGIN {statements} END WRITE(EXPR, EXPR, ...) Sunday, August 29, 2010 Sunday, August 29, 2010 Scanner Recognizing tokens • Skip spaces • Scanner algorithm provided in book (pages 28/29) • If the first non-space character is: What the scanner can identify corresponds to what the • • letter: read until non-alphanumeric. Check for reserved words finite automaton for a regular expression can accept (“begin,” “end”). Return reserved word or (ID and variable name) • Identifies the next token in the input stream • digit: read until non-digit. Return LITERAL and number • Read a token (process finite automaton until accept • ( ) ; , +: return single character state found) • : : next must be =. Return ASSIGN_OP • Identify its type (determine which accept state the FA is • - : if next is also - skip to end of line, otherwise return in) MINUS_OP Return type and “value” (e.g., type = LITERAL, value = 5) • “unget” the next character that had to be read to find end of IDs, • reserved words, literals and minus ops. Sunday, August 29, 2010 Sunday, August 29, 2010 Parsers and Grammars Micro grammar • Language syntax is usually specified with context-free program ::= BEGIN statement_list END grammars (CFGs) statement_list ::= statement { statement } • Backus-Naur form (BNF) is the standard notation statement ::= ID := expression; | READ( id_list) | WRITE( expr_list) id_list ::= ID {, ID} • Written as a set of rewrite rules: expr_list ::= expression {, expression} • Non-terminal ::= (set of terminals and non-terminals) expression ::= primary {add_op primary} • Terminals are the set of tokens primary ::= ( expression ) | ID | LITERAL • Each rule tells how to compose a non-terminal from add_op ::= PLUSOP | MINUSOP other non-terminals and terminals system_goal ::= program SCANEOF Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS program SCANEOF replace program Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS BEGIN statement_list END BEGIN statement; {statement} END replace statement_list drop {statement } Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS BEGIN statement; END BEGIN ID := expression; END replace statement replace expression Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS BEGIN ID := primary add_op primary; END BEGIN ID := ID add_op primary; END replace 1st primary replace add_op Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS BEGIN ID := ID + primary; END BEGIN ID := ID + ID; END replace primary Sunday, August 29, 2010 Sunday, August 29, 2010 How do we go in reverse? Quiz: how much lookahead? • How do we parse a program given a CFG? • Start at goal term, rewrite productions from left to right program ::= BEGIN statement_list END • If it is a terminal, make sure we match input token statement_list ::= statement { statement } • Otherwise, there is a syntax error statement ::= ID := expression; | READ( id_list) | WRITE( expr_list) • If it is a non-terminal id_list ::= ID {, ID} • If there is a single choice for a production, pick it expr_list ::= expression {, expression} • If there are multiple choices for a production, choose the expression ::= primary {add_op primary} production that matches the next token(s) in the stream primary ::= ( expression ) | ID | LITERAL • e.g., when parsing statement, could use production for ID, add_op ::= PLUSOP | MINUSOP READ or WRITE system_goal ::= program SCANEOF • Note that this means we have to look ahead in the stream to match tokens! Sunday, August 29, 2010 Sunday, August 29, 2010 Recursive descent parsing Recursive descent parsing (II) statement() { • Idea: parse using a set token = peek_at_match(); • How do we parse id_list ::= ID {, ID} of mutually recursive switch(token) { case ID: One idea: this is equivalent to id_list ::= ID | ID, id_list functions match(ID); //consume ID • match(ASSIGN); //consume := id_list() { One function per expression(); //process non-terminal match(ID); //consume ID • break; if (peek_at_match() == COMMA) { non-terminal case READ: match(COMMA) match(READ); //consume READ id_list(); Each function match(LPAREN); //match ( } • id_list(); //process non-terminal } attempts to match match(RPAREN); //match ) any terminals in its break; This is equivalent to the following loop case WRITE: • production match(WRITE); id_list() { match(LPAREN); //match ( match(ID); //consume ID If a rule produces expr_list(); //process non-terminal while (peek_at_match() == COMMA) { • match(RPAREN); //match ) match(COMMA) non-terminals, call break; match(ID); the appropriate } } function match(SEMICOLON); } } • Note: in both cases, if peek_at_match() isn’t COMMA, we statement ::= ID := expression; | READ( id_list) | WRITE( expr_list) don’t consume the next token! Sunday, August 29, 2010 Sunday, August 29, 2010 General rules Semantic processing • One function per non-terminal • Want to generate code for a 3-address machine: • Non-terminals with multiple choices (like statement) use case or if statements to distinguish • OP A, B, C performs A op B ! C • Conditional based on first set of the non-terminal, the • Temporary variables may be created to convert more terminals that can distinguish between productions complex expressions into three-address code • An optional list (like id_list) uses a while loop • Naming scheme: Temp&1, Temp&2, etc. Conditional for continuing with loop is if next • MULT C, B, Temp&1 terminal(s) are in first-set D = A + B * C ADD A, Temp&1, Temp&2 • Full parser code in textbook on pages 36-38 STORE

Load more