<<

Last Extra: Front-end vs. Back-end

Source code Source code • Scanner + Parser + Semantic actions + (high Scanner level) optimizations called the front-end of a Scanner compiler Tokens Tokens • IR-level optimizations and code generation front Parser (instruction selection, scheduling, register Parser allocation) called the back-end of a compiler end Syntax Syntax tree What are compilers? Can build multiple front-ends for a particular • Semantic • Semantic Routines back-end Routines Phases of a compiler • IR IR • e.g., gcc & g++, or many front-ends generate CIL Optimizer Optimizer Can build multiple back-ends for a particular back IR • IR front-end end Code Generation Code Generation • e.g., gcc allows targeting different Executable architectures Executable

Sunday, August 29, 2010 Sunday, August 29, 2010

High level structure

• Single pass compiler: no intermediate representation The MICRO compiler: a • Scanner tokenizes input stream, but is called by parser on demand simple example • As parser recognizes constructs, it invokes semantic routines • Semantic routines build symbol table on-the-fly and directly generate code for a 3-address virtual machine

Sunday, August 29, 2010 Sunday, August 29, 2010

The Micro language The Micro language

• Tokens are defined by regular expressions • One data —all IDs are integers • Tokens: BEGIN, END, READ, WRITE, ID, LITERAL, • Statement: LPAREN, RPAREN, SEMICOLON, COMMA, ASSIGN_OP, PLUS_OP, MINUS_OP, SCANEOF ID := • Implicit identifier declaration (no need to predeclare • Expressions are simple arithmetic expressions which can variables): ID = [A-Z][A-Z0-9]* contain identifiers • Literals (numbers): LITERAL = [0-9]+ • Note: no unary minus • Comments (not passed on as tokens): --(Not(\n))*\n • Input/output • Program: READ(ID, ID, ...) • BEGIN {statements} END WRITE(EXPR, EXPR, ...)

Sunday, August 29, 2010 Sunday, August 29, 2010 Scanner Recognizing tokens

• Skip spaces • Scanner algorithm provided in book (pages 28/29) • If the first non-space character is: What the scanner can identify corresponds to what the • • letter: read until non-alphanumeric. Check for reserved words finite automaton for a regular expression can accept (“begin,” “end”). Return reserved word or (ID and variable name) • Identifies the next token in the input stream • digit: read until non-digit. Return LITERAL and number • Read a token (process finite automaton until accept • ( ) ; , +: return single character state found) • : : next must be =. Return ASSIGN_OP • Identify its type (determine which accept state the FA is • - : if next is also - skip to end of line, otherwise return in) MINUS_OP Return type and “value” (e.g., type = LITERAL, value = 5) • “unget” the next character that had to be read to find end of IDs, • reserved words, literals and minus ops.

Sunday, August 29, 2010 Sunday, August 29, 2010

Parsers and Grammars Micro grammar

• Language syntax is usually specified with context-free program ::= BEGIN statement_list END grammars (CFGs) statement_list ::= statement { statement } • Backus-Naur form (BNF) is the standard notation statement ::= ID := expression; | READ( id_list) | WRITE( expr_list) id_list ::= ID {, ID} • Written as a set of rewrite rules: expr_list ::= expression {, expression} • Non-terminal ::= (set of terminals and non-terminals) expression ::= primary {add_op primary} • Terminals are the set of tokens primary ::= ( expression ) | ID | LITERAL • Each rule tells how to compose a non-terminal from add_op ::= PLUSOP | MINUSOP other non-terminals and terminals system_goal ::= program SCANEOF

Sunday, August 29, 2010 Sunday, August 29, 2010

Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS

program SCANEOF

replace program

Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS

BEGIN statement_list END BEGIN statement; {statement} END

replace statement_list drop {statement }

Sunday, August 29, 2010 Sunday, August 29, 2010

Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS

BEGIN statement; END BEGIN ID := expression; END

replace statement replace expression

Sunday, August 29, 2010 Sunday, August 29, 2010

Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS

BEGIN ID := primary add_op primary; END BEGIN ID := ID add_op primary; END

replace 1st primary replace add_op

Sunday, August 29, 2010 Sunday, August 29, 2010 Relating the CFG to a program Relating the CFG to a program • CFGs can produce a program by applying a sequence of • CFGs can produce a program by applying a sequence of productions productions • How to produce BEGIN id := id + id; END • How to produce BEGIN id := id + id; END • Rewrite by starting with the goal production and • Rewrite by starting with the goal production and replacing non-terminals with the rule’s RHS replacing non-terminals with the rule’s RHS

BEGIN ID := ID + primary; END BEGIN ID := ID + ID; END

replace primary

Sunday, August 29, 2010 Sunday, August 29, 2010

How do we go in reverse? Quiz: how much lookahead?

• How do we parse a program given a CFG? • Start goal term, rewrite productions from left to right program ::= BEGIN statement_list END • If it is a terminal, sure we match input token statement_list ::= statement { statement } • Otherwise, there is a syntax error statement ::= ID := expression; | READ( id_list) | WRITE( expr_list) • If it is a non-terminal id_list ::= ID {, ID} • If there is a single choice for a production, pick it expr_list ::= expression {, expression} • If there are multiple choices for a production, choose the expression ::= primary {add_op primary} production that matches the next token(s) in the stream primary ::= ( expression ) | ID | LITERAL • e.g., when parsing statement, could use production for ID, add_op ::= PLUSOP | MINUSOP READ or WRITE system_goal ::= program SCANEOF • Note that this means we have to look ahead in the stream to match tokens!

Sunday, August 29, 2010 Sunday, August 29, 2010

Recursive descent parsing Recursive descent parsing (II)

statement() { • Idea: parse using a set token = peek_at_match(); • How do we parse id_list ::= ID {, ID} of mutually recursive switch(token) { case ID: One idea: this is equivalent to id_list ::= ID | ID, id_list functions match(ID); //consume ID • match(ASSIGN); //consume := id_list() { One function per expression(); //process non-terminal match(ID); //consume ID • break; if (peek_at_match() == COMMA) { non-terminal case READ: match(COMMA) match(READ); //consume READ id_list(); Each function match(LPAREN); //match ( } • id_list(); //process non-terminal } attempts to match match(RPAREN); //match ) any terminals in its break; This is equivalent to the following loop case WRITE: • production match(WRITE); id_list() { match(LPAREN); //match ( match(ID); //consume ID If a rule produces expr_list(); //process non-terminal while (peek_at_match() == COMMA) { • match(RPAREN); //match ) match(COMMA) non-terminals, call break; match(ID); the appropriate } } function match(SEMICOLON); } } • Note: in both cases, if peek_at_match() isn’t COMMA, we statement ::= ID := expression; | READ( id_list) | WRITE( expr_list) don’t consume the next token!

Sunday, August 29, 2010 Sunday, August 29, 2010 General rules Semantic processing

• One function per non-terminal • Want to generate code for a 3-address machine: • Non-terminals with multiple choices (like statement) use case or if statements to distinguish • OP A, B, C performs A op B ! C • Conditional based on first set of the non-terminal, the • Temporary variables may be created to convert terminals that can distinguish between productions complex expressions into three-address code • An optional list (like id_list) uses a while loop • Naming scheme: Temp&1, Temp&2, etc. Conditional for continuing with loop is if next • MULT C, B, Temp&1 terminal(s) are in first-set D = A + B * C ADD A, Temp&1, Temp&2 • Full parser code in textbook on pages 36-38 STORE &Temp2, D

Sunday, August 29, 2010 Sunday, August 29, 2010

Semantic action routines Operator Precedence

• To produce code, we call routines during parsing to • Operator precedence can be specified in the CFG generate three-address code. • CFG can determine the order in which expressions are • These action routines do one of two things: parsed • Collect information about passed symbols for use by • For example: other semantic action routines. This information is stored in semantic records. expr ::= {+ factor} • Generate code using information from semantic factor ::= primary {* primary} records. and the current parse procedure primary ::= (expr) | ID | LITERAL • Note: for this to work correctly, we must parse • Because +-expressions are composed of *-expressions, we expressions according to order of operations (i.e., must will finish dealing with the * production before we finish parse a * expression before a + expression) with the + production

Sunday, August 29, 2010 Sunday, August 29, 2010

Example More in book

• Annotations are inserted into grammar, specifying when semantic routines should be called statement ::= ID = expr #assign expr ::= term + term #addop • Complete set of routines for recursive descent parser term ::= ID #id | LITERAL #num (pages 36–38) Consider A = B + 2; • • Semantic records for Micro (page 41, figure 2.8) • num() and id() create semantic records containing ID names and number values • Complete annotated grammar for Micro (page 42, figure 2.9) • addop() generates code for the expression, using information from the num() and id() records, and creates a temporary • Semantic routines for Micro (pages 43–45) variable • assign() generates code for the assignment using the temporary variable generated by addop()

Sunday, August 29, 2010 Sunday, August 29, 2010 Next time Backup slides

• Scanners • How to specify the tokens for a language • How to construct a scanner • How to use a scanner generator

Sunday, August 29, 2010 Sunday, August 29, 2010

Annotated Micro Grammar (fig. 2.9) Annotated Micro Grammar

Program ::= #start BEGIN Statement-list END Program ::= #start BEGIN Statement-list END Statement-list ::= Statement {Statement} Statement ::= ID := Expression; #assign | Semantic routines in Chap. 2 print information about what the parser has recognized. READ ( Id-list ) ; | WRITE ( Expr-list ) ; At #start, nothing has been recognized, so this takes no Id-list ::= Ident #read_id {, Ident #read_id } action. End of parse is recognized by the final production: Expr-list ::= Expression #write_expr {, Expression #write_expr } Expression ::= Primary { Add-op Primary #gen_infix} System-goal ::= Program SCANEOF #finish Primary ::= ( Expression ) | Ident | In a production compiler, the #start routine might set up INTLITERAL #process_literal program initialization code (i.e. initialization of heap storage and static storage, initialization of static values, Ident ::= ID #process_id etc.) Add-op ::= PLUSOP #process_op | MINUSOP #process_op System-goal ::= Program SCANEOF #finish

33 34

Sunday, August 29, 2010 Sunday, August 29, 2010

Annotated Micro Grammar Annotated Micro Grammar

Statement-list ::= Statement {Statement} Statement ::= ID := Expression; #assign | READ ( Id-list ) ; | No semantic actions are associated with this statement because WRITE ( Expr-list ) ; the necessary semantic actions associated with statements are done when a statement is recognized. Expr-list ::= Expression #write_expr {, Expression #write_expr } Expression ::= Primary { Add-op Primary #gen_infix} Primary ::= ( Expression ) | Ident | INTLITERAL #process_literal

Different semantic actions used when the parser finds an expression. In Expr- list, it is handled with write_expr, whereas Primary we choose to do nothing – but could express a different semantic action if there were a reason to do so.

We know that different productions, or rules of the grammar, are reached in different ways, and can tailor semantic actions (and the grammar) appropriately.

35 36

Sunday, August 29, 2010 Sunday, August 29, 2010 Annotated Micro Grammar

Statement ::= Ident := Expression; #assign | READ ( Id-list ) ; | WRITE ( Expr-list ) ; Id-list ::= Ident #read_id {, Ident #read_id } Ident ::= ID #process_id

Note that in the grammar of Fig. 2.4, there is no Ident nonterminal. By adding a nonterminal Ident a placeholder is created to take semantic actions as the nonterminal is processed. The programs look syntactically the same, but the additional productions allow the semantics to be richer.

Semantic actions create a semantic record for the ID and thereby create something for read_id to work with.

37

Sunday, August 29, 2010