Anatomy of a Compiler Computer Language Engineering Program (character stream) Lexical Analyzer (Scanner) • How to give instructions to a computer? Token Stream – Programming Languages. Syntax Analyzer (Parser) Parse Tree • How to make the computer carryout the Intermediate Code Generator instructions efficiently? Intermediate Representation Intermediate Code Optimizer – Compilers. Optimized Intermediate Representation Code Generator Assembly code cs480(Prasad) L2Syntax 1 cs480(Prasad) L2Syntax 2

What is a Lexical Analyzer? Lexical Analyzer in Action Source program text Tokens • Examples of Token f o r v a r 1 = 1 0 v a r 1 < = • Operators = + - > ( { := == <> •Keywords if while for int double • Numeric literals 43 6.035 -3.6e10 0x13F3A for ID(“var1”) eq_op Num(10) ID(“var1”) leq_op • Character literals ‘a’ ‘~’ ‘\’’ • String literals “3.142” “Fall” “\”\” = empty” • Partition input program text into sequence of • Examples of non-token tokens, attaching corresponding attributes. • White space space(‘ ’) tab(‘\t’) end-of-line(‘\n’) – E.g., -lexeme “015” token NUM attribute 13 • Comments /*this is not a token*/ • Eliminate white space and comments

cs480(Prasad) L2Syntax 3 cs480(Prasad) L2Syntax 4

1 Syntax and Semantics of a programming Input to and output of a parser language Input: - (123.3 + 23.6) •Syntax Token Stream Parse Tree – What is the structure of the program? minus_op – Textual representation. left_paren_op - • Formally defined using context-free num(123.3) grammars (Backus-Naur Formalism) plus_op • Semantics ( ) num(23.6) – What is the meaning of a program? right_paren_op

• Harder to give a mathematical definition. Syntax Analyzer (Parser) 123.3 23.6 +

cs480(Prasad) L2Syntax 5 cs480(Prasad) L2Syntax 6

Example: A CFG for expressions • Simple arithmetic expressions with + and * Example: A CFG for expressions – 8.2 + 35.6 – 8.32 + 86 * 45.3 – (6.001 + 6.004) * (6.035 * -(6.042 + 6.046))  ( ) • Terminals (or tokens)  - – num for all the numbers  num – plus_op, minus_op, times_op, left_paren_op, right_paren_op  +  * • What is the grammar for all possible expressions?

cs480(Prasad) L2Syntax 7 cs480(Prasad) L2Syntax 8

2 Context-Free Grammars (CFGs) English Language

• Terminals • Letters: a,b,c, … – Symbols for strings or tokens { num, (, ), +, *, - } • Alphabet: { a,b, …, z} U { A, B, …, Z} • Nonterminals – Syntactic variables { , } • Tokens (Terminals) : English words • Start symbol • Nonterminals: , , – A special nonterminal , … • Productions • Context-free aspect: Replacing by – The manner in which terminals and nonterminals are or combined to form strings. • Context-sensitive aspects: Agreement among – A nonterminal in LHS and a string of terminals and non- words w.r.t. number, gender, tense, etc. terminals in RHS.  -

cs480(Prasad) L2Syntax 9 cs480(Prasad) L2Syntax 10

Example Derivation Another Example Derivation  num *  num * *( )  num * ( )  *( )  num * ( )  * ( num )  num * ( num )  * ( num + )  num * ( num + )  num * ( num + num )  * ( num + num )  num * ( num + num ) num * ( num + num )

cs480(Prasad) L2Syntax 11 cs480(Prasad) L2Syntax 12

3 Parse/Derivation Tree Parse/Derivation Tree Example • Graphical Representation of the parsed structure num * ( num + num ) • Shows the sequence of derivations performed – Internal nodes are non-terminals.

– Leaves are terminals. – Each parent node is LHS and the children are RHS of a production. num * () • Abstracts the details of sequencing of the rule applications, but preserves decomposition of a non-terminal. num num cs480(Prasad) L2Syntax 13 +

Expression – Expression – One possible parse tree Another possible parse tree num + num * num num + num * num

num + * num

num* num num+ num

124 + (23.5 * 86) = 2145 (124 + 23.5) * 86 = 12685

cs480(Prasad) L2Syntax 15 cs480(Prasad) L2Syntax 16

4 Same string – Two distinct parse (derivation) trees Ambiguous Grammar num + num * num • Applying different derivation orders produces different parse trees. !AMBIGUITY! – This can lead to ambiguous/unexpected results. – Note that multiple derivations leading to the same parse tree is not a problem.

• A CFG is ambiguous if the same string can num + * num be associated with two distinct parse trees. num* num num+ num – E.g., The expression grammar can be shown to be ambiguous using expressions containing binary infix operators and non-fully parenthesized expressions. 124 + (23.5 * 86) = 2145 (124 + 23.5) * 86 = 12685

cs480(Prasad) L2Syntax 17 cs480(Prasad) L2Syntax 18

Eliminating Ambiguity Removing Ambiguity + • Sometimes rewriting a grammar to reflect operator precedence with additional * nonterminals will eliminate ambiguity.  * more binding than +.  num  && has precedence over ||.  ( )  ( )  num • One can view the rewrite as “breaking the  + symmetry” through “stratification” .  *

cs480(Prasad) L2Syntax 19 cs480(Prasad) L2Syntax 20

5 Extended BNF (Backus Naur Formalism) Expression Parser Fragment ( + ) * (String => Parse Tree) Node expr() { ( * ) * // PRE: Expects lookahead token. // POST: Consume an Expression  num | ( ) // and update Lookahead token. Node temp = term(); ( ) * while ( inTok.ttype == '+') { inTok.nextToken(); Kleene Star :  ( ) Node temp1 = term(); Zero or more temp = new OpNode(temp,'+',temp1); occurrences  num }  + | * return temp; }

cs480(Prasad) L2Syntax 21 cs480(Prasad) L2Syntax 22

Arithmetic Expressions (ambiguous) (with variables) Resolving Ambiguity in Expressions + | * • Different operators : precedence relation | ( ) | • Same operator : associativity relation | 5-2-3 2**3**4

5-2 - 3 5- 2-3  x | y | z 2^12 2^81 = 0 = 6

 0 | 1 | 2 Left Associative ( (5-2)-3) Right Associative (2**(3**4))

cs480(Prasad) L2Syntax 23 cs480(Prasad) L2Syntax 24

6 C++ Operator Precedence and Associativity

Level Operator Function Level Operator Function 17R :: global (unary) 12L + , - arithmetic operators 17L :: class scope (binary) 11L << , >> bitwise shift 16L -> , .member selectors 10L < , <= , > , >= relational operators 16L [] array index 9L == , != equaltity, inequality 16L () function call 8L & bitwise AND 16L () type construction 15R sizeof size in bytes 7L ^ bitwise XOR 15R ++ , -- increment, decrement 6L | bitwise OR 15R ~ bitwise NOT 5L && logical AND 15R ! logical NOT 4L || logical OR 15R + , - uniary minus, plus 3L ?: arithmetic if 15R * , & dereference, address-of 2R = , *= , /= , %= , += , -= 15R () type conversion (cast) <<= , >>= , &= , |= , ^= assignment 15R new , delete free store management operators 14L ->* , .* member pointer select 1L , comma operator 13L * , / , % multiplicative operators cs480(Prasad) L2Syntax 25

7