Scanning and Structure of a Typical Interpreter Compiler

Analysis Synthesis Announcements character stream – Project 1 is 5% of total grade – Project 2 is 10% of total grade lexical analysis IR code generation – Project 3 is 15% of total grade tokens “words” IR – Project 4 is 10% of total grade syntactic analysis optimization Today – Outline of planned topics for course AST “sentences” IR – Overall structure of a compiler semantic analysis code generation – Lexical analysis (scanning) – Syntactic analysis (parsing) annotated AST target language

interpreter

CS553 Lecture Scanning and Parsing 2 CS553 Lecture Scanning and Parsing 3

Lexical Analysis (Scanning) Interaction Between Scanning and Parsing

Break character stream into tokens (“words”) – Tokens, lexemes, and patterns – Lexical analyzers are usually automatically generated from patterns lexer.next() parse tree (regular expressions) (e.g., ) lexer.peek() character stream or AST Lexical Examples Parser analyzer token lexeme(s) pattern token const const const if if if relation <,<=,=,!=,... < | <= | = | != | ... identifier foo,index [a-zA-Z_]+[a-zA-Z0-9_]* number 3.14159,570 [0-9]+ | [0-9]*.[0-9]+ string “hi”, “mom” “.*”

const pi := 3.14159 ⇒ const, identifier(pi), assign,number(3.14159)

CS553 Lecture Scanning and Parsing 4 CS553 Lecture Scanning and Parsing 5

1 Specifying Tokens with SableCC Recognizing Tokens with DFAs

Theory meets practice: i f – Regular expressions, formal ‘if‘ 1 4 5 t_if languages, grammars, parsing…

SableCC example input file:

Tokens

Package minijava; t_plus = '+'; letter or digit

Helpers t_if = 'if';

all = [0..0xFFFF];

cr = 13; letter

letter (letter | digit)* t_id = letter (letter | digit | underscore)*; 1 2 t_id

digit = ['0'..'9']; t_blank = (' ' | eol | tab)+; letter = ['a'..'z'] | ['A'..'Z']; t_comment = c_comment | line_comment; underscore = ’_’;

not_star = [all - '*']; Ignored Tokens Ambiguity due to matching substrings not_star_slash = [not_star - '/']; t_blank,

t_comment; – Longest match

c_comment = '/*' not_star* ('*' (not_star_slash not_star*)?)* '*/'; – Rule priority

CS553 Lecture Scanning and Parsing 6 CS553 Lecture Scanning and Parsing 7

Syntactic Analysis (Parsing) Interaction Between Scanning and Parsing

Impose structure on token stream – Limited to syntactic structure (⇒ high-level) – Structure usually represented with an abstract syntax tree (AST) lexer.next() parse tree – Parsers are usually automatically generated from context-free grammars lexer.peek() (e.g., , bison, cup, javacc, sablecc) character stream or AST Lexical for Parser Example analyzer

i 1 10 asg token for i = 1 to 10 do

a[i] = x * 5; arr tms

a i x 5

for id(i) equal number(1) to number(10) do

id(a) lbracket id(i) rbracket equal id(x) times number(5) semi

CS553 Lecture Scanning and Parsing 8 CS553 Lecture Scanning and Parsing 9

2 Bottom-Up Parsing: Shift-Reduce Shift-Reduce Parsing Example Grammer a + b + c Stack Input Action (1) S -> E (2) E -> E + T $ a + b + c shift (1) S -> E S -> E (3) E -> T $ a + b + c reduce (4) (2) E -> E + T -> E + T (4) T -> id (3) E -> T -> E + id $ T + b + c reduce (3) (4) T -> id -> E + T + id $ E + b + c shift -> E + id + id $ E + b + c shift -> T + id + id $ E + b + c reduce (4) -> id + id + id $ E + T + c reduce (2) $ E + c shift Rightmost derivation: expand rightmost non-terminals first $ E + c shift SableCC, yacc, and bison generate shift-reduce parsers: – LALR(1): look-ahead, left-to-right, rightmost derivation in reverse, 1 symbol lookahead $ E + c reduce (4) – LALR is a parsing table construction method, smaller tables than canonical LR $ E + T reduce (2) $ E reduce (1) $ S accept Reference: Barbara Ryder’s 198:515 lecture notes Reference: Barbara Ryder’s 198:515 lecture notes CS553 Lecture Scanning and Parsing 10 CS553 Lecture Scanning and Parsing 11

Shift-Reduce Parsing Example (precedence problem) Syntax-directed Translation: AST Construction example

Stack Input Action (1) S -> E Grammer with production rules (2) E -> E + T $ a + b * c shift S: E { $$ = $1; }; (3) E -> E * T E: E ‘+’ T { $$ = new node(“+”, $1, $3); }

(4) E -> T | T { $$ = $1; }

(5) T -> id ; T: T_ID { $$ = new leaf(“id”, $1); };

Implicit parse tree for a+b+c AST for a+b+c S + E E + + T c E + T T_ID a b T T_ID T_ID c b a Reference: Barbara Ryder’s 198:515 lecture notes CS553 Lecture Scanning and Parsing 12 CS553 Lecture Scanning and Parsing 13

3 Using SableCC to specify grammar and generate AST Parsing Terms

Productions CFG (Context-free Grammer) cst_program {-> program} = cst_main_class cst_class_decl* – production rule {-> New program(cst_main_class.main_class,[cst_class_decl.class_decl])} ; – terminal cst_exp_list {-> exp* } =

{many_rule} cst_exp cst_exp_rest* – nonterminal

{-> [cst_exp.exp, cst_exp_rest.exp] }

– FOLLOW(X): “the set of terminals that can immediately follow X”

| {empty_rule}

{-> [] }

;

cst_exp_rest {-> exp* } = t_comma cst_exp

{-> [cst_exp.exp] };

BNF (Backus-Naur Form) and EBNF (Extended BNF): equivalent to CFGs Abstract Syntax Tree

program =

main_class [class_decls]:class_decl*;

exp =

{call} exp t_id [args]:exp* | ...

CS553 Lecture Scanning and Parsing 14 CS553 Lecture Scanning and Parsing 15

Parsing Terms cont … Concepts

Top-down parsing Compilation stages in a compiler – LL(1): left-to-right reading of tokens, leftmost derivation, 1 symbol look-ahead – Scanning, parsing, semantic analysis, intermediate code generation, – Predictive parser: an efficient non-backtracking top-down parser that can handle optimization, code generation LL(1) – More generally recursive descent parsing may involve backtracking Lexical analysis or scanning – Tools: SableCC, lex, flex, etc.

Bottom-up Parsing Syntactic analysis or parsing – LR(1): left-to-right reading of tokens, rightmost derivation in reverse, 1 symbol – Tools: SableCC, yacc, bison, etc. lookahead – Shift-reduce parsers: for example, bison, yacc, and SableCC generated parsers – Methods for producing an LR parsing table – SLR, simple LR – Canonical LR, most powerful – LALR(1)

CS553 Lecture Scanning and Parsing 16 CS553 Lecture Scanning and Parsing 17

4 sm Next Time Language Implementation Timeline li le al ar ] p tz ’s For entertainment purposes only! r rt a 4] ] o ] w 0 l. p . ch 0 a m e] ] m S [4 et a l g m ] ] & c r L w in o d tz .] e o e ] o n C r ur l k r ln & n T n e aa t a c ] p i l] o a g K e o y st u r] l s & B Lecture ng y C e [M e da n e h] [ a & rp [ en 1 r il h e rt d .] ] y] R s N y a g h L au Jo rj ] a e l us ] h t or & n K in C & M r [K [ e k B in a k . rt r t l e [ r [ ] e an uc [ ef et r] c m a o ra h m s e th & m A C t d l – More undergraduate compilers review e a C h e a e or b C ir ] ol F C [B [K us . e p B om c [S n D K t m G ie D A t e s r p [ e [ [ c u g [W h [C Y s s m n o o n [C [M L g a C ve n n l c rn te a . se M [H ra l O er ul I . e i a it g e & fr v - [ rt o P B s S p lu py c R lo d D a y w E -0 o lg IS O ar im A e a o as [ ro o ex C ar a lo R A F A L C P S B D V C P C P M L G P M F P

‘50 ‘60 ‘70 ‘80

] in it ha C . [ oc ] ll ck . a s de y] g si a ] d re he Z M ne g t & ] IB n n ’s n u [ e ri e e a w K lo lf l] h m M [ o o al ac g [H V C C c e R F ] W W ] / W ng s P r ), ] l [ w [ li ] e e 1 U r am t. u un ik & sh 0 ] ] S e L 86 s d S J ] i 8 p A A P [ 4 on e U ay [F ru I [ ] g ] c ch & & S . M t R k te n n . s ng s C [K d IB us N o n ni o d li ip e ( o I o ra li tr n ck s h @ lk ch tr [ b e e y o o o s 3 ta s C S l n F p C c l G 5 ll IS [ m o [ i [ e rb [ m 5 a ce R + a g p rs e a iu S a st c a G A a p v n m r + O r D W S p u a ta C S T 1 C D P S S S S J I

‘80 ‘90 2000 2010

CS553 Lecture Scanning and Parsing 18 CS553 Lecture Scanning and Parsing 19

5