Scanning and Parsing Structure of a Typical Interpreter Compiler
Analysis Synthesis Announcements character stream – Project 1 is 5% of total grade – Project 2 is 10% of total grade lexical analysis IR code generation – Project 3 is 15% of total grade tokens “words” IR – Project 4 is 10% of total grade syntactic analysis optimization Today – Outline of planned topics for course AST “sentences” IR – Overall structure of a compiler semantic analysis code generation – Lexical analysis (scanning) – Syntactic analysis (parsing) annotated AST target language
interpreter
CS553 Lecture Scanning and Parsing 2 CS553 Lecture Scanning and Parsing 3
Lexical Analysis (Scanning) Interaction Between Scanning and Parsing
Break character stream into tokens (“words”) – Tokens, lexemes, and patterns – Lexical analyzers are usually automatically generated from patterns lexer.next() parse tree (regular expressions) (e.g., lex) lexer.peek() character stream or AST Lexical Examples Parser analyzer token lexeme(s) pattern token const const const if if if relation <,<=,=,!=,... < | <= | = | != | ... identifier foo,index [a-zA-Z_]+[a-zA-Z0-9_]* number 3.14159,570 [0-9]+ | [0-9]*.[0-9]+ string “hi”, “mom” “.*”
const pi := 3.14159 ⇒ const, identifier(pi), assign,number(3.14159)
CS553 Lecture Scanning and Parsing 4 CS553 Lecture Scanning and Parsing 5
1 Specifying Tokens with SableCC Recognizing Tokens with DFAs
Theory meets practice: i f – Regular expressions, formal ‘if‘ 1 4 5 t_if languages, grammars, parsing…
SableCC example input file:
Tokens
Package minijava; t_plus = '+'; letter or digit
Helpers t_if = 'if';
all = [0..0xFFFF];
cr = 13; letter
letter (letter | digit)* t_id = letter (letter | digit | underscore)*; 1 2 t_id
digit = ['0'..'9']; t_blank = (' ' | eol | tab)+; letter = ['a'..'z'] | ['A'..'Z']; t_comment = c_comment | line_comment; underscore = ’_’;
not_star = [all - '*']; Ignored Tokens Ambiguity due to matching substrings not_star_slash = [not_star - '/']; t_blank,
t_comment; – Longest match
c_comment = '/*' not_star* ('*' (not_star_slash not_star*)?)* '*/'; – Rule priority
CS553 Lecture Scanning and Parsing 6 CS553 Lecture Scanning and Parsing 7
Syntactic Analysis (Parsing) Interaction Between Scanning and Parsing
Impose structure on token stream – Limited to syntactic structure (⇒ high-level) – Structure usually represented with an abstract syntax tree (AST) lexer.next() parse tree – Parsers are usually automatically generated from context-free grammars lexer.peek() (e.g., yacc, bison, cup, javacc, sablecc) character stream or AST Lexical for Parser Example analyzer
i 1 10 asg token for i = 1 to 10 do
a[i] = x * 5; arr tms
a i x 5
for id(i) equal number(1) to number(10) do
id(a) lbracket id(i) rbracket equal id(x) times number(5) semi
CS553 Lecture Scanning and Parsing 8 CS553 Lecture Scanning and Parsing 9
2 Bottom-Up Parsing: Shift-Reduce Shift-Reduce Parsing Example Grammer a + b + c Stack Input Action (1) S -> E (2) E -> E + T $ a + b + c shift (1) S -> E S -> E (3) E -> T $ a + b + c reduce (4) (2) E -> E + T -> E + T (4) T -> id (3) E -> T -> E + id $ T + b + c reduce (3) (4) T -> id -> E + T + id $ E + b + c shift -> E + id + id $ E + b + c shift -> T + id + id $ E + b + c reduce (4) -> id + id + id $ E + T + c reduce (2) $ E + c shift Rightmost derivation: expand rightmost non-terminals first $ E + c shift SableCC, yacc, and bison generate shift-reduce parsers: – LALR(1): look-ahead, left-to-right, rightmost derivation in reverse, 1 symbol lookahead $ E + c reduce (4) – LALR is a parsing table construction method, smaller tables than canonical LR $ E + T reduce (2) $ E reduce (1) $ S accept Reference: Barbara Ryder’s 198:515 lecture notes Reference: Barbara Ryder’s 198:515 lecture notes CS553 Lecture Scanning and Parsing 10 CS553 Lecture Scanning and Parsing 11
Shift-Reduce Parsing Example (precedence problem) Syntax-directed Translation: AST Construction example
Stack Input Action (1) S -> E Grammer with production rules (2) E -> E + T $ a + b * c shift S: E { $$ = $1; }; (3) E -> E * T E: E ‘+’ T { $$ = new node(“+”, $1, $3); }
(4) E -> T | T { $$ = $1; }
(5) T -> id ; T: T_ID { $$ = new leaf(“id”, $1); };
Implicit parse tree for a+b+c AST for a+b+c S + E E + + T c E + T T_ID a b T T_ID T_ID c b a Reference: Barbara Ryder’s 198:515 lecture notes CS553 Lecture Scanning and Parsing 12 CS553 Lecture Scanning and Parsing 13
3 Using SableCC to specify grammar and generate AST Parsing Terms
Productions CFG (Context-free Grammer) cst_program {-> program} = cst_main_class cst_class_decl* – production rule {-> New program(cst_main_class.main_class,[cst_class_decl.class_decl])} ; – terminal cst_exp_list {-> exp* } =
{many_rule} cst_exp cst_exp_rest* – nonterminal
{-> [cst_exp.exp, cst_exp_rest.exp] }
– FOLLOW(X): “the set of terminals that can immediately follow X”
| {empty_rule}
{-> [] }
;
cst_exp_rest {-> exp* } = t_comma cst_exp
{-> [cst_exp.exp] };
BNF (Backus-Naur Form) and EBNF (Extended BNF): equivalent to CFGs Abstract Syntax Tree
program =
main_class [class_decls]:class_decl*;
exp =
{call} exp t_id [args]:exp* | ...
CS553 Lecture Scanning and Parsing 14 CS553 Lecture Scanning and Parsing 15
Parsing Terms cont … Concepts
Top-down parsing Compilation stages in a compiler – LL(1): left-to-right reading of tokens, leftmost derivation, 1 symbol look-ahead – Scanning, parsing, semantic analysis, intermediate code generation, – Predictive parser: an efficient non-backtracking top-down parser that can handle optimization, code generation LL(1) – More generally recursive descent parsing may involve backtracking Lexical analysis or scanning – Tools: SableCC, lex, flex, etc.
Bottom-up Parsing Syntactic analysis or parsing – LR(1): left-to-right reading of tokens, rightmost derivation in reverse, 1 symbol – Tools: SableCC, yacc, bison, etc. lookahead – Shift-reduce parsers: for example, bison, yacc, and SableCC generated parsers – Methods for producing an LR parsing table – SLR, simple LR – Canonical LR, most powerful – LALR(1)
CS553 Lecture Scanning and Parsing 16 CS553 Lecture Scanning and Parsing 17
4 sm Next Time Language Implementation Timeline li le al ar ] p tz ’s For entertainment purposes only! r rt a 4] ] o ] w 0 l. p . ch 0 a m e] ] m S [4 et a l g m ] ] & c r L w in o d tz .] e o e ] o n C r ur l k r ln & n T n e aa t a c ] p i l] o a g K e o y st u r] l s & B Lecture ng y C e [M e da n e h] [ a & rp [ en 1 r il h e rt d .] ] y] R s N y a g h L au Jo rj ] a e l us ] h t or & n K in C & M r [K [ e k B in a k . rt r t l e [ r [ ] e an uc [ ef et r] c m a o ra h m s e th & m A C t d l – More undergraduate compilers review e a C h e a e or b C ir ] ol F C [B [K us . e p B om c [S n D K t m G ie D A t e s r p [ e [ [ c u g [W h [C Y s s m n o o n [C [M L g a C ve n n l c rn te a . se M [H ra l O er ul I . e i a it g e & fr v - [ rt o P B s S p lu py c R lo d D a y w E -0 o lg IS O ar im A e a o as [ ro o ex C ar a lo R A F A L C P S B D V C P C P M L G P M F P
‘50 ‘60 ‘70 ‘80
] in it ha C . [ oc ] ll ck . a s de y] g si a ] d re he Z M ne g t & ] IB n n ’s n u [ e ri e e a w K lo lf l] h m M [ o o al ac g [H V C C c e R F ] W W ] / W ng s P r ), ] l [ w [ li ] e e 1 U r am t. u un ik & sh 0 ] ] S e L 86 s d S J ] i 8 p A A P [ 4 on e U ay [F ru I [ ] g ] c ch & & S . M t R k te n n . s ng s C [K d IB us N o n ni o d li ip e ( o I o ra li tr n ck s h @ lk ch tr [ b e e y o o o s 3 ta s C S l n F p C c l G 5 ll IS [ m o [ i [ e rb [ m 5 a ce R + a g p rs e a iu S a st c a G A a p v n m r + O r D W S p u a ta C S T 1 C D P S S S S J I
‘80 ‘90 2000 2010
CS553 Lecture Scanning and Parsing 18 CS553 Lecture Scanning and Parsing 19
5