Homework 2: Parser and Lexer
Remi Meier Compiler Design – 15.03.2018
1 Compiler Phases
x86 Javali Compiler Assembly
IR IR Front-end Optimizations Back-end
Machine Machine independent dependent
Lexical Syntactc AST Semantic Analysis Analysis Analysis
2 Homework 2
Class
class Main { Method void main() { write(222); ? writeln(); Seq } } Text Write WriteLn Abstract IntConst Syntax Tree
How do we… • check if a program follows the syntax of Javali? • extract meaning / structure?
3 Homework 2
Token Lexer Parser Parse Tree Javali AST Stream
Part 2a Part 2b
Token Parse Tree / Text Lexer Parser Stream AST
Lexer • Read input character by character • Recognize character groups → tokens
Token • Sequence of characters with a collectve meaning → grammar terminals • E.g. constants, identiiers, keywords, …
5 Lexical Analysis
class Main { ID : [a-zA-Z]+ ; void main() { NUM : [0-9]+ ; write(222); MISC : [{()};] ; writeln(); WS : ('\n'|' ') → skip ; } }
Token stream:
ID: class ID: Main MISC: { ID: void ID: main MISC: ( MISC: ) …
6 Syntactc Analysis
Token Parse Tree / Text Lexer Parser Stream AST
Parser • Check if token stream follows the grammar • Group tokens hierarchically (extract structure) → Parse Tree / Abstract Syntax Tree
…
7 TOP-DOWN PARSER
8 Top-down Parsers return a + b ; Grammar in Extended Backus- Naur Form (EBNF): return a + b ; statement: statement return | assign return: return ‘return’ expr ‘;’ ‘return’ expr ‘;’ assign: ID ‘=‘ expr ‘;’ expr: ID ‘+’ ID ‘a’ ‘+’ ‘b’
9 Implementaton return a + b ;
Grammar in Extended Backus- void statement() { Naur Form (EBNF): return(); or assign(); statement: } return | assign void return() { match(‘return’); return: expr(); ‘return’ expr ‘;’ match(‘;’); assign: } ID ‘=‘ expr ‘;’ void expr() { match(ID); expr: ID ‘+’ ID match(‘+’); How to deal match(ID);with alternatves?}
10 Lookahead return a + b ; Grammar in Extended Backus- Naur Form (EBNF):
statement: void statement() { return if (next() is ‘return’) { | assign return(); } else if (next() is ID) { return: assign(); ‘return’ expr ‘;’ } assign: } ID ‘=‘ expr ‘;’ expr: ID ‘+’ ID LL(1)
11 http://www.antlr4.org/ (or HW2 fragment)
12 ANTLR
Token speciications MyLexer.java + MyParser.java Grammar
Top-down parser generator ● ALL(*) adaptive, arbitrary lookahead ● handles any non-lef-recursive context-free grammar
13 ANTLR – Grammar Descripton
/* This is an example */ grammar Example;
/* Parser rules = Non-terminals */ program : Start rule matching end-of-ile statement* EOF ;
Lower-case initial: Parser statement : assignment ';' | expression ';' ;
/* Lexer rules = Terminals */ Identifier : Letter (Letter | Digit)* ; Upper-case initial: Lexer Letter : '\u0024' | '\u0041'..'\u005a'; → Tokens
14 ANTLR – Operators Extended Backus-Naur Form (EBNF)
program : EBNF operators statement* EOF; x | y | z (ordered) alternative x? at most once (optional) statement : assignment ';' x* 0 .. n times x+
| expression ';' 1 .. n times l
e
; x one of the chars, e.g.: e
[charset] r [a-zA-Z] -
o
method : n
l
'x'..'y' characters in range y type name '(' params? ')' ;
15 Demo 1
16 ANTLR – Troubleshootng ANTLR does not warn about ambiguous rules ● resolves ambiguity at runtime → requires lots of testing
ANTLR does not handle indirect lef-recursion ● direct lef-recursion supported
17 ANTLR – Lexer Ambiguity What if some input is matched by multiple lexer rules?
d
o creates implicit lexer rule
parserRule : 'foo' parserRule ; c u T123 : 'foo'
m
e
n fragment enforces that the rule fragment t
o
Letter : [a-z] ; r never produces a token, but can be
d
e
r used in other lexer rules Identifier : Letter+ ; can never match foo, but e.g., foot Lexer decides based on: 1. rule with the longest match irst 2. literal tokens before all regular Lexer rules 3. document order 4. fragment rules never match on their own 18 ANTLR – Parser Ambiguity
stmt: 'if' expr 'then' stmt 'else' stmt (1) | 'if' expr 'then' stmt (2) | ID '=' expr ;
if a then if c then d else e
) (2 (2 ), ), (1 (1 ) if a then if c then d else e if a then if c then d else e
Ambiguous since there exist more than one parse trees for the same input.
19 ANTLR – Parser Ambiguity
stmt: 'if' expr 'then' stmt 'else' stmt (1) | 'if' expr 'then' stmt (2) | ID '=' expr ;
At decision points, if more than one alternative matches a given input, follow document order.
if a then if c then d else e
) (2 (2 ), ), (1 (1 ) if a then if c then d else e if a then if c then d else e
20 ANTLR – Parser Ambiguity
stmt: 'if' expr 'then' stmt 'else' stmt (1) | 'if' expr 'then' stmt (2) | ID '=' expr ;
At decision points, if more than one alternative matches a given input, follow document order.
stmt: 'if' expr 'then' stmt Soluton | 'if' expr 'then' stmt 'else' stmt | ID '=' expr ;
21 ANTLR – Parser Ambiguity At decision points, if more than one alternative matches a given input, follow document order. Alternative solution: stmt: 'if' expr 'then' stmt ('else' stmt)? (…)? → (…| ) | ID '=' expr ;
Sub-rules introduce additional decision points.
22 ANTLR – Lef-recursion
Without: “a, b, c” list : LETTER (',' LETTER)*;
Direct:
list : list ',' LETTER void list() { | LETTER ; if (???) { list(); Indirect: match(‘,’); match(LETTER); list : LETTER } else { | longlist ; match(LETTER); } } longlist : list ',' LETTER;
23 ANTLR – Direct Lef-recursion
exp : exp '*' exp A grammar that implicitly assigns | exp '+' exp rewrite priorites to alternatives in | ID ; document order
a * a + a * a a + a * a
https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lef-recursive+rules 24 Demo 2
25 Homework
Token Lexer Parser Parse Tree Javali AST Stream
Part 2a Part 2b Parser grammar: Javali.g4 JavaliAstVisitor.java
26 Javali AST Generaton
BinaryOp(+) “a * a + a * a”
Visitor BinaryOp(*) BinaryOp(*)
Var('a') Var('a') Var('a') Var('a')
→ JavaliAstVisitor.java Generated Files
ANTLR
JavaliLexer/Parser.java ● the real thing Javali(Base)Visitor.java ● base class for parse-tree visitor Javali(Lexer).tokens ● token → number mapping for debugging
28 Generated Visitor
start : exp EOF ;
exp : exp '*' exp one method | exp '+' exp per rule | ID ;
start : exp EOF ;
exp : exp '*' exp # MULT one method | exp '+' exp # ADD per label / rule | ID # TERM ;
https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parser+Rules 29 Constructng the Javali AST
start : exp EOF ;
exp : exp '*' exp # MULT | exp '+' exp # ADD | ID # TERM ;
BinaryOp(+) “a * a + a * a”
visitADD visitMULT visitMULT Visitor BinaryOp(*) BinaryOp(*)
visitTERM
Var('a') Var('a') Var('a') Var('a')
30 Demo 3
31 Final Notes
• Look on our website for more material. • Due date is March, 29th at 10 a.m.
Please don't do that…
32