Homework 2: Parser and Lexer

Remi Meier Design – 15.03.2018

1 Compiler Phases

x86 Javali Compiler Assembly

IR IR Front-end Optimizations Back-end

Machine Machine independent dependent

Lexical Syntactc AST Semantic Analysis Analysis Analysis

2 Homework 2

Class

class Main { Method void main() { write(222); ? writeln(); Seq } } Text Write WriteLn Abstract IntConst Syntax

How do we… • check if a program follows the syntax of Javali? • extract meaning / structure?

3 Homework 2

Token Lexer Parser Javali AST Stream

Part 2a Part 2b

4

Token Parse Tree / Text Lexer Parser Stream AST

Lexer • Read input character by character • Recognize character groups → tokens

Token • Sequence of characters with a collectve meaning → grammar terminals • E.g. constants, identiiers, keywords, …

5 Lexical Analysis

class Main { ID : [a-zA-Z]+ ; void main() { NUM : [0-9]+ ; write(222); MISC : [{()};] ; writeln(); WS : ('\n'|' ') → skip ; } }

Token stream:

ID: class ID: Main MISC: { ID: void ID: main MISC: ( MISC: ) …

6 Syntactc Analysis

Token Parse Tree / Text Lexer Parser Stream AST

Parser • Check if token stream follows the grammar • Group tokens hierarchically (extract structure) → Parse Tree / Abstract Syntax Tree

7 TOP-DOWN PARSER

8 Top-down Parsers return a + b ; Grammar in Extended Backus- Naur Form (EBNF): return a + b ; statement: statement return | assign return: return ‘return’ expr ‘;’ ‘return’ expr ‘;’ assign: ID ‘=‘ expr ‘;’ expr: ID ‘+’ ID ‘a’ ‘+’ ‘b’

9 Implementaton return a + b ;

Grammar in Extended Backus- void statement() { Naur Form (EBNF): return(); or assign(); statement: } return | assign void return() { match(‘return’); return: expr(); ‘return’ expr ‘;’ match(‘;’); assign: } ID ‘=‘ expr ‘;’ void expr() { match(ID); expr: ID ‘+’ ID match(‘+’); How to deal match(ID);with alternatves?}

10 Lookahead return a + b ; Grammar in Extended Backus- Naur Form (EBNF):

statement: void statement() { return if (next() is ‘return’) { | assign return(); } else if (next() is ID) { return: assign(); ‘return’ expr ‘;’ } assign: } ID ‘=‘ expr ‘;’ expr: ID ‘+’ ID LL(1)

11 http://www.antlr4.org/ (or HW2 fragment)

12 ANTLR

Token speciications MyLexer.java + MyParser.java Grammar

Top-down parser generator ● ALL(*) adaptive, arbitrary lookahead ● handles any non-lef-recursive context-free grammar

13 ANTLR – Grammar Descripton

/* This is an example */ grammar Example;

/* Parser rules = Non-terminals */ program : Start rule matching end-of-ile statement* EOF ;

Lower-case initial: Parser statement : assignment ';' | expression ';' ;

/* Lexer rules = Terminals */ Identifier : Letter (Letter | Digit)* ; Upper-case initial: Lexer Letter : '\u0024' | '\u0041'..'\u005a'; → Tokens

14 ANTLR – Operators Extended Backus-Naur Form (EBNF)

program : EBNF operators statement* EOF; x | y | z (ordered) alternative x? at most once (optional) statement : assignment ';' x* 0 .. n times x+

| expression ';' 1 .. n times l

e

; x one of the chars, e.g.: e

[charset] r [a-zA-Z] -

o

method : n

l

'x'..'y' characters in range y type name '(' params? ')' ;

15 Demo 1

16 ANTLR – Troubleshootng ANTLR does not warn about ambiguous rules ● resolves ambiguity at runtime → requires lots of testing

ANTLR does not handle indirect lef-recursion ● direct lef-recursion supported

17 ANTLR – Lexer Ambiguity What if some input is matched by multiple lexer rules?

d

o creates implicit lexer rule

parserRule : 'foo' parserRule ; c u T123 : 'foo'

m

e

n fragment enforces that the rule fragment t

o

Letter : [a-z] ; r never produces a token, but can be

d

e

r used in other lexer rules Identifier : Letter+ ; can never match foo, but e.g., foot Lexer decides based on: 1. rule with the longest match irst 2. literal tokens before all regular Lexer rules 3. document order 4. fragment rules never match on their own 18 ANTLR – Parser Ambiguity

stmt: 'if' expr 'then' stmt 'else' stmt (1) | 'if' expr 'then' stmt (2) | ID '=' expr ;

if a then if c then d else e

) (2 (2 ), ), (1 (1 ) if a then if c then d else e if a then if c then d else e

Ambiguous since there exist more than one parse trees for the same input.

19 ANTLR – Parser Ambiguity

stmt: 'if' expr 'then' stmt 'else' stmt (1) | 'if' expr 'then' stmt (2) | ID '=' expr ;

At decision points, if more than one alternative matches a given input, follow document order.

if a then if c then d else e

) (2 (2 ), ), (1 (1 ) if a then if c then d else e if a then if c then d else e

20 ANTLR – Parser Ambiguity

stmt: 'if' expr 'then' stmt 'else' stmt (1) | 'if' expr 'then' stmt (2) | ID '=' expr ;

At decision points, if more than one alternative matches a given input, follow document order.

stmt: 'if' expr 'then' stmt Soluton | 'if' expr 'then' stmt 'else' stmt | ID '=' expr ;

21 ANTLR – Parser Ambiguity At decision points, if more than one alternative matches a given input, follow document order. Alternative solution: stmt: 'if' expr 'then' stmt ('else' stmt)? (…)? → (…| ) | ID '=' expr ;

Sub-rules introduce additional decision points.

22 ANTLR – Lef-recursion

Without: “a, b, c” list : LETTER (',' LETTER)*;

Direct:

list : list ',' LETTER void list() { | LETTER ; if (???) { list(); Indirect: match(‘,’); match(LETTER); list : LETTER } else { | longlist ; match(LETTER); } } longlist : list ',' LETTER;

23 ANTLR – Direct Lef-recursion

exp : exp '*' exp A grammar that implicitly assigns | exp '+' exp rewrite priorites to alternatives in | ID ; document order

a * a + a * a a + a * a

https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Lef-recursive+rules 24 Demo 2

25 Homework

Token Lexer Parser Parse Tree Javali AST Stream

Part 2a Part 2b Parser grammar: Javali.g4 JavaliAstVisitor.java

26 Javali AST Generaton

BinaryOp(+) “a * a + a * a”

Visitor BinaryOp(*) BinaryOp(*)

Var('a') Var('a') Var('a') Var('a')

→ JavaliAstVisitor.java Generated Files

ANTLR

JavaliLexer/Parser.java ● the real thing Javali(Base)Visitor.java ● base class for parse-tree visitor Javali(Lexer).tokens ● token → number mapping for debugging

28 Generated Visitor

start : exp EOF ;

exp : exp '*' exp one method | exp '+' exp per rule | ID ;

start : exp EOF ;

exp : exp '*' exp # MULT one method | exp '+' exp # ADD per label / rule | ID # TERM ;

https://theantlrguy.atlassian.net/wiki/display/ANTLR4/Parser+Rules 29 Constructng the Javali AST

start : exp EOF ;

exp : exp '*' exp # MULT | exp '+' exp # ADD | ID # TERM ;

BinaryOp(+) “a * a + a * a”

visitADD visitMULT visitMULT Visitor BinaryOp(*) BinaryOp(*)

visitTERM

Var('a') Var('a') Var('a') Var('a')

30 Demo 3

31 Final Notes

• Look on our website for more material. • Due date is March, 29th at 10 a.m.

Please don't do that…

32