Compiler: Parsing

Compiler: Parsing Computer Science and Engineering College of Engineering The Ohio State University Lecture 37 Backus-Naur Form (BNF) Computer Science and Engineering The Ohio State University Classic notation for writing CFG Very old: Invented to describe the syntax of ALGOL 60 But still used today! Basic syntax <name> is a symbol ::= is the arrow in a production rule Vertical bar ( | ) for choice in a rule Common extensions (many variants) ( )’s to group elements RE repetition operators * + ? (or { } [ ] for ?) Comments Example: Mailing Addresses Computer Science and Engineering The Ohio State University <postal> ::= <name> <addr> <zip-part> <name> ::= <personal> <last-name> <EOL> | <personal> <name> <personal> ::= <initial> "." | <first-name> <addr> ::= <house> <street> <opt-apt> <EOL> <zip-part> ::= <town> "," <state> <zip> <EOL> <opt-apt> ::= "#" <apt-num> | "" Example: Mailing Addresses Computer Science and Engineering The Ohio State University <postal> ::= <name> <addr> <zip-part> <name> ::= <personal> <last-name> <EOL> | <personal> <name> <personal> ::= <initial> "." | <first-name> <addr> ::= <house> <street> <opt-apt> <EOL> <zip-part> ::= <town> "," <state> <zip> <EOL> <opt-apt> ::= "#" <apt-num> | "" Parse Tree Computer Science and Engineering The Ohio State University A parse tree records how grammar rules are applied to form a string Root: start symbol Internal nodes: non-terminals Leaves: terminals (i.e., tokens) Example Computer Science and Engineering The Ohio State University Grammar: <exp> ::= <exp> <exp> | "a" <exp> "b" | "a" "b" String: abaababb exp exp exp abexp ab exp exp abb a Your Turn Computer Science and Engineering The Ohio State University Grammar: <expr> ::= <term> (("+"|"-") <term>)* <term> ::= <fact> (("*"|"/") <fact>)* <fact> ::= <int> | "(" <expr> ")" String: 14 + 2 * 3 - 6 / 2 Parse tree: Your Turn Computer Science and Engineering The Ohio State University MEAN := SUM DIV 100; Ambiguity Computer Science and Engineering The Ohio State University An ambiguous grammar is one that permits two different parse trees to be formed for the same Example: <e> ::= <e> + <e> | <e> - <e> | <int> Consider: 3 + 6 – 2 Parse tree? Algorithm Computer Science and Engineering The Ohio State University How do we calculate a parse tree from a sequence of tokens? In general, there are two basic ways to tackle a problem of tree construction: 1. Bottom up Start at leaves Calculate their parents, then their parents… 2. Top down Start at root Calculate its children, then their children… Operator Precedence Parsing Computer Science and Engineering The Ohio State University An early bottom-up technique Define binding priority between "operators" (ie token types) e.g., A + B * C – D Priority: '+' < '*' and '-' < '*' Resulting parse tree: Operator Precedence Parsing Computer Science and Engineering The Ohio State University Operators are tokens Binding priority only defined between terminals Grammar (implicitly) defines a matrix of binding priorities Note 1: Not all pairs defined! Note 2: Ordering is not antisymmetric! Example Computer Science and Engineering The Ohio State University Reductions Parse tree creation BEGIN READ ( id ) ; < = < > > BEGIN READ ( nt1 ) ; < = = > BEGIN nt2 ; < Notes: Each reduction adds an internal node Internal node names do not matter Shift-Reduce Parsing Computer Science and Engineering The Ohio State University Generalizes idea of operator precedence Two phases: 1. Shift: Scan tokens, placing them on a stack 2. Reduce: Group tokens at top of stack Pop tokens that group together off Push corresponding non-terminal Repeat until done Should be left with ________________ Shift-Reduce Parsing II Computer Science and Engineering The Ohio State University Grammar must be "LR" "Left-to-right scan of the input, producing a right-most derivation" Symbols to be reduced always appear at top of stack (never inside it) Need to "look ahead" to decide how/when to reduce symbols at the top of the stack If we only need to look ahead 1 token: LR (1) grammar Recursive Descent Computer Science and Engineering The Ohio State University Top-down approach Each rule has an associated function Scan forward Try to identify string matching this rule Function may have to call other functions Example: Function to recognize <read> find "READ"; find "(" ; find <id-list>; //another function call find ")"; Recursive Descent: Problem Computer Science and Engineering The Ohio State University Subtle potential problem: "left- recursion" Occurs when left-most (first) symbol rule is the same non-terminal (recursive) <id-list> ::= id | <id-list> "," id If we want to expand 2nd alternative, first call ourselves! Results in infinite recursion Solution: use optional repetition <id-list> ::= id { "," <id-list> } Now the function always consumes a token before recursive call Abstract Syntax Tree (AST) Computer Science and Engineering The Ohio State University Concrete parse tree: Faithful representation of each grammar rule application Often contains syntactic clutter Abstract syntax tree: Faithful representation of structure of program Only semantically important information is included Parse Tree Computer Science and Engineering The Ohio State University MEAN := SUM DIV 100; <assign> id := <exp> MEAN <term> <term> <factor> DIV <factor> int id 100 SUM AST Computer Science and Engineering The Ohio State University MEAN := SUM DIV 100; := id MEAN DIV int id 100 SUM Code Generation Computer Science and Engineering The Ohio State University Output produced from the AST Semantic routines: one routine per internal node in AST Two approaches: Create entire tree, then transform and walk the tree, generating output Generate output as the grammar rules are recognized, bottom up Example Computer Science and Engineering The Ohio State University Code snippet MID := (MAX + MIN) DIV 2 Grammar rule <term> ::= <term> DIV <factor> Semantic routine: Needs results from children, eg registers which contain values being div'ed Generates output: machine code for div'ing Returns location where result is placed, for its parent to use DIV + int 2 Optimization Computer Science and Engineering The Ohio State University An optimizing compiler tries to generate the most efficient object code Time (fast execution times) Space (small object files) Requires sophisticated analysis Often uses an intermediate representation of code IR is not executed directly IR is analyzed for deciding register allocation, instruction ordering, branch shadows, etc... Example: LLVM IR Computer Science and Engineering The Ohio State University @.str = internal constant [14 x i8] c"hello, world\0A\00" declare i32 @printf(i8*, ...) define i32 @main(i32 %argc, i8** %argv) nounwind { entry: %tmp1 = getelementptr [14 x i8], [14 x i8]* @.str, i32 0, i32 0 %tmp2 = call i32 (i8*, ...) @printf( i8* %tmp1 ) nounwind ret i32 0 } Compiler Compilers Computer Science and Engineering The Ohio State University Write: Token definitions (REs) Grammar definition (CFG) Semantic routines (code to execute when visiting/generating the nodes of the tree) Use a tool to translate this information into a compiler (in C or Java or…) Translation tool a compiler compiler! Classic unix tools: Old school: lex and yacc ("lexical analyzer", "yet another compiler compiler") Better: Gnu's flex and bison Output a lexer and a compiler that calls the generated lexer Modern Tool: ANTLR Computer Science and Engineering The Ohio State University ANother Tool for Language Recognition See: antlr.org, github.com/antlr/antlr4 Examples: github.com/antlr/grammars-v4 (simple one: arithmetic.g4) Can generate code in many languages (Java, C#, Python, JavaScript, C++…) Two parts: The tool (processes grammar to generate the lexer/parser) The runtime (libraries for running the generated lexer/parser) Summary Computer Science and Engineering The Ohio State University BNF: Syntax for grammar definition Parse trees reflect application of grammar rules to produce program Parse tree vs abstract syntax tree Two strategies: Bottom up (shift reduce) Top down (recursive descent) Code generation IR and optimizations Compiler compilers: lex/yacc, flex/bison, antlr.

Load more