Compiler: Parsing
Computer Science and Engineering College of Engineering The Ohio State University
Lecture 37 Backus-Naur Form (BNF)
Computer Science and Engineering The Ohio State University Classic notation for writing CFG Very old: Invented to describe the syntax of ALGOL 60 But still used today! Basic syntax
Computer Science and Engineering The Ohio State University
Computer Science and Engineering The Ohio State University
Computer Science and Engineering The Ohio State University A parse tree records how grammar rules are applied to form a string Root: start symbol Internal nodes: non-terminals Leaves: terminals (i.e., tokens) Example
Computer Science and Engineering The Ohio State University Grammar:
exp
exp exp
abexp ab exp exp
abb a Your Turn
Computer Science and Engineering The Ohio State University Grammar:
Computer Science and Engineering The Ohio State University MEAN := SUM DIV 100; Ambiguity
Computer Science and Engineering The Ohio State University An ambiguous grammar is one that permits two different parse trees to be formed for the same Example:
Computer Science and Engineering The Ohio State University How do we calculate a parse tree from a sequence of tokens? In general, there are two basic ways to tackle a problem of tree construction: 1. Bottom up Start at leaves Calculate their parents, then their parents… 2. Top down Start at root Calculate its children, then their children… Operator Precedence Parsing
Computer Science and Engineering The Ohio State University An early bottom-up technique Define binding priority between "operators" (ie token types) e.g., A + B * C – D Priority: '+' < '*' and '-' < '*' Resulting parse tree: Operator Precedence Parsing
Computer Science and Engineering The Ohio State University Operators are tokens Binding priority only defined between terminals Grammar (implicitly) defines a matrix of binding priorities Note 1: Not all pairs defined! Note 2: Ordering is not antisymmetric! Example
Computer Science and Engineering The Ohio State University Reductions Parse tree creation BEGIN READ ( id ) ; < = < > > BEGIN READ ( nt1 ) ; < = = > BEGIN nt2 ; < Notes: Each reduction adds an internal node Internal node names do not matter Shift-Reduce Parsing
Computer Science and Engineering The Ohio State University Generalizes idea of operator precedence Two phases: 1. Shift: Scan tokens, placing them on a stack 2. Reduce: Group tokens at top of stack Pop tokens that group together off Push corresponding non-terminal Repeat until done Should be left with ______Shift-Reduce Parsing II
Computer Science and Engineering The Ohio State University Grammar must be "LR" "Left-to-right scan of the input, producing a right-most derivation" Symbols to be reduced always appear at top of stack (never inside it) Need to "look ahead" to decide how/when to reduce symbols at the top of the stack If we only need to look ahead 1 token: LR (1) grammar Recursive Descent
Computer Science and Engineering The Ohio State University Top-down approach Each rule has an associated function Scan forward Try to identify string matching this rule Function may have to call other functions Example: Function to recognize
Computer Science and Engineering The Ohio State University Subtle potential problem: "left- recursion" Occurs when left-most (first) symbol rule is the same non-terminal (recursive)
Computer Science and Engineering The Ohio State University Concrete parse tree: Faithful representation of each grammar rule application Often contains syntactic clutter Abstract syntax tree: Faithful representation of structure of program Only semantically important information is included Parse Tree
Computer Science and Engineering The Ohio State University MEAN := SUM DIV 100;
id :=
Computer Science and Engineering The Ohio State University MEAN := SUM DIV 100; :=
id MEAN DIV
int id 100 SUM Code Generation
Computer Science and Engineering The Ohio State University Output produced from the AST Semantic routines: one routine per internal node in AST Two approaches: Create entire tree, then transform and walk the tree, generating output Generate output as the grammar rules are recognized, bottom up Example
Computer Science and Engineering The Ohio State University Code snippet MID := (MAX + MIN) DIV 2 Grammar rule
DIV
+ int 2 Optimization
Computer Science and Engineering The Ohio State University An optimizing compiler tries to generate the most efficient object code Time (fast execution times) Space (small object files) Requires sophisticated analysis Often uses an intermediate representation of code IR is not executed directly IR is analyzed for deciding register allocation, instruction ordering, branch shadows, etc... Example: LLVM IR
Computer Science and Engineering The Ohio State University
@.str = internal constant [14 x i8] c"hello, world\0A\00" declare i32 @printf(i8*, ...) define i32 @main(i32 %argc, i8** %argv) nounwind { entry: %tmp1 = getelementptr [14 x i8], [14 x i8]* @.str, i32 0, i32 0 %tmp2 = call i32 (i8*, ...) @printf( i8* %tmp1 ) nounwind ret i32 0 } Compiler Compilers
Computer Science and Engineering The Ohio State University Write: Token definitions (REs) Grammar definition (CFG) Semantic routines (code to execute when visiting/generating the nodes of the tree) Use a tool to translate this information into a compiler (in C or Java or…) Translation tool a compiler compiler! Classic unix tools: Old school: lex and yacc ("lexical analyzer", "yet another compiler compiler") Better: Gnu's flex and bison Output a lexer and a compiler that calls the generated lexer Modern Tool: ANTLR
Computer Science and Engineering The Ohio State University ANother Tool for Language Recognition See: antlr.org, github.com/antlr/antlr4 Examples: github.com/antlr/grammars-v4 (simple one: arithmetic.g4) Can generate code in many languages (Java, C#, Python, JavaScript, C++…) Two parts: The tool (processes grammar to generate the lexer/parser) The runtime (libraries for running the generated lexer/parser) Summary
Computer Science and Engineering The Ohio State University BNF: Syntax for grammar definition Parse trees reflect application of grammar rules to produce program Parse tree vs abstract syntax tree Two strategies: Bottom up (shift reduce) Top down (recursive descent) Code generation IR and optimizations Compiler compilers: lex/yacc, flex/bison, antlr