UNIT – IV PARSERS Role of Parsers, Classification of Parsers: Top Down Parsers- Recursive Descent Parser and Predictive Parser
Total Page:16
File Type:pdf, Size:1020Kb
UNIT – IV PARSERS Role of parsers, Classification of Parsers: Top down parsers- recursive descent parser and predictive parser. Bottom up Parsers – Shift Reduce: SLR, CLR and LALR parsers. Error Detection and Recovery in Parser. YACC specification and Automatic construction of Parser (YACC). YACC • What is YACC ? – developed by Stephen C. Johnson. – Yacc (for "yet another compiler compiler." ) is the standard parser generator for the Unix operating system. – An open source program, yacc generates code for the parser in the C programming language. – It is a Look Ahead Left-to-Right (LALR) parser generator, 2 How YACC Works y.tab.h YACC source (*.y) yacc y.tab.c y.output (1) Parser generation time y.tab.c C compiler/linker a.out (2) Compile time Abstract Token stream a.out Syntax Tree (3) Run time 3 YACC File Format %{ C declarations %} yacc declarations %% Grammar rules %% Additional C code – Comments enclosed in /* ... */ may appear in any of the sections. 4 Definitions Section %{ #include <stdio.h> #include <stdlib.h> %} It is a terminal %token ID NUM %start expr 5 YACC Declaration Summary `%start' Specify the grammar's start symbol `%union' Declare the collection of data types that semantic values may have `%token' Declare a terminal symbol (token type name) with no precedence or associativity specified `%type' Declare the type of semantic values for a nonterminal symbol 6 YACC Declaration `%right' Declare a terminal symbol (token type name) that is right-associative `%left' Declare a terminal symbol (token type name) that is left-associative `%nonassoc' Declare a terminal symbol (token type name) that is nonassociative (using it in a way that would be associative is a syntax error, ex: x op. y op. z is syntax error) 7 Rules Section • This section defines grammar • A context-free grammar (CFG) is a set of recursive rewriting rules (or productions) used to generate patterns of strings. • Example • E → E + E • E → E * E • E → id • Input string: id + id * id 8 Rules Section • Normally written like this • Example: expr : expr '+' expr { $$ = $1 + $3; } | expr ‘-' expr { $$ = $1 - $3; } |expr ‘*' expr { $$ = $1 * $3; } | NUM 9 The Position of Rules expr : expr '+' expr { $$ = $1 + $3; } | expr ‘-' expr { $$ = $1 - $3; } |expr ‘*' expr { $$ = $1 * $3; } | NUM ; 10 Communication between LEX and YACC LEX [0-9]+ call yylex() yylex() Input programs YACC yyparse() next token is NUM 12 + 26 NUM ‘+’ NUM 11 Communication between LEX and YACC yacc -d gram.y Will produce: y.tab.h 12 Precedence / Association %right ‘=‘ %left '<' '>' NE LE GE %left '+' '-‘ %left '*' '/' highest precedence 13 Lex v.s. Yacc • Lex – Lex generates C code for a lexical analyzer, or scanner – Lex uses patterns that match strings in the input and converts the strings to tokens • Yacc – Yacc generates C code for syntax analyzer, or parser. – Yacc uses grammar rules that allow it to analyze tokens from Lex and create a syntax tree. 14 Example of LEX and YACC //************LEX FILE*********** Title : Implementation of Calculator using LEX and YACC %{ #include "y.tab.h" extern int yylval; %} %% [0-9]+ {yylval=atoi(yytext); return NUMBER; } [\t] ; \n return 0; . return yytext[0]; %% 15 //************YACC FILE*********** %{ #include<stdio.h> %} %token NUMBER %left '+' '-' %left '*' '/' %% Statement: expr {printf("\nOutput : %d",$1);} ; expr:expr'+'expr {$$=$1+$3;} | expr'-'expr {$$=$1-$3;} | expr'*'expr {$$=$1*$3;} | expr'/'expr {$$=$1/$3;} | NUMBER {$$=$1;} ; %% 16 main() { printf("Enter the operation:"); return yyparse(); } char *s; yyerror() { printf("%s",s); } yywrap() { return 1; } 17 Output • [root@localhost ~]# lex cal.l • [root@localhost ~]# yacc -d cal.y • [root@localhost ~]# cc y.tab.c lex.yy.c -ll • [root@localhost ~]# ./a.out • Enter the operation: 3+2 • Output : 5 • [root@localhost ~]# ./a.out • Enter the operation: 5-9 • Output : -4 18 Reference Books • lex & yacc, 2nd Edition – by John R.Levine, Tony Mason & Doug Brown – O’Reilly – ISBN: 1-56592-000-7 • Mastering Regular Expressions – by Jeffrey E.F. Friedl – O’Reilly – ISBN: 1-56592-257-3 19 Role of Parser Types of Parser Types of Parser • Top-down Parsing • When the parser starts constructing the parse tree from the start symbol and then tries to transform the start symbol to the input, it is called top-down parsing. • Bottom-up Parsing • As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the parse tree up to the start symbol. Definitions • Syntax – the form or structure of the expressions, statements, and program units • Semantics – the meaning of the expressions, statements, and program units • Sentence – a string of characters over some alphabet • Language – a set of sentences • Lexeme – the lowest level syntactic unit of a language • :=, {, while • Token – a category of lexemes (e.g., identifier ) Basic Terms… • Terminals :A terminal is a symbol which does not appear on the left-hand side of any production. • A grammar contains a set of terminal symbols (tokens) such as the plus sign, +, the times sign, *, and other tokens defined by the lexical analyzer such as Identifiers • Nonterminals :Nonterminals are the non-leaf nodes in a parse tree. • In the Expression grammar, E, T, and F are nonteminals. Sometimes nonterminals are enclosed bewteen angle brackets to distinguish them from terminals. Basic Terms… • a start symbol, which is a special nonterminal symbol that appears in the initial string generated by the grammar. • Ambiguity:A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least one string. Basic Terms…. • categories of grammars – regular • good for identifiers, parameter lists, subscripts – context free • LHS of production is single non-terminal – context sensitive CFG Grammer • Definition: G = (V,T,P,S) is a CFG where V is a finite set of variables. T is a finite set of terminals. P is a finite set of productions of the form, A -> α, where A is a variable and α ∈ (V ∪ T)* S is a designated variable called the start symbol. Basic Terms productions • rules for transforming nonterminal symbols into terminals or other nonterminals • each has lefthand side (LHS) and righthand side (RHS) • every nonterminal must appear on LHS of at least one production • Example: S -> cAd A -> a | ab Derivation • A derivation is basically a sequence of production rules, in order to get the input string. • During parsing, we take two decisions for some sentential form of input: 1. Deciding the non-terminal which is to be replaced. 2. Deciding the production rule, by which, the non-terminal will be replaced. • To decide which non-terminal to be replaced with production rule, we can have two options. • Left-most Derivation: • If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation. The sentential form derived by the left- most derivation is called the left-sentential form. • Right-most Derivation:If we scan and replace the input with production rules, from right to left, it is known as right-most derivation. The sentential form derived from the right-most derivation is called the right-sentential form. Example: Production rules: • E → E + E • E → E * E • E → id • Input string: id + id * id • The left-most derivation is: • E → E * E • E → E + E * E • E → id + E * E • E → id + id * E • E → id + id * id • Notice that the left-most side non-terminal is always processed first. Example: Production rules: • The right-most derivation is: • E → E + E • E → E + E * E • E → E + E * id • E → E + id * id • E → id + id * id Parse Tree • A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree. • The left-most derivation is: • E → E * E • E → E + E * E • E → id + E * E • E → id + id * E • E → id + id * id • Step 1: • E → E * E • Step 2: • E → E + E * E Step 3: E → id + E * E • Step 4: • E → id + id * E • Step 5: • E → id + id * id Parse Tree…. • In a parse tree: • All leaf nodes are terminals. • All interior nodes are non-terminals. • In-order traversal gives original input string. Ambiguous grammar: • CFG is said to be ambiguous if and only if there exist a string in T* that has more than on parse tree. Ambiguous grammar: • A CFG is said to ambiguous if there exists more than one derivation tree for the given input string i.e., more than one LeftMost Derivation Tree (LMDT) or RightMost Derivation Tree (RMDT). • For Example: Let us consider this grammar : E -> E+E|id • We can create 2 parse tree from this grammar to obtain a string id+id+id : • The following are the 2 parse trees generated by left most derivation: Both the above parse trees are derived from same grammar rules but both parse trees are different. Hence the grammar is ambiguous. Error Detection and Recovery in Parser • In this phase of compilation, all possible errors made by the user are detected and reported to the user in form of error messages. This process of locating errors and reporting it to user is called Error Handling process. syntax errors • These errors are detected during syntax analysis phase. Typical syntax errors are 1. Errors in structure 2. Missing operator 3. Misspelled keywords 4. Unbalanced parenthesis Example : • swicth(ch) { ....... ....... } • The keyword switch is incorrectly written as swicth. Hence, “Unidentified keyword/identifier” error occurs. Error recovery 4 Technique: 1.Panic mode recovery 2.Phase level recovery 3.Error productions 4.Global correction 1.Panic mode recovery • When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by not processing input from erroneous input to delimiter, such as semi-colon.