UNIT – IV PARSERS Role of parsers, Classification of Parsers: Top down parsers- and predictive parser. Bottom up Parsers – Shift Reduce: SLR, CLR and LALR parsers. Error Detection and Recovery in Parser. specification and Automatic construction of Parser (YACC). YACC

• What is YACC ? – developed by Stephen C. Johnson. – Yacc (for "yet another compiler compiler." ) is the standard parser generator for the Unix operating system. – An open source program, yacc generates code for the parser in the C programming language. – It is a Look Ahead Left-to-Right (LALR) parser generator,

2 How YACC Works

y.tab.h YACC source (*.y) yacc y.tab.c y.output (1) Parser generation time

y.tab.c C compiler/linker a.out

(2) Compile time Abstract Token stream a.out Syntax Tree (3) Run time 3 YACC File Format

%{ C declarations %} yacc declarations %% Grammar rules %% Additional C code – Comments enclosed in /* ... */ may appear in any of the sections.

4 Definitions Section

%{ #include #include %} It is a terminal %token ID NUM %start expr

5 YACC Declaration Summary `%start' Specify the grammar's start symbol

`%union' Declare the collection of data types that semantic values may have

`%token' Declare a terminal symbol (token type name) with no precedence or associativity specified

`%type' Declare the type of semantic values for a nonterminal symbol

6 YACC Declaration

`%right' Declare a terminal symbol (token type name) that is right-associative

`%left' Declare a terminal symbol (token type name) that is left-associative

`%nonassoc' Declare a terminal symbol (token type name) that is nonassociative (using it in a way that would be associative is a syntax error, ex: x op. y op. z is syntax error)

7 Rules Section

• This section defines grammar • A context-free grammar (CFG) is a set of recursive rewriting rules (or productions) used to generate patterns of strings. • Example • E → E + E • E → E * E • E → id • Input string: id + id * id

8 Rules Section • Normally written like this • Example: expr : expr '+' expr { $$ = $1 + $3; } | expr ‘-' expr { $$ = $1 - $3; } |expr ‘*' expr { $$ = $1 * $3; } | NUM

9 The Position of Rules expr : expr '+' expr { $$ = $1 + $3; } | expr ‘-' expr { $$ = $1 - $3; } |expr ‘*' expr { $$ = $1 * $3; } | NUM ;

10 Communication between LEX and YACC

LEX [0-9]+ call yylex() yylex()

Input programs YACC yyparse() next token is NUM 12 + 26

NUM ‘+’ NUM

11 Communication between LEX and YACC

yacc -d gram.y Will produce: y.tab.h

12 Precedence / Association

%right ‘=‘ %left '<' '>' NE LE GE %left '+' '-‘ %left '*' '/'

highest precedence

13 Lex v.s. Yacc

• Lex – Lex generates C code for a lexical analyzer, or scanner – Lex uses patterns that match strings in the input and converts the strings to tokens

• Yacc – Yacc generates C code for syntax analyzer, or parser. – Yacc uses grammar rules that allow it to analyze tokens from Lex and create a syntax tree.

14 Example of LEX and YACC //************LEX FILE***********

Title : Implementation of Calculator using LEX and YACC

%{ #include "y.tab.h" extern int yylval; %} %% [0-9]+ {yylval=atoi(yytext); return NUMBER; } [\t] ; \n return 0; . return yytext[0]; %%

15 //************YACC FILE***********

%{ #include %} %token NUMBER %left '+' '-' %left '*' '/'

%% Statement: expr {printf("\nOutput : %d",$1);} ; expr:expr'+'expr {$$=$1+$3;} | expr'-'expr {$$=$1-$3;} | expr'*'expr {$$=$1*$3;} | expr'/'expr {$$=$1/$3;} | NUMBER {$$=$1;} ; %% 16 main()

{ printf("Enter the operation:"); return yyparse(); } char *s; yyerror() { printf("%s",s); } yywrap() { return 1; } 17 Output • [root@localhost ~]# lex cal.l • [root@localhost ~]# yacc -d cal.y • [root@localhost ~]# cc y.tab.c lex.yy.c -ll • [root@localhost ~]# ./a.out • Enter the operation: 3+2 • Output : 5

• [root@localhost ~]# ./a.out • Enter the operation: 5-9 • Output : -4

18 Reference Books

• lex & yacc, 2nd Edition – by John R.Levine, Tony Mason & Doug Brown – O’Reilly – ISBN: 1-56592-000-7

• Mastering Regular Expressions – by Jeffrey E.F. Friedl – O’Reilly – ISBN: 1-56592-257-3

19 Role of Parser Types of Parser Types of Parser

• Top-down • When the parser starts constructing the from the start symbol and then tries to transform the start symbol to the input, it is called top-down parsing. • Bottom-up Parsing • As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the parse tree up to the start symbol. Definitions • Syntax – the form or structure of the expressions, statements, and program units • Semantics – the meaning of the expressions, statements, and program units • Sentence – a string of characters over some alphabet • Language – a set of sentences • Lexeme – the lowest level syntactic unit of a language • :=, {, while • Token – a category of lexemes (e.g., identifier ) Basic Terms… • Terminals :A terminal is a symbol which does not appear on the left-hand side of any production. • A grammar contains a set of terminal symbols (tokens) such as the plus sign, +, the times sign, *, and other tokens defined by the lexical analyzer such as Identifiers • Nonterminals :Nonterminals are the non-leaf nodes in a parse tree. • In the Expression grammar, E, T, and F are nonteminals. Sometimes nonterminals are enclosed bewteen angle brackets to distinguish them from terminals. Basic Terms…

• a start symbol, which is a special nonterminal symbol that appears in the initial string generated by the grammar. • Ambiguity:A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at least one string. Basic Terms….

• categories of grammars – regular • good for identifiers, parameter lists, subscripts – context free • LHS of production is single non-terminal – context sensitive CFG Grammer

• Definition: G = (V,T,P,S) is a CFG

where V is a finite set of variables. T is a finite set of terminals. P is a finite set of productions of the form, A -> α, where A is a variable and α ∈ (V ∪ T)* S is a designated variable called the start symbol. Basic Terms

productions • rules for transforming nonterminal symbols into terminals or other nonterminals • each has lefthand side (LHS) and righthand side (RHS) • every nonterminal must appear on LHS of at least one production • Example: S -> cAd A -> a | ab Derivation • A derivation is basically a sequence of production rules, in order to get the input string. • During parsing, we take two decisions for some sentential form of input: 1. Deciding the non-terminal which is to be replaced. 2. Deciding the production rule, by which, the non-terminal will be replaced.

• To decide which non-terminal to be replaced with production rule, we can have two options. • Left-most Derivation: • If the sentential form of an input is scanned and replaced from left to right, it is called left-most derivation. The sentential form derived by the left- most derivation is called the left-sentential form.

• Right-most Derivation:If we scan and replace the input with production rules, from right to left, it is known as right-most derivation. The sentential form derived from the right-most derivation is called the right-sentential form. Example: Production rules:

• E → E + E • E → E * E • E → id • Input string: id + id * id • The left-most derivation is: • E → E * E • E → E + E * E • E → id + E * E • E → id + id * E • E → id + id * id • Notice that the left-most side non-terminal is always processed first. Example: Production rules:

• The right-most derivation is: • E → E + E • E → E + E * E • E → E + E * id • E → E + id * id • E → id + id * id Parse Tree

• A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from the start symbol. The start symbol of the derivation becomes the root of the parse tree. • The left-most derivation is: • E → E * E • E → E + E * E • E → id + E * E • E → id + id * E • E → id + id * id • Step 1: • E → E * E

• Step 2: • E → E + E * E

Step 3: E → id + E * E • Step 4: • E → id + id * E

• Step 5: • E → id + id * id Parse Tree….

• In a parse tree: • All leaf nodes are terminals. • All interior nodes are non-terminals. • In-order traversal gives original input string. Ambiguous grammar:

• CFG is said to be ambiguous if and only if there exist a string in T* that has more than on parse tree. Ambiguous grammar:

• A CFG is said to ambiguous if there exists more than one derivation tree for the given input string i.e., more than one LeftMost Derivation Tree (LMDT) or RightMost Derivation Tree (RMDT). • For Example: Let us consider this grammar : E -> E+E|id • We can create 2 parse tree from this grammar to obtain a string id+id+id : • The following are the 2 parse trees generated by left most derivation: Both the above parse trees are derived from same grammar rules but both parse trees are different. Hence the grammar is ambiguous. Error Detection and Recovery in Parser

• In this phase of compilation, all possible errors made by the user are detected and reported to the user in form of error messages. This process of locating errors and reporting it to user is called Error Handling process. syntax errors

• These errors are detected during syntax analysis phase. Typical syntax errors are 1. Errors in structure 2. Missing operator 3. Misspelled keywords 4. Unbalanced parenthesis Example :

• swicth(ch) { ...... }

• The keyword switch is incorrectly written as swicth. Hence, “Unidentified keyword/identifier” error occurs. Error recovery

4 Technique: 1.Panic mode recovery 2.Phase level recovery 3.Error productions 4.Global correction 1.Panic mode recovery • When a parser encounters an error anywhere in the statement, it ignores the rest of the statement by not processing input from erroneous input to delimiter, such as semi-colon. • In this method on discovering error, the parser discard the input symbol one at a time. • This process is continued until one of a designated set of synchronizing token is found.(synchronizing token are delimiters such as semicolon(;),} end token ) • Advantage is that its easy to implement and guarantees not to go to infinite loop • Disadvantage is that a considerable amount of input is skipped without checking it for additional errors 2.Phase level recovery • When parser find the error, it tries to take corrective measure, • so that the rest of inputs of statement allow the parser to parse ahead. • One wrong correction will lead to an infinite loop. • The local correction may be • Replacing a prefix by some string. • Replacing comma by semicolon. • Deleting extra semicolon. • Insert missing semicolon. 2.Phase level recovery… Advantage • It can correct any input string. Disadvantage • It is difficult to cope up with actual error if it has occurred before the point of detection. 3.Error productions

• Some common errors are known to the compiler designers that may occur in the code. • The productions which create error causing possibilities are Augmented grammars ( G to G’). • These production also find the estimated errors while the parsing is done. 4.Global correction

• There are algorithms which make changes to modify an incorrect string into a correct string. • When a grammar G and an incorrect string q is given, these algorithms find a parse tree for a string q related top with smaller number of transformations. • The transformations may be insertions, deletions and change of tokens. 4.Global correction

• Advantage • It has been used for phrase level recovery to find optimal replacement strings. • Disadvantage • This strategy is too costly to implement in terms of time and space. Recursive Descent Parsing (continued)

• A recursive descent parser traces out a parse tree in top-down order. • The recursive descent parsing subprograms are built directly from the grammar rules

• Notes Predictive Parser

• Notes Left Recursion

• LEFT RECURSION. • Let G be a context-free grammar. • A production of G is said left recursive if it has the form • Aa A • where A is a nonterminal and a is a string of grammar symbols • The grammar G is left recursive if it has at least one left recursive nonterminal. Left Recursion

• Becoz of left recursion parser can enter in Infinite loop. • transforming a left recursive grammar G into a grammar G' which is not left recursive and which generates the same language as G. • THE BASIC TRICK is to replace the production

• Notes Bottom-up Parsing

• As the name suggests, bottom-up parsing starts with the input symbols and tries to construct the parse tree up to the start symbol. • 2 types • 1.Shift Reduce Parser • 2.L R Parser: i.SLR ii.LALR iii.LR(k) • 3.operator precedence 1.Shift Reduce parser

• It attempts for the construction of parse in a similar manner as done in bottom up parsing i.e. the parse tree is constructed from leaves(bottom) to the root(up).

This parser requires some data structures i.e. • A input buffer for storing the input string. • A stack for storing and accessing the production rules. HANDLE

• A Handle is a substring that matches the body of a production. • HANDLE PRUNING is the general approach used in shift-and-reduce parsing. • A rightmost derivation in reverse can be obtained by handle pruning Example

• Handle Pruning • If A -->β is a production then reducing β to A by the given production is called handle pruning i.e., removing the children of A from the parse tree. • • A rightmost derivation in reverse can be obtained by handle pruning. Basic Operations

• Shift: This involves moving of symbols from input buffer onto the stack.

• Reduce: If the handle appears on top of the stack then, its reduction by using appropriate production rule is done i.e. RHS of production rule is popped out of stack and LHS of production rule is pushed onto the stack. Basic Operations

• Accept: If only start symbol is present in the stack and the input buffer is empty then, the parsing action is called accept. When accept action is obtained, it is means successful parsing is done. • Error: This is the situation in which the parser can neither perform shift action nor reduce action and not even accept action. Example

• Consider the grammar S –> S + S S –> S * S S –> id Perform Shift Reduce parsing for input string “id + id + id”.

Example

• Consider the grammar E –> 2E2 E –> 3E3 E –> 4

• Perform Shift Reduce parsing for input string “32423”

LR Parser

• LR parsing is one type of bottom up parsing. • It is used to parse the large class of grammars. • In the LR parsing, • LR(K): "L" stands for left-to-right scanning of the input. R stands for the construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to make decisions.

LR(0) items

• Canonical Collection of LR(0) items • An LR (0) item is a production G with dot at some position on the right side of the production. • LR(0) items is useful to indicate that how much of the input has been scanned up to a given point in the process of parsing. Example

Given grammar: S → AA A → aA | b Add Augment Production and insert '•' symbol at the first position for every production in G S` → •S S → •AA A → •aA A → •b Augmented Grammar

• Before we start determining the transitions between the different states, the grammar is always augmented with an extra rule • S → E or S’ S • where S is a new start symbol and E the old start symbol. • The parser will use this rule for reduction exactly when it has accepted the input string. LL LR

Does a leftmost derivation. Does a rightmost derivation in reverse.

Starts with the root nonterminal on the Ends with the root nonterminal on the stack. stack. Ends when the stack is empty. Starts with an empty stack. Uses the stack for designating what is Uses the stack for designating what is still to be expected. already seen.

Builds the parse tree top-down. Builds the parse tree bottom-up.

Continuously pops a nonterminal off the Tries to recognize a right hand side on stack, and pushes the corresponding the stack, pops it, and pushes the right hand side. corresponding nonterminal.

Expands the non-terminals. Reduces the non-terminals. Reads the terminals when it pops one Reads the terminals while it pushes off the stack. them on the stack.

Pre-order traversal of the parse tree. Post-order traversal of the parse tree. Types of LR parser

• SLR(1) – Simple LR Parser: – Works on smallest class of grammar – Few number of states, hence very small table – Simple and fast construction • LR(1) – LR Parser: – Works on complete set of LR(1) Grammar – Generates large table and large number of states – Slow construction • LALR(1) – Look-Ahead LR Parser: – Works on intermediate size of grammar – Number of states are same as in SLR(1)

CLR SLR(Simple LR Parser)

• SLR is same as LR(0) parser but it has reduced entry. • Various steps involved in the SLR (1) Parsing: 1. For the given input string write a context free grammar 2. Check the ambiguity of the grammar 3. Add Augment production in the given grammar 4. Create Canonical collection of LR (0) items 5. Draw a data flow diagram (DFA) 6. Construct a SLR (1) parsing table LR(0) items

• Canonical Collection of LR(0) items • An LR (0) item is a production G with dot at some position on the right side of the production. • LR(0) items is useful to indicate that how much of the input has been scanned up to a given point in the process of parsing. • In the LR (0), we place the reduce node in the entire row. Closure Operation

• Closure Operation for the context free grammer G,If I is the set of item then the function clourse(I) can e constructed using following rules: 1.Consider I is a set of canonical(ordered) items & initially every item I is added to clouser(I). 2.If rule Aa.Bbeta & there is another rule for B such as Bg then clourse(I): Aa.Bbeta B.g Goto Operation

If there is production Aa.Bbeta then Goto(A a.Bbeta ,B)= AaB.beta

• Shifting of dot one postion ahead over the grammer symbol(may be terminal or non terminal). • The rule Aa.Bbeta is in I Then same goto function can be written as goto(I,B) Example

• I0= S` → •E E → •E + T E → •T T → •T * F T → •F F → •id • I1= Go to (I0, E) = closure (S` → E•, E → E• + T) I2= Go to (I0, T) = closure (E → T•, T• → * F) I3= Go to (I0, F) = Closure ( T → F• ) = T → F• I4= Go to (I0, id) = closure ( F → id•) = F → id• I5= Go to (I1, +) = Closure (E → E +•T) • I5 = E → E +•T T → •T * F T → •F F → •id • Go to (I5, F) = Closure (T → F•) = (same as I3) Go to (I5, id) = Closure (F → id•) = (same as I4) • I6= Go to (I2, *) = Closure (T → T * •F) • Add all productions starting with F in I6 State because "." is followed by the non-terminal. So, the I6 State becomes • I6 = T → T * •F F → •id • Go to (I6, id) = Closure (F → id•) = (same as I4) • I7= Go to (I5, T) = Closure (E → E + T•) = E → E + T• I8= Go to (I6, F) = Closure (T → T * F•) = T → T * F• Drawing DFA: CLR (1) Parsing

• CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items to build the CLR (1) parsing table. CLR (1) parsing table produces the more number of states as compare to the SLR (1) parsing. • In the CLR (1), we place the reduce node only in the lookahead symbols. CLR (1)

• Various steps involved in the CLR (1) Parsing: • For the given input string write a context free grammar • Check the ambiguity of the grammar • Add Augment production in the given grammar • Create Canonical collection of LR (0) items • Draw a data flow diagram (DFA) • Construct a CLR (1) parsing table • LR (1) item • LR (1) item is a collection of LR (0) items and a look ahead symbol. • LR (1) item = LR (0) item + look ahead • The look ahead is used to determine that where we place the final item. • The look ahead always add $ symbol for the argument production. LALR • Notes