<<

Language Processing Systems

Prof. Mohamed Hamada Syntax Analysis () Software Engineering Lab. The University of Aizu Japan

Parsing 1. Uses Regular Expressions to define tokens How parser works? 2. Uses Finite Automata to recognize tokens

next char next token Bottom Up Parsing lexical Syntax Down Parsing get next analyzer analyzer char get next token Source Shift-reduce Parsing Program symbol Predictive Parsing table LL(k) Parsing LR(k) Parsing (Contains a record for each identifier) Left Recursion Uses Top-down parsing or Bottom-up parsing Left Factoring To construct a Parse How to parser?

Yacc Yacc

Compiler Source program token description

Language grammar Yacc syntax analysis

Inter. representation

code generation

Target program How to write an LR parser? LR parser generators

General approach: The construction is done automatically Yacc: compiler by a tool such as the program yacc.

Using the source program language grammar to write a • Automatically generate LALR parsers simple yacc program and save it in a named name.y Using the unix program yacc to compile name.y resulting in a (parser) program named y.tab.c • Created by S.C. Johnson in 1970’s

Compiling and linking the C program y.tab.c in a normal way resulting the required parser.

Yacc Using Yacc Source program

Yacc Lexical source Yacc lexer spec LEX .c C compiler y.tab.c analyzer program compiler filename.y tokens

C y.tab.c a.out Parser Yacc .c C compiler Parser compiler spec

Input a.out tokens (Parser)

Compiler parse tree

Yacc How to write a yacc program Example tomatoes + potatoes + carrots myfile.y

Lexical lexer spec LEX .c C compiler analyzer %{ This part will be embedded < C global variables, prototypes, into myfile.tab.c comments > id1, PLUS, id2, PLUS, id3 %} contains token declarations. Tokens are recognized in Parser [DEFINITION SECTION] lexer. Yacc .c C compiler Parser spec %% define how to “understand” the input language, and + [PRODUCTION RULES SECTION] what actions to take for each “sentence”. + %% id3 any user code. For < C auxiliary subroutines> example, a main function to id1 id2 call the parser function yyparse() Running Yacc programs Running Yacc programs % yacc -d -v my_prog.y % gcc –o y.tab.c -ly The -d option creates a file "y.tab.h", contains a • Yacc: #define statement for each terminal declared. Place #include "y.tab.h“ in between the %{ and %} – produce C file y.tab.c contains the C code to to use the tokens in the functions section. apply the grammar

The -v option creates a file "y.output", which contains useful information on – y.tab.h contains the data structures to be used debugging. by lex to pass data to yacc We can use Lex to create the lexical analyser. If so, we should also place #include "y.tab.h" in Lex's definitions section, and we must the parser and lexer together with both libraries (-ly and -ll).

PRODUCTION RULES SECTION DEFINITION SECTION Grammar

Any terminal symbols which will be used A production rule: nontermsym à symbol1 symbol2 … | symbol3 symbol4 … | …. in the grammar must be declared in this section as a token. For example Yacc nontermsym : symbol1 symbol2 … { actions } | symbol3 symbol4 … { actions } %token VERB | … %token NOUN Alternatives ;

Non-terminals do not need to be pre-declared. Example: a productionrule: à expr + expr Anything enclosed between %{ ... %} in this section will be copied straight into y.tab.c (the expr : expr ‘+’ expr { $$ = $1 + $3 } C source for the parser).

All #include and #define statements, all Value of non-terminal Value of n-th symbol variable declarations, all function declarations on lhs on rhs and any comments should be placed here.

PRODUCTION RULES SECTION PRODUCTION RULES SECTION Semantic Actions in Yacc • Semantic actions are embedded in RHS of %token DIGIT

input file %% rules. line : expr '\n' { ("%d\n", $1);} An action consists of one or C statements, ; expr : expr '+' expr { $$ = $1 + $3;} enclosed in braces { … }. | expr '*' expr { $$ = $1 * $3;} | '(' expr ')' { $$ = $2;} | DIGIT • Examples: ; %% ident_decl : ID { symtbl_install( id_name ); } grammar Semantics action type_decl : { tval = … } id_list; Yacc maintains a stack of “values” that may be referenced ($i) in the semantic actions PRODUCTION RULES SECTION PRODUCTION RULES SECTION Example:

Semantic Actions in Yacc statement à expression expression à expression + expression | expression - expression Each nonterminal can return a value. | expression * expression | expression / expression – The value returned by the ith symbol on the | NUMBER

RHS is denoted by $i. statement : expression { printf (“ = %g\n”, $1); } – An action that occurs in the middle of a rule expression : expression ‘+’ expression { $$ = $1 + $3; } counts as a “symbol” for this. | expression ‘-’ expression { $$ = $1 - $3; } | expression ‘*’ expression { $$ = $1 * $3; } – To set the value to be returned by a rule, | expression ‘/’ expression { $$ = $1 / $3 ; } assign to $$. | NUMBER { $$ = $1; } By default, the value returned by a rule is the value of ; the first RHS symbol, i.e., $1.

C auxiliary subroutines C auxiliary subroutines Yacc interface to lexical analyzer This section contains the user-defined main() routine, plus any other required functions. It is Example usual to include: • Yacc invokes yylex() %% yylex() lexerr() - to be called if the lexical analyser to get the next token { finds an undefined token. The default case int c; in the lexical analyser must therefore call • the “value” of a token this function. c = getchar(); must be stored in the if (isdigit(c)) { yyerror(char*) - to be called if the parser global variable yylval yylval = c - '0'; cannot recognise the syntax of part of the return DIGIT; input. The parser will pass a string describing • the default value type } the type of error. is int, but can be return c; } The line number of the input when the error changed occurs is held in yylineno.

The last token read is held in yytext.

C auxiliary subroutines Yacc Errors Yacc interface to back-end Yacc can not accept ambiguous grammars, nor can it accept grammars requiring two or Example more symbols of lookahead. • Yacc generates a %% yylex() function named { The two common error messages are: yyparse() ... } shift-reduce conflict • syntax errors are main() { reduce-reduce conflict reported by invoking yyparse(); The first case is where the parser would a callback function } have a choice as to whether it shifts the

yyerror() yyerror() next symbol from the input, or reduces the { current symbols on the top of the stack. printf("syntax error\n"); (1); The second case is where the parser has } a choice of rules to reduce the stack. Yacc Errors Yacc Errors Example 1

Do not let errors go uncorrected. A parser will be generated, but it may produce Yacc unexpected results. Yacc Animal : Dog Expr : INT_T | | Expr + Expr ; Study the file "y.output" to out when ; the errors occur. Dog : FRED_T; Causes a shift-reduce error, because Cat : FRED_T; The SUN C compiler and the Berkeley INT_T + INT_T + INT_T Causes a reduce-reduce error, because PASCAL compiler are both written in Yacc. can be parsed in two ways. FRED_T You should be able to change your grammar can be parsed in two ways. rules to get an unambiguous grammar.

Yacc Errors Yacc Conflict resolution in Yacc Example 2 %token DIGIT %% Correcting errors line : expr '\n' { printf("%d\n", $1);} 1. input file (desk0.y) ; expr : expr '+' expr { $$ = $1 + $3;} | expr '*' expr { $$ = $1 * $3;} | '(' expr ')' { $$ = $2;} • shift-reduce: prefer shift | DIGIT 2. run yacc ; %% yylex() • reduce-reduce: prefer the rule that comes first > yacc -v desk0.y { int c;

Conflicts: 4 shift/reduce c = getchar();

if (isdigit(c)) { yylval = c - '0'; return DIGIT; } return c; }

Conflict resolution in Yacc Conflict resolution in Yacc

Correcting errors Correcting errors • shift-reduce: prefer shift • shift-reduce: prefer shift • reduce-reduce: prefer the rule that comes first • reduce-reduce: prefer the rule that comes first

>cat y.output state 11 state 12 State 11 conflicts: 2 shift/reduce State 12 conflicts: 2 shift/reduce. 2 expr: expr . '+' expr 2 expr: expr . '+' expr 2 | expr '+' expr . 3 | expr . '*' expr Grammar 3 | expr . '*' expr 3 | expr '*' expr .

0 $accept: line $end '+' shift, and go to state 8 '+' shift, and go to state 8 '*' shift, and go to state 9 '*' shift, and go to state 9 1 line: expr '\n' '+' [reduce using rule 2 (expr)] '+' [reduce using rule 3 (expr)] 2 expr: expr '+' expr '*' [reduce using rule 2 (expr)] '*' [reduce using rule 3 (expr)] 3 | expr '*' expr $default reduce using rule 2 (expr) $default reduce using rule 3 (expr) 4 | '(' expr ')' 5 | DIGIT Conflict resolution in Yacc Example 2 %token DIGIT Correct %left '+' %left '*' Operator %% Define operator’s precedence and associativity precedence in Yacc line : expr '\n' { printf("%d\n", $1);} ; resolve shift/reduce conflict in Example 2 expr : expr '+' expr { $$ = $1 + $3;} priority from | expr '*' expr { $$ = $1 * $3;} Definition section | '(' expr ')' { $$ = $2;} top (low) to | DIGIT ; %left ‘+’ ‘-’ bottom (high) %% yylex() %left ‘*’ ‘/’ { > yacc -v desk0.y int c;

c = getchar();

Higher precedence operators > gcc -o desk0 y.tab.c if (isdigit(c)) { are defined later yylval = c - '0'; Specify the return DIGIT; associativity } return c; }

Exercise Answer %{ int reg[26];

%} %% multiple lines: %token DIGIT lines: line %token REG | lines line %right '=' ; %left '+' line : expr '\n' { printf("%d\n", $1);} %left '*' ; %% expr : expr '+' expr { $$ = $1 + $3;} expr : REG '=' expr { $$ = reg[$1] = $3;} | expr '*' expr { $$ = $1 * $3;} | expr '+' expr { $$ = $1 + $3;} | '(' expr ')' { $$ = $2;} | expr '*' expr { $$ = $1 * $3;} | DIGIT | '(' expr ')' { $$ = $2;} ; | REG { $$ = reg[$1];} %% | DIGIT Extend the interpreter to a desk calculator with ; %% registers named a – z. Example input: v=3*(+4)

Example Yacc Script Answer A case study 1

%% We want to write a Yacc script yylex() S à NP VP which will handle files with multiple { int c = getchar(); sentences from this grammar. Each NP à Det NP1 | PN if (isdigit(c)) { NP1 à Adj NP1| N sentence will be delimited by a "." yylval = c - '0'; Det à a | the return DIGIT; PN à peter | paul | mary Change the first production to } else if ('a' <= c && c <= 'z') { Adj à large | grey S à NP VP . N à dog | cat | male | female yylval = c - 'a'; VP à V NP return REG; and add V à is | likes | hates } return c; D à S D | S } %{ /* simple part of speech lexer */ Yacc Definitions

The Lex #include "y.tab.h" %{ %} /* simple natural language grammar */ Script L [a-zA-Z] #include #include "y.tab.h" %% extern in yyleng; [ \t\n]+ /* ignore space */ extern char yytext[]; is|likes|hates return VERB_T; extern int yylineno; a|the return DET_T; extern int yyval; dog | cat | extern int yyparse(); male | %} female return NOUN_T; peter|paul|mary return PROPER_T; %token DET_T large|grey return ADJ_T; %token NOUN_T \. return PERIOD_T; %token PROPER_T {L}+ lexerr(); %token VERB_T . lexerr(); %token ADJ_T %token PERIOD_T %% %%

/* a document is a sentence followed User-defined functions Yacc rules by a document, or is empty */ void lexerr() Doc : Sent Doc { | /* empty */ printf("Invalid input '%s' line%i\n", ; yytext,yylineno); exit(1); Sent : NounPhrase VerbPhrase PERIOD_T } ; void yyerror(s) NounPhrase : DET_T NounPhraseUn char *s; | PROPER_T { ; (void)fprintf(stderr, "%s at line %i, last token: %s\n", NounPhraseUn : ADJ_T NounPhraseUn s, yylineno, yytext); | NOUN_T } ; void main() VerbPhrase : VERB_T NounPhrase { ; if (yyparse() == 0) printf("Parse OK\n"); %% else printf("Parse Failed\n"); }

Running the example A case study 2 – The Calculator zcalc.y % yacc -d -v parser.y zcalc.l %{ % cc -c y.tab.c %{ #include “zcalc.h” % lex parser.l #include “zcalc.tab.h” %} % cc -c lex.yy.c %} %union { double dval; struct symtab *symp; } % cc y.tab.o lex.yy.o -o parser -ly -ll %% %token NAME ([0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?) %token NUMBER { yylval.dval = atof(yytext); %left ‘+’ ‘-’ peter is a large grey cat. the cat is mary. return NUMBER; } %type expression the dog is a female. a dogcat is a male. [ \t] ; %% paul is peter. [a-zA-Z][a-zA-Z0-(]* statement_list : statement ‘\n’ | statement_list statement ‘\n’ { struct symtab *sp = symlook(yytext); statement : NAME ‘=‘ expression {$1->value = $3;} file2 | expression { printf (“ = %g\n”, $1); } file1 yylval.symp = sp; return NAME; expression : expression ‘+’ expression { $$ = $1 + $3; } | expression ‘-’ expression { $$ = $1 - $3; } } | NUMBER { $$ = $1; } % parser < file1 | NAME { $$ = $1->value; } peter is male. Parse OK %% mary is a female. %% struct symtab * symlook( char *s ) % parser < file2 { /* this function looks up the symbol table and check whether Invalid input 'dogcat' at line 2 the symbol s is already there. If not, add s into symbol table. */ file3 } % parser < file3 int main() { syntax error at line 1, last token: male yyparse(); return 0; }