Lex Overview Usage Paradigm of Lex to Use Lex and Yacc Together Lex Internals Mechanism Lex.Yy.C : What It Produces Running

Lex Overview Usage Paradigm of Lex • Lex is a tool for creating lexical analyzers. • Lexical analyzers tokenize input streams. • Tokens are the terminals of a language. • Regular expressions define tokens . To Use Lex and Yacc Together Lex Internals Mechanism Lex source Yacc source (Lexical Rules) (Grammar Rules) • Converts regular expressions into DFAs. • DFAs are implemented as table driven Lex Yacc state machines. lex.yy.c y.tab.c call Inputyylex() yyparse() Parsed Input return token lex.yy.c : What it produces Running Lex • To run lex on a source file, use the command: lex source.l • This produces the file lex.yy.c which is the C source for the lexical analyzer. • To compile this, use: cc -o prog -O lex.yy.c -ll 1 Versions and Reference Books General Format of Lex Source • AT&T lex, GNU flex, and Win32 version • lex & yacc ,2/e by John R.Levine, Tony Mason & Doug Brown, O’Reilly • Mastering Regular Expressions, by Jeffrey E.F. Friedl, O’Reilly • Input specification file is in 3 parts Regular Policy of None- – Declarations: Definitions translated Source – Rules: Token Descriptions and actions • Remember that Lex is turning the rules – Auxiliary Procedures: User-Written code into a program. Any source not intercepted • Three parts are separated by %% by Lex is copied into the generated •Tips:The first part defines patterns, program. the third part defines actions, the second part puts together to express – Any line which is not part of a Lex rule or “If we see some pattern, then we do action which begins with a blank or tab some action”. – Anything included between lines containing only %{ and %} – Anything after the second %% delimiter Regular Policy of Translated Position of Copied Source Source • source input prior to the first %% • Various variables or tables whose name – external to any function in the generated code prefixed by yy • after the first %% and prior to the second – yyleng, yysvec[], yywork[] %% • Various functions whose name prefixed by – appropriate place for declarations in the yy function generated by Lex which contains the – yyless(), yymore(), yywarp(), yylex()… actions • Various definition whose name are capital • after the second %% – BEGIN, INITIAL… – after the Lex generated output 2 Default Rules and Actions Default Input and Output • The first and second part must exist, but • If you don’t write your own main() to deal may be empty, the third part and the with the input and the output of yylex(), the second %% are optional. default input of default main() is stdin and • If the third part dose not contain a main(), - the default output of default main() is ll will link a default main() which calls stdout. yylex() then exits. – stdin usually is to be keyboard input • Unmatched patterns will perform a default stdout usually is to be screen output action, which consists of copying the input – cs20: %./a.out < inputfile > outputfile to the output Some Simple Lex Source A General Lex Source Example Examples • A minimum lex program: %{ %% /* It only copies the input to the output unchanged. * Example lex source file • A trivial program to deletes three spacing * This first section contains necessary characters: * C declarations and includes %% * to use throughout the lex specifications. [ \t\n]; */ • Another trivial example: #include <stdio.h> %% %} [ \t]+$; bin_digit [01] It deletes from the input all blanks or tabs at the %% ends of lines. {bin_digit}* { %% /* match all strings of 0's and 1's */ /* /* Print out message with matching * Now this is where you want your main * text program */ */ printf("BINARY: %s\n", yytext); int main(int argc, char *argv[]) { } /* ([ab]*aa[ab]*bb[ab]*)|([ab]*bb[ab]*aa[ab]*) { * call yylex to use the generated lexer /* match all strings over */ * (a,b) containing aa and bb yylex(); */ /* printf("AABB\n"); * make sure everything was printed } */ \n ; /* ignore newlines */ fflush(yyout); exit(0); } 3 Token Definitions • Elementary Operations (cont.) ( Extended Regular Expression ) – NOTE: . matches any character except the • Elementary Operations newline – * -- Kleene Closure – single characters • except “ \ . $ ^ [ ] - ? * + | ( ) / { } % < > – + -- Positive Closure – concatenation (put characters together) – alternation (a|b|c) • Examples: • [ab] == a|b – [0-9]+"."[0-9]+ • [a-k] == a|b|c|...|i|j|k • note: without the quotes it could be any character • [a-z0-9] == any letter or digit • [^a] == any character but a – [ \t]+ -- is whitespace • (except CR). • There is a blank space character before the \t • Special Characters: • Special Characters (cont.) – . -- matches any single character – ^ -- means at the beginning of the line (except newline) (unless it is inside of a [ ]) – “ and \ -- quote the part as text – $ means at the end of the line, same as /\n –\t --tab – [^ ] -- means anything except – \n -- newline • \"[^\"]*\" is a double quoted string – \b -- backspace – {n,m} – m through n occurrences – \" -- double quote • a{1,3} is a or aa or aaa –\\ --\ – {definition} – translation from definition – ? -- this means the preceding was – / -- matches only if followed by right part of optional / • ab? == a|ab • 0/1 means the 0 of 01 but not 02 or 03 or … • (ab)? == ab|ε – ( ) -- grouping Definitions • The definitions can also contain variables and other declarations used by the Code generated by Lex. • NAME REG_EXPR – These usually go at the start of this section, – digs [0-9]+ marked by %{ at the beginning and %} at the end – integer {digs} or the line which begins with a blank or tab . – plainreal {digs}"."{digs} – Includes usually go here. – expreal {digs}"."{digs}[Ee][+-]?{digs} – It is usually convenient to maintain a line counter – real {plainreal}|{expreal} so that error messages can be keyed to the lines in which the errors are found. • NAME must be a valid C identifier •%{ • {NAME} is replaced by prior REG_EXPR • int linecount = 1; •%} 4 Transition Rules Tokens and Actions • ERE <one or more blanks> { program statement • Example: program – {real} return FLOAT; statement } – begin return BEGIN; • A null statement ; will ignore the input – {newline} linecount++; • Four special options: – {integer} { | ECHO; REJECT; BEGIN; • printf("I found an integer\n"); • The unmatched token is using a default action • return INTEGER; that ECHO from the input to the output •} • | indicates that the action for this rule is from the action for the next rule Ambiguous Source Rules Multiple States • lex allows the user to explicitly declare • If 2 rules match the same pattern, Lex will multiple states ( in Definitions section ) use the first rule. %s COMMENT • Lex always chooses the longest matching • Default states is INITIAL or 0 substring for its tokens. • Transition rules can be classified into • To overide the choice, use action REJECT different states, which will be match ex: she {s++; REJECT;} depend on states he {h++; REJECT;} • BEGIN is used to change state . | \n ; Lex Special Variables Lex library function calls • identifiers used by Lex and Yacc begin with •yylex() yy – default main() contains a return yylex(); – yytext -- a string containing the lexeme • yywarp() – yyleng -- the length of the lexeme – called by lexical analyzer if end of the input file – yyin – the input stream pointer – default yywarp() always return 1 •Example: • yyless(n) – {integer} { – n characters in yytext are retained • printf("I found an integer\n"); • sscanf(yytext,"%d", &yylval); • yymore() • return INTEGER; – the next input expression recognized is to be tacked •} on to the end of this input – C++ Comments -- // ..... • //.* ; 5 User Written Code More Example 1 int lengs[100]; %% • The actions associated with any given [a-z]+ lengs[yyleng]++; token are normally specified using . | statements in C. But occasionally the \n ; actions are complicated enough that it is %% better to describe them with a function call, yywrap() and define the function elsewhere. { int i; • Definitions of this sort go in the last section printf("Length No. words\n"); of the Lex input. for(i=0; i<100; i++) if (lengs[i] > 0) printf("%5d%10d\n",i,lengs[i]); return(1); } More Example 2 Using yacc with lex • yacc will call yylex() to get the token from the input so that each lex rule should end with: return(token); where the appropriate token value is returned. • An easy way is placing the line: #include “lex.yy.c” in the last section of yacc input. 6.

Lex Overview Usage Paradigm of Lex to Use Lex and Yacc Together Lex Internals Mechanism Lex.Yy.C : What It Produces Running

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support