<<

Lex Overview Usage Paradigm of Lex

• Lex is a tool for creating lexical analyzers.

• Lexical analyzers tokenize input streams.

• Tokens are the terminals of a language.

• Regular expressions define tokens .

To Use Lex and Together Lex Internals Mechanism

Lex source Yacc source (Lexical Rules) (Grammar Rules) • Converts regular expressions into DFAs.

• DFAs are implemented as table driven Lex Yacc state machines. lex.yy. y.tab.c call Inputyylex() yyparse() Parsed Input

return token

lex.yy.c : What it produces Running Lex

• To run lex on a source , use the : lex source.l • This produces the file lex.yy.c is the C source for the lexical analyzer. • To compile this, use: cc -o prog -O lex.yy.c -ll

1 Versions and Reference Books General Format of Lex Source

&T lex, GNU flex, and Win32 version

• lex & yacc ,2/e by John R.Levine, Tony Mason & Doug Brown, O’Reilly

• Mastering Regular Expressions, by Jeffrey E.F. Friedl, O’Reilly

• Input specification file is in 3 parts Regular Policy of None- – Declarations: Definitions translated Source – Rules: Token Descriptions and actions • Remember that Lex is turning the rules – Auxiliary Procedures: User-Written code into a program. Any source not intercepted • Three parts are separated by %% by Lex is copied into the generated •Tips:The first part defines patterns, program. the third part defines actions, the second part puts together to express – Any line which is not part of a Lex rule or “If we see some pattern, then we do action which begins with a blank or tab some action”. – Anything included between lines containing only %{ and %} – Anything after the second %% delimiter

Regular Policy of Translated Position of Copied Source Source • source input prior to the first %% • Various variables or tables whose name – external to any function in the generated code prefixed by yy • after the first %% and prior to the second – yyleng, yysvec[], yywork[] %% • Various functions whose name prefixed by – appropriate place for declarations in the yy function generated by Lex which contains the – yyless(), yymore(), yywarp(), yylex()… actions • Various definition whose name are capital • after the second %% – BEGIN, INITIAL… – after the Lex generated output

2 Default Rules and Actions Default Input and Output

• The first and second part must exist, but • If you don’t your own main() to deal may be empty, the third part and the with the input and the output of yylex(), the second %% are optional. default input of default main() is stdin and • If the third part dose not contain a main(), - the default output of default main() is ll will a default main() which calls stdout. yylex() then exits. – stdin usually is to be keyboard input • Unmatched patterns will perform a default stdout usually is to be screen output action, which consists of copying the input – cs20: %./a.out < inputfile > outputfile to the output

Some Simple Lex Source A General Lex Source Example Examples • A minimum lex program: %{ %% /* It only copies the input to the output unchanged. * Example lex source file • A trivial program to deletes three spacing * This first section contains necessary characters: * C declarations and includes %% * to use throughout the lex specifications. [ \t\n]; */ • Another trivial example: #include %% %} [ \t]+$; bin_digit [01] It deletes from the input all blanks or tabs at the %% ends of lines.

{bin_digit}* { %% /* match all of 0's and 1's */ /* /* Print out message with matching * Now this is where you want your main * text program */ */ ("BINARY: %s\n", yytext); int main(int argc, char *argv[]) { } /* ([ab]*aa[ab]*bb[ab]*)|([ab]*bb[ab]*aa[ab]*) { * call yylex to use the generated lexer /* match all strings over */ * (a,b) containing aa and bb yylex(); */ /* printf("AABB\n"); * sure everything was printed } */ \n ; /* ignore newlines */ fflush(yyout); (0); }

3 Token Definitions • Elementary Operations (cont.) ( Extended ) – NOTE: . matches any character except the • Elementary Operations newline – * -- Kleene Closure – single characters • except “ \ . $ ^ [ ] - ? * + | ( ) / { } % < > – + -- Positive Closure – concatenation (put characters together) – alternation (a|b|c) • Examples: • [ab] == a|b – [0-9]+"."[0-9]+ • [a-k] == a|b|c|...|i|j|k • note: without the quotes it could be any character • [a-z0-9] == any letter or digit • [^a] == any character but a – [ \t]+ -- is whitespace • (except CR). • There is a blank space character before the \t

• Special Characters: • Special Characters (cont.) – . -- matches any single character – ^ -- means at the beginning of the line (except newline) (unless it is inside of a [ ]) – “ and \ -- quote the part as text – $ means at the end of the line, same as /\n –\t --tab – [^ ] -- means anything except – \n -- newline • \"[^\"]*\" is a double quoted string – \b -- backspace – {n,m} – m through n occurrences – \" -- double quote • a{1,3} is a or aa or aaa –\\ --\ – {definition} – translation from definition – ? -- this means the preceding was – / -- matches only if followed by right part of optional / • ab? == a|ab • 0/1 means the 0 of 01 but not 02 or 03 or … • (ab)? == ab|ε – ( ) -- grouping

Definitions • The definitions can also contain variables and other declarations used by the Code generated by Lex. • NAME REG_EXPR – These usually go at the start of this section, – digs [0-9]+ marked by %{ at the beginning and %} at the end – integer {digs} or the line which begins with a blank or tab . – plainreal {digs}"."{digs} – Includes usually go here. – expreal {digs}"."{digs}[Ee][+-]?{digs} – It is usually convenient to maintain a line counter – real {plainreal}|{expreal} so that error messages can be keyed to the lines in which the errors are found. • NAME must be a valid C identifier •%{ • {NAME} is replaced by prior REG_EXPR • int linecount = 1; •%}

4 Transition Rules Tokens and Actions

• ERE { program statement • Example: program – {real} return FLOAT; statement } – begin return BEGIN; • A null statement ; will ignore the input – {newline} linecount++; • Four special options: – {integer} { | ; REJECT; BEGIN; • printf("I found an integer\n"); • The unmatched token is using a default action • return INTEGER; that ECHO from the input to the output •} • | indicates that the action for this rule is from the action for the next rule

Ambiguous Source Rules Multiple States • lex allows the user to explicitly declare • If 2 rules match the same pattern, Lex will multiple states ( in Definitions section ) use the first rule. %s COMMENT • Lex always chooses the longest matching • Default states is INITIAL or 0 substring for its tokens. • Transition rules can be classified into • To overide the choice, use action REJECT different states, which will be match : she {s++; REJECT;} depend on states he {h++; REJECT;} • BEGIN is used to change state . | \n ;

Lex Special Variables Lex library function calls

• identifiers used by Lex and Yacc begin with •yylex() yy – default main() contains a return yylex(); – yytext -- a string containing the lexeme • yywarp() – yyleng -- the length of the lexeme – called by lexical analyzer if end of the input file – yyin – the input stream pointer – default yywarp() always return 1 •Example: • yyless(n) – {integer} { – n characters in yytext are retained • printf("I found an integer\n"); • sscanf(yytext,"%d", &yylval); • yymore() • return INTEGER; – the next input expression recognized is to be tacked •} on to the end of this input – C++ Comments -- // ..... • //.* ;

5 User Written Code Example 1 int lengs[100]; %% • The actions associated with any given [a-z]+ lengs[yyleng]++; token are normally specified using . | statements in C. But occasionally the \n ; actions are complicated enough that it is %% better to describe them with a function call, yywrap() and define the function elsewhere. { int i; • Definitions of this go in the last section printf("Length No. words\n"); of the Lex input. for(i=0; i<100; i++) if (lengs[i] > 0) printf("%5d%10d\n",i,lengs[i]); return(1); }

More Example 2 Using yacc with lex

• yacc will call yylex() to get the token from the input so that each lex rule should end with: return(token); where the appropriate token value is returned. • An easy way is placing the line: #include “lex.yy.c” in the last section of yacc input.

6