Other Scanner Issues Identifiers vs. Reserved Words We will consider other practical Most programming languages issues in building real scanners contain reserved words like if, for real programming languages. while, switch, etc. These tokens Our finite automaton model look like ordinary identifiers, but sometimes needs to be aren’t. augmented. Moreover, error It is up to the scanner to decide if handling must be incorporated what looks like an identifier is into any practical scanner. really a reserved word. This distinction is vital as reserved words have different token codes than identifiers and are parsed differently. How can a scanner decide which tokens are identifiers and which are reserved words?

© © CS 536 Spring 2008 146 CS 536 Spring 2008 147

• We can scan identifiers and • We can give distinct regular reserved words using the same expression definitions for each pattern, and then look up the token reserved word, and for identifiers. in a special “reserved word” table. Since the definitions overlap (if will match a reserved word and the • It is known that any regular general identifier pattern), we give expression may be complemented priority to reserved words. Thus a to obtain all strings not in the token is scanned as an identifier if original regular expression. Thus A, it matches the identifier pattern the complement of A, is regular if A and does not match any reserved is. Using complementation we can word pattern. This approach is write a regular expression for commonly used in scanner nonreserved generators like Lex and JLex. identifiers: ()ident if while … Since scanner generators don’t usually support complementation of regular expressions, this approach is more of theoretical than practical interest.

© © CS 536 Spring 2008 148 CS 536 Spring 2008 149 Converting Token Values We can safely convert from string to integer form by first converting For some tokens, we may need to the string to double form, convert from string form into checking against max and min int, numeric or binary form. and then converting to int form if For example, for integers, we the value is representable. need to transform a string a digits Thus d = new Double(str) will into the internal (binary) form of create an object d containing the integers. value of str in double form. If We know the format of the token str is too large or too small to be is valid (the scanner checked represented as a double, plus or this), but: minus infinity is automatically substituted. • The string may represent an integer too large to represent in 32 d.doubleValue() will give d’s or 64 bit form. value as a Java double, which can be compared against • Languages like CSX and ML use a Integer.MAX_VALUE or non-standard representation for Integer.MIN_VALUE. negative values (~123 instead of -123)

© © CS 536 Spring 2008 150 CS 536 Spring 2008 151

If d.doubleValue() represents a Scanner Termination valid integer, (int) d.doubleValue() A scanner reads input characters will create the appropriate integer and partitions them into tokens. value. What happens when the end of If a string representation of an the input file is reached? It may be integer begins with a “~” we can useful to create an Eof pseudo- strip the “~”, convert to a double character when this occurs. In and then negate the resulting Java, for example, value. InputStream.read(), which reads a single byte, returns -1 when end of file is reached. A constant, EOF, defined as -1 can be treated as an “extended” ASCII character. This character then allows the definition of an Eof token that can be passed back to the parser. An Eof token is useful because it allows the parser to verify that the logical end of a program corresponds to its physical end.

© © CS 536 Spring 2008 152 CS 536 Spring 2008 153 Most parsers require an end of Multi Character Lookahead file token. Lex and Jlex automatically create We may allow finite automata to an Eof token when the scanner look beyond the next input they build tries to scan an EOF character. character (or tries to scan when This feature is necessary to eof() is true). implement a scanner for FORTRAN. In FORTRAN, the statement DO 10 J = 1,100 specifies a loop, with index J ranging from 1 to 100. The statement DO 10 J = 1.100 is an assignment to the variable DO10J. (Blanks are not significant except in strings.) A FORTRAN scanner decides whether the O is the last character of a DO token only after reading as far as the comma (or period).

© © CS 536 Spring 2008 154 CS 536 Spring 2008 155

A milder form of extended Suppose we use the following FA. lookahead problem occurs in D D Pascal and Ada. The token 10.50 is a real literal, D . whereas 10..50 is three different D tokens. . We need two-character lookahead . after the 10 prefix to decide whether we are to return 10 (an integer literal) or 10.50 (a real Given 10..100 we scan three literal). characters and stop in a non- accepting state. Whenever we stop reading in a non-accepting state, we back up along accepted characters until an accepting state is found. Characters we back up over are rescanned to form later tokens. If no accepting state is reached during backup, we have a lexical error.

© © CS 536 Spring 2008 156 CS 536 Spring 2008 157 Performance Considerations On a 30 SPECmark machine (30 million instructions/sec.), we Because scanners do so much have 1000 instructions per character-level processing, they character to spend on all can be a real performance compiling steps. bottleneck in production If we allow 25% of compiling to be compilers. scanning (a compiler has a lot Speed is not a concern in our more to do than just scan!), that’s project, but let’s see why scanning just 250 instructions per speed can be a concern in character. production compilers. A key to efficient scanning is to Let’s assume we want to compile group character-level operations at a rate of 1000 lines/sec. (so whenever possible. It is better to that most programs compile in do one operation on n characters just a few seconds). rather than n operations on single Assuming 30 characters/line (on characters. average), we need to scan 30,000 In our examples we’ve read input char/sec. one character as a time. A subroutine call can cost hundreds or thousands of instructions to execute—far too much to spend on a single character.

© © CS 536 Spring 2008 158 CS 536 Spring 2008 159

We prefer routines that do block Lexical Error Recovery reads, putting an entire block of characters directly into a buffer. A character sequence that can’t be Specialized scanner generators scanned into any valid token is a can produce particularly fast lexical error. scanners. Lexical errors are uncommon, but The GLA scanner generator claims they still must be handled by a that the scanners it produces run scanner. We won’t stop as fast as: compilation because of so minor an error. while( != Eof) { Approaches to lexical error c = getchar(); handling include: } • Delete the characters read so far and restart scanning at the next unread character. • Delete the first character read by the scanner and resume scanning at the character following it.

Both of these approaches are reasonable.

© © CS 536 Spring 2008 160 CS 536 Spring 2008 161 The first is easy to do. We just Consider reset the scanner and begin ...for$tnight... scanning anew. The $ terminates scanning of for. The second is a bit harder but Since no valid token begins with also is a bit safer (less is $, it is deleted. Then tnight is immediately deleted). It can be scanned as an identifier. In effect implemented using scanner we get backup. ...for tnight... Usually, a lexical error is caused which will cause a syntax error. by the appearance of some illegal Such “false errors” are character, mostly at the beginning unavoidable, though a syntactic of a token. error-repair may help. (Why at the beginning?) In these case, the two approaches are equivalent. The effects of lexical error recovery might well create a later syntax error, handled by the parser.

© © CS 536 Spring 2008 162 CS 536 Spring 2008 163

Error Tokens One way to handle runaway strings is to define an error token. Certain lexical errors require An error token is not a valid special care. In particular, token; it is never returned to the runaway strings and runaway parser. Rather, it is a pattern for comments ought to receive an error condition that needs special error messages. special handling. We can define an In Java strings may not cross line error token that represents a boundaries, so a runaway string string terminated by an end of is detected when an end of a line line rather than a double quote is read within the string body. character. Ordinary recovery rules are For a valid string, in which inappropriate for this error. In internal double quotes and back particular, deleting the first slashes are escaped (and no other character (the double quote escaped characters are allowed), character) and restarting scanning we can use is a bad decision. " ( Not( " | Eol | \ ) | \" | \\ )* " It will almost certainly lead to a cascade of “false” errors as the For a runaway string we use string text is inappropriately " ( Not( " | Eol | \ ) | \" | \\ )* Eol scanned as ordinary input. (Eol is the end of line character.)

© © CS 536 Spring 2008 164 CS 536 Spring 2008 165 When a runaway string token is In languages like C, C++, Java and recognized, a special error CSX, which allow multiline message should be issued. comments, improperly terminated Further, the string may be (runaway) comments present a “repaired” into a correct string by similar problem. returning an ordinary string token A runaway comment is not with the closing Eol replaced by a detected until the scanner finds a double quote. close comment (possibly This repair may or may not be belonging to some other “correct.” If the closing double comment) or until the end of file quote is truly missing, the repair is reached. Clearly a special, will be good; if it is present on a detailed error message is succeeding line, a cascade of required. inappropriate lexical and Let’s look at Pascal-style syntactic errors will follow. comments that begin with a { and Still, we have told the end with a }. Comments that programmer exactly what is begin and end with a pair of wrong, and that is our primary characters, like /* and */ in Java, goal. C and C++, are a bit trickier.

© © CS 536 Spring 2008 166 CS 536 Spring 2008 167

Correct Pascal comments are We split our legal comment defined quite simply: definition into two token { Not( } )* } definitions. To handle comments terminated The definition that accepts an by Eof, this error token can be open comment in its body causes used: a warning message ("Possible unclosed comment") to be { Not( } )* Eof printed. We want to handle comments We now use: unexpectedly closed by a close { Not( { | } )* } and comment belonging to another { (Not( { | } )* { Not( { | } )* )+ } comment: {... missing close comment The first definition matches ... { normal comment }... correct comments that do not We will issue a warning (this form contain an open comment in their of comment is lexically legal). body. Any comment containing an open The second definition matches comment symbol in its body is correct, but suspect, comments most probably a missing } error. that contain at least one open comment in their body.

© © CS 536 Spring 2008 168 CS 536 Spring 2008 169 Single line comments, found in Java, CSX and C++, are terminated by Eol. They can fall prey to a more subtle error—what if the last line has no Eol at its end? The solution? Another error token for single line comments: // Not(Eol)* This rule will only be used for comments that don’t end with an Eol, since scanners always match the longest rule possible.

© CS 536 Spring 2008 170