REVIEWS • COMPUTER ENGINEERING Discovery Engineering, Volume 2, Number 7, October 2013 53 68 – Discovery Engineering • REVIEWS • COMPUTER ENGINEERING discovery 2320

EISSN Engineering 75 66 – 320 ISSN 2

Lexical analysis - a brief study

Harish R, Shahbaz Ali Khan, Shokat Ali, Rajat, Vaibhav Jain, Nitish Raj

CSE department, Dronacharya College of engineering, Gurgaon, Haryana-06, India

Received 13 August; accepted 21 September; published online 01 October; printed 16 October 2013 ABSTRACT The intention of this paper is to provide an overview on the subject of compiler design. The overview includes previous and existing concepts, current technologies. This paper also covers definition, design, overview, and advantages of compiler and its different parts. Through this paper we are creating awareness among the people about this rising field of compiler design. This paper also offers a comprehensive number of references for each concept in .

Keywords: lexemes, scanner, lexer, parser.

To Cite This Article Harish R, Shahbaz Ali Khan, Shokat Ali, Rajat, Vaibhav Jain, Nitish Raj. Lexical analysis - a brief study. Discovery Engineering, 2013, 2(7), 30-34 1. INTRODUCTION  The first phase of compilation  Also known as lexer, scanner  Takes a stream of characters and Returns tokens (words)  Each token has a “type” and an optional “value”  Called by the parser each time a new token is needed.

1.1. Typical tokens of programming languages  Reserved words: class, int, char, bool,…  Identifiers: abc, def, mmm, mine,…  Constant numbers: 123, 123.45, 1.2E3…  Operators and separators: (, ), <, <=, +, -, …

1.2. Goal  recognize token classes, report error if a string does not match any class Such specific instances are called lexemes. A lexeme is the actual character sequence forming a token, the token is the 2. LEXICAL ANALYSIS general class that a lexeme belongs to. Some tokens have Lexical analysis or scanning is the process where the exactly one lexeme (e.g., the > character); for others, there stream of characters making up the source program is read are many lexemes (e.g., integer constants). The scanner is from left-to-right and grouped into tokens. Tokens are tasked with determining that the input stream can be divided sequences of characters with a collective meaning. There into valid symbols in the source language, but has no smarts are usually only a small number of tokens for a programming about which token should come where. Few errors can be language: constants (integer, double, char, string, etc.), detected at the lexical level alone because the scanner has operators (arithmetic, relational, logical), punctuation, and a localized view of the source program without any context. reserved words. The scanner can report about characters that are not valid The lexical analyzer takes a source program as input, tokens (e.g., an illegal or unrecognized symbol) and a few and produces a stream of tokens as output. The lexical other malformed entities (illegal characters within a string analyzer might recognize particular instances of tokens such constant, unterminated comments, etc.) It does not look for as: or detect garbled sequences, tokens out of place, undeclared identifiers, misspelled keywords, mismatched 30 3 or 255 for an integer constant token types and the like. For example, the following input will not "Fred" or "Wilma" for a string constant token generate any errors in the lexical analysis phase, because

numTickets or queue for a variable token the scanner has no concept of the appropriate arrangement Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved Discovery Engineering • REVIEWS • COMPUTER ENGINEERING

A token is a string of one or more characters that is of tokens for declaration. The syntax analyzer will catch this significant as a group. The process of forming tokens from error later in the next phase. an input stream of characters is called tokenization. Tokens are identified based on the specific rules of the lexer. Some int a double } switch b[2] =; methods used to identify tokens include: regular Compilation: expressions, specific sequences of characters known as a The action or process of Furthermore, the scanner has no idea how tokens are flag, specific separating characters called delimiters, and producing something grouped. In the above sequence, it returns b, [, 2, and] as explicit definition by a dictionary. Special characters, four separate tokens, having no idea they collectively form including punctuation characters, are commonly used by an array access. The lexical analyzer can be a convenient lexers to identify tokens because of their natural use in place to carry out some other chores like stripping out written and programming languages. comments and white space between tokens and perhaps Tokens are often categorized by character content or by even some features like macros and conditional compilation context within the data stream. Categories are defined by (although often these are handled by some sort of the rules of the lexer. Categories often involve grammar preprocessor which filters the input before the compiler elements of the language used in the data stream. runs). Programming languages often categorize tokens as identifiers, operators, grouping symbols, or by data type. 2.1. Task Written languages commonly categorize tokens as nouns, verbs, adjectives, or punctuation. Categories are used for

 Also known as tokenizer or scanner. post-processing of the tokens either by the parser or by  In Spanish, called analizadorMorfologico other functions in the program. A lexical analyzer generally  Purpose: translation of the source code into a sequence does nothing with combinations of tokens, a task left for a of symbols. Comparison:  The symbols identified by the morphological analyzer It is an act of assessment will be considered terminal symbols in the grammar or evaluation of things used by the syntactic analyzer. side by side in order to see to what extent they are similar or different. It 2.2. Other Tasks is used to bring out  Identification of lexical errors, similarities or diffrences e.g., starting an identifier with a digit where the between two things of language does not allow this: 2abc same type mostly to  Deletion of white-space: discover essential Usually, the function of white-space is only to separate features or meaning tokens. either scientifically or  otherwise. Exceptions: languages where whitespace indicates code block, e.g.,python: if 1 == 2: Content: print 1 The amount of things print 2 contained in something.  Things written or spoken Deletion of comments: not relevant to execution of in a book, an article, a program. programme, a speech, 31 etc. 3. TOKENS Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved Discovery Engineering • REVIEWS • COMPUTER ENGINEERING

parser. For example, a typical lexical analyzer recognizes delimiter (i.e. matching the string " " or regular expression parentheses as tokens, but does nothing to ensure that each /\s{1}/). "(" is matched with a ")"). The tokens could be represented in XML,

Consider this expression in the programming language: sum = 3 + 2;

Tokenized and represented by the following table:

4. LEXEMES AND PATTERNS  Lexeme: A specific instance of a token. Used to differentiate tokens. For instance, both position and initial belong to the identifier class, however each a different lexeme. Lexical analyzer may return a token type to the Parser, but must also keep track of “attributes” that distinguish one lexeme from another. Examples of attributes: Identifiers: string Numbers: value s-expression Attributes are used during semantic checking and code generation. They are not needed during parsing. A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the 7. LEXICAL GENERATOR evaluator function for these can return nothing. The Lexical analysis can often be performed in a single pass if evaluators for integers, identifiers, and strings can be reading is done a character at a time. Single-pass lexers can considerably more complex. Sometimes evaluators can be generated by tools such as flex. The lex/flex family of suppress a lexeme entirely, concealing it from the parser, generators uses a table-driven approach which is much less which is useful for whitespace and comments.) efficient than the directly coded approach. With the latter approach the generator produces an engine that directly  Patterns: Rule describing how tokens are specified in a jumps to follow-up states via goto statements. Tools like program. Needed because a language can contain infinite re2c and Quex have proven (e.g. RE2C - A More Versatile possible strings. They all cannot be enumerated. Scanner Generator (1994) to produce engines that are Formal mechanisms used to represent these patterns. between two to three times faster than flex produced Formalism helps in describing precisely engines. It is in general difficult to hand-write analyzers that 1. Which strings belong to the language, perform better than engines generated by these latter tools. 2. Which do not? The simple utility of using a scanner generator should not be Also, form basis for developing tools that can automatically discounted, especially in the developmental phase, when a determine if a string belongs to a language. language specification might change daily. The ability to express lexical constructs as regular expressions facilitates the description of a lexical analyzer. Some tools offer the 5. SCANNERS specification of pre- and post-conditions which are hard to The first stage, the scanner, is usually based on a finite- program by hand. In that case, using a scanner generator state machine (FSM). It has encoded within it information on may save a lot of development time, at least where extreme the possible sequences of characters that can be contained performance optimality is not a concern. within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of 8. LEXICAL ANALYZER GENERATOR  numerical digit characters. In many cases, the first non- ANTLR - Can generate lexical analyzers and parsers. whitespace character can be used to deduce the kind of  Flex - Alternative variant of the classic "lex" (C/C++). token that follows and subsequent input characters are then  JFlex - A rewrite of JLex. processed one at a time until reaching a character that is not  Ragel - A state machine and lexical scanner generator in the set of characters acceptable for that token (this is with output support for C, C++, C#, Objective-C, D, known as the maximal munch rule, or longest match rule). In Java, Go and Ruby source code. some languages, the lexeme creation rules are more complicated and may involve backtracking over previously The following lexical analysers can handle Unicode: read characters. For example, in C, a single 'L' character is  JavaCC - JavaCC generates lexical analyzers written in not enough to distinguish between an identifier that begins Java. with 'L' and a wide-character string literals.  JLex - A lexical analyzer generator for Java.  Quex (or "Queχ") - A fast universal lexical analyzer 6. TOKENIZTION generator for C and C++ Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The 9. ADVANTAGES AND ROLES OF Demarcating: resulting tokens are then passed on to some other form of Set the boundaries or LEXICAL ANALYZER limits of processing. The process can be considered a sub-task of parsing input. Primary role: Take, for example, Scan a source program (a string) and break it up into small, The quick brown fox jumps over the lazy dog meaningful units, called tokens.

The string isn't implicitly segmented on spaces, as an Example: 32 English speaker would do. The raw input, the 43 characters, position := initial + rate * 60; must be explicitly split into the 9 tokens with a given space Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved Discovery Engineering • REVIEWS • COMPUTER ENGINEERING

Transform into meaningful units: identifiers, constants, operators, and punctuation.

Other roles: o Removal of comments o Case conversion o Removal of white spaces o Interpretation of compiler directives or pragmas: For instance, in Turbo Pascal {$R+ means range checking is enabled. o Communication with symbol table: Store information regarding an identifier in the symbol table. Not advisable in cases where scopes can be nested. o Preparation of output listing: Keep track of source program, line numbers, and correspondences between error messages and line numbers.

Why separate LA from parser? o Simpler design of both LA and parser o More efficient compiler o More portable compiler

10. LEXICAL ERRORS  Scanner may come across certain errors: invalid character, invalid token etc. Usually detected by reaching a state that is not final and there are no transitions for the current input symbol.  Cannot uncover syntactical, semantic or logical errors: Role of lexical analyzer in the compiler view of the lexical analyzer is localized.  What to do when lexical errors occur? o Delete the characters read so far and restart scanning at the next unread character. o Delete the first character read by the scanner and resume scanning at the character following it. o Local transformations: replace a char by another, transpose adjacent chars etc.  Note that error recovery at this stage may create errors in the parsing stage: for instance, replace beg#in by beg in which will cause error during the parsing phase. Other approach could be for the scanner to provide a certain warning token to the parser. Parser can use this information to do syntactic error-repair. Lexical and syntax analysis of program  Error recovery of a common problem: runaway comments and strings. Possible solutions: Introduce error token that represent a runaway string or comment. Once the runaway error token is recognized, a special error message may be issued.

11. CERTAIN DIAGRAMS FOR Syntactical: LEXICAL ANALYSIS Conforming to the rules of syntax 12. SUMMARY 1. In computer science, lexical analysis is the process of converting a sequence of Lexical analysis characters into a sequence of tokens. 2. A program or function that performs lexical analysis is called a lexical analyzer, lexer, or scanner. A lexer often exists as a single function which is called by a

parsers or another function, or can be combined with the 33 parser in scanner less parsing. Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved Discovery Engineering • REVIEWS • COMPUTER ENGINEERING

DISCLOSURE STATEMENT There is no financial support for this research work from the funding agency.

ACKNOWLEDGMENTS We thank our guide for his timely help, giving outstanding ideas and encouragement to finish this research work successfully.

REFERENCE 1. Compiling with C# and Java, Pat Terry, 2005, ISBN 3. Compiler Construction, Niklaus Wirth, 1996, ISBN 0-201- 032126360X 40353-6 2. Algorithms + Data Structures = Programs, Niklaus Wirth, 4. Sebesta, R. W. (2006). Concepts of programming 1975, ISBN 0-13-022418-9 languages (Seventh edition) pp. 177. Boston: Pearson/Addison-Wesley.

RELATED RESOURCES 1. Aho, R. Sethi, J. Ullman, Compilers: Principles, Techniques, 7. S. Kleene, "Representation of Events in Nerve Nets and and Tools. Reading, MA: Addison-Wesley, 1986. Finite Automata," in C. Shannon and J. McCarthy (eds), 2. J. Backus, “The History of FORTRAN I, II, III.”, SIGPLAN Automata Studies, Princeton, NJ: Princeton University Notices, Vol. 13, no. 8, August, 1978, pp. 165-180. Press, 1956. 3. J.P. Bennett, Introduction to Compiling Techniques. 8. McGettrick, The Definition of Programming Languages. Berkshire, England: McGraw-Hill, 1990. Cambridge: Cambridge University Press, 1980. 4. D. Cohen, Introduction to Computer Theory, New York: 9. T. Sudkamp, Languages and Machines: An Introduction to Wiley, 1986. the Theory of Computer Science, Reading, MA: Addison- 5. Fischer, R. LeBlanc, Crafting a Compiler. Menlo Park, CA: Wesley, 1988. Benjamin/Cummings, 1988. 10. R.L.Wexelblat, History of Programming Languages. London: 6. J. Hopcroft, J. Ullman, Introduction to Automata Theory, Academic Press, 1981. Languages, and Computation, Reading MA: Addison- Wesley, 1979. 34 Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved