ENGINEERING Discovery Engineering, Volume 2, Number 7, October 2013 53 68 – Discovery Engineering • REVIEWS • COMPUTER ENGINEERING Discovery 2320
Total Page:16
File Type:pdf, Size:1020Kb
REVIEWS • COMPUTER ENGINEERING Discovery Engineering, Volume 2, Number 7, October 2013 53 68 – Discovery Engineering • REVIEWS • COMPUTER ENGINEERING discovery 2320 EISSN Engineering 75 66 – 320 ISSN 2 Lexical analysis - a brief study Harish R, Shahbaz Ali Khan, Shokat Ali, Rajat, Vaibhav Jain, Nitish Raj CSE department, Dronacharya College of engineering, Gurgaon, Haryana-06, India Received 13 August; accepted 21 September; published online 01 October; printed 16 October 2013 ABSTRACT The intention of this paper is to provide an overview on the subject of compiler design. The overview includes previous and existing concepts, current technologies. This paper also covers definition, design, overview, and advantages of compiler and its different parts. Through this paper we are creating awareness among the people about this rising field of compiler design. This paper also offers a comprehensive number of references for each concept in Lexical Analysis. Keywords: lexemes, scanner, lexer, parser. To Cite This Article Harish R, Shahbaz Ali Khan, Shokat Ali, Rajat, Vaibhav Jain, Nitish Raj. Lexical analysis - a brief study. Discovery Engineering, 2013, 2(7), 30-34 1. INTRODUCTION The first phase of compilation Also known as lexer, scanner Takes a stream of characters and Returns tokens (words) Each token has a “type” and an optional “value” Called by the parser each time a new token is needed. 1.1. Typical tokens of programming languages Reserved words: class, int, char, bool,… Identifiers: abc, def, mmm, mine,… Constant numbers: 123, 123.45, 1.2E3… Operators and separators: (, ), <, <=, +, -, … 1.2. Goal recognize token classes, report error if a string does not match any class Such specific instances are called lexemes. A lexeme is the actual character sequence forming a token, the token is the 2. LEXICAL ANALYSIS general class that a lexeme belongs to. Some tokens have Lexical analysis or scanning is the process where the exactly one lexeme (e.g., the > character); for others, there stream of characters making up the source program is read are many lexemes (e.g., integer constants). The scanner is from left-to-right and grouped into tokens. Tokens are tasked with determining that the input stream can be divided sequences of characters with a collective meaning. There into valid symbols in the source language, but has no smarts are usually only a small number of tokens for a programming about which token should come where. Few errors can be language: constants (integer, double, char, string, etc.), detected at the lexical level alone because the scanner has operators (arithmetic, relational, logical), punctuation, and a localized view of the source program without any context. reserved words. The scanner can report about characters that are not valid The lexical analyzer takes a source program as input, tokens (e.g., an illegal or unrecognized symbol) and a few and produces a stream of tokens as output. The lexical other malformed entities (illegal characters within a string analyzer might recognize particular instances of tokens such constant, unterminated comments, etc.) It does not look for as: or detect garbled sequences, tokens out of place, undeclared identifiers, misspelled keywords, mismatched 30 3 or 255 for an integer constant token types and the like. For example, the following input will not "Fred" or "Wilma" for a string constant token generate any errors in the lexical analysis phase, because numTickets or queue for a variable token the scanner has no concept of the appropriate arrangement Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved Discovery Engineering • REVIEWS • COMPUTER ENGINEERING A token is a string of one or more characters that is of tokens for declaration. The syntax analyzer will catch this significant as a group. The process of forming tokens from error later in the next phase. an input stream of characters is called tokenization. Tokens are identified based on the specific rules of the lexer. Some int a double } switch b[2] =; methods used to identify tokens include: regular Compilation: expressions, specific sequences of characters known as a The action or process of Furthermore, the scanner has no idea how tokens are flag, specific separating characters called delimiters, and producing something grouped. In the above sequence, it returns b, [, 2, and] as explicit definition by a dictionary. Special characters, four separate tokens, having no idea they collectively form including punctuation characters, are commonly used by an array access. The lexical analyzer can be a convenient lexers to identify tokens because of their natural use in place to carry out some other chores like stripping out written and programming languages. comments and white space between tokens and perhaps Tokens are often categorized by character content or by even some features like macros and conditional compilation context within the data stream. Categories are defined by (although often these are handled by some sort of the rules of the lexer. Categories often involve grammar preprocessor which filters the input before the compiler elements of the language used in the data stream. runs). Programming languages often categorize tokens as identifiers, operators, grouping symbols, or by data type. 2.1. Task Written languages commonly categorize tokens as nouns, verbs, adjectives, or punctuation. Categories are used for Also known as tokenizer or scanner. post-processing of the tokens either by the parser or by In Spanish, called analizadorMorfologico other functions in the program. A lexical analyzer generally Purpose: translation of the source code into a sequence does nothing with combinations of tokens, a task left for a of symbols. Comparison: The symbols identified by the morphological analyzer It is an act of assessment will be considered terminal symbols in the grammar or evaluation of things used by the syntactic analyzer. side by side in order to see to what extent they are similar or different. It 2.2. Other Tasks is used to bring out Identification of lexical errors, similarities or diffrences e.g., starting an identifier with a digit where the between two things of language does not allow this: 2abc same type mostly to Deletion of white-space: discover essential Usually, the function of white-space is only to separate features or meaning tokens. either scientifically or otherwise. Exceptions: languages where whitespace indicates code block, e.g.,python: if 1 == 2: Content: print 1 The amount of things print 2 contained in something. Things written or spoken Deletion of comments: not relevant to execution of in a book, an article, a program. programme, a speech, 31 etc. 3. TOKENS Page Harish et al. Lexical analysis - a brief study, Discovery Engineering, 2013, 2(7), 30-34, www.discovery.org.in www.discovery.org.in/de.htm © 2013 Discovery Publication. All Rights Reserved Discovery Engineering • REVIEWS • COMPUTER ENGINEERING parser. For example, a typical lexical analyzer recognizes delimiter (i.e. matching the string " " or regular expression parentheses as tokens, but does nothing to ensure that each /\s{1}/). "(" is matched with a ")"). The tokens could be represented in XML, Consider this expression in the C programming language: sum = 3 + 2; Tokenized and represented by the following table: 4. LEXEMES AND PATTERNS Lexeme: A specific instance of a token. Used to differentiate tokens. For instance, both position and initial belong to the identifier class, however each a different lexeme. Lexical analyzer may return a token type to the Parser, but must also keep track of “attributes” that distinguish one lexeme from another. Examples of attributes: Identifiers: string Numbers: value s-expression Attributes are used during semantic checking and code generation. They are not needed during parsing. A lexeme, however, is only a string of characters known to be of a certain kind (e.g., a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the 7. LEXICAL GENERATOR evaluator function for these can return nothing. The Lexical analysis can often be performed in a single pass if evaluators for integers, identifiers, and strings can be reading is done a character at a time. Single-pass lexers can considerably more complex. Sometimes evaluators can be generated by tools such as flex. The lex/flex family of suppress a lexeme entirely, concealing it from the parser, generators uses a table-driven approach which is much less which is useful for whitespace and comments.) efficient than the directly coded approach. With the latter approach the generator produces an engine that directly Patterns: Rule describing how tokens are specified in a jumps to follow-up states via goto statements. Tools like program. Needed because a language can contain infinite re2c and Quex have proven (e.g. RE2C - A More Versatile possible strings. They all cannot be enumerated. Scanner Generator (1994) to produce engines that are Formal mechanisms used to represent these patterns. between two to three times faster than flex produced Formalism helps in describing precisely engines. It is in general difficult to hand-write analyzers that 1. Which strings belong to the language, perform better than engines generated by these latter tools. 2. Which do not? The simple utility of using a scanner generator should not be Also, form basis for developing tools that can automatically discounted, especially in the developmental phase, when a determine if a string belongs to a language. language specification might change daily. The ability to express lexical constructs as regular expressions facilitates the description of a lexical analyzer.