Lexical Analyzer in C

Total Page:16

File Type:pdf, Size:1020Kb

Lexical Analyzer in C CONTENTS TOPIC NAME PAGE NO. Certificate ………………… i Declaration ………………… ii Abstract ………………… 1 1.0 INTRODUCTION ………………… 2 1.1.0 Introduction to Lexical Grammar ………………… 3 1.2.0 Introduction to Token ………………… 4 1.3.0 How scanner and tokenizer works ? ………………… 4 1.4.0 Platform used ………………… 7 2.0 PROPOSED METHODOLOGY ………………… 9 2.1.0 Block Diagram ………………… 9 2.2.0 Data Flow Diagram ………………… 9 2.3.0 Flow Chart ………………… 11 3.0 APPROACHED RESULT AND CONCLUSION ………………… 13 4.0 APPLICATIONS AND FUTURE WORK ………………… 14 REFERENCES ………………… 15 ABSTRACT The lexical analyzer is responsible for scanning the source input file and translating lexemes (strings) into small objects that the compiler for a high level language can easily process. These small values are often called “tokens”. The lexical analyzer is also responsible for converting sequences of digits in to their numeric form as well as processing other literal constants, for removing comments and white spaces from the source file, and for taking care of many other mechanical details. Lexical analyzer converts stream of input characters into a stream of tokens. For tokenizing into identifiers and keywords we incorporate a symbol table which initially consists of predefined keywords. The tokens are read from an input file. The output file will consist of all the tokens present in our input file along with their respective token values. KEYWORDS: Lexeme, Lexical Analysis, Compiler, Parser, Token 1.0 INTRODUCTION In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner. A lexer often exists as a single function which is called by a parser or another function and works alongside other components for making compilation of a high level language possible. This complete setup is what we call a compiler. To define what a compiler is one must first define what a translator is. A translator is a program that takes another program written in one language, also known as the source language, and outputs a program written in another language, known as the target language. Now that the translator is defined, a compiler can be defined as a translator. The source language is a high-level language such as Java or Pascal and the target language is a low-level language such as machine or assembly. There are five parts of compilation (or phases of the compiler) 1.) Lexical Analysis 2.) Syntax Analysis 3.) Semantic Analysis 4.) Code Optimization 5.) Code Generation Lexical Analysis is the act of taking an input source program and outputting a stream of tokens. This is done with the Scanner. The Scanner can also place identifiers into something called the symbol table or place strings into the string table. The Scanner can report trivial errors such as invalid characters in the input file. Syntax Analysis is the act of taking the token stream from the scanner and comparing them against the rules and patterns of the specified language. Syntax Analysis is done with the Parser. The Parser produces a tree, which can come in many formats, but is referred to as the parse tree. It reports errors when the tokens do not follow the syntax of the specified language. Errors that the Parser can report are syntactical errors such as missing parenthesis, semicolons, and keywords. Semantic Analysis is the act of determining whether or not the parse tree is relevant and meaningful. The output is intermediate code, also known as an intermediate representation (or IR). Most of the time, this IR is closely related to assembly language but it is machine independent. Intermediate code allows different code generators for different machines and promotes abstraction and portability from specific machine times and languages. (I dare to say that the most famous example is java’s byte-code and JVM). Semantic Analysis finds more meaningful errors such as undeclared variables, type compatibility, and scope resolution. Code Optimization makes the IR more efficient. Code optimization is usually done in a sequence of steps. Some optimizations include code hosting, or moving constant values to better places within the code, redundant code discovery, and removal of useless code. Code Generation is the final step in the compilation process. The input to the Code Generator is the IR and the output is machine language code. 1.1.0 Introduction to Lexical Grammar The specification of a programming language will often include a set of rules which defines the lexer. These rules are usually called regular expressions and they define the set of possible character sequences that are used to form tokens or lexemes. whitespace, (i.e. characters that are ignored), are also defined in the regular expressions. 1.2.0 Introduction to token A token is a string of characters, categorized according to the rules as a symbol (e.g. IDENTIFIER, NUMBER, COMMA, etc.). The process of forming tokens from lexe an input stream of characters is called Token type me (tokenization) and the lexer categorizes them Sum Identifier according to a symbol type. A token can look like = Assignment operator anything that is useful for processing an input 3 Number text stream or text file. A lexical analyzer generally does nothing + Addition operator with combinations of tokens, a task left for 2 Number a parser. For example, a typical lexical analyzer ; End of statement recognizes parenthesis as tokens, but does nothing to ensure that each '(' is matched with a ')'. Consider this expression in the C programming language: sum=3+2; Tokenized in the following table: Tokens are frequently defined by regular expressions, which are understood by a lexical analyzer generator such as lex. The lexical analyzer (either generated automatically by a tool like lex, or hand-crafted) reads in a stream of characters, identifies the lexemes in the stream, and categorizes them into tokens. This is called "tokenizing." If the lexer finds an invalid token, it will report an error. Following tokenizing is parsing. From there, the interpreted data may be loaded into data structures for general use, interpretation, or compiling. 1.3.0 How scanner and tokenizer work? The first stage, the scanner, is usually based on a finite state machine. It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases, the first non-white space character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule, or longest match rule). In some languages the lexeme creation rules are more complicated and may involve backtracking over previously read characters. Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input. Take, for example, the following string. The quick brown fox jumps over the lazy dog Unlike humans, a computer cannot intuitively 'see' that there are 9 words. To a computer this is only a series of 43 characters. A process of tokenization could be used to split the sentence into word tokens. Although the following example is given as XML there are many ways to represent tokenized input: <sentence> <word>The</word> <word>quick</word> <word>brown</word> <word>fox</word> <word>jumps</word> <word>over</word> <word>the</word> <word>lazy</word> <word>dog</word> </sentence> A lexeme, however, is only a string of characters known to be of a certain kind (eg, a string literal, a sequence of letters). In order to construct a token, the lexical analyzer needs a second stage, the evaluator, which goes over the characters of the lexeme to produce a value. The lexeme's type combined with its value is what properly constitutes a token, which can be given to a parser. (Some tokens such as parentheses do not really have values, and so the evaluator function for these can return nothing. The evaluators for integers, identifiers, and strings can be considerably more complex. Sometimes evaluators can suppress a lexeme entirely, concealing it from the parser, which is useful for whitespace and comments.) For example, in the source code of a computer program the string net_worth_future = (assets - liabilities); might be converted (with whitespace suppressed) into the lexical token stream: NAME "net_worth_future" EQUALS OPEN_PARENTHESIS NAME "assets" MINUS NAME "liabilities" CLOSE_PARENTHESIS SEMICOLON Though it is possible and sometimes necessary to write a lexer by hand, lexers are often generated by automated tools. These tools generally accept regular expressions that describe the tokens allowed in the input stream. Each regular expression is associated with a production in the lexical grammar of the programming language that evaluates the lexemes matching the regular expression. These tools may generate source code that can be compiled and executed or construct a state table for a finite state machine (which is plugged into template code for compilation and execution). Regular expressions compactly represent patterns that the characters in lexemes might follow. For example, for an English-based language, a NAME token might be any English alphabetical character or an underscore, followed by any number of instances of any ASCII alphanumeric character or an underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]*.
Recommended publications
  • Design a Lexical Analyser for a Language Whose Grammar Is Known
    Practical No.3 Date:- Design a Lexical Analyser for a language whose grammar is known. Lexical analysis n computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer, or scanner. A lexer often exists as a single function which is called by a parser or another function. Lexical grammar The specification of a programming language often includes a set of rules which defines the lexer. These rules usually consist of regular expressions, and they define the set of possible character sequences that are used to form individual tokens or lexemes. In programming languages that delimit blocks with tokens (e.g., "{" and "}"), as opposed to off-side rule languages that delimit blocks with indentation, white space is also defined by a regular expression and influences the recognition of other tokens but does not itself contribute tokens. White space is said to be non-significant in such languages. Token A token is a string of characters, categorized according to the rules as a symbol (e.g., IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream of characters is called tokenization, and the lexer categorizes them according to a symbol type. A token can look like anything that is useful for processing an input text stream or text file. A lexical analyzer generally does nothing with combinations of tokens, a task left for a parser. For example, a typical lexical analyzer recognizes parentheses as tokens, but does nothing to ensure that each "(" is matched with a ")".
    [Show full text]
  • PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages Joel Denny Clemson University, [email protected]
    Clemson University TigerPrints All Dissertations Dissertations 5-2010 PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages Joel Denny Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Part of the Computer Sciences Commons Recommended Citation Denny, Joel, "PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages" (2010). All Dissertations. 519. https://tigerprints.clemson.edu/all_dissertations/519 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Joel E. Denny May 2010 Accepted by: Dr. Brian A. Malloy, Committee Chair Dr. Harold C. Grossman Dr. Jason Hallstrom Dr. Stephen T. Hedetniemi Abstract Composite languages are composed of multiple sub-languages. Examples include the parser specification languages read by parser generators like Yacc, modern extensible languages with com- plex layers of domain-specific sub-languages, and even traditional programming languages like C and C++. In this dissertation, we describe PSLR(1), a new scanner-based LR(1) parser generation system that automatically eliminates scanner conflicts typically caused by language composition. The fundamental premise of PSLR(1) is the pseudo-scanner, a scanner that only recognizes tokens accepted by the current parser state.
    [Show full text]
  • Compiler Construction
    Compiler construction PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 10 Dec 2011 02:23:02 UTC Contents Articles Introduction 1 Compiler construction 1 Compiler 2 Interpreter 10 History of compiler writing 14 Lexical analysis 22 Lexical analysis 22 Regular expression 26 Regular expression examples 37 Finite-state machine 41 Preprocessor 51 Syntactic analysis 54 Parsing 54 Lookahead 58 Symbol table 61 Abstract syntax 63 Abstract syntax tree 64 Context-free grammar 65 Terminal and nonterminal symbols 77 Left recursion 79 Backus–Naur Form 83 Extended Backus–Naur Form 86 TBNF 91 Top-down parsing 91 Recursive descent parser 93 Tail recursive parser 98 Parsing expression grammar 100 LL parser 106 LR parser 114 Parsing table 123 Simple LR parser 125 Canonical LR parser 127 GLR parser 129 LALR parser 130 Recursive ascent parser 133 Parser combinator 140 Bottom-up parsing 143 Chomsky normal form 148 CYK algorithm 150 Simple precedence grammar 153 Simple precedence parser 154 Operator-precedence grammar 156 Operator-precedence parser 159 Shunting-yard algorithm 163 Chart parser 173 Earley parser 174 The lexer hack 178 Scannerless parsing 180 Semantic analysis 182 Attribute grammar 182 L-attributed grammar 184 LR-attributed grammar 185 S-attributed grammar 185 ECLR-attributed grammar 186 Intermediate language 186 Control flow graph 188 Basic block 190 Call graph 192 Data-flow analysis 195 Use-define chain 201 Live variable analysis 204 Reaching definition 206 Three address
    [Show full text]
  • ANTLR Reference PDF Manual
    ANTLR Reference Manual Home | Download | News | About ANTLR | Support Latest version is 2.7.3. Download now! » » Home » Download ANTLR Reference Manual » News »Using ANTLR » Documentation ANTLR » FAQ » Articles Reference Manual » Grammars Credits » File Sharing » Code API Project Lead and Supreme Dictator Terence Parr » Tech Support University of San Franciso »About ANTLR Support from » What is ANTLR jGuru.com » Why use ANTLR Your View of the Java Universe » Showcase Help with initial coding » Testimonials John Lilly, Empathy Software » Getting Started C++ code generator by » Software License Peter Wells and Ric Klaren » ANTLR WebLogs C# code generation by »StringTemplate Micheal Jordan, Kunle Odutola and Anthony Oguntimehin. »TML Infrastructure support from Perforce: »PCCTS The world's best source code control system Substantial intellectual effort donated by Loring Craymer Monty Zukowski Jim Coker Scott Stanchfield John Mitchell Chapman Flack (UNICODE, streams) Source changes for Eclipse and NetBeans by Marco van Meegen and Brian Smith ANTLR Version 2.7.3 March 22, 2004 What's ANTLR http://www.antlr.org/doc/index.html (1 of 6)31.03.2004 17:11:46 ANTLR Reference Manual ANTLR, ANother Tool for Language Recognition, (formerly PCCTS) is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing Java, C++, or C# actions [You can use PCCTS 1.xx to generate C-based parsers]. Computer language translation has become a common task. While compilers and tools for traditional computer languages (such as C or Java) are still being built, their number is dwarfed by the thousands of mini-languages for which recognizers and translators are being developed.
    [Show full text]