Lexical and Syntactic Analysis

10/4/2012 COS 301 Lexical and Syntactic Analysis • Language implementation systems must Programming Languages analyze source code, regardless of the specific implementation approach (compiler or Lexical and Syntactic Analysis interpreter) • Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF) Sebesta Chapter 4.1-4.4 – Lexical analysis uses less powerful grammars than syntactic analysis Source Code Syntax Analysis Why Separate Lexical and Syntax Analysis? • The syntax analysis portion of a language • Simplicity - less complex approaches can be processor nearly always consists of two parts: used for lexical analysis; separating them – A low-level part called a lexical analyzer simplifies the parser (mathematically, a finite automaton based on a • Efficiency - separation allows optimization of regular grammar) the lexical analyzer – A high-level part called a syntax analyzer, or parser – About 75% of execution time for a non-optimizing (mathematically, a push-down automaton based on compiler is lexical analysis a context-free grammar, or BNF) • Portability - parts of the lexical analyzer may not be portable, but the parser always is portable – The lexical analyzer has to deal with low-level details of the character set – such as what a newline character looks like, EOF etc. Lexical Analysis Lexical Analyzer • A lexical analyzer is a pattern matcher for • Purpose: transform program representation character strings from sequence of characters to sequence of • A lexical analyzer is a “front-end” for the tokens parser • Input: a stream of characters • Identifies substrings of the source program that • Output: lexemes / tokens belong together - lexemes • Discard: whitespace, comments – Lexemes match a character pattern, which is associated with a lexical category called a token – sum is a lexeme; its token may be IDENT • Often “token” is used in place of lexeme 1 10/4/2012 Example Tokens Other Sequences • Identifiers • Whitespace: space tab • Literals: 123, 5.67, 'x', true • Comments, e.g. • Keywords or reserved words: bool, while, char // {any-char} end-of-line ... /* {any-char} */ • Operators: + - * / ... • End-of-line • Punctuation: ; , ( ) { } • End-of-file • Note: in some languages end-of-line or newline characters are considered white space (C, C++, Java…) • In other languages (BASIC, Fortran, etc.) they are statement delimiters Lexical Analyzer (continued) The Chomsky Hierarchy (Again) • The lexical analyzer is usually a function that is called • Four levels of grammar: by the parser when it needs the next token 1. Regular • Three approaches to building a lexical analyzer: – Write a formal description of the tokens (grammar or regular 2. Context-free expressions) and use a software tool that constructs table- 3. Context-sensitive driven lexical analyzers given such a description • Ex. lex, flex, flex++ 4. Unrestricted (recursively enumerable) – Design a state diagram that describes the tokens and write a • CFGs are used for syntax parsing program that implements the state diagram – Design a state diagram that describes the tokens and hand- • Regular grammars are used for lexical analysis construct a table-driven implementation of the state diagram Productions Three models of the lexical level • All grammars are tuples {P,T,N,S} • Although the lexical level can be described – Where P is a set of productions, T a set of terminal with BNF, regular grammars can be used symbols, N a set of non-terminal symbols and S is • Equivalent to regular grammars are: the start symbol – a member of N – Regular expressions – Finite state automata • The form of production rules distinguishes grammars in hierarchy 2 10/4/2012 Context-Sensitive Grammars Context-free Grammars • Production: • Already discussed as BNF - a stylized form of • α → β |α| ≤ |β| CFG • α, β (N T)* • Every production is in the form A where A – The left-hand side can be composed of strings of terminals is a single non-terminal and is a string of and nonterminals – Length of RHS cannot be less than length of LHS (sentential terminals and/or non-terminals (possibly form cannot shrink in derivation) except S is allowed empty) • Note than context sensitive grammars can have • Equivalent to a pushdown automaton productions such as – aXb => aYZc • For a wide class of unambiguous CFGs, there – aXc => aaXb are table-driven, linear time parsers Regular Grammars Regular Grammars • Simplest and least powerful; equivalent to: • Left regular grammar: T*, B N – Regular expression A → B – Finite-state automaton A → • All productions must be right-regular or left- • A regular grammar is a right-regular or a left- regular regular grammar • Right regular grammar: T*, B N – If we have both types of rules we have a linear A → B grammar – a more powerful language than a regular A → grammar – Regular langs linear langs context-free langs • E.g., rhs of any production must contain at • Example of a linear language that is not a regular most one nonterminal AND it must be the language: rightmost symbol { aⁿ bⁿ | n ≥ 1 } • Direct recursion is permitted A → A i.e., we cannot balance symbols that have matching pairs such as ( ), { }, begin end, with a regular grammar Right-regular Integer grammar Summary of Grammatical Forms Integer → 0 Integer | 1 Integer | ... | 9 Integer • Regular Grammars Integer → 0 | 1 | ... | 9 – Only one nonterminal on left; rhs of any production must contain at most one nonterminal AND it must • In EBNF be the rightmost (leftmost) symbol Integer → (0 |... | 9) Integer • Context Free Grammars Integer → 0 | ... | 9 – Only one non-terminal symbol on lhs • Context-Sensitive Grammars – Lhs can contain any number of terminals and nonterminals – Sentential form cannot shrink in derivation • Unrestricted Grammars – Same as CSGs but remove restriction on shrinking sentential forms 3 10/4/2012 Left-regular Integer grammar Finite State Automata Integer → Integer 0 | Integer 1 | ... | Integer 9 • An abstract machine that is useful for lexical Integer → 0 | 1 | ... | 9 analysis • In EBNF – Also know as Finite State Machines Integer → Integer (0 |... | 9) • Two varieties (equivalent in power): Integer → 0 | ... | 9 – Non-deterministic finite state automata (NFSA) – Deterministic finite state automata (DFSA) • Only DFSAs are directly useful for constructing programs – Any NFSA can be converted into an equivalent DFSA • We will use an informal approach to describe DFSAs What is a Finite State Machine? Other uses of FSAs / FSMs • A device that has a finite number of states. • Finite state machines can be used to describe • It accepts input from a “tape” things other than languages • Each state and each input symbol uniquely determine • Many relatively simply embedded systems can another state (hence deterministic) be described with a finite state machine • The device starts operation before any input is read – this is the “start state” • At the end of input the device may be in an “accepting” state – If inputs are characters then the device recognizes a language • Some inputs may cause the device to enter an “error” state (not usually explicitly represented) FSA Graph Representation Example: Vending Machine • A finite state automaton has • Adapted from Wulf, Shaw, Hilfinger, Flon, Fundamental Structures of Computer Science, p.17. 1. A set of states: represented by nodes in a graph 2. An input alphabet augmented with unique end of input symbol 3. State transition function, represented by directed edges in graph, labeled with symbols from alphabet or set of inputs 4. A unique start state 5. One or more final (accepting) states – no exiting edges 4 10/4/2012 Example: Battery Charger A Finite State Automaton for Identifiers • From http://www.jcelectronica.com/articles/state_machines.htm Letter, Digit $ Letter S 1 F • This diagram indicates an explicit transition to an accepting state • We could also use this diagram: L, D L S 1 FSM for a childish language Quiz Oct 2 • What language is described by this diagram? 1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0 a 2. Draw a DFSA that recognizes binary strings m a m with at least three consecutive 1’s S 3. Below is a BNF grammar for fractional d a d numbers. Rewrite as EBNF a S -> -FN | FN FN -> DL | DL.DL DL -> D | D DL D -> 0|1|2|3|4|5|6|7|8|9 Regular Expressions Regular Expressions • An alternative to regular grammars for Regex Meaning specifying a language at the lexical level x a character x (stands for itself) • Also used extensively in text-processing \x an escaped character, e.g., \n M | N M or N • Very useful for web applications M N M followed by N • Built-in support in many languages, e.g., Perl, M* zero or more occurrences of M Ruby, Java, Javascript, Python, .NET languages Note: \ varies with software, typical usage: • There are several different syntactic certain non-printable characters (e.g., \n = newline and \t=tab) conventions for regexes ASCII hex (\xFF) or Unicode hex (\xFFFF) Shorthand character classes (\w = word, \s = whitespace \d=digit) Escaping a literal, e.g. \* or \. 5 10/4/2012 Regular Expression Metasymbols Regex Examples - 1 Regex Meaning Let Σ = { a, b, c } r = ( a | b ) * c M+ One or more occurrences of M This regex specifies repetition (0, 1, 2, etc. occurrences) of either a or b followed by c. Strings that match this regular expression M? Zero or one occurrence of M include: M* Zero or more occurrences of M c [aeiou] the set of vowels ac [0-9] the set of digits bc . Any single character abc aabbaabbc ( ) Grouping Regex Examples – 2 Regex Examples – 3 Let Σ = { a, b, c } r = ( a | c ) * b ( a | c ) * • A regular expression to represent a signed integer. This regular expression specifies repetition of either a or c followed • There is an optional leading sign (+ or -) followed by at by b followed by repetition of either a or c.

Load more