<<

10/4/2012

COS 301 Lexical and Syntactic Analysis • Language implementation systems must Programming Languages analyze source code, regardless of the specific implementation approach (compiler or Lexical and Syntactic Analysis interpreter) • Nearly all syntax analysis is based on a formal description of the syntax of the source language (BNF)

Sebesta Chapter 4.1-4.4 – uses less powerful grammars than syntactic analysis

Source Code Syntax Analysis Why Separate Lexical and Syntax Analysis?

• The syntax analysis portion of a language • Simplicity - less complex approaches can be processor nearly always consists of two parts: used for lexical analysis; separating them – A low-level part called a lexical analyzer simplifies the parser (mathematically, a finite automaton based on a • Efficiency - separation allows optimization of regular grammar) the lexical analyzer – A high-level part called a syntax analyzer, or parser – About 75% of execution time for a non-optimizing (mathematically, a push-down automaton based on compiler is lexical analysis a context-free grammar, or BNF) • Portability - parts of the lexical analyzer may not be portable, but the parser always is portable – The lexical analyzer has to deal with low-level details of the character set – such as what a newline character looks like, EOF etc.

Lexical Analysis Lexical Analyzer

• A lexical analyzer is a pattern matcher for • Purpose: transform program representation character strings from sequence of characters to sequence of • A lexical analyzer is a “front-end” for the tokens parser • Input: a stream of characters • Identifies substrings of the source program that • Output: lexemes / tokens belong together - lexemes • Discard: whitespace, comments – Lexemes match a character pattern, which is associated with a lexical category called a token – sum is a lexeme; its token may be IDENT • Often “token” is used in place of lexeme

1 10/4/2012

Example Tokens Other Sequences

• Identifiers • Whitespace: space tab • Literals: 123, 5.67, 'x', true • Comments, e.g. • Keywords or reserved words: bool, while, char // {any-char} end-of-line ... /* {any-char} */ • Operators: + - * / ... • End-of-line • Punctuation: ; , ( ) { } • End-of-file • Note: in some languages end-of-line or new- line characters are considered white space (C, C++, Java…) • In other languages (BASIC, Fortran, etc.) they are statement delimiters

Lexical Analyzer (continued) The Chomsky Hierarchy (Again)

• The lexical analyzer is usually a function that is called • Four levels of grammar: by the parser when it needs the next token 1. Regular • Three approaches to building a lexical analyzer: – Write a formal description of the tokens (grammar or regular 2. Context-free expressions) and use a software tool that constructs table- 3. Context-sensitive driven lexical analyzers given such a description • Ex. lex, flex, flex++ 4. Unrestricted (recursively enumerable) – Design a state diagram that describes the tokens and write a • CFGs are used for syntax parsing program that implements the state diagram – Design a state diagram that describes the tokens and hand- • Regular grammars are used for lexical analysis construct a table-driven implementation of the state diagram

Productions Three models of the lexical level

• All grammars are tuples {P,T,N,S} • Although the lexical level can be described – Where P is a set of productions, T a set of terminal with BNF, regular grammars can be used symbols, N a set of non-terminal symbols and S is • Equivalent to regular grammars are: the start symbol – a member of N – Regular expressions – Finite state automata • The form of production rules distinguishes grammars in hierarchy

2 10/4/2012

Context-Sensitive Grammars Context-free Grammars

• Production: • Already discussed as BNF - a stylized form of • α → β |α| ≤ |β| CFG • α, β  (N  T)* • Every production is in the form A   where A – The left-hand side can be composed of strings of terminals is a single non-terminal and  is a string of and nonterminals – Length of RHS cannot be less than length of LHS (sentential terminals and/or non-terminals (possibly form cannot shrink in derivation) except S  is allowed empty) • Note than context sensitive grammars can have • Equivalent to a pushdown automaton productions such as – aXb => aYZc • For a wide class of unambiguous CFGs, there – aXc => aaXb are table-driven, linear time parsers

Regular Grammars Regular Grammars

• Simplest and least powerful; equivalent to: • Left regular grammar:   T*, B  N – A → B  – Finite-state automaton A →  • All productions must be right-regular or left- • A regular grammar is a right-regular or a left- regular regular grammar • Right regular grammar:   T*, B  N – If we have both types of rules we have a linear A →  B grammar – a more powerful language than a regular A →  grammar – Regular langs  linear langs  context-free langs • E.g., rhs of any production must contain at • Example of a linear language that is not a regular most one nonterminal AND it must be the language: rightmost symbol { aⁿ bⁿ | n ≥ 1 } • Direct recursion is permitted A →  A i.e., we cannot balance symbols that have matching pairs such as ( ), { }, begin end, with a regular grammar

Right-regular Integer grammar Summary of Grammatical Forms

Integer → 0 Integer | 1 Integer | ... | 9 Integer • Regular Grammars Integer → 0 | 1 | ... | 9 – Only one nonterminal on left; rhs of any production must contain at most one nonterminal AND it must • In EBNF be the rightmost (leftmost) symbol Integer → (0 |... | 9) Integer • Context Free Grammars Integer → 0 | ... | 9 – Only one non-terminal symbol on lhs • Context-Sensitive Grammars – Lhs can contain any number of terminals and non- terminals – Sentential form cannot shrink in derivation • Unrestricted Grammars – Same as CSGs but remove restriction on shrinking sentential forms

3 10/4/2012

Left-regular Integer grammar Finite State Automata

Integer → Integer 0 | Integer 1 | ... | Integer 9 • An abstract machine that is useful for lexical Integer → 0 | 1 | ... | 9 analysis • In EBNF – Also know as Finite State Machines Integer → Integer (0 |... | 9) • Two varieties (equivalent in power): Integer → 0 | ... | 9 – Non-deterministic finite state automata (NFSA) – Deterministic finite state automata (DFSA) • Only DFSAs are directly useful for constructing programs – Any NFSA can be converted into an equivalent DFSA • We will use an informal approach to describe DFSAs

What is a Finite State Machine? Other uses of FSAs / FSMs

• A device that has a finite number of states. • Finite state machines can be used to describe • It accepts input from a “tape” things other than languages • Each state and each input symbol uniquely determine • Many relatively simply embedded systems can another state (hence deterministic) be described with a finite state machine • The device starts operation before any input is read – this is the “start state” • At the end of input the device may be in an “accepting” state – If inputs are characters then the device recognizes a language • Some inputs may cause the device to enter an “error” state (not usually explicitly represented)

FSA Graph Representation Example: Vending Machine

• A finite state automaton has • Adapted from Wulf, Shaw, Hilfinger, Flon, Fundamental Structures of Computer Science, p.17. 1. A set of states: represented by nodes in a graph 2. An input alphabet augmented with unique end of input symbol 3. State transition function, represented by directed edges in graph, labeled with symbols from alphabet or set of inputs 4. A unique start state 5. One or more final (accepting) states – no exiting edges

4 10/4/2012

Example: Battery Charger A Finite State Automaton for Identifiers

• From http://www.jcelectronica.com/articles/state_machines.htm Letter, Digit $ Letter S 1 F

• This diagram indicates an explicit transition to an accepting state • We could also use this diagram:

L,

L S 1

FSM for a childish language Quiz Oct 2

• What language is described by this diagram? 1. Draw a DFSA that recognizes binary strings that start with 1 and end with 0 a 2. Draw a DFSA that recognizes binary strings

m a m with at least three consecutive 1’s S 3. Below is a BNF grammar for fractional d a d numbers. Rewrite as EBNF

a S -> -FN | FN FN -> DL | DL.DL DL -> D | D DL D -> 0|1|2|3|4|5|6|7|8|9

Regular Expressions Regular Expressions

• An alternative to regular grammars for Regex Meaning specifying a language at the lexical level x a character x (stands for itself) • Also used extensively in text-processing \x an escaped character, e.g., \n M | N M or N • Very useful for web applications M N M followed by N • Built-in support in many languages, e.g., , M* zero or more occurrences of M Ruby, Java, Javascript, Python, .NET languages Note: \ varies with software, typical usage: • There are several different syntactic certain non-printable characters (e.g., \n = newline and \t=tab) conventions for regexes ASCII hex (\xFF) or Unicode hex (\xFFFF) Shorthand character classes (\w = word, \s = whitespace \d=digit) Escaping a literal, e.g. \* or \.

5 10/4/2012

Regular Expression Metasymbols Regex Examples - 1

Regex Meaning Let Σ = { a, b, c } r = ( a | b ) * c M+ One or more occurrences of M This regex specifies repetition (0, 1, 2, etc. occurrences) of either a or b followed by c. Strings that match this regular expression M? Zero or one occurrence of M include: M* Zero or more occurrences of M c [aeiou] the set of vowels ac [0-9] the set of digits bc . Any single character abc aabbaabbc ( ) Grouping

Regex Examples – 2 Regex Examples – 3

Let Σ = { a, b, c } r = ( a | c ) * b ( a | c ) * • A regular expression to represent a signed integer. This regular expression specifies repetition of either a or c followed • There is an optional leading sign (+ or -) followed by at by b followed by repetition of either a or c. least one digit in the range 0 .. 9. b ab (\+ | \- )? [ 0 – 9 ] + bcccc abc Matches include +1, 0, -0, 827356, -98686, … aaccaab aacabccca

Regex Examples - 4 Regex Libraries

• A regular expression to represent a signed floating point • Many sources available online number. There is an optional leading sign ( + or - ) • See for example followed by 1 or more digits in the range 0 .. 9 followed by an optional decimal point and then 1 or more digits in the http://regexlib.com/Default.aspx range 0 .. 9. The \ symbol indicates . is the literal period and not the . symbol for “any character.”

1. (\+|\-)?[0-9]+(\.[0-9]+)? 2. [-+]?([0-9]+\.[0-9]+|[0-9]+) 3. [-+]?[0-9]+\.?[0-9]+

This illustrates how complex regexes can be!

6 10/4/2012

Lexical Syntax for a simple C-like language Lexical Syntax for a simple C-like language anyChar [ -~] Note: space(0x20) to tilde (0x7f) Keyword bool | char | else | false | float | Letter [a-zA-Z] if | int | main | true | while Identifier {Letter}({Letter} | {Digit})* Digit [0-9] integerLit {Digit}+ Whitespace [ \t] Again note literal space(0x20) floatLit {Digit}+\.{Digit}+ Eol \n charLit ‘{anyChar}’ Eof \004 Operator = | || | && | == | != | < | <= |> | >= | + | - | * | / |! | [ | ] Separator : | . | { | } | ( | ) Comment // ({anyChar} | {Whitespace})* {eol

Some Common Conventions

• When expressing lexical rules for a language: • Recognition of end symbol (end of file) – Explicit terminator typically is used only for ends in a final state. program as a whole, not each token. – An unlabeled arc represents any other valid input • Automaton must be deterministic. symbol. – Drop keywords; handle separately with lookup – Recognition of a token ends in a final state. table – Recognition of a non-token (e.g., whitespace, – We must consider all sequences with a common comment) transitions back to start state. prefix together. Examples: 1. floats and ints 2. Comments and division

DFSAs for a small C-like Language DFSAs for a small C-like language

ws = whitespace, l = letter, d = digit, eoln = \n, eof = end of input, All others are literal Ints and floats

Whitespace Single & double quotes // comments Assignment & comparison

Addition Identifiers Logical and bitwise AND

7 10/4/2012

Translations State Diagram Design

• A DFSA that accepts binary strings with an even – A naive state diagram would have a transition from number of 1 bits every state on every character in the source language – 0 0 – All keywords would be captured in the state 1 diagram A B – such a diagram would be very large!

1 • Right Regular Grammar A -> 0A | 1B | ε B -> 0B | 1A • Regex 0*(10*1)*0*

Lexical Analysis (cont.) Lexical Analysis (cont.)

• In many cases, transitions can be combined to • Reserved words and identifiers can be simplify the state diagram recognized together (rather than having a part – When recognizing an identifier, all uppercase and of the diagram for each reserved word) lowercase letters are equivalent – Use a table lookup to determine whether a possible • Use a character class that includes all letters identifier is in fact a reserved word – When recognizing an integer literal, all digits are equivalent - use a digit class

Lexical Rules State Diagram

::= | ::= | | | ::= | ::= + | - | * | / | ( | )

8 10/4/2012

Lexical Analyzer from Text Program Structure

• Program is a DFSA with global variables Implementation: • Utility routines: front.c (pp. 176-181) – getChar - gets the next character of input, puts it in nextChar, determines its class and puts the - Following is the output of the lexical analyzer of class in charClass front.c when used on (sum + 47) / total – getNonBlank – advances over whitespace to the first char of a token Next token is: 25 Next lexeme is ( Next token is: 11 Next lexeme is sum – addChar - puts the character from nextChar into Next token is: 21 Next lexeme is + Next token is: 10 Next lexeme is 47 the place the lexeme is being accumulated, lexeme Next token is: 26 Next lexeme is ) – Next token is: 24 Next lexeme is / lookup - determines whether the string in lexeme Next token is: 11 Next lexeme is total is a reserved word (returns a code) Next token is: -1 Next lexeme is EOF

front.c 1 front.c 2

#include /* Character classes */ #include #define LETTER 0 #define DIGIT 1 /* global declarations */ #define UNKNOWN 99 /* variables */ /* Token codes */ int charclass; #define INT_LIT 10 char lexeme[100]; #define IDENT 11 char nextChar; #define ASSIGN_OP 20 int lexlen; #define ADD_OP 21 int nextToken; #define SUB_OP 22 FILE *in_fp, *fopen(); #define MULT_OP 23 #define DIV_OP 24 /* Function declarations */ #define LEFT_PAREN 25 void addChar(); #define RIGHT_PAREN 26 void getChar(); void getNonBlank(); int lex();

front.c 3 front.c 4

/* main driver */ /* lookup - a function to lookup operators and parentheses and return the token */ int lookup(char ch){ main() { switch(ch){ /* open the input data file and process contents */ case '(': addChar(); if ((in_fp = fopen = fopen("front.in","r")) == NULL) nextToken = LEFT_PAREN; printf("ERROR - cannot open front in \n"); break; case ')': else { addChar(); getChar(); nextToken = RIGHT_PAREN; do { break; case '+': lex(); addChar(); } while nextToken != EOF nextToken = ADD_OP; break; } case '-': } addChar(); nextToken = SUB_OP; break; case '*': addChar(); nextToken = MULT_OP; break; case '/': addChar(); nextToken = DIV_OP; break; default: addChar(); nextToken = EOF; break; } return nextToken; }

9 10/4/2012

front.c 5 front.c 6

/* addChar - a function to add next char to lexeme */ /* getNonBlank - a function to call getChar until it void addChar(){ returns a non-whitespace character */ if (lexlen <= 98){ void getNonBlank(){ lexeme[lexlen++] = nextChar; while (isspace(nextChar)) lexeme[lexlen] = 0; getChar(); } else { } printf("Error - lexeme too long \n"); } /* lex - a simple lexical analyzer for arithmetic expressions */ } int lex(){ lexLen = 0; /* getChar - a function get the next char of input and determine getNonBlank(); its character class */ switch (charClass){ void getChar(){ case LETTER: if ((nextChar = getc(in_fp)) != EOF){ /* parse identifiers */ if (isalpha(nextChar)) addChar(); charClass = LETTER; getChar(); else if (isdigit(nextChar)) while (charClass == LETTER || charClass == DIGIT){ charClass = DIGIT; addChar(); else charClass = UNKNOWN; getChar(); } else } charClass = EOF; nextToken = IDENT; } break; }

front.c 7 Example output (sum + 47) / total

case DIGIT: Next token is: 25 lexeme is ( /* parse integer literals */ addChar(); Next token is: 11 lexeme is sum getChar(); while (charClass == DIGIT){ Next token is: 21 lexeme is + addChar(); getChar(); Next token is: 10 lexeme is 47 } nextToken = INT_LIT; Next token is: 26 lexeme is ) break; case UNKNOWN: Next token is: 24 lexeme is / /* parenthese and operators */ lookup(nextChar); Next token is: 11 lexeme is total getChar(); break; Next token is: -1 lexeme is EOF case EOF: /* EOF */ nextToken = EOF; lexeme[0] = 'E'; lexeme[1] = 'O'; lexeme[2] = 'F'; lexeme[3] = 0; break; } /* end of switch */ printf("Next token is: %d, next lexeme is %s\n", nextToken, lexeme); return nextToken; } /* end lex */}

Syntactic Analysis Two general types of parsers

• Syntactic analysis or parsing determines • Top-down parsers start with the start symbol whether a program is legal or syntactically of the language and build a parse tree in correct. preorder: • There are two distinct goals: – Visit the node 1. If not, produce diagnostic messages. Many parsers – Visit the left subtree try to recover and continue analysis as long as – Visit the right subtree possible in order to diagnose as many problems as • This corresponds to a leftmost derivation possible 2. If a program is syntactically correct, produce a • Example: Given current string x A y, and a parse tree rule A → w, rewrite the string as x w y

10 10/4/2012

Bottom-up parsers Computational Complexity of Parsing

• Bottom up parsers construct a tree starting • Parsing CFLs in the general case is inefficient and with the leaves – the reverse order of a exponential in the length of the program string rightmost derivation – Each possible rule has to be tried (exhaustive search) • In broad terms, the parser finds a right • There are a number of algorithms that can reduce sentential form  (called a handle) with a complexity to O(n3) substring of  that is the RHS of a rule that – Still too complex for commercial compilers produces the previous sentential form of  • By reducing the generality of the languages to be – The sentential form is then reduced to its LHS parsed complexity can be reduced to – Example: If the current string is x w y and there is a approximately linear O(n) rule A → w, rewrite the string as x A y

Top-Down Parsing How to choose?

• Given the sentential form xAwhere • Examine the next token of input: is it a, b or c? – x is a string of terminal symbols • This of course is easy but it may get – A is the leftmost non-terminal considerably more complex if the RHSs begin – is a string of terminals and non-terminals with non terminals • Our goal is to find the next sentential form in a leftmost derivation – We need to choose a rule where A is the LHS – Suppose the possibilities are • A => bB A => cBb A => a – We need to choose among • A => xbB A => xcBb A => xa

Recursive Descent Parsing Recursive Descent Parsing

• An easy and straightforward top down parsing • Constructed from a set of mutually recursive algorithm (at least for humans to write) routines that mirror the productions of the – It only works with a subset of CFGs called LL(k) grammar • L = Left-to-right parsing – EBNF is well-suited as a model for a recursive • L = Leftmost derivation descent parser • (k) means at most k tokens lookahead – usually 1 for an • Each non-terminal in the grammar has a single efficient parser routine or function – LR grammars are left-to-right parsing with rightmost derivation – Its purpose is to trace the parse tree starting from • Handle a wider class of grammars than LL parsers that symbol • Better at error reporting – It is effectively a parser for that language where the • Table driven parser, harder for humans to write than LL nonterminal is the start symbol • Easy to generate with machine (e.g, yacc)

11 10/4/2012

Example Defines from front.c 2

/* Character classes */ • EBNF #define LETTER 0 #define DIGIT 1 => {(+|-) } #define UNKNOWN 99 => {(* | /) } /* Token codes */ => | int_constant | ( expr ) #define INT_LIT 10 #define IDENT 11 • In the following example, remember that the #define ASSIGN_OP 20 #define ADD_OP 21 lexer has global variables: #define SUB_OP 22 #define MULT_OP 23 char nextChar; #define DIV_OP 24 #define LEFT_PAREN 25 int lexlen; #define RIGHT_PAREN 26 int nextToken;

Expr Term void expr(){ void term(){ /* parses => {(+|-) } */ /* parses => {(+|-) } */ printf("enter \n"); term(); printf("enter \n"); while (nextToken == ADD_OP || factor(); nextToken == SUB_OP) { while (nextToken == MULT_OP || lex(); nextToken == DIV_OP) { term(); lex(); } factor(); printf("exit \n"); } } printf("exit \n"); /* Q: Where does nextToken come from? } A: each function leaves the next unconsumed token in nextToken each function assumes on entry that it is available in nextToken */

is a bit more complex… Factor

• Factor has to choose between the several void factor(){ /* parses => | int_constant | ( expr ) */ alternate RHS => | int_constant | ( expr ) printf("enter \n"); if (nextToken == IDENT || nextToken == INT_LIT) • Also we may be able to detect a syntax error in lex(); this function else { if (nextToken == LEFT_PAREN) { – The previous two functions could not lex(); expr(); /* recursion! */ if (nextToken == RIGHT_PAREN) lex(); else error(); } else error(); printf("exit \n"); }

12 10/4/2012

Example output (sum + 47) / total Example output (sum + 47) / total

Next token is: 25 lexeme is ( Next token is: 26 lexeme is ) Enter Exit Enter Exit Enter Exit Next token is: 11 lexeme is sum Next token is: 24 lexeme is / Enter Exit Enter Next token is: 11 lexeme is total Enter Enter Next token is: 21 lexeme is + Next token is: -1 lexeme is EOF Exit Exit Exit Exit Next token is: 10 lexeme is 47 Exit Enter Enter

Example 2: if statement

-> if ( ) [else ] • Recursive descent subprogram has to – Check that current token is IF – Lex() and check that current token is ( – Lex() and call – Check that current token is ) – Lex() and call – Check if current token is ELSE, if so Lex() and call

Example 2: if statement Example 2: if statement void ifstmt(){ if (nextToken != IF_CODE) stmt(); error(); if (nextToken == ELSE_CODE){ else { lex(); lex(); stmt(); if (nextToken != LEFT_PAREN) } /* end if (nextToken == ELSE_CODE) /* error(); } /* end if (nextToken != RIGHT_PAREN) /* else { lex(); /* error in text; this was omitted */ } /* end if (nextToken != LEFT_PAREN) /* boolexpr(); } /* end if (nextToken != IF_CODE) /* if (nextToken != RIGHT_PAREN) } /* end IF_STMT /* error(); else { lex(); /* error in text; this was omitted */ stmt();

13 10/4/2012

LL Grammars BNF and EBNF

• Top down parsing algorithms are simple and • BNF easy to hand code + – But the class of grammars that can be recognized | - using top down parsing is limited to LL(k) (and it is | easiest when k=1: one symbol lookahead) * • Rule #1: left recursion is prohibited | / – Given a rule => + we would obviously | have infinite recursion as A has to start with a • EBNF recursive call {(+ | -) } – Note that this applies only to the FORM of the {(* | /) } grammar – EBNF can be useful for top down parsing

Eliminating Direct Left Recursion Left recursion removal

• Direct left recursion can be removed by + | - | rewriting any rule of the form * | / | – A => AxB | B | C | ( ) • As – A => BA' | C – A' => xBA' |   + | - |   * | / | ( )

Indirect Left Recursion Rule #2

• Indirect left recursion also presents a problem: • In order to use one-symbol lookahead the rules A => B x A on the RHS of any production must be B => A B distinguishable by examining only one token • It is possible to remove indirect left recursion – The text refers to this as the "pairwise disjointness" but this is beyond our scope rule – For any RHS of nonterminal A -> , we can compute a set called FIRST() which contains the non- terminals that can appear on the left of  – So A -> , we want the intersection of FIRST() and FIRST() to be empty

14 10/4/2012

Example Left factoring

• Consider • Rewriting the grammar can solve many – A => aB | bAb | Bb lookahead problems – B => cB | d • Consider subscript expressions – FIRST(aB) = {a}; FIRST(bAb) = {b}; FIRST(Bb) = {c,d} => | [] • Consider • Rewrite as – A => aB | BAb => – B => aB | b => [] |  – FIRST(aB) = {a} FIRST(BAb) = {a,b} • Which is identical to the EBNF • When parsing A we can't determine what => [[] ] production to apply by looking at the next terminal

Quiz Answers DFSA for q2

• Draw a DFSA that recognizes binary strings with at least three • Draw a DFSA that recognizes binary strings that start with 1 and consecutive 1’s end with 0 1 0 1 0 S 1 0 1,0 1 1 1 • Below is a BNF grammar for fractional numbers. Rewrite as EBNF S S -> -FN | FN FN -> DL | DL.DL 0 DL -> D | D DL 0 D -> 0|1|2|3|4|5|6|7|8|9

S -> [-]FN FN-> DL[.DL] DL -> D{D}

Quiz 4

• For the language of binary strings that contain at least 3 consecutive 1’s write: 1. A regular grammar 2. A regular expression

15