Lexical Analysis Lecture 3

Home , Compound (linguistics), Lexical analysis

January 10, 2018 Announcements

É PA1c due tonight at 11:50pm! É Don’t forget about PA1, the Cool implementation! É Use Monday’s lecture, the video guides and Cool examples if you’re stuck with Cool!

Compiler Construction 2/39 Programming Assignments Going Forward

É C was allowed for PA1, but not for PA2 through PA6

É How comfortable are we with other languages? É Python, Haskell, Ruby, OCaml, and JavaScript

Compiler Construction 3/39 Lexical Analysis Summary

É Lexical analysis turns a stream of characters into a stream of tokens É Regular expressions are a way to specify sets of strings, which we use to describe tokens class Main { ... Lexical Analyzer CLASS, IDENT, LBRACE, ...

Compiler Construction 4/39 Lexical Analysis Summary

Compiler Construction 4/39 Cunning Plan

É Informal Sketch of Lexical Analysis É LA identiﬁes tokens from input string É List lexer ( char[] )

É Issues in Lexical Analysis É Lookahead É Ambiguity

É Specifying Lexers É Regular Expressions É Examples

Compiler Construction 5/39 É In English: É noun, verb, adjective É In Programming: É identiﬁer, integer, keyword, whitespace, ...

Deﬁnitions

É Token — set of strings deﬁning an atomic element with a distinct meaning

É a syntactic category

É Lexeme — a sequence of characters than can be categorized as a Token

Compiler Construction 6/39 Deﬁnitions

É Token — set of strings deﬁning an atomic element with a distinct meaning

É a syntactic category É In English: É noun, verb, adjective É In Programming: É identiﬁer, integer, keyword, whitespace, ...

É Lexeme — a sequence of characters than can be categorized as a Token

Compiler Construction 6/39 Tokens and Lexemes

Token Lexeme CLASS class LT < FALSE false IDENT variable_name

Compiler Construction 7/39 Tokens and Lexemes

Token Lexeme CLASS class LT < FALSE false IDENT variable_name

Compiler Construction 7/39 Tokens and Lexemes

Token Lexeme CLASS class LT < FALSE false IDENT variable_name

By the way, what do you think of Cool’s fi, pool, esac...?

Compiler Construction 7/39 Context for Lexers

É Lexing and Parsing go hand-in-hand É Parser uses distinctions between tokens É e.g., a keyword is treated differently than an identiﬁer get_char() get_token()

input Lexer Parser

character get_token()

Compiler Construction 8/39 Lexical Analysis

É Consider this example: if(i=j) then z<-0 else z<-1 É The input is simply a sequence of characters: if(i=j) then\n\tz<-0\nelse ... É Goal partition input strings into substrings É Then, classify them according to their role (tokenize!)

Compiler Construction 9/39 Tokens

É Tokens correspond to sets of strings

É Identiﬁer— strings of letters or digits, starting with a letter É Integer— a non-empty string of digits É Keyword— “else” or “class” or “let” ... É Whitespace— Non-empty sequence of blanks, newlines, and/or tabs É OpenParen— a left parenthesis ( É CloseParen— a right parenthesis )

Compiler Construction 10/39 Building a Lexical Analyzer

É Lexer implementation must do three things 1. Recognize substrings corresponding to tokens 2. Return the value of lexeme of the token 3. Report errors intelligently (line numbers for Cool)

Compiler Construction 11/39 Lexical Analyzer Implementation

É Lexer usually discards “uninteresting” tokens that don’t contribute to parsing

É Examples: Whitespace, comments É Exceptions: Which languages care about whitespace?

É Review: What would happen if we removed all whitespace and comments before lexing?

Compiler Construction 12/39 Example

É Recall: if (i = j) then z<-0 else z<-1 É Our Cool Lexer would return token-lexeme-linenumber tuples

Compiler Construction 13/39 É We really need a way to describe the lexemes associated with each token É And also a way to handle ambiguities É is “if” two variables “i”, “f”

Lexing Considerations

É The goal is to partition the input string into meaningful tokens.

É Scan left to right (i.e., in order) É Recognize tokens

Compiler Construction 14/39 É And also a way to handle ambiguities É is “if” two variables “i”, “f”

Lexing Considerations

É The goal is to partition the input string into meaningful tokens.

É Scan left to right (i.e., in order) É Recognize tokens

É We really need a way to describe the lexemes associated with each token

Compiler Construction 14/39 Lexing Considerations

É The goal is to partition the input string into meaningful tokens.

É Scan left to right (i.e., in order) É Recognize tokens

É We really need a way to describe the lexemes associated with each token É And also a way to handle ambiguities É is “if” two variables “i”, “f”

Compiler Construction 14/39 Regular Languages

É Sounds like we can use DFAs to recognize lexemes É With accepting states corresponding to tokens! Example: Capture the word “class” c l a s s WS

Example: Capture some variable name

A WS A = letter AN = alphanumeric W = whitespace AN

Compiler Construction 15/39 Capturing Multiple Tokens

What about both “class” and variable names?

c l a s s WS 1

A-c WS 2 WS AN

Compiler Construction 16/39 Lexical Analyzer Generators

É We like regular languages as a means to categorize lexemes into tokens É We don’t like the complexity of implementing a DFA manually

É We use Regular Expressions to describe regular languages

É And our tokens are recognizable as regular languages!

É Regular Expressions can be automatically turned into a DFA for rapid lexing!

Compiler Construction 17/39 Languages Review

É Deﬁnition Let Σ be a set of characters. A language over Σ is a set of strings of characters drawn from Σ. Σ is called the alphabet

Compiler Construction 18/39 Examples of Languages

É Alphabet = English Characters É Language = English Sentences É Note: Not every string on English characters is an English sentence

É Example: adsfasdklg gdsajkl É Alphabet = ASCII characters É Language = C Programs É Note: ASCII character set is different from English character set

Compiler Construction 19/39 Notation

É Languages are sets of strings

É We need some notation for specifying which sets we want

É i.e., which strings are in a set?

É For lexical analysis, we care about regular languages, which can we described using regular expressions

Compiler Construction 20/39 Regular Expressions

É Each regular expressions is a notation for a regular language (a set of “words”)

É Notation forthcoming!

É If A is a regular expression, we write L(A) to refer to the language denoted by A

Compiler Construction 21/39 Base Regular Expressions

É Single character: ‘c’ É L(‘c’) = { ‘c’ }

É Concatenation: AB É A and B are both Regular expressions É L(AB) = ab a L(A) and b L(B) { | ∈ ∈ }

É Example: L(‘i’ ‘f’) = { ‘if’ }

Compiler Construction 22/39 Compound Regular Expressions

É Union É L(A B) = s s L(A) or s L(B) | { | ∈ ∈ }

É Examples

É L(‘if’ | ‘then‘ | ‘else’) = { ‘if’, ‘then’, ‘else’ } É L(‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’) = what?

É L ( (‘0’|‘1’) (‘0’|’1’) ) = what?

Compiler Construction 23/39 Starz! É So far, base and compound regular expressions only describe ﬁnite languages Iteration: A É ∗ É L(A ) = “” L(A) L(AA) L(AAA) ... ∗ { } ∪ { } ∪ { } ∪ { } ∪ É Examples É L(‘00 ) = “”,“0”,“00”,“000”,... ∗ { } É L(‘10‘00 ) = “1”,“10”,“100”,“1000”,... ∗ { } É Empty: ε É L(ε) = “” { }

Compiler Construction 24/39 Example: Keyword

É Keywords: “else” or “if” or “ﬁ” ‘else’ | ‘if’ | ‘ﬁ’ (Recall that ‘else’ abbreviates concatenation of ‘e’ ‘l’ ‘s’ ‘e’ )

Compiler Construction 25/39 Example: Integer

É Integer: a non-empty string of digits digit = ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’

number = digit digit*

É Abbreviation: A+ = AA*

Compiler Construction 26/39 Example: Identiﬁers

É Identiﬁer: string of letters or digits, start with a letter

letter = ‘A’ | ... | ‘Z’ | ‘a’ | ... ‘Z’ ident = letter (letter | digit ) *

É Is (letter*|digit*) the same?

Compiler Construction 27/39 Example: Whitespace

É Whitespace: a non-empty sequence of blanks, newlines, and tabs

( ‘ ’ | ‘\t’ | ‘\n’ | ‘\r’ ) +

Compiler Construction 28/39 Example: Phone Numbers

Regular expressions are everywhere! Consider: (123)-234-4567

Σ { 0, 1, 2, ... 9, (, ), -} area digit digit digit exch digit digit digit phone digit digit digit digit number ‘(’ area ‘)’ ‘-’ exch ‘-’ phone

Compiler Construction 29/39 Example: Email addresses

Consider [email protected] Σ {a, b, c, ..., z, ‘.’, ‘@’} name letter+ address name ‘@’ name (‘.’ name)*

Compiler Construction 30/39 É NO Recall we need the original lexeme! É We must adapt regular expressions to this goal

Regular Expression Summary

É Regular expressions describe many useful languages É Given a string s and a regexp R, we can ﬁnd if s L(R) ∈ É Is this enough?

Compiler Construction 31/39 É We must adapt regular expressions to this goal

Regular Expression Summary

É Regular expressions describe many useful languages É Given a string s and a regexp R, we can ﬁnd if s L(R) ∈ É Is this enough? É NO Recall we need the original lexeme!

Compiler Construction 31/39 Regular Expression Summary

É Regular expressions describe many useful languages É Given a string s and a regexp R, we can ﬁnd if s L(R) ∈ É Is this enough? É NO Recall we need the original lexeme! É We must adapt regular expressions to this goal

Compiler Construction 31/39 Next time

É Specifying lexical structure using regular expressions É Finite automata É Deterministic Finite Automata É Nondeterministic Finite Automata É Implementation of Regular Expressions É Regexp NFA DFA lookup table → → → É Lexical Analyzer Generation (i.e., doing this all automatically)

Compiler Construction 32/39 Lexical Speciﬁcation (1)

É Start with a set of tokens (protip, PA2 lists them for Cool) É Write a regular expressions for the lexemes representing each token

É Number = digit+ É IF = “if” É ELSE = “else” É IDENT = letter ( letter | digit ) * ...

Compiler Construction 33/39 Lexical Speciﬁcation (2)

É Construct R, matching all lexemes for all tokens

É R = Number | IF | ELSE | IDENT | ... É R = R1 | R2 | R3 | ...

If s L(R), then s is a lexeme É ∈ É Also s L(Rj) for some j ∈ É The particular j corresponds to the type of token reported by lexer

Compiler Construction 34/39 Lexical Speciﬁcation (3)

É For an input x1,...,xn É Each xi Σ ∈ For each 1 i n, check É ≤ ≤ É is x1...xi L(R)? ∈ É If so, it must be that x1...xi L(Rj) for some j ∈ É Remove x1...xi from input and restart

Compiler Construction 35/39 Example Lexing

R = Whitespace | Integer | Identifier | Plus Lex “f + 3 + g” É “f” matches R (more specifically, Identifier) É “ ” matches R (more specifically, Whitespace) ... What does the lexer output look like for this example?

Compiler Construction 36/39 Ambiguities

É Ambiguities arise in this algorithm

É R = Whitespace | Integer | Identiﬁer | Plus

É Lex “foo+3” É “f”, “fo”, and “foo” all match R, but not “foo+” É How much input do we consume? É Maximal munch rule: pick the longest possible substring that matches R

Compiler Construction 37/39 Ambiguities (2)

R = Whitespace | ‘new’ | Integer | Identiﬁer | Plus É Lex “new foo”

É “new” matches both the ‘new’ rule and the ‘Identiﬁer’ rule which one do you pick?

É Generally, pick the rule listed ﬁrst É Arbitrary, but typical É Important for PA2! ‘new’ was listed before ‘Identiﬁer‘, so the token given is ‘new’

Compiler Construction 38/39 Summary

É Regular expressions provide a concise notation for string patterns É We need to adapt them for lexical analysis to É Resolve ambiguities É Handle errors (report line numbers)

É Next time, Lexical Analysis Generators

Compiler Construction 39/39 c l a s s WS

A WS A = letter AN = alphanumeric W = whitespace AN c l a s s WS 1

A-c WS 2 WS AN

Compiler Construction 39/39