Fall 2016-2017 Compiler Principles Lecture 1: Lexical Analysis
Roman Manevich Ben-Gurion University of the Negev Agenda • Understand role of lexical analysis in a compiler • Regular languages reminder • Lexical analysis algorithms • Scanner generation
2 Javascript example
• Can you some identify basic units in this code?
var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 3 Javascript example
• Can you some identify basic units in this code? keyword ? ? ? ? ? var currOption = 0; // Choose content to display in lower pane. ? function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; ? for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 4 Javascript example
• Can you some identify basic units in this code?
keyword identifier operator numeric literal punctuation comment var currOption = 0; // Choose content to display in lower pane. string literal function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; whitespace for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 5 Role of lexical analysis
• First part of compiler front-end
Lexical Syntax AST Symbol Inter. Code High-level Analysis Analysis Table Rep. Generation Executable Language etc. (IR) Parsing Code (scheme)
• Convert stream of characters into stream of tokens – Split text into most basic meaningful strings • Simplify input for syntax analysis
6 From scanning to parsing program text 59 + (1257 * xPosition)
Lexical Analyzer Lexical error valid token stream num + ( num * id )
Grammar: E id E num Parser E E + E E E * E syntax valid E ( E ) error + Abstract Syntax Tree
num *
num x 7 Scanner output
wherewhere isis thethe whitewhite space?space? var currOption = 0; // Choose content to display in lower pane. Stream of Tokens function choose ( id ) { LINE: ID(value) var menu = ["about-me", "publications“, "teaching", "software", "activities"]; 1: VAR 1: ID(currOption) for (i = 0; i < menu.length; i++) { 1: EQ currOption = menu[i]; 1: INT_LITERAL(0) var elt = document.getElementById(currOption); 1: SEMI if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; 3: FUNCTION } 3: ID(choose) else { 3: LP elt.style.display = "none"; 3: ID(id) } 3: EP } 3: LCB } ...
8 Tokens
9 What is a token?
• Lexeme – substring of original text constituting an identifiable unit – Identifiers, values, reserved words, … • Record type storing: – Kind – Value (when applicable) – Start-position/end-position – Any information that is useful for the parser • Different for different languages
10 Example tokens
Type Examples Identifier x, y, z, foo, bar NUM 42 FLOATNUM -3.141592654 STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if …
11 C++ example 1
• Splitting text into tokens can be tricky • How should the code below be split?
vector
>> >, > or operator two tokens ?
12 C++ example 2
• Splitting text into tokens can be tricky • How should the code below be split?
vector
>, > two tokens
13 Separating tokens
Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n
• Lexemes that are recognized but get consumed rather than transmitted to parser – if i f i/*comment*/f
14 Preprocessor directives in C
Type Examples Include directives #include
15 First step of designing a scanner
• Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length?
16 First step of designing a scanner
• Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length? – Regular expressions
17 Agenda • Understand role of lexical analysis in a compiler – Convert text to stream of tokens • Regular languages reminder • Lexical analysis algorithms • Scanner generation
18 Regular languages reminder
19 Basic definitions and facts
• Formal languages – Alphabet = finite set of letters – Word = sequence of letter – Language = set of words • Regular languages defined equivalently by – Regular expressions – Finite-state automata
20 Regular expressions
• Empty string: Є
• Letter: a1, …, ak Alphabet • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* – Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) – What is this language?
21 Exercise 1 - Question
• Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit
22 Exercise 1 - Answer
• Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit
– (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)*
23 Exercise 1 – Better answer
• Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit
– (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next*
24 Exercise 2 - Question
• Language of rational numbers in decimal representation (no leading, ending zeros) – Positive examples: • 0 • 123.757 • .933333 • 0.7 – Negative examples: • 007 • 0.30
25 Exercise 2 - Answer
• Language of rational numbers in decimal representation (no leading, ending zeros) – Digit = 1|2|…|9 Digit0 = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg
26 Exercise 3 - Question
• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …
27 Exercise 3 - Answer
• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar: S ::= [] | [S]
28 Finite automata
29 Finite automata: known results
• Types of finite automata: – Deterministic (DFA) – Non-deterministic (NFA) – Non-deterministic + epsilon transitions • Theorem: translation of regular expressions to NFA+epsilon (linear time) • Theorem: translation of NFA+epsilon to DFA – Worst-case exponential time • Theorem [Myhill-Nerode]: DFA can be minimized
30 Finite automata
• An automaton M = Q, , , q0, F is defined by states and transitions
transition accepting b state
c a
start b
start state 31 Exercise - Question
• What is the language defined by the automaton below?
b
c a
start b
32 Exercise - Answer
• What is the language defined by the automaton below? – a b* c – Generally: all paths leading to accepting states b
c a
start b
33 Non-deterministic automata
• Allow multiple transitions from given state labeled by same letter
b
c a
start c a b
34 NFA+Є automata
• Є transitions can “fire” without reading the input
b
a c start Є
35 A little about me
• Joined Ben-Gurion University in 2012 • Research interests – Inductive programming and synthesis – Static analysis and verification – Language-supported parallelism
36 I am here for
• Teaching you theory and practice of popular compiler algorithms – Hopefully make you think about solving problems by examples from the compilers world – Answering questions about material
• Contacting me – e-mail: [email protected] – Office hours: see course web-page • Announcements • Forums (per assignment)
37 Tentative syllabus
Front Intermediate Code Optimizations End Representation Generation
Operational Dataflow Register Scanning Semantics Analysis Allocation
Top-down Loop Energy Lowering Parsing (LL) Optimizations Optimization
Bottom-up Instruction Parsing (LR) Selection
mid-term exam
38 Reg-exp vs. automata A high-level • Regular expressions are declarative language – Offer compact way to define a regular language by humans – Don’t offer direct way to check whether a given word is in the language A machine • Automata are operative language – Define an algorithm for deciding whether a given word is in a regular language – Not a natural notation for humans
39 From Regular expressions to automata
40 From reg. exp. to NFA+Є automata
• Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression • Proof: by induction on the structure of the regular expression
start
41 Inductive constructions
R = start R1 R1 | R2 a start R = a start R2
R R* R1 R2 R1 R2 start start
42 Running time of NFA+Є
• Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length n takes O(n·k2) time 2 Each state in a configuration of – Can we reduce the k factor? O(k) states may have O(k) outgoing edges, so processing an input letter may take O(k2) time
43 From NFA+Є to DFA
• Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length n takes O(n·k2) time – Can we reduce the k2 factor? • Theorem: for any NFA+Є automaton there exists an equivalent deterministic automaton • Proof: determinization via subset construction – Number of states in the worst-case O(2k) – Running time O(n)
44 Recap
• We know how to define any single type of lexeme
• We know how to convert any regular expression into a recognizing automaton
• But how do we use this for scanning?
45 The formal scanning problem
46 What is a scanner
var currOption = 0; // Choose content function choose ( id ) { ...
Lexical Specification: List of regular expressions (one per lexeme)
R1 … Scanner Rk
Stream of Tokens LINE: ID(value) 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI ... 47 Scanning problem
• Input:
– Lexical specification: R1,…, Rk (regular expressions, one per lexeme) – input: string of n characters • Output: sequence of tokens R1(lex1) … Rn(lexn) such that • The lexemes partition the input lex1 … lexn = input • R1 … Rn match the lexeme type from the specification
48 Example 1: partitioning
• ID = (a|b|…|z) (a|b|…|z)* ONE = 1 • Input: abb1 • What should the output be? 1. ID(a) ID(b) ID(b) ONE 2. ID(a) ID(bb) ONE First match semantics 3. ID(ab) ID(b) ONE 4. ID(abb) ONE Maximal munch semantics
49 Maximal munch semantics
• ID = (a|b|…|z) (a|b|…|z)*
ONE = 1 Automaton may enter and leave accepting state many times • Input: abb1 before longest match is found • How do we return ID(abb) ONE? • Solution: find longest matching lexeme • Intuition: some tokens, such as identifiers are prefix-closed
50 Example 2: handling ambiguities
• ID = (a|b|…|z) (a|b|…|z)* IF = if • Input: if • Matches both tokens • What should the scanner output be?
a-z a-z\i ID a-z start q0 i a-z\f DFA f ID ID, IF 51 Solution: precedence semantics
• ID = (a|b|…|z) (a|b|…|z)* IF = if • Input: if • Matches both tokens • What should the scanner output be? • Break tie using order of definitions a-z – Output: ID(if) a-z\i ID a-z start q0 i a-z\f DFA f ID ID, IF 52 Solution: precedence semantics
• IF = if ID = (a|b|…|z) (a|b|…|z)* Conclusion: list keyword token definitions • Input: if before identifier definition • Matches both tokens • What should the scanner output be? • Break tie using order of definitions a-z – Output: IF a-z\i ID a-z start q0 i a-z\f DFA f ID IF, ID 53 Putting together an algorithm
54 Overall algorithm structure
List of regular expressions High-level Medium-level (one per lexeme) intermediate intermediate representation representation R1 … minimization Rk
Crucial: NFA+Є DFA for
Assign semantics R1 | … | Rk R1 | … | Rk
How do we implement maximal munch? Scanner implementation (efficient data structures)
55 A First match algorithm
56 First match algorithm
• Suggestions?
• What is the complexity?
57 A Maximal munch algorithm
58 Maximal munch scanning algorithm
• Input: – input: string of n characters – M: DFA for union of tokens
• Output: positions of in input that are the final characters of each token
• Data: – Stack of state, index of states and their positions encountered since last accepting state – i: index of next character in input – q: current state or Bottom (no state)
59 Maximal munch pseudo-code Reset DFA to look for next token
Used to indicate an error situation (no token is found)
60 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1
Output =
61 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1 q1 2 B,1 q0,1
Output =
62 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2
Output =
63 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3
Output =
64 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3 q3 3 q1,2
Output =
65 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3 q3 3 q1,2 q1 2 Output =
66 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3 q3 3 q1,2 q1 2 Output = 1
67 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 2 B,2
Output = 1
68 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 2 B,2 q1 3 B,2 q0,2
Output = 1
69 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3
Output = 1
70 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3
Output = 1
71 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3 q1 3
Output = 1
72 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3 q1 3
Output = 1 2
73 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 3 B,3
Output = 1 2
74 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 3 B,3 q1 4 B,3 q0,3
Output = 1 2
75 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 3 B,3 q1 4 B,3 q0,3
Output = 1 2
76 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 3 B,3 q1 4 B,3 q0,3
Output = 1 2 3
77 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b
q i stack q0 3 B,3 q1 4 B,3 q0,3
Output = 1 2 3
78 Complexity of maximal munch
• What is the complexity of tokenizing a text of n characters by matching longest tokens?
79 Complexity of maximal munch
• What is the complexity of tokenizing a text of n characters by matching longest tokens? • Assume the following token classes R1 = a * R2 = a b • For input=an it is O(n2)
Can we improve the worst-case complexity? qa … n qa qa qa a a a … a
n 80 Improved scanning algorithm
• Idea: use work done on “leftover” stack to improve future decisions • Remember for each index which states have failed – cannot be extended to a token
• “Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]
81 Improved algorithm pseudo-code
What is the running time?
How many times can this test fail for a given index?
82 Agenda • Understand role of lexical analysis in a compiler – Convert text to stream of tokens • Regular languages reminder • Lexical analysis algorithms – Precedence + First match – Precedence + Maximal munch • Scanner generation
83 Implementing a scanner
84 Implementing modern scanners
• Manual construction of automata + determinization + maximal munch + tie breaking – Very tedious – Error-prone – Non-incremental • Fortunately there are tools that automatically generate robust code from a specification for most languages – C: Lex, Flex Java: JLex, JFlex
85 Using JFlex
• Define tokens (and states) • Run JFlex to generate Java implementation • Usually MyScanner.nextToken() will be called in a loop by parser
Stream of characters MyScanner.lex
Lexical JFlex MyScanner.java Specification
Tokens 86 Filtering illegal combinations
• Which tokens should the scanner return for “123foo”?
87 Filtering illegal combinations
• Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far?
88 Filtering illegal combinations
• Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far? • Define “error” lexemes
89 Catching errors
• What if input doesn’t match any token definition? – Want to gracefully signal an error • Trick: add a “catch-all” rule that matches any character and reports an error – Add after all other rules
90 Next lecture: parsing