Fall 2016-2017 Principles Lecture 1:

Roman Manevich Ben-Gurion University of the Negev Agenda • Understand role of lexical analysis in a compiler • Regular languages reminder • Lexical analysis algorithms • Scanner generation

2 Javascript example

• Can you some identify basic units in this code?

var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];

for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 3 Javascript example

• Can you some identify basic units in this code? keyword ? ? ? ? ? var currOption = 0; // Choose content to display in lower pane. ? function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; ? for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 4 Javascript example

• Can you some identify basic units in this code?

keyword identifier operator numeric literal punctuation comment var currOption = 0; // Choose content to display in lower pane. string literal function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; whitespace for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 5 Role of lexical analysis

• First part of compiler front-end

Lexical Syntax AST Symbol Inter. Code High-level Analysis Analysis Table Rep. Generation Executable Language etc. (IR) Parsing Code (scheme)

• Convert stream of characters into stream of tokens – Split text into most basic meaningful strings • Simplify input for syntax analysis

6 From scanning to parsing program text 59 + (1257 * xPosition)

Lexical Analyzer Lexical error valid token stream num + ( num * id )

Grammar: E  id E  num Parser E  E + E E  E * E syntax valid E  ( E ) error + Abstract Syntax Tree

num *

num x 7 Scanner output

wherewhere isis thethe whitewhite space?space? var currOption = 0; // Choose content to display in lower pane. Stream of Tokens function choose ( id ) { LINE: ID(value) var menu = ["about-me", "publications“, "teaching", "software", "activities"]; 1: VAR 1: ID(currOption) for (i = 0; i < menu.length; i++) { 1: EQ currOption = menu[i]; 1: INT_LITERAL(0) var elt = document.getElementById(currOption); 1: SEMI if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; 3: FUNCTION } 3: ID(choose) else { 3: LP elt.style.display = "none"; 3: ID(id) } 3: EP } 3: LCB } ...

8 Tokens

9 What is a token?

• Lexeme – substring of original text constituting an identifiable unit – Identifiers, values, reserved words, … • Record type storing: – Kind – Value (when applicable) – Start-position/end-position – Any information that is useful for the parser • Different for different languages

10 Example tokens

Type Examples Identifier x, y, z, foo, bar NUM 42 FLOATNUM -3.141592654 STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if …

11 ++ example 1

• Splitting text into tokens can be tricky • How should the code below be split?

vector> myVector

>> >, > or operator two tokens ?

12 C++ example 2

• Splitting text into tokens can be tricky • How should the code below be split?

vector > myVector

>, > two tokens

13 Separating tokens

Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n

• Lexemes that are recognized but get consumed rather than transmitted to parser – if i f i/*comment*/f

14 Preprocessor directives in C

Type Examples Include directives #include Macros #define THE_ANSWER 42

15 First step of designing a scanner

• Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length?

16 First step of designing a scanner

• Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length? – Regular expressions

17 Agenda • Understand role of lexical analysis in a compiler – Convert text to stream of tokens • Regular languages reminder • Lexical analysis algorithms • Scanner generation

18 Regular languages reminder

19 Basic definitions and facts

• Formal languages – Alphabet = finite set of letters – Word = sequence of letter – Language = set of words • Regular languages defined equivalently by – Regular expressions – Finite-state automata

20 Regular expressions

• Empty string: Є

• Letter: a1, …, ak  Alphabet • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* – Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) – What is this language?

21 Exercise 1 - Question

• Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit

22 Exercise 1 - Answer

• Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit

– (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)*

23 Exercise 1 – Better answer

• Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit

– (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next*

24 Exercise 2 - Question

• Language of rational numbers in decimal representation (no leading, ending zeros) – Positive examples: • 0 • 123.757 • .933333 • 0.7 – Negative examples: • 007 • 0.30

25 Exercise 2 - Answer

• Language of rational numbers in decimal representation (no leading, ending zeros) – Digit = 1|2|…|9 Digit0 = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg

26 Exercise 3 - Question

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

27 Exercise 3 - Answer

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar: S ::= [] | [S]

28 Finite automata

29 Finite automata: known results

• Types of finite automata: – Deterministic (DFA) – Non-deterministic (NFA) – Non-deterministic + epsilon transitions • Theorem: translation of regular expressions to NFA+epsilon (linear time) • Theorem: translation of NFA+epsilon to DFA – Worst-case exponential time • Theorem [Myhill-Nerode]: DFA can be minimized

30 Finite automata

• An automaton M = Q, , , q0, F is defined by states and transitions

transition accepting b state

c a

start b

start state 31 Exercise - Question

• What is the language defined by the automaton below?

b

c a

start b

32 Exercise - Answer

• What is the language defined by the automaton below? – a b* c – Generally: all paths leading to accepting states b

c a

start b

33 Non-deterministic automata

• Allow multiple transitions from given state labeled by same letter

b

c a

start c a b

34 NFA+Є automata

• Є transitions can “fire” without reading the input

b

a c start Є

35 A little about me

• Joined Ben-Gurion University in 2012 • Research interests – Inductive programming and synthesis – Static analysis and verification – Language-supported parallelism

36 I am here for

• Teaching you theory and practice of popular compiler algorithms – Hopefully make you think about solving problems by examples from the world – Answering questions about material

• Contacting me – e-mail: [email protected] – Office hours: see course web-page • Announcements • Forums (per assignment)

37 Tentative syllabus

Front Intermediate Code Optimizations End Representation Generation

Operational Dataflow Register Scanning Semantics Analysis Allocation

Top-down Loop Energy Lowering Parsing (LL) Optimizations Optimization

Bottom-up Instruction Parsing (LR) Selection

mid-term exam

38 Reg-exp vs. automata A high-level • Regular expressions are declarative language – Offer compact way to define a regular language by humans – Don’t offer direct way to check whether a given word is in the language A machine • Automata are operative language – Define an algorithm for deciding whether a given word is in a regular language – Not a natural notation for humans

39 From Regular expressions to automata

40 From reg. exp. to NFA+Є automata

• Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression • Proof: by induction on the structure of the regular expression

start

41 Inductive constructions

 R =  start R1 R1 | R2   a start R = a start  R2 

R R*  R1 R2 R1 R2   start    start 

42 Running time of NFA+Є

• Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length n takes O(n·k2) time 2 Each state in a configuration of – Can we reduce the k factor? O(k) states may have O(k) outgoing edges, so processing an input letter may take O(k2) time

43 From NFA+Є to DFA

• Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length n takes O(n·k2) time – Can we reduce the k2 factor? • Theorem: for any NFA+Є automaton there exists an equivalent deterministic automaton • Proof: determinization via subset construction – Number of states in the worst-case O(2k) – Running time O(n)

44 Recap

• We know how to define any single type of lexeme

• We know how to convert any regular expression into a recognizing automaton

• But how do we use this for scanning?

45 The formal scanning problem

46 What is a scanner

var currOption = 0; // Choose content function choose ( id ) { ...

Lexical Specification: List of regular expressions (one per lexeme)

R1 … Scanner Rk

Stream of Tokens LINE: ID(value) 1: VAR 1: ID(currOption) 1: EQ 1: INT_LITERAL(0) 1: SEMI ... 47 Scanning problem

• Input:

– Lexical specification: R1,…, Rk (regular expressions, one per lexeme) – input: string of n characters • Output: sequence of tokens R1(lex1) … Rn(lexn) such that • The lexemes partition the input lex1 … lexn = input • R1 … Rn match the lexeme type from the specification

48 Example 1: partitioning

• ID = (a|b|…|z) (a|b|…|z)* ONE = 1 • Input: abb1 • What should the output be? 1. ID(a) ID(b) ID(b) ONE 2. ID(a) ID(bb) ONE First match semantics 3. ID(ab) ID(b) ONE 4. ID(abb) ONE Maximal munch semantics

49 Maximal munch semantics

• ID = (a|b|…|z) (a|b|…|z)*

ONE = 1 Automaton may enter and leave accepting state many times • Input: abb1 before longest match is found • How do we return ID(abb) ONE? • Solution: find longest matching lexeme • Intuition: some tokens, such as identifiers are prefix-closed

50 Example 2: handling ambiguities

• ID = (a|b|…|z) (a|b|…|z)* IF = if • Input: if • Matches both tokens • What should the scanner output be?

a-z a-z\i ID a-z start q0 i a-z\f DFA f ID ID, IF 51 Solution: precedence semantics

• ID = (a|b|…|z) (a|b|…|z)* IF = if • Input: if • Matches both tokens • What should the scanner output be? • Break tie using order of definitions a-z – Output: ID(if) a-z\i ID a-z start q0 i a-z\f DFA f ID ID, IF 52 Solution: precedence semantics

• IF = if ID = (a|b|…|z) (a|b|…|z)* Conclusion: list keyword token definitions • Input: if before identifier definition • Matches both tokens • What should the scanner output be? • Break tie using order of definitions a-z – Output: IF a-z\i ID a-z start q0 i a-z\f DFA f ID IF, ID 53 Putting together an algorithm

54 Overall algorithm structure

List of regular expressions High-level Medium-level (one per lexeme) intermediate intermediate representation representation R1 … minimization Rk

Crucial: NFA+Є DFA for

Assign semantics R1 | … | Rk R1 | … | Rk

How do we implement maximal munch? Scanner implementation (efficient data structures)

55 A First match algorithm

56 First match algorithm

• Suggestions?

• What is the complexity?

57 A Maximal munch algorithm

58 Maximal munch scanning algorithm

• Input: – input: string of n characters – M: DFA for union of tokens

• Output: positions of in input that are the final characters of each token

• Data: – Stack of state, index of states and their positions encountered since last accepting state – i: index of next character in input – q: current state or Bottom (no state)

59 Maximal munch pseudo-code Reset DFA to look for next token

Used to indicate an error situation (no token is found)

60 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1

Output =

61 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1 q1 2 B,1 q0,1

Output =

62 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2

Output =

63 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3

Output =

64 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3 q3 3 q1,2

Output =

65 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3 q3 3 q1,2 q1 2 Output =

66 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 1 B,1 q1 2 B,1 q0,1 q3 3 q1,2 q3 4 q1,2 q3,3 q3 3 q1,2 q1 2 Output = 1

67 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 2 B,2

Output = 1

68 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 2 B,2 q1 3 B,2 q0,2

Output = 1

69 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3

Output = 1

70 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3

Output = 1

71 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3 q1 3

Output = 1

72 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 2 B,2 q1 3 B,2 q3 4 q1,3 q1 3

Output = 1 2

73 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 3 B,3

Output = 1 2

74 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 3 B,3 q1 4 B,3 q0,3

Output = 1 2

75 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 3 B,3 q1 4 B,3 q0,3

Output = 1 2

76 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 3 B,3 q1 4 B,3 q0,3

Output = 1 2 3

77 Maximal munch run example * • Assume R1 = a R2 = a b • input = aaa a a q0 a q1 q3 b q2 b b

q i stack q0 3 B,3 q1 4 B,3 q0,3

Output = 1 2 3

78 Complexity of maximal munch

• What is the complexity of tokenizing a text of n characters by matching longest tokens?

79 Complexity of maximal munch

• What is the complexity of tokenizing a text of n characters by matching longest tokens? • Assume the following token classes R1 = a * R2 = a b • For input=an it is O(n2)

Can we improve the worst-case complexity? qa … n qa qa qa a a a … a

n 80 Improved scanning algorithm

• Idea: use work done on “leftover” stack to improve future decisions • Remember for each index which states have failed – cannot be extended to a token

• “Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]

81 Improved algorithm pseudo-code

What is the running time?

How many times can this test fail for a given index?

82 Agenda • Understand role of lexical analysis in a compiler – Convert text to stream of tokens • Regular languages reminder • Lexical analysis algorithms – Precedence + First match – Precedence + Maximal munch • Scanner generation

83 Implementing a scanner

84 Implementing modern scanners

• Manual construction of automata + determinization + maximal munch + tie breaking – Very tedious – Error-prone – Non-incremental • Fortunately there are tools that automatically generate robust code from a specification for most languages – C: Lex, Flex Java: JLex, JFlex

85 Using JFlex

• Define tokens (and states) • Run JFlex to generate Java implementation • Usually MyScanner.nextToken() will be called in a loop by parser

Stream of characters MyScanner.lex

Lexical JFlex MyScanner.java Specification

Tokens 86 Filtering illegal combinations

• Which tokens should the scanner return for “123foo”?

87 Filtering illegal combinations

• Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far?

88 Filtering illegal combinations

• Which tokens should the scanner return for “123foo”? – We sometimes want to rule out certain token concatenations prior to parsing – How can we do that with what we’ve seen so far? • Define “error” lexemes

89 Catching errors

• What if input doesn’t match any token definition? – Want to gracefully signal an error • Trick: add a “catch-all” rule that matches any character and reports an error – Add after all other rules

90 Next lecture: parsing