Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]

Fall 2016-2017 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University of the Negev Agenda • Understand role of lexical analysis in a compiler • Regular languages reminder • Lexical analysis algorithms • Scanner generation 2 Javascript example • Can you some identify basic units in this code? var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 3 Javascript example • Can you some identify basic units in this code? keyword ? ? ? ? ? var currOption = 0; // Choose content to display in lower pane. ? function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; ? for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 4 Javascript example • Can you some identify basic units in this code? keyword identifier operator numeric literal punctuation comment var currOption = 0; // Choose content to display in lower pane. string literal function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; whitespace for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 5 Role of lexical analysis • First part of compiler front-end Lexical Syntax AST Symbol Inter. Code High-level Analysis Analysis Table Rep. Generation Executable Language etc. (IR) Parsing Code (scheme) • Convert stream of characters into stream of tokens – Split text into most basic meaningful strings • Simplify input for syntax analysis 6 From scanning to parsing program text 59 + (1257 * xPosition) Lexical Analyzer Lexical error valid token stream num + ( num * id ) Grammar: E id E num Parser E E + E E E * E syntax valid E ( E ) error + Abstract Syntax Tree num * num x 7 Scanner output wherewhere isis thethe whitewhite space?space? var currOption = 0; // Choose content to display in lower pane. Stream of Tokens function choose ( id ) { LINE: ID(value) var menu = ["about-me", "publications“, "teaching", "software", "activities"]; 1: VAR 1: ID(currOption) for (i = 0; i < menu.length; i++) { 1: EQ currOption = menu[i]; 1: INT_LITERAL(0) var elt = document.getElementById(currOption); 1: SEMI if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; 3: FUNCTION } 3: ID(choose) else { 3: LP elt.style.display = "none"; 3: ID(id) } 3: EP } 3: LCB } ... 8 Tokens 9 What is a token? • Lexeme – substring of original text constituting an identifiable unit – Identifiers, values, reserved words, … • Record type storing: – Kind – Value (when applicable) – Start-position/end-position – Any information that is useful for the parser • Different for different languages 10 Example tokens Type Examples Identifier x, y, z, foo, bar NUM 42 FLOATNUM -3.141592654 STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if … 11 C++ example 1 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int>> myVector >> >, > or operator two tokens ? 12 C++ example 2 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int> > myVector >, > two tokens 13 Separating tokens Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n • Lexemes that are recognized but get consumed rather than transmitted to parser – if i f i/*comment*/f 14 Preprocessor directives in C Type Examples Include directives #include<foo.h> Macros #define THE_ANSWER 42 15 First step of designing a scanner • Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length? 16 First step of designing a scanner • Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length? – Regular expressions 17 Agenda • Understand role of lexical analysis in a compiler – Convert text to stream of tokens • Regular languages reminder • Lexical analysis algorithms • Scanner generation 18 Regular languages reminder 19 Basic definitions and facts • Formal languages – Alphabet = finite set of letters – Word = sequence of letter – Language = set of words • Regular languages defined equivalently by – Regular expressions – Finite-state automata 20 Regular expressions • Empty string: Є • Letter: a1, …, ak Alphabet • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* – Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) – What is this language? 21 Exercise 1 - Question • Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit 22 Exercise 1 - Answer • Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* 23 Exercise 1 – Better answer • Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next* 24 Exercise 2 - Question • Language of rational numbers in decimal representation (no leading, ending zeros) – Positive examples: • 0 • 123.757 • .933333 • 0.7 – Negative examples: • 007 • 0.30 25 Exercise 2 - Answer • Language of rational numbers in decimal representation (no leading, ending zeros) – Digit = 1|2|…|9 Digit0 = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg 26 Exercise 3 - Question • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … 27 Exercise 3 - Answer • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar: S ::= [] | [S] 28 Finite automata 29 Finite automata: known results • Types of finite automata: – Deterministic (DFA) – Non-deterministic (NFA) – Non-deterministic + epsilon transitions • Theorem: translation of regular expressions to NFA+epsilon (linear time) • Theorem: translation of NFA+epsilon to DFA – Worst-case exponential time • Theorem [Myhill-Nerode]: DFA can be minimized 30 Finite automata • An automaton M = Q, , , q0, F is defined by states and transitions transition accepting b state c a start b start state 31 Exercise - Question • What is the language defined by the automaton below? b c a start b 32 Exercise - Answer • What is the language defined by the automaton below? – a b* c – Generally: all paths leading to accepting states b c a start b 33 Non-deterministic automata • Allow multiple transitions from given state labeled by same letter b c a start c a b 34 NFA+Є automata • Є transitions can “fire” without reading the input b a c start Є 35 A little about me • Joined Ben-Gurion University in 2012 • Research interests – Inductive programming and synthesis – Static analysis and verification – Language-supported parallelism 36 I am here for • Teaching you theory and practice of popular compiler algorithms – Hopefully make you think about solving problems by examples from the compilers world – Answering questions about material • Contacting me – e-mail: [email protected] – Office hours: see course web-page • Announcements • Forums (per assignment) 37 Tentative syllabus Front Intermediate Code Optimizations End Representation Generation Operational Dataflow Register Scanning Semantics Analysis Allocation Top-down Loop Energy Lowering Parsing (LL) Optimizations Optimization Bottom-up Instruction Parsing (LR) Selection mid-term exam 38 Reg-exp vs. automata A high-level • Regular expressions are declarative language – Offer compact way to define a regular language by humans – Don’t offer direct way to check whether a given word is in the language A machine • Automata are operative language – Define an algorithm for deciding whether a given word is in a regular language – Not a natural notation for humans 39 From Regular expressions to automata 40 From reg. exp. to NFA+Є automata • Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression • Proof: by induction on the structure of the regular expression start 41 Inductive constructions R = start R1 R1 | R2 a start R = a start R2 R R* R1 R2 R1 R2 start start 42 Running time of NFA+Є • Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length n takes O(n·k2) time 2 Each state in a configuration of – Can we reduce the k factor? O(k) states may have O(k) outgoing edges, so processing an input letter may take O(k2) time 43 From NFA+Є to DFA • Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length

Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]

Modern Compiler Implementation in Java. Second Edition

CS 444: Compiler Construction

Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice

Rewriting Strategies for Instruction Selection

Lecture 9 Code Generation Instruction Selection Instruction Selection

One Parser to Rule Them All

Instruction Selection

Compilers & Programming Systems

7. Code Generation

Javacc Tutorial

Instruction Selection Compiler Backend Intermediate Representations

Back-End Code Generation Itree Stmts and Exprs Side-Effects