Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]

Maximal-Munch” Tokenization in Linear Time Tom Reps [TOPLAS 1998]

Fall 2016-2017 Compiler Principles Lecture 1: Lexical Analysis Roman Manevich Ben-Gurion University of the Negev Agenda • Understand role of lexical analysis in a compiler • Regular languages reminder • Lexical analysis algorithms • Scanner generation 2 Javascript example • Can you some identify basic units in this code? var currOption = 0; // Choose content to display in lower pane. function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 3 Javascript example • Can you some identify basic units in this code? keyword ? ? ? ? ? var currOption = 0; // Choose content to display in lower pane. ? function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; ? for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 4 Javascript example • Can you some identify basic units in this code? keyword identifier operator numeric literal punctuation comment var currOption = 0; // Choose content to display in lower pane. string literal function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"]; whitespace for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } } } 5 Role of lexical analysis • First part of compiler front-end Lexical Syntax AST Symbol Inter. Code High-level Analysis Analysis Table Rep. Generation Executable Language etc. (IR) Parsing Code (scheme) • Convert stream of characters into stream of tokens – Split text into most basic meaningful strings • Simplify input for syntax analysis 6 From scanning to parsing program text 59 + (1257 * xPosition) Lexical Analyzer Lexical error valid token stream num + ( num * id ) Grammar: E id E num Parser E E + E E E * E syntax valid E ( E ) error + Abstract Syntax Tree num * num x 7 Scanner output wherewhere isis thethe whitewhite space?space? var currOption = 0; // Choose content to display in lower pane. Stream of Tokens function choose ( id ) { LINE: ID(value) var menu = ["about-me", "publications“, "teaching", "software", "activities"]; 1: VAR 1: ID(currOption) for (i = 0; i < menu.length; i++) { 1: EQ currOption = menu[i]; 1: INT_LITERAL(0) var elt = document.getElementById(currOption); 1: SEMI if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; 3: FUNCTION } 3: ID(choose) else { 3: LP elt.style.display = "none"; 3: ID(id) } 3: EP } 3: LCB } ... 8 Tokens 9 What is a token? • Lexeme – substring of original text constituting an identifiable unit – Identifiers, values, reserved words, … • Record type storing: – Kind – Value (when applicable) – Start-position/end-position – Any information that is useful for the parser • Different for different languages 10 Example tokens Type Examples Identifier x, y, z, foo, bar NUM 42 FLOATNUM -3.141592654 STRING “so long, and thanks for all the fish” LPAREN ( RPAREN ) IF if … 11 C++ example 1 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int>> myVector >> >, > or operator two tokens ? 12 C++ example 2 • Splitting text into tokens can be tricky • How should the code below be split? vector<vector<int> > myVector >, > two tokens 13 Separating tokens Type Examples Comments /* ignore code */ // ignore until end of line White spaces \t \n • Lexemes that are recognized but get consumed rather than transmitted to parser – if i f i/*comment*/f 14 Preprocessor directives in C Type Examples Include directives #include<foo.h> Macros #define THE_ANSWER 42 15 First step of designing a scanner • Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length? 16 First step of designing a scanner • Define each type of lexeme – Reserved words: var, if, for, while – Operators: < = ++ – Identifiers: myFunction – Literals: 123 “hello” – Annotations: @SuppressWarnings • How can we define lexemes of unbounded length? – Regular expressions 17 Agenda • Understand role of lexical analysis in a compiler – Convert text to stream of tokens • Regular languages reminder • Lexical analysis algorithms • Scanner generation 18 Regular languages reminder 19 Basic definitions and facts • Formal languages – Alphabet = finite set of letters – Word = sequence of letter – Language = set of words • Regular languages defined equivalently by – Regular expressions – Finite-state automata 20 Regular expressions • Empty string: Є • Letter: a1, …, ak Alphabet • Concatenation: R1 R2 • Union: R1 | R2 • Kleene-star: R* – Shorthand: R+ stands for R R* • scope: (R) • Example: (0* 1*) | (1* 0*) – What is this language? 21 Exercise 1 - Question • Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit 22 Exercise 1 - Answer • Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* 23 Exercise 1 – Better answer • Language of Java identifiers – Identifiers start with either an underscore ‘_’ or a letter – Continue with either underscore, letter, or digit – (_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* – Using shorthand macros First = _|a|b|…|z|A|…|Z Next = First|0|…|9 R = First Next* 24 Exercise 2 - Question • Language of rational numbers in decimal representation (no leading, ending zeros) – Positive examples: • 0 • 123.757 • .933333 • 0.7 – Negative examples: • 007 • 0.30 25 Exercise 2 - Answer • Language of rational numbers in decimal representation (no leading, ending zeros) – Digit = 1|2|…|9 Digit0 = 0|Digit Num = Digit Digit0* Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.Frac PosOrNeg = (Є|-)Pos R = 0 | PosOrNeg 26 Exercise 3 - Question • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … 27 Exercise 3 - Answer • Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], … • Not regular • Context-free • Grammar: S ::= [] | [S] 28 Finite automata 29 Finite automata: known results • Types of finite automata: – Deterministic (DFA) – Non-deterministic (NFA) – Non-deterministic + epsilon transitions • Theorem: translation of regular expressions to NFA+epsilon (linear time) • Theorem: translation of NFA+epsilon to DFA – Worst-case exponential time • Theorem [Myhill-Nerode]: DFA can be minimized 30 Finite automata • An automaton M = Q, , , q0, F is defined by states and transitions transition accepting b state c a start b start state 31 Exercise - Question • What is the language defined by the automaton below? b c a start b 32 Exercise - Answer • What is the language defined by the automaton below? – a b* c – Generally: all paths leading to accepting states b c a start b 33 Non-deterministic automata • Allow multiple transitions from given state labeled by same letter b c a start c a b 34 NFA+Є automata • Є transitions can “fire” without reading the input b a c start Є 35 A little about me • Joined Ben-Gurion University in 2012 • Research interests – Inductive programming and synthesis – Static analysis and verification – Language-supported parallelism 36 I am here for • Teaching you theory and practice of popular compiler algorithms – Hopefully make you think about solving problems by examples from the compilers world – Answering questions about material • Contacting me – e-mail: [email protected] – Office hours: see course web-page • Announcements • Forums (per assignment) 37 Tentative syllabus Front Intermediate Code Optimizations End Representation Generation Operational Dataflow Register Scanning Semantics Analysis Allocation Top-down Loop Energy Lowering Parsing (LL) Optimizations Optimization Bottom-up Instruction Parsing (LR) Selection mid-term exam 38 Reg-exp vs. automata A high-level • Regular expressions are declarative language – Offer compact way to define a regular language by humans – Don’t offer direct way to check whether a given word is in the language A machine • Automata are operative language – Define an algorithm for deciding whether a given word is in a regular language – Not a natural notation for humans 39 From Regular expressions to automata 40 From reg. exp. to NFA+Є automata • Theorem: there is an algorithm to build an NFA+Є automaton for any regular expression • Proof: by induction on the structure of the regular expression start 41 Inductive constructions R = start R1 R1 | R2 a start R = a start R2 R R* R1 R2 R1 R2 start start 42 Running time of NFA+Є • Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length n takes O(n·k2) time 2 Each state in a configuration of – Can we reduce the k factor? O(k) states may have O(k) outgoing edges, so processing an input letter may take O(k2) time 43 From NFA+Є to DFA • Construction requires O(k) states for a reg-exp of length k • Running an NFA+Є with k states on string of length

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    91 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us