Lexical Analyzer, Lexer Or Scanner

COMP 442/6421 – Compiler Design 1 COMPILER DESIGN Lexical analysis Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 2 Lexical analysis • Lexical analysis is the process of converting a sequence of characters into a sequence of tokens. • A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner. • A scanner often exists as a single function which is called by the parser, whose functionality is to extract the next token from the source code. • The lexical specification of a programming language is defined by a set of rules which defines the scanner, which are understood by a lexical analyzer generator such as lex or flex. These are most often expressed as regular expressions. • The lexical analyzer (either generated automatically by a tool like lex, or hand- crafted) reads the source code as a stream of characters, identifies the lexemes in the stream, categorizes them into tokens, and outputs a token stream. • This is called "tokenizing." • If the scanner finds an invalid token, it will report a lexical error. Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 3 Roles of the scanner • Removal of comments • Comments are not part of the program’s meaning • Multiple-line comments? • Nested comments? • Case conversion • Is the lexical definition case sensitive? • For identifiers • For keywords Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 4 Roles of the scanner • Removal of white spaces • Blanks, tabulars, carriage returns • Is it possible to identify tokens in a program without spaces? • Interpretation of compiler directives • #include, #ifdef, #ifndef and #define are directives to “redirect the input” of the compiler • May be done by a pre-compiler • Initial creation of the symbol table • A symbol table entry is created when an identifier is encountered • The lexical analyzer cannot create the whole entries • Can convert literals to their value and assign a type • Convert the input file to a token stream • Input file is a character stream • Lexical specifications: literals, operators, keywords, punctuation Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 5 Lexical specifications: tokens and lexemes • Token: An element of the lexical definition of the language. • Lexeme: A sequence of characters identified as a token. Token Lexeme id distance,rate,time,a,x relop >=,<,== openpar ( if if then then assignop = semi ; Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 6 Design of a lexical analyzer Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 7 Design of a lexical analyser • Procedure 1. Construct a set of regular expressions (REs) that define the form of any valid token 2. Derive an NDFA from the REs 3. Derive a DFA from the NDFA 4. Translate the NDFA to a state transition table 5. Implement the table 6. Implement the algorithm to interpret the table • This is exactly the procedure that a scanner generator is implementing. • Scanner generators include: • Lex, flex • Jlex • Alex • Lexgen • re2c Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 8 Regular expressions : { } s : {s | s in s^} a : {a} r | s : {r | r in r^} or {s | s in s^} s* : {sn | s in s^ and n>=0} s+ : {sn | s in s^ and n>=1} id ::= letter(letter|digit)* Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 9 Deriving DFA from REs • Thompson’s construction is an algorithm invented by Ken Thompson in 1968 to translate regular expressions into an NFA. • Rabin-Scott powerset construction is an algorithm invented by Michael O. Rabin and Dana Scott in 1959 to transform an NFA to a DFA. Ken Thompson • Kleene’s algorithm, is an algorithm invented by Stephen Cole Kleene in 1956 to transform a DFA into a regular expression. • These algorithms are the basis of the implementation of all scanner generators. Dana Scott Michael O. Rabin Stephen Cole Kleene Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 10 Thompson’s construction Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 11 REs to NDFA: Thompson’s construction id ::= letter(letter|digit)* ɛ 2 l 3 ɛ ɛ 0 l 1 6 ɛ 7 ɛ ɛ 4 d 5 ɛ Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 12 Thompson’s construction • Thompson’s construction s i s f works recursively by splitting an expression into its constituent subexpressions. • Each subexpression st i N(s) N(t) f corresponds to a subgraph. • Each subgraph is then grafted with other subgraphs depending on the nature of ɛ N(s) ɛ the composed subexpression, i.e. s|t i f • An atomic lexical symbol ɛ N(t) ɛ • A concatenation expression • A union expression ɛ • A Kleene star expression s* i ɛ N(s) ɛ f ɛ Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 13 Thompson’s construction: example (a|b)*abb a 2 a 3 e i e f b 4 b 5 N(s) ɛ 2 a 3 ɛ ɛ ɛ a|b 1 6 s|t i f ɛ 4 b 5 ɛ ɛ N(t) ɛ Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 14 Thompson’s construction: example (a|b)*abb ɛ ɛ 2 a 3 2 ɛ a 3 ɛ ɛ ɛ a(a||bb)* 1 0 ɛ 1 6 6 ɛ 7 ɛ ɛ ɛ 4 b 5 4 ɛ b 5s* i ɛ N(s) ɛ f ɛ ɛ Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 15 Thompson’s construction: example (a|b)*abb a 2 a 3 b 4 b 5 st i N(s) N(t) f ɛ 2 a 3 ɛ ɛ ɛ (a|b)* 0 ɛ 1 6 ɛ 7 2 a 3 ɛ ɛ ɛ ɛ 4 b 5 (a|b)*abb 0 ɛ 1 6 ɛ 7 a 8 b 9 b 10 ɛ ɛ ɛ 4 b 5 ɛ Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 16 Rabin-Scott powerset construction Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 17 Rabin-Scott powerset construction: concepts • SDFA: set of states in the DFA • SNFA: set of states in the NFA • Σ: set of all symbols in the lexical specification. • ɛ-closure(S): set of states in the NDFA that can be reached with ɛ transitions from any element of the set of states S, including the state itself. • MoveNFA(T,a): state in SNFA to which there is a transition from one of the states in states set T, having encounteredɛ symbol a. • Move (T,a): state in S to which there is a transition from one of the DFA DFA 2 l 3 states in states set T, having encounteredɛ symbolɛ a. 0 l 1 id ::= letter(letter|digit)*6 ɛ 7 ɛ ɛ l 4 d 5 C l ɛ A l B d l d D d Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 18 Rabin-Scott powerset construction: algorithm Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 19 Rabin-Scott powerset construction: example ɛ 2 l 3 ɛ ɛ 0 l 1 6 ɛ 7 ɛ ɛ l 4 d 5 C l ɛ A l B d l d D d Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 20 Rabin-Scott powerset construction: example Starting state A = ɛ-closure(0) = {0} A State A : {0} moveDFA(A,l) = ɛ-closure(moveNFA(A,l)) A l B = ɛ-closure({1}) = {1,2,4,7} = B moveDFA(A,d) = ɛ-closure(moveNFA(A,d)) = ɛ-closure({}) = {} Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 21 Rabin-Scott powerset construction: example State B : {1,2,4,7} moveDFA(B,l) = ɛ-closure(moveNFA(B,l)) A l B l C = ɛ-closure({3}) = {1,2,3,4,6,7} = C move (B,d) DFA C = ɛ-closure(moveNFA(B,d)) l = ɛ-closure({5}) A l B = {1,2,4,5,6,7} d = D D Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 22 Rabin-Scott powerset construction: example State C : {1,2,3,4,6,7} l C moveDFA(C,l) l = ɛ-closure(moveNFA(C,l)) A l B = ɛ-closure({3}) d = {1,2,3,4,6,7} D = C l move (C,d) DFA C = ɛ-closure(moveNFA(C,d)) l = ɛ-closure({5}) A l B d = {1,2,4,5,6,7} d = D D Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 23 Rabin-Scott powerset construction: example State D : {1,2,4,5,6,7} l C moveDFA(D,l) l = ɛ-closure(moveNFA(D,l)) A l B d l = ɛ-closure({3}) d = {1,2,3,4,6,7} D = C l move (D,d) DFA C = ɛ-closure(moveNFA(D,d)) l = ɛ-closure({5}) A l B d l = {1,2,4,5,6,7} d = D D d Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 24 Rabin-Scott powerset construction: example Final states: l C l A l B d l d D d Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 25 Generate state transition table state letter digit final A B N B C D Y l C C D Y C l D C D AY l B d l d D d Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 26 Implementation Concordia University Department of Computer Science and Software Engineering Joey Paquet, 2000-2018 COMP 442/6421 – Compiler Design 27 Implementation concerns • Backtracking • Principle : A token is normally recognized only when the next character is read.

Lexical Analyzer, Lexer Or Scanner

Twenty Years of Berkeley Unix : from AT&T-Owned to Freely

Introduction

The UNIX Time- Sharing System

Arxiv:2106.11534V1 [Cs.DL] 22 Jun 2021 2 Nanjing University of Science and Technology, Nanjing, China 3 University of Southampton, Southampton, U.K

Turing Award • John Von Neumann Medal • NAE, NAS, AAAS Fellow

Corbató Transcript Final

Alan Mathison Turing and the Turing Award Winners

Linux Internals

Principles of Computer System Design an Introduction

Operating Systems and Middleware: Supporting Controlled Interaction

Open Source Resources for Teaching and Research in Mathematics

Dennis Ritchie - the Computer Science Pioneer Without Whom There Would Be No Jobs