Compiler Design Spring 2018
3.2 Lexical Analysis
Thomas R. Gross
Computer Science Department ETH Zurich, Switzerland 1 Overview
§ 3.1 Introduction § 3.2 Lexical analysis § 3.3 “Top down” parsing § 3.4 “Bottom up” parsing
2 Outline
§ Ambiguity (from Tuesday) § Lexical analysis § Top-down parsing § Simple backtracking parsers § Simple predictive parsers
3 3.2 Lexical analysis
§ Use regular expression to describe elements of language § Names of variables, fields, methods, classes, … § Constants (int, float, double, hex, …) § Keywords of the language (if then else while class …) § Example (from previous lecture): § Id: L { L | N } * § L = { a | b | c | … | z } § N = { 0 | 1 | 2 | … | 9 } § Regular expressions à DFA § Automatic construction easy § DFA produces the tokens 4 How it works
a 3 + b Tokens: Yes Source b + 3 a Id(b) Term(+) Id(a3) DFA Analyzer program Lexer (or scanner) Parser No
5 Scanner
§ Also known as lexer § Problem: Characters à tokens § How to identify token? § Example: b + 3 a
§ How to stop building a token and then start a new one?
§ How much does the scanner “need to know”? 6 Token assembly
§ First (part of) answer: Stop when encountering a character that does not belong to current token § For many languages: Stop when encountering whitespace § Whitespace: Invisible and/or irrelevant for program § Look at C, C++, Java §
§ Comments 8 Comments and whitespace § Some languages attach meaning to whitespace § Nesting level in Python § “make” utility § Warning: macro facilities, pragma § Not all comments are whitespace § Directives hidden in comments § Example: Fortran90 comments start with “!” !DEC$ IVDEP – ignore vector dependencies DO I=1, N A(INDARR(I)) = A(INDARR(I)) + B(I) END DO 9 IVDEP – what’s that?
§ “ignore vector dependencies” A(INDARR(I)) = A(INDARR(I)) + B(I) § Parallelization § Processor 0: A[INDARR[1]] = … § Processor 1: A[INDARR[2]] = … § Possible outcomes § INDARR[1] == 10 192 § INDARR[2] == 100 192
10 Simple strategy
§ Works for JavaLi and other languages § Put as many characters into a token as possible until it is obvious that a new token starts a3 + b a – first token 3 – add to token + – new token § “Maximal munch” § May make good error reporting difficult § a12 vs. a + 12 11 Maximal munch limitations
§ Does not work for all programming languages § Example C program segment int j, k; int* kaddr; int** kkaddr; kaddr = & k; j = *kaddr + 2; kkaddr = & kaddr; Token: ”**” j = ** kkaddr + 3; k = 5 * * * kkaddr; Token: ”*” j = 7 * * kaddr; 13 What can be done?
§ Close(r) coupling between scanner and parser
. . . . Input program
Scanner
Token Requests Id, “*” (type of token expected or a list of types expected) Parser
14