Compiler Design Spring 2018

3.2

Thomas R. Gross

Computer Science Department ETH Zurich, Switzerland 1 Overview

§ 3.1 Introduction § 3.2 Lexical analysis § 3.3 “Top down” parsing § 3.4 “Bottom up” parsing

2 Outline

§ Ambiguity (from Tuesday) § Lexical analysis § Top-down parsing § Simple backtracking parsers § Simple predictive parsers

3 3.2 Lexical analysis

§ Use regular expression to describe elements of language § Names of variables, fields, methods, classes, … § Constants (int, float, double, hex, …) § Keywords of the language (if then else while class …) § Example (from previous lecture): § Id: L { L | N } * § L = { a | b | | … | z } § N = { 0 | 1 | 2 | … | 9 } § Regular expressions à DFA § Automatic construction easy § DFA produces the tokens 4 How it works

a 3 + b Tokens: Yes Source b + 3 a Id(b) Term(+) Id(a3) DFA Analyzer program Lexer (or scanner) Parser No

5 Scanner

§ Also known as lexer § Problem: Characters à tokens § How to identify token? § Example: b + 3 a

§ How to stop building a token and then start a new one?

§ How much does the scanner “need to know”? 6 Token assembly

§ First (part of) answer: Stop when encountering a character that does not belong to current token § For many languages: Stop when encountering whitespace § Whitespace: Invisible and/or irrelevant for program § Look at C, C++, Java § ␣ § Newline, form feed, CR (carriage return) § Tab

§ Comments 8 Comments and whitespace § Some languages attach meaning to whitespace § Nesting level in Python § “make” utility § Warning: macro facilities, pragma § Not all comments are whitespace § Directives hidden in comments § Example: Fortran90 comments start with “!” !DEC$ IVDEP – ignore vector dependencies DO I=1, N A(INDARR(I)) = A(INDARR(I)) + B(I) END DO 9 IVDEP – what’s that?

§ “ignore vector dependencies” A(INDARR(I)) = A(INDARR(I)) + B(I) § Parallelization § Processor 0: A[INDARR[1]] = … § Processor 1: A[INDARR[2]] = … § Possible outcomes § INDARR[1] == 10 192 § INDARR[2] == 100 192

10 Simple strategy

§ Works for JavaLi and other languages § Put as many characters into a token as possible until it is obvious that a new token starts a3 + b a – first token 3 – add to token + – new token § “Maximal munch” § May make good error reporting difficult § a12 vs. a + 12 11 Maximal munch limitations

§ Does not work for all programming languages § Example C program segment int j, k; int* kaddr; int** kkaddr; kaddr = & k; j = *kaddr + 2; kkaddr = & kaddr; Token: ”**” j = ** kkaddr + 3; k = 5 * * * kkaddr; Token: ”*” j = 7 * * kaddr; 13 What can be done?

§ Close(r) coupling between scanner and parser

. . . . Input program

Scanner

Token Requests Id, “*” (type of token expected or a list of types expected) Parser

14