CS308 Compiler Principles

Introduction

Li Jiang Department of Computer Science and Engineering Shanghai Jiao Tong University

1 Why study compiling?

• Importance: – Programs written in high-level languages have to be translated into binary codes before executing – Reduce execution overhead of the programs – Make high-performance computer architectures effective on users' programs • Influence: – Language Design – Computer Architecture (influence is bi-directional) • Techniques used influence other areas – Text editors, information retrieval system, and pattern recognition programs – Query processing system such as SQL – Equation solver – Natural Language Processing – Debugging and finding security holes in codes – …

2 Compiler Principles Compiler Concept • A compiler is a program that takes a program written in a source language and translates it into an equivalent program in a target language.

source program COMPILER target program

( Normally a program ( Normally the equivalent written in a high-level program in machine code programming relocatable object file) language) Can be called to process error messages input and provide output

3 Compiler Principles Interpreter • An interpreter directly executes the operations specified in the source program on inputs supplied by the user.

source program INTERPRETER output input

error messages

4 Compiler Principles Programming Languages • Compiled languages: – Fortran, Pascal, C, C++, C#, Delphi, Visual Basic, …

• Interpreted languages: – BASIC, Perl, PHP, Ruby, TCL, MATLAB,…

• Joint Compiled and Interpreted languages – Java, Python, …

5 Compiler Principles Compiler vs. Interpreter*

• Preprocessing e.g. Debugging – Compilers do extensive preprocessing – Interpreters run programs “as is”, with little or no preprocessing • Efficiency – The target program produced by a compiler is usually much faster than interpreting the source codes

6 Compiler Principles Compiler Structure

Intermediate Language Source Front End – Back End – Target language machine Language Language specific specific

Analysis Symbol Synthesis Table

•Separation of Concerns •Retargeting

7 Compiler Principles Two Main Phases • Analysis Phase: breaks up a source program into constituent pieces and produces an internal representation of it called intermediate code. – Syntactically ill, semantically unsound – Collect useful information and pass it to • Synthesis Phase: translates the intermediate code into the target program.

8 Compiler Principles Phases of Compilation*

• Compilers work in a sequence of phases. • Each phase transforms the source program from one representation into another representation. • They use the symbol table to store information of the entire source program.

Intermediate Language Lexical Analyzer Syntax Analyzer Source Code Optimizer Semantic Analyzer Target Code Generator Language Intermediate Code Language Generator Analysis Symbol Synthesis Table

9 Compiler Principles A Model of A Compiler Font End

• Lexical analyzer reads the source program character by character and returns the tokens of the source program. • Parser creates the tree-like syntactic structure of the given program. • Intermediate-code generator translates the syntax tree into three- address codes.

10 Compiler Principles Lexical Analysis

11 Compiler Principles Lexical Analysis • Lexical Analyzer reads the source program character by character and returns the tokens of the source program. • A token describes a pattern of characters having the same meaning in the source program. (such as identifiers, operators, keywords, numbers, delimiters, and so on)

12 Compiler Principles White Space Removal • No blank, tab, newline, or comments in grammar

What is line used for?

Skipping white space

13 Compiler Principles Constants • When a sequence of digits appears in the input stream, the lexical analyzer passes to the parser a token consisting of the terminal num along with an integer-valued attribute computed from the digits. 31+28+59 è <+><+> • How to get these attributes? • Simulate some number ....

14 Compiler Principles Keywords and Identifiers* Keywords: Fixed character strings used as punctuation marks or to identify constructs.

Identifiers: A character string forms an identifier only if it is not a keyword. How differentiate?

15 Compiler Principles Lexical Analysis Cont’d • Puts information about identifiers into the symbol table.

• Regular expressions are used to describe tokens (lexical constructs). – [a-z]*[A-Z]*[0-9]*

• A (Deterministic) Finite State Automaton can be used in the implementation of a lexical analyzer.

16 Compiler Principles Symbol Table

17 Compiler Principles Symbol Table • Symbol Tables are data structures that are used by compilers to hold information about the source-program constructs. • For each identifier, there is an entry in the symbol table containing its information. • Symbol tables need to support multiple declarations of the same identifier – One symbol table per scope (of declaration)... { int x; char y; { bool y; x; y; } x; y; } Most-close rule x int Implementation: y char y bool Stack, hash table Outer symbol table Inner symbol table

18 Compiler Principles Symbol Table

• A Symbol Table is a data structure containing a record for each variable name, with fields for the attributes of the name.

position Id1 & attributes

Initial Id2 & attributes

rate Id3 & attributes

19 Compiler Principles Parsing Understand sentence

Understand words

• A Syntax/Semantic Analyzer (Parser) creates the syntactic structure (generally a parse tree) of the given program.

• Parsing is the problem of taking a string of terminals and figuring out how to derive it from the start symbol of the grammar

20 Compiler Principles Syntax Analysis • A Syntax Analyzer/Parser creates the syntactic structure (generally a parse tree) of the given program. • A parse tree describes a syntactic structure. • Each interior node represents an operation

• The children of the node represent the arguments of the operation

Understand the sentence: the order of operations

21 Compiler Principles Syntax (CFG) • The syntax of a language is specified by a context free grammar (CFG). • The rules in a CFG are mostly recursive. • A syntax analyzer checks whether a given program satisfies the rules implied by a CFG or not. – If it satisfies, the syntax analyzer creates a parse tree for the given program.

• Ex: We use BNF (Backus Naur Form) to specify a CFG

assgstmt -> identifier := expression expression -> identifier expression -> number expression -> expression + expression

22 Compiler Principles Syntax Definition • Context-Free Grammar (CFG) is used to specify the syntax of a formal language (for example a programming language like C, Java) • Grammar describes the structure (usually hierarchical) of programming languages.

– Example: in Java an IF statement should fit in • if ( expression ) statement else statement Production Can be/can have a form – statement à if ( expression ) statement else statement

– Note the recursive nature of statement.

23 Compiler Principles Definition of CFG • Four components: – A set of terminal symbols (name of tokens): elementary symbols of the language defined by the grammar – A set of non-terminals (syntactic variables): represent the set of strings of terminals – A set of productions: non-terminal à a sequence of terminals and/or non-terminals – A designation of one of the non-terminals as the start symbol.

24 Compiler Principles A Grammar Example List of digits separated by plus or minus signs

• Accepts strings such as 9-5+2, 3-1, or 7. • 0, 1, …, 9, +, - are the terminal symbols • list and digit are non-terminals • Every “line” is a production • list is the start symbol • Grouping: list → list + digit | list – digit | digit

25 Compiler Principles Derivations* • Given grammar, how to understand the sentence? • A grammar derives strings by beginning with the start symbol and repeatedly replacing a non-terminal by the body of a production • Language: The terminal strings that can be derived from the start symbol defined by the grammar. • Example: Derivation of 9-5+2 – 9 is a list, since 9 is a digit. Bottom up – 9-5 is a list, since 9 is a list and 5 is a digit. – 9-5+2 is a list, since 9-5 is a list and 2 is a digit.

26 Compiler Principles Parse Trees • A parse tree shows how the start symbol of a grammar derives a string in the language A à XYZ Root terminal Leaf non-terminal Interior node start symbol What is the relationship?

27 Compiler Principles Parse Trees Properties • The root is labeled by the start symbol.

• Each leaf is labeled by a terminal or by ε.

• Each interior node is labeled by a non- terminal.

• If A is the non-terminal labeling some interior node and X1, X2,… , Xn are the labels of the children of that node from left to right, then there must be a production A à X1X2 · · · Xn.

28 Compiler Principles Parse Tree for 9-5+2 *

29 Compiler Principles Ambiguity • A grammar can have more than one parse tree generating a given string of terminals. list à list + digit | list – digit | digit digit à 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 string à string + string | string - string | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 (9-5)+2 = 6 9-5+2 9-(5+2) = 2

How to eliminate it?

30 Compiler Principles Eliminating Ambiguity * • Operator Associativity: in most programming languages arithmetic operators have left associativity. – Example: 9+5-2 = (9+5)-2 – Exception: Assignment operator = has right associativity: a=b=c is equivalent to a=(b=c) – Elaborate: list -> list + digit | list - digit | digit – What about right associativity? • Operator Precedence: if an operator has higher precedence, then it will bind to it’s operands first. – Example: * has higher precedence than +, therefore 9+5*2 = 9+(5*2) – How to elaborate the CFG for operator precedence?

31 Compiler Principles Parsing * • Parsing is the process of determining how a string of terminals can be generated by a grammar. – Does each program has a specific tree? – If cannot derive a tree from start symbol? • Two classes: – Top-down: construction of parse tree starts at the root and proceeds towards the leaves; easier to follow/by hand – Bottom-up: construction of parse tree starts at the leaves and proceeds towards the root; wider usage/automatic generated

32 Compiler Principles Top-Down Parsing * • The top-down construction of a parse tree is done by starting from the root, and repeatedly performing the following two steps.

– At node N, labeled with non-terminal A, select the proper production of A and construct children at N for the symbols in the production body.

– Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded non-terminal of the tree.

33 Compiler Principles Top-Down Parsing For computers: try each production until the right one is found

If the first picked production is unsuitable? E.g., we add another production: Backtrack. Stmt -> if (expr) stmt else stmt How to avoid, see next? if (expr) stmt else if (expr) stmt else stmt 34 Compiler Principles Predictive Parsing • Recursive descent parsing: a top-down method of syntax analysis in which a set of recursiverecursive proceduresprocedures is used to process the input. Associate a procedure to every non-terminal • Predictive parsing: a simple form of recursive- descent parsing – Lookahead symbol unambiguously determines the flow of control based on the first terminal(s) of the nonterminal – FIRST(stmts) -> production to be used Parse tree ß the sequence of procedure calls

35 Compiler Principles Procedure for stmt Associate a procedure to every non-terminal

Necessary condition to use predictive parsing? No confliction on the first symbols of the bodies for the same head. What if a conflict as in page 34?

36 Compiler Principles Left Recursion Elimination * • Predictive parsing relies on information about the first symbols that can be generated by a production body. • Leftmost symbol of the body is the same as the nonterminal: • Left-recursive production leads to loop forever • A left-recursive production can be eliminated by rewriting the offending production:

• Are they equivalent? Draw a parsing tree!

37 Compiler Principles Syntax Analyzer vs. Lexical Analyzer • Both of them do similar things • Granularity – The lexical analyzer works on the characters to recognize the smallest meaningful units (tokens) in a source program. – The syntax analyzer works on the smallest meaningful units (tokens) in a source program to recognize meaningful structures in the programming language. • Recursion – The lexical analyzer deals with simple non- recursive constructs of the language. – The syntax analyzer deals with recursive constructs of the language.

38 Compiler Principles Semantic Analysis Once sentence structure is understood, we can try to understand “meaning”.

• Semantic Analyzer – adds semantic information to the parse tree (syntax- directed translation) – checks the source program for semantic errors – collects type information for the code generation – type checking: check whether each operator has matching operands where is the type information? – coercion: type conversion

39 Compiler Principles Semantic Analysis • A Semantic Analyzer checks the source program for semantic errors and collects the type information for the code generation. • Type checking is an important part of semantic analysis.

Syntax Tree Semantic Tree

40 Compiler Principles Syntax-Directed Translation • Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar.

• Infix expression à postfix expression • Techniques: Attributes & Translation Schemes 41 Compiler Principles Postfix Notation • Definition: – If E is a variable or constant , • E à E

– If E is an expression of the form E1 op E2, • E1 op E2 à E’1 E’2 op

– If E is a parenthesized expression of the form (E1), • (E1) à E’1

• Examples: – 9-5+2 à 95-2+ – 9-(5+2) à 952+-

42 Compiler Principles Attributes • A syntax-directed definition – associates attributes with non-terminals and terminals in a grammar – attaches semantic rules to the productions of the grammar

• An attribute is said to be synthesized if its value at a parse-tree node is determined/computed from attribute values of its children and itself.

43 Compiler Principles Semantic Rules for Infix to Postfix

Annotated 9-5+2 à 95-2+ Parse Tree

Syntax-directed definition

How attributes are passed along in a complex tree? Learn it in the chapter.

44 Compiler Principles Translation Schemes • The other syntax-directed translation approach • A Syntax-Directed Translation Scheme is a notation for specifying a translation by attaching program fragments to productions in a grammar.

• The program fragments are called semantic actions.

45 Compiler Principles A Translation Scheme

Parse tree 9-5+2 à 95-2+

Translation scheme

postorder traversal

46 Compiler Principles Attribute vs. Translation Scheme • Syntax-directed attribute attaches strings as attributes to the nodes in the parse tree

• Syntax-directed translation scheme prints the translation incrementally, through semantic actions

47 Compiler Principles Parsing Techniques • Depending on how the parse tree is created, there are different parsing techniques. • These parsing techniques are categorized into two groups: – Top-Down Parsing, – Bottom-Up Parsing • Top-Down Parsing: – Construction of the parse tree starts at the root, and proceeds towards the leaves. – Efficient top-down parsers can be easily constructed by hand. – Recursive Predictive Parsing, Non-Recursive Predictive Parsing (LL Parsing). • Bottom-Up Parsing: – Construction of the parse tree starts at the leaves, and proceeds towards the root. – Normally efficient bottom-up parsers are created with the help of some software tools. – Bottom-up parsing is also known as shift-reduce parsing. – Operator-Precedence Parsing – simple, restrictive, easy to implement – LR Parsing – much general form of shift-reduce parsing, LR, SLR, LALR

48 Compiler Principles A Simple Translator Semantic actions embedded in the productions are simply carried along in the transformation, as if they were terminals.

careful

Grammar of List of digits separated by plus or minus signs

49 Compiler Principles Translation of 9-5+2 to 95-2+

Left- recursion eliminated

50 Compiler Principles Procedures for Simple Translator

51 Compiler Principles (Abstract) Syntax Trees • In an (abstract) syntax tree for an expression – each interior node represents an operator – the children of the node represent the operands of the operator.

Syntax tree for 9-5+2

• In the syntax tree, interior nodes represent programming constructs. • In the parse tree, the interior nodes represent nonterminals. (Auxiliary)

52 Compiler Principles Syntax vs. Semantics • The syntax of a programming language describes the proper form of its programs.

• The semantics of the language defines what its programs mean, what each program does when it executes.

53 Compiler Principles Intermediate Code Generation

54 Compiler Principles Intermediate Code Generation • The front end of a compiler constructs an What intermediate representation of the source else? program from which the back end generates the target program.

• Two kinds of intermediate representations

– Tree: parse trees and (abstract) syntax trees

– Linear representation: three-address code

55 Compiler Principles Intermediate Code Generation • A compiler may produce an explicit intermediate codes representing the source program. • These intermediate codes are generally machine (architecture) independent. But the level of intermediate codes is close to the level of machine codes. • Ex: three-address code x = y op z

56 Compiler Principles Syntax Trees For Statement stmt -> while ( expr ) stmt o { stmt.n = new While(expr.n, stmt.n) } n is a Node in the syntax tree

Keywords has an operator with the same name; we use it as a class(ext. from Semantically meaningful components Node) New class/objects (ext. from Node)

Expr Stmt

57 Compiler Principles Syntax Trees For Expressions

Operator as a new class/objects (ext. from Node)

58 Compiler Principles Syntax Trees For Expressions • Q: what if the expression is in Stmt à expr; • A: expression statement – Define a new operator eval and class Eval • Grouping of operators – Based on “similarity” Concrete syntax Abstract syntax = assign ||, && cond == != , < <= > >= rel +- , * / % op ! not [] access

59 Compiler Principles Static Checking • Done by a compiler front end • To check that the program follows the syntactic and semantic rules – Syntactic checking • More than grammar – e.g., declaration, break statement, …not enforced by grammar – Type checking • Operator/function is applied to the right type of operand – If (expr) stmt; à if (E1.type==E2.type) E.type = Boolean; else error – Coercion & overloading • Complex checking: analyzing the syntax tree 60 Compiler Principles Three-Address Codes • Three-address code is a sequence of instructions of the form x = y op z • Arrays will be handled by using the following two variants of instructions: x [ y ] = z x = y [ z ] • Instructions for control flow: ifFalse x goto L ifTrue x goto L goto L • Instruction for copying value x = y

61 Compiler Principles Translation of Statements • Use jump instructions to implement the flow of control through the statement. • The translation of

if expr then stmtl

62 Compiler Principles Translation of Statements

63 Compiler Principles Translation of Expressions • Approach: – No code is generated for identifiers and constants – If a node x of class Expr has operator op, then an instruction is emitted to compute the value at node x into a temporary.

• Expression: i-j+k translates into t1 = i-j t2 = t1+k • Expression: 2 * a[i] translates into t1 = a [ i ] t2 = 2 * t1 * Do not use a temporary in place of a[i], if a[i] appears on the left side of an assignment.

64 Compiler Principles Functions lvalue and rvalue • a = a + 1, a is computed differently for the l-value and r-value

• Two functions used to distinguish them: – lvalue: generates instructions to compute the subtrees below x, and returns a node representing the “address” for x

– rvalue: generates the instructions to compute x into a temporary, and returns a new node representing the temporary.

• R-values is what we usually think of as “values” while L-values are “locations”

65 Compiler Principles Translation of Expressions • Example:

66 Compiler Principles Test Yourself • Generate three-address codes for If(x[2*a]==y[b]) x[2*a+1]=y[b+1];

t4=2*a t2=x[t4] t3=y[b] t1= t2 == t3 ifFalse t1 goto after t5=t4+1 t7=b+1 t6=y[t7] x[t5]=t6 after:

67 Compiler Principles Code Optimization • The code optimizer optimizes the code produced by the intermediate code generator in the terms of time and space.

68 Compiler Principles Code Generation • The code generator takes as input an intermediate representation of the source program and maps it into the target language.

• Example: MOVE id3, R1 MULT #60.0, R1 ADD id2, R1 MOVE R1, id1

69 Compiler Principles Issues Driving Compiler Design • Correctness • Speed (runtime and compile time) – Degrees of optimization – Multiple passes • Space • Feedback to user • Debugging

70 Compiler Principles Tools

• Lexical Analysis – , FLeX, JLeX

• Syntax Anaysis – , JavaCC, SableCC

• Semantic Analysis – Yacc, JavaCC, SableCC

71 Compiler Principles Homework • Reading – Chapter 1 and 2

72 Compiler Principles