<<

CSc 453 and Systems What does a do? 1 : Compiler Overview

Department of Science University of Arizona

[email protected]

Copyright °c 2009 Christian Collberg

What’s a Compiler??? What’s a Language ???

At the very level a compiler translates a from to some kind of code: x.s x. x. asm A compiler is really a special case of a language translator. source Compiler A translator is a program that transforms a “program” P1 VM written in a language L1 into a program P2 written in another x.class language L2. errors Typically, we desire P1 and P2 to be semantically equivalent, Often the source code is simply a text file and the executable i.e. they should behave identically. code is a resulting language program: gcc -S x.c reads the C source file x.c and generates an assembly code file x.s. Or the output can be a virtual code: javac x.java produces x.class. Example Language Translators What’s a Language???

source language translator target language latex2html A is a notation for precisely communicating LATEX −→ html ideas. ps2ascii Postscript −→ text By formal we mean that we know exactly which “sentences” f2c −→ C belong to the language and that every sentence has a C++ −→ C well-defined meaning. gcc C −→ assembly A language is defined by specifying its syntax and semantics. SourceAgain .class −→ Java The syntax describes how words can be formed into sentences. fx32 The semantics describes what those sentences mean. x86 binary −→ Alpha binary

Example Languages Compiler Input

English is a natural, not a formal language. The sentence Many missiles have many warheads. has multiple possible meanings. Text Common on . Programming languages: FORTRAN, LISP, Java, C++,. . . Syntax Tree A uses its knowledge of the source Text processing languages: LAT X, troff,. . . E language syntax to help the user edit & run the \begin{frame}\frametitle{Example Languages} program. It can send a syntax tree to the compiler, \begin{itemize} relieving it of lexing & . \item English is a \redtxt{natural}, not a formal language. The sentence \end{itemize} \end{frame} Specification languages: VDM, Z, OBJ,. . . Compiler Output Compiler Tasks

Static Semantic Is the program (statically) correct? If not, produce error messages to the user. Assembly Code Unix compilers do this. Slow, but easy for the Code Generation The compiler must produce code that can be compiler. executed. .o-files on Unix. Faster, since we don’ have to call Symbolic Debug Information The compiler should produce a the assembler. description of the source program needed by symbolic Executable Code Called a load-and-go-compiler. . Try man gdb . Abstract Serves as input to an . Fast Cross References The compiler may produce cross-referencing turnaround time. information. Where are identifiers declared & C-code Good for portability. referenced? Profiler Information Where does my program spend most of its time? Try man gprof .

Compiler Phases

A N A L Y S I S S Y N T H E S I S

Intermediate Code The structure of a compiler Generation

Syntactic Analysis Code Optimization

Machine Code Semantic Analysis Generation Compiler Phases – Lexical analysis

Lexical Analysis of English The lexer reads the source file and divides the text into lexical The sentence units (tokens), such as: The boy’s cowbell won’t play. Reserved words BEGIN, IF,... would be translated to the list of tokens x StringTokenizer identifiers , ,... the, boy+possessive, cowbell, will, not, play special characters +, ∗, −, ^,... numbers 30, 3.14,. . . Lexical Analysis of Java comments (* text *), The sentence strings "text". x = 3.14 * (9.0+y); Lexical errors (such as ’illegal character’, ’undelimited would be translated to the list of tokens character string’, ’comment without end’) are reported. , EQ, , STAR, LPAREN, , PLUS, , RPAREN, SEMICOLON

Compiler Phases – Syntactic analysis

Syntactic Analysis of English The sentence Syntax analysis (parsing) determines the structure of a The boy plays cowbell. sentence. would be parsed into the tree The compiler reads the tokens produced during lexical analysis S and builds an (AST), which reflects the hierarchical structure of the input program. NP VP Syntactic errors are reported to the user: Det N V NP ’missing ;’, N ’BEGIN without END’ The boy plays cowbell S=sentence, NP=noun phrase, VP=verb phrase, Det=determiner, N=noun, V=verb. Compiler Phases – Semantic analysis

Syntactic Analysis of Java The sentence x = 3.14 * (9.0+y); The AST produced during syntactic analysis is decorated with attributes, which are then evaluated. The attributes can would be parsed into the AST represent any kind of information such as expression types. Assign The compiler also collects information useful for future phases. ID MUL Semantic errors are reported: x ’identifier not declared’, FLOAT ADD ’integer type expected’. 3.14 FLOAT ID 9.0 y

Compiler Phases – IR Generation

Semantic Analysis of English In the sentence The boy plays his cowbell. From the decorated AST this phase generates intermediate we determine that his refers to the boy. code (IR). Semantic Analysis of Java The IR adds an extra a level of abstraction between the high level AST and the very low level assembly code we want to In the sentence generate. This simplifies code generation. static float luftballons = 10; A carefully designed IR allows us to compilers for a void P() {int luftballons = 99; number of languages and , by mixing and matching System.out.println(luftballons);} front-ends and back-ends. the compiler must determine which luftballons the print-statement refers to, that float luftballons=10 has the wrong type. Compiler Phases – Code Optimization

IR Generation of English From the sentence Every boy has a cowbell. The (often optional) code optimization phase transforms an IR program into an equivalent but more efficient program. we could generate Typical transformations include ∀x; boy(x) ⇒ has-cowbell(x) common subexpression elimination only compute an expression once, even if it occurs more than IR Generation of Java once in the source), From the sentence insert a procedure’s code at the call site to x = 3.14 * (9.0+y); reduce call overhead, algebraic transformations A := A + A is faster than the compiler could generate the (stack-based) IR code A := A ∗ 2. pusha x, pushf 3.14, pushf 9.0, push y, add, mul, assign

Compiler Phases – Code Generation

The last compilation phase transforms the intermediate code Multipass Compilation into machine code, usually assembly code or link modules. Alternatively, the compiler generates Code (VM), i.e. code for a software defined architecture. Java compilers, for example, generate class files containing for the Java VM. Multi-pass Compilation Multi-pass Compilation. . .

tokens source Lexer Parser The next slide shows the outline of a typical compiler. In a unix environment each pass could be a stand-alone program, AST and the passes could be connected by pipes: errors errors x.c | parse | sem | ir | opt | codegen > x.s IR AST Interm. Semantic Optimize For performance reasons the passes are usually integrated: Code Gen Analyser front x.c > x.ir IR errors back x.ir > x.s Machine VM Code The front-end does all analysis and IR generation. The Code Gen Gen back-end optimizes and generates code.

asm VM

Mix-and-Match Compilers

F E Ada Pascal −2 C++ N O N Example T E A Sparc Mips 68000 IBM/370 N C D K

Ada Pascal Pascal Mips−compiler Mips−compiler 68k−compiler Example – Lexical Analysis

Let’s compile the procedure Foo, from start to finish: PROCEDURE Foo (); Break up the source code (a text file) and into tokens. VAR i : INTEGER; BEGIN Source Code Stream of Tokens i := 1; PROCEDURE Foo (); PROCEDURE, , LPAR, RPAR, SC, WHILE i < 20 DO VAR i : INTEGER; VAR, , COLON, ,SC, PRINT i * 2; BEGIN BEGIN, ,CEQ,,SC, i := i * 2 + 1; i := 1; WHILE, , LT, ,DO, ENDDO; WHILE i < 20 DO PRINT, , MUL, , SC, END Foo; PRINT i * 2; , CEQ, , MUL, , i := i * 2 + 1; PLUS, , SC, ENDDO, SC, END, The compilation phases are: ENDDO; , SC ⇒ ⇒ Lexial Analysis Syntactic Analysis Semantic END Foo; Analysis ⇒ Intermediate code generation ⇒ Code Optimization ⇒ Machine code generation.

Example – Lexical Analysis. . . Example – Syntactic Analysis

We defined the following set of tokens: Stream of Tokens Abstract Syntax Tree PROC−DECL Token String Token String PROCEDURE, , VAR−DECL

PROCEDURE "PROCEDURE" integer literal LPAR,RPAR,SC,VAR,, ASSIGN WHILE identifier WHILE "WHILE" COLON,,SC, i 1 < LPAR "(" LT "<" BEGIN,,CEQ,, RPAR ")" DO "DO" SC,WHILE,,LT,, i 20 PRINT ASSIGN SC ";" PRINT "PRINT" DO,PRINT,,MUL,, VAR "VAR" MUL "*" SC,,CEQ,,MUL, * i + ,PLUS,,SC, COLON ":" PLUS "+" i 2 * 1 BEGIN "BEGIN" ENDDO "ENDDO" ENDDO,SC,END,,SC CEQ ":=" END "END" i 2 Example – Semantic Analysis

PROC−DECL id:Foo ArgsDecls Stats Abstract Syntax Tree Decorated Abstract Syntax Tree

PROC−DECL VAR−DECL id:i ASSIGN−STAT WHILE−STAT type:INTEGER VAR−DECL Des Expr Expr Body PROC−DECL

ASSIGN WHILE

VAR−DECL VAR−REF LITERAL EXPR op:< PRINT−STAT ASSIGN−STAT i 1 < PRINT ASSIGN id:i val:1 INT INT BOOL Left Right Expr Des Expr ASSIGN WHILE

i 20 INT INT EXPR op:* VAR−REF EXPR op:+ i 1 < PRINT i + VAR−REF LITERAL INT id:i INT id:i val:20 Left Right Left Right * i 20 INT * * 1 INT INT VAR−REF LITERAL EXPR op:* LITERAL i 2 id:i val:1 INT INT val:2 Left Right i 2 i 2 INT INT

VAR−REF LITERAL id:i val:2

Example – IR Generation

PROC−DECL id:Foo ArgsDecls Stats

WHILE−STAT Decorated Abstract Syntax Tree Intermediate Code VAR−DECL id:i ASSIGN−STAT type:INTEGER Expr Body [1] ASSIGN i 1 Des Expr PROC−DECL ASSIGN−STAT [2] BRGE i 20 [9] VAR−REF LITERAL EXPR op:< PRINT−STAT Des Expr VAR−DECL id:i val:1 type:bool Expr [3] MUL t1 i 2

type:int type:int VAR−REF EXPR op:+ ASSIGN WHILE Left Right id:i [4] PRINT t1 EXPR op:* type:int type:int i 1 < PRINT [5] MUL t i 2 VAR−REF LITERAL type:int Left Right INT INT BOOL 2 id:i val:20 Left Right i 20 [6] ADD t3 t2 1 type:int EXPR op:* LITERAL type:int INT INT * VAR−REF LITERAL type:int val:1 INT [7] ASSIGN i t3 type:int id:i val:2 Left Right i [8] JUMP [2] type:int type:int 2 INT INT [9] VAR−REF LITERAL id:i val:2 type:int type:int Example – IR Generation. . . Example – Code Optimization

Intermediate Code Intermediate Code Definition Intermediate Code Optimized Intermediate Code ASSIGN A, B A := B; [1] ASSIGN i 1 [1] ASSIGN i 1 [1] ASSIGN i 1 BRGE A, B, C IF (A ≥ B) THEN [2] BRGE i 20 [9] [2] BRGE i 20 [9] [2] BRGE i 20 [8] [3] MUL t1 i 2 continue at instruction C; [3] MUL t1 i 2 [3] SHL t1 i 1 ∗ [4] PRINT t1 [4] PRINT t1 MUL A, B, C A := B C; [4] PRINT t [5] MUL t i 2 1 [5] MUL t2 i 2 2 [5] ADD t t 1 ADD A, B, C A := B + C; [6] ADD t t 1 2 1 [6] ADD t3 t2 1 3 2 [6] ASSIGN i t [7] ASSIGN i t 2 [7] ASSIGN i t3 SHL A, B, C A:=shift B left C steps; 3 [7] JUMP [2] [8] JUMP [2] [8] JUMP [2] PRINT A Print A and a ; [8] [9] [9] JUMP A Continue at instruction A;

Example – Machine Code Generation

Intermediate Code MIPS Machine Code . _i: .word 0 [1] ASSIGN i 1 .text Summary [2] BRGE i 20 [8] .globl main [3] SHL t1 i 1 main: li $14, 1 [4] PRINT t1 $32: bge $14, 20, $33 [5] ADD t2 t1 1 sll $a0, $14, 1 [6] ASSIGN i t2 li $v0, 1 [7] JUMP [2] syscall [8] addu $14, $a0, 1 b $32 $33: sw $14, _i Readings and References Summary

Read Louden: The structure of a compiler depends on Introduction pp. 1–18, 21–27. 1 the complexity of the language we’re working on (higher or the Dragon Book: complexity ⇒ more passes), 2 the quality of the code we hope to produce (better code ⇒ Introduction pp. 1–24 more passes), A Simple Compiler pp. 25–82 3 the degree of portability we hope to achieve (more portable ⇒ Some Compilers pp. 733–744 better separation between front- and back-ends). or the Tiger Book: 4 the number of people working on the compiler (more people ⇒ more independent modules). Introduction pp. 1–15

Summary. . . Summary. . .

Some interpreters use (or jitting), Some highly retargetable compilers for high-level languages switching between produce C-code, rather than machine code. This C-code is 1 interpreting the virtual machine code, then compiled by the native C compiler to machine code. 2 translating the virtual machine code to native machine code, 3 executing the native machine code, Some languages (APL, LISP, , Java, ICON, , 4 optimizing the native and/or virtual machine code, and Awk) are traditionally interpreted (executed in software by an 5 throwing native code away if it is no longer needed or takes up interpreter) rather than compiled to machine code. too much room. All this is done dynamically at runtime. Exercise

Consider this little Java class: class M { public static void main(String args[]) { Exercises int x = 6; while (x != 42) x += 6; } } Show the result after lexical analysis (what tokens do you need to compile Java?)! Show the AST after syntax analysis of the token stream (what AST nodes does Java nead?)! Show the AST after semantic analysis! Define an intermediate code and generate it from the AST!

The First Compiler

FORTRAN I was the first “high-level” . It’s designers also wrote the first real compiler and invented Historical Notes many of the techniques that we use today. The FORTRAN manual can be found here: http://www.fh-jena.de/~kleine/history. The excerpt on the next few slides is taken from , The history of FORTRAN I, II, and III, History of Programming Languages, The first ACM SIGPLAN conference on History of programming languages, 1978. Before 1954 almost all programming was done in Another factor which influenced the development of machine language or . FORTRAN was the economics of programming in 1954. rightly regarded their work as a complex, creative The cost of programmers associated with a computer art that required human inventiveness to produce an center was usually at least as great as the cost of efficient program. Much of their effort was devoted the computer itself. ... In addition, from one to overcoming the difficulties created by the quarter to one half of the computer’s time was spent of that era: the lack of index registers, in . ... the lack of builtin floating point operations, This economic factor was one of the prime motivations restricted instruction sets (which might have AND but which led me to propose the FORTRAN project ... in not OR, for example), and primitive input- output late 1953 (the exact date is not known but other arrangements. Given the nature of computers, the facts suggest December 1953 as a likely date). I services which "" performed for believe that the economic need ... provided for our the were concerned with overcoming the constantly expanding needs over the next five years machine’s shortcomings. Thus the primary concern of without ever askinging us to project or justify those some "automatic programming" systems was to allow the needs in a formal budget. use of symbolic addresses and decimal numbers...

It is difficult for a programmer of today to comprehend what "automatic program- ming" meant to In view of the widespread skepticism about the programmers in 1954. To many it then meant simply possibility of producing efficient programs with an providing mnemonic operation codes and symbolic automatic programming system and the fact that addresses, to others it meant the simple’process of inefficiencies could no longer be hidden, we were obtaining from a and inserting convinced that the kind of system we had in mind the addresses of operands into each . ... would be widely used only if we could demonstrate We went on to raise the question "...can a machine that it would produce programs almost as efficient as translate a sufficiently rich mathematical language hand coded ones and do so on virtually every job. into a sufficiently economical program at a sufficiently low cost to the whole affair feasible?" ... As far as we were aware, we simply made up the language as we went along. We did not regard One much-criticized design choice in FORTRAN concerns language design as a difficult problem, merely a the use of spaces: blanks were ignored, even blanks simple prelude to the real problem: designing a in the middle of an identifier. There was a common compiler which could produce efficient programs. Of problem with keypunchers not recognizing or properly course one of our goals was to design a language counting blanks in handwritten data, and this caused which would make it possible for engineers and many errors. We also regarded ignoring blanks as a scientists to write programs themselves for the 704. device to enable programmers to arrange their ... Very early in our work we had in mind the programs in a more readable form without altering notions of assignment statements, subscripted their meaning or introducing complex rules for variables, and the DO statement.... formatting statements. The language described in the "Preliminary Report" Section I was to read the entire source program, had variables of one or two characters in length, compile what instructions it could, and file all the function names of three or more characters, rest of the information from the source program in recursively defined "expressions", subscripted appropriate tables. ... variables with up to three subscripts, "arithmetic formulas" (which turn out to be assignment statements), and "DO-formulas".

Using the information that was filed in section I, section 2 faced a completely new kind of problem; it was required to analyze the entire structure of the The final section of the compiler, section 6, program in order to generate optimal code from DO assembled the final program into a relocatable binary statements and references to subscripted variables. program...... Unfortunately we were hopelessly optimistic in 1954 section 4, ... analyze the flow of a program about the problems of debugging FORTRAN programs produced by sections I and 2, divide it into "basic (thus we find on page 2 of the Report: "Since blocks" (which contained no branching), do a Monte FORTRAN should virtually eliminate coding and Carlo (statistical) analysis of the expected debugging...") frequency of execution of basic blocks--by simulating Because of our 1954 view that success in producing the behavior of the program and keeping counts of the efficient programs was more important than the design use of each block--using information from DO of the FORTRAN language, I consider the history of statements and FREQUENCY statements, and collect the compiler construction and the work of its information about index register usage ... Section 5 inventors an integral part of the history of the would then do the actual transformation of the FORTRAN language; ... program from one having an unlimited number of index registers to one having only three.