5 Programming Language Implementation
Total Page:16
File Type:pdf, Size:1020Kb
5 Programming Language Implementation Program ®le for this chapter: pascal We are now ready to turn from the questions of language design to those of compiler implementation. A Pascal compiler is a much larger programming project than most of the ones we've explored so far. You might well ask, ªwhere do we begin in writing a compiler?º My goal in this chapter is to show some of the parts that go into a compiler design. A compiler translates programs from a language like Pascal into the machine language of some particular computer model. My compiler translates into a simpli®ed, simulated machine language; the compiled programs are actually carried out by another Logo program, the simulator, rather than directly by the computer hardware. The advantage of using this simulated machine language is that this compiler will work no matter what kind of computer you have; also, the simpli®cations in this simulated machine allow me to leave out many confusing details of a practical compiler. Our machine language is, however, realistic enough to give you a good sense of what compiling into a real machine language would be like; it's loosely based on the MIPS microprocessor design. You'll see in a moment that most of the structure of the compiler is independent of the target language, anyway. Here is a short, uninteresting Pascal program: program test; procedure doit(n:integer); begin writeln(n,n*n) end; begin doit(3) end. 209 If you type this program into a disk ®le and then compile it using compile as described in Chapter 4, the compiler will translate the program into this sequence of instructions, contained in a list in the variable named %test: [ [add 3 0 0] [add 4 0 0] [addi 2 0 36] [jump "g1] %doit [store 1 0(4)] [jump "g2] g2 [rload 7 36(4)] [putint 10 7] [rload 7 36(4)] [rload 8 36(4)] [mul 7 7 8] [putint 10 7] [newline] [rload 1 0(4)] [add 2 4 0] [rload 4 3(2)] [jr 1] g1 [store 5 1(2)] [add 5 2 0] [addi 2 2 37] [store 4 3(5)] [store 4 2(5)] [addi 7 0 3] [store 7 36(5)] [add 4 5 0] [rload 5 1(4)] [jal 1 "%doit] [exit] ] I've displayed this list of instructions with some extra spacing thrown in to make it look somewhat like a typical assembler listing. (An assembler is a program that translates a notation like add 3 0 0 into a binary number, the form in which the machine hardware actually recognizes these instructions.) A real assembler listing wouldn't have the square brackets that Logo uses to mark each sublist, but would instead depend on the convention that each instruction occupies one line. The ®rst three instructions carry out initialization that would be the same for any compiled Pascal program; the fourth is a jump instruction that tells the (simulated) computer to skip to the instruction following the label g1 that appears later in the 210 Chapter 5 Programming Language Implementation program. (A word that isn't part of a sublist is a label.) In Pascal, the body of the main program comes after the declarations of procedures; this jump instruction allows the compiler to translate the parts of the program in the order in which they appear. (Two instructions later, you'll notice a jump to a label that comes right after the jump instruction! The compiler issues this useless instruction just in case some internal procedures were declared within the procedure doit. A better compiler would include an optimizer that would go through the compiled program looking for ways to eliminate unnecessary instructions such as this one. The optimizer is the most important thing that I've left out of my compiler.) We're not ready yet to talk in detail about how the compiled instructions represent the Pascal program, but you might be able to guess certain things. For example, the variable n in procedure doit seems to be represented as 36(4) in the compiled program; you can see where 36(4) is printed and then multiplied by itself, although it may not yet be clear to you what the numbers 7 and 8 have to do with anything. Before we get into those details, I want to give a broader overview of the organization of the compiler. The compilation process is divided into three main pieces. First and simplest is tokenization. The compiler initially sees the source program as a string of characters: p, then r, and so on, including spaces and line separators. The ®rst step in compilation is to turn these characters into symbols, so that the later stages of compilation can deal with the word program as a unit. The second piece of the compiler is the parser, the part that recognizes certain patterns of symbols as representing meaningful units. ªOh,º says the parser, ªI've just seen the word procedure so what comes next must be a procedure header and then a begin±end block for the body of the procedure.º Finally, there is the process of code generation, in which each unit that was recognized by the parser is actually translated into the equivalent machine language instructions. (I don't mean that parsing and code generation happen separately, one after the other, in the compiler's algorithm. In fact each meaningful unit is translated as it's encountered, and the translation of a large unit like a procedure includes, recursively, the translation of smaller units like statements. But parsing and code generation are conceptually two different tasks, and we'll talk about them separately.) Formal Syntax De®nition One common starting place is to develop a formal de®nition for the language we're trying to compile. The regular expressions of Chapter 1 are an example of what I mean by a formal de®nition. A regular expression tells us unambiguously that certain strings Formal Syntax De®nition 211 of characters are accepted as members of the category de®ned by the expression, while other strings aren't. A language like Pascal is too complicated to be described by a regular expression, but other kinds of formal de®nition can be used. The formal systems of Chapter 1 just gave a yes-or-no decision for any input string: Is it, or is it not, accepted in the language under discussion? That's not quite good enough for a compiler. We don't just want to know whether a Pascal program is syntactically correct; we want a translation of the program into some executable form. Nevertheless, it turns out to be worthwhile to begin by designing a formal acceptor for Pascal. That part of the compilerÐthe part that determines the syntactic structure of the source programÐis called the parser. Later we'll add provisions for code generation: the translation of each syntactic unit of the source program into a piece of object (executable) program that carries out the meaning (the semantics) of that unit. One common form in which programming languages are described is the production rule notation mentioned brie¯y in Chapter 1. For example, here is part of a speci®cation for Pascal: program : program identi®er ®lenames ; block . ®lenames : | ( idlist ) idlist : identi®er | idlist , identi®er block : varpart procpart compound varpart : | var varlist procpart : | procpart procedure | procpart function compound : begin statements end statements : statement | statements ; statement procedure : procedure identi®er args ; block ; function : function identi®er args : type ; block ; A program consists of six components. Some of these components are particular words (like program) or punctuation marks; other components are de®ned in terms of even smaller units by other rules.* * The ®lenames component is an optional list of names of ®les, part of Pascal's input/output capability; my compiler doesn't handle ®le input or output, so it ignores this list if there is one. 212 Chapter 5 Programming Language Implementation A vertical bar (|) in a rule separates alternatives; an idlist (identi®er list) is either a single identi®er or a smaller idlist followed by a comma and another identi®er. Sometimes one of the alternatives in a rule is empty; for example, a varpart can be empty because a block need not declare any local variables. The goal in designing a formal speci®cation is to capture the syntactic hierarchy of the language you're describing. For example, you could de®ne a Pascal type as type : integer | real | char | boolean | array range of integer | packed array range of integer | array range of real | ... but it's better to say type : scalar | array range of scalar | packed array range of scalar scalar : integer | real | char | boolean Try completing the syntactic description of my subset of Pascal along these lines. You might also try a similar syntactic description of Logo. Which is easier? Another kind of formal description is the recursive transition network (RTN). An RTN is like a ®nite-state machine except that instead of each arrow representing a single symbol in the machine's alphabet, an arrow can be labeled with the name of another RTN; such an arrow represents any string of symbols accepted by that RTN. On this page and the next I show two RTNs, one for a program and one for a sequence of statements (the body of a compound statement). In the former, the transition from state 5 to state 6 is followed if what comes next in the Pascal program is a string of symbols accepted by the RTN named ªblock.º In these diagrams, a word in typewriter style like program represents a single symbol, as in a ®nite-state machine diagram, while a word in italics like block represents any string accepted by the RTN of that name.