A Parallel Virtual Machine for

Ecient Scheme Compilation



Marc Feeley and James S Miller

Brandeis University

Waltham MA

Architecture HPPA MIPS R Motorola K Abstract

BBN Monarch For our purp oses it was imp ortant

Programs compiled by Gambit our Scheme achieve

that retargetting the compiler b e simple and yet still

p erformance as much as twice that of the fastest available

yield a high p erformance system We rejected exist

Scheme Gambit is easily p orted while retaining

ing compilerbased Scheme systems with Orbit

its high p erformance through the use of a simple virtual

CScheme with Liar mainly b ecause of the di

machine PVM PVM allows a wide variety of machine

culty of retargetting and mo difying the compilation

indep endent optimizations and it supp orts parallel computa

strategy of these large systems

tion based on the future construct PVM conveys highlevel

High p erformance of output programs We are not

information bidirectional l y b etween the machineindepen

concerned with program development features For ex

dent front end of the compiler and the machinedependent

ample we do not allow the user to interrupt execution

back end making it easy to implement a number of common

of a program other than by ab orting it

back end optimizations that are dicult to achieve for other

virtual machines

Supp ort for task creation and synchronization through

PVM is similar to many real computer architectures and

implici t data op erations p ossibly augmented by con

has an option to eciently gather dynamic measurements of

trol constructs The future construct provides these

virtual machine usage These measurements can b e used in

features compatibly with most other features of the

p erformance prediction for p orts to other architectures as

Scheme language and was therefore our initial fo cus

well as design decisions related to prop osed optimizations

We are also interested in exploring other parallel con

and ob ject representations

trol and data constructs

Intro duction

While the rst and second goals are somewhat at o dds

with one another we b elieve that architectural comparisons

Our primary interest is in ecient mechanisms for imple

and architecture indep endent implementation techniques will

menting futurebased symbolic computation on currently

b e among the imp ortant results from our research We have

available MIMD machines Having already done work in this

therefore chosen to build a compiler based on an easily re

area using the Scheme language augmented with the future

targetted virtual machine even though it may result in less

mechanism we are now extending our interpreter

ecient compiled co de Fortunately our exp erience with

based results into the realm of compiled Scheme co de For

Gambit indicates that a well chosen virtual machine do es

this purp ose we underto ok the implementation of a new

not result in any noticeable p erformance p enalties

Scheme compiler Gambit with the intention of creating

a simple environment for exp eriments across a wide range

PVM A Parallel Virtual Machine

of hardware platforms and over a range of implementation

techniques The ma jor design goals for Gambit from the

In designing our virtual machine we tried to avoid a pair of

outset were

twin hazards that we have seen in other virtual machines

used for compilation On the one hand there are virtual

Co de generation for multiple target machines span

machines like MITs scode or the co de ob jects of UMB

ning b oth common CISC computers DEC Vax Mo

Scheme that are so close to the source language that the

torola MC and RISC computers HP Precision

machine indep endent front end of the compiler is unable to



express imp ortant optimizations in the virtual machines in

This research was supp orted in part by the Op en Software Foun

dation the HewlettPackard Corp oration and NSF equipment grant

struction set This places a ma jor burden on the back end

CDA Marc Feeley is on study leave from the Universitede

which b ecomes resp onsible for analysis of the virtual ma

Montreal

chine co de a task very nearly as dicult as the original

compilation task On the other hand there are virtual ma

chines like Multilisps mcode or Scheme s byte co de

that match neither the actual target machine nor the source

language The result is either a complex back end that again attempts to recover data and control ow information from

the virtual machine or a simple back end that pro duces

Op erand Meaning

p o or co de

regn General purp ose register n

Our Parallel Virtual Machine or PVM is intended to

fall in b etween these kinds of machines We allow each back

th

stkn N slot of the current stack frame

end to sp ecify a wide range of architectural details of the

globname Global variable

virtual machine includin g a description of primitive pro ce

membase oset Indexed reference base is an

dures available on the target machine and the number of

op erand oset is a constant

general purp ose registers As a result we can think of PVM

objobject Constant

as a set of virtual machines dep ending on the back end de

lbln Program lab el

scription that is used Each sp ecic virtual machine is close

loc Parallelism supp ort see Section

to its target machine yet the common abstraction hides the

precise details from the front end PVM also remains close

to the source language since its small instruction set closely

Figure PVM Op erands

matches the Scheme language itself

PVM can b e viewed as a bidirectional communication

medium b etween the front and back ends of the compiler

The traditional role of a virtual machine of course is to con

we leave other data structure accesses to the more general

vey information from the front end to the back end PVM

APPLY instruction As a result the ability to nest other

however conveys information in the reverse direction as well

op erands within mem is not actually in use although the

back ends supp ort it

The number of general registers

Finally we note that all op erands can b e used as the

The pro cedure calling convention

source of values However values cannot b e stored into obj

The format of closures

lbl or op erands

Enumeration and description of primitive pro cedures

Machinespecic declarations

Instructions for Sequential Computation

We view this bidirection al communication as an imp or

The PVM instruction set provides a small set of general

tant comp onent of Gambits organization The communica

instructions to eciently enco de the op eration of Scheme

tion is supp orted by a language for describing implementa

programs Like many compilers Gambit represents the pro

tionlevel ob jects which is the basis of the PVM abstraction

gram as a set of basic blo cks This representation is ap

Four types of ob jects are manipulated using this language

parent in the PVM co de Each basic blo ck is headed by a

primitive pro cedures data ob jects stack frames and argu

co de lab el followed by the co de for the data op erations in

mentparameter blo cks Corresp onding to each of these is

the blo ck and ends with a branch instruction Our current

a means of reference the name of the primitive pro cedure

instruction set for sequential computation consists of four

slots within a data structure slots within a stack frame

kinds of co de lab els three data manipulatin g instructions

and argumentparameter number This particular level of

and three branch instructions

abstraction is convenient for b oth the front and back ends

An imp ortant part of Gambits communication mecha

For example b oth the back and front ends agree to discuss

nism is the description of a set of pro cedures known as prim

stack slots as p ositive integers in units of Scheme ob jects

itives that are supp orted by the back end All primitives

increasing as ob jects are pushed on the stack This is clearly

are available through the general pro cedure call mechanism

convenient for the front end and the back end can easily

but some can also b e op en co ded by the APPLY and COND

translate this into appropriate osets from a base register

instructions The front end requires the back end to supply

taking into account the number of bytes p er argument the

a sp ecic minimal set of ab out primitive op erations

direction of stack growth and the choice of stack discipli ne

but the back end can in fact sp ecify any pro cedure as a

on the target machine

primitive The description of each primitive indicates its ar

ity and strictness It also indicates whether it can b e op en

Op erands

co ded and whether it can return a placeholder as a value

Thus list has unbounded arity is not strict in any argu

PVM has seven classes of op erands as shown in Figure

ment and never returns a placeholder while setcar has

which naturally divide storage into disjoint areas registers

arity two is strict in its rst argument but not its second

current stack frame global variables heap storage constant

and never returns a placeholder

area and co de area This makes it easy to track values and

PVMs handling of stack frames is unusual and is de

with the exception of mem op erands removes the traditional

scrib ed in Section The siz e parameter to the lab el and

aliasing problem

branch instructions is used to supp ort this mechanism and

Neither the stack nor the heap p ointer is directly visible

is describ ed in detail in that section

Instead the stack is accessible by indexing o of a virtual

The description of the sequential PMV instructions fol

frame base p ointer that is mo died as part of the pro ce

lows Figure shows a simple program iterative factorial

dure call mechanism The heap is accessed implicitly when

along with its PVM co de and the co de generated for the

allo cating ob jects and explicitly by indexing relative to ex

MC A comparison to co de from other compilers is

isting heapallo cated ob jects By making the stack p ointer

included in App endix A

and heap p ointer invisible we allow the back end to make a

number of optimizations based on the target architecture

LABELn siz e

The mem op erand which gives access to heap storage

A simple lab el n which may app ear only in JUMP

allows nesting of other op erands in its base comp onent Our

and COND instructions front end however uses it only for access to closed variables

declare standardbindings fixnum

define fact n Iterative factorial

let loop i n ans

if i

ans

loop i ans i

Virtual Machine Co de MC Co de

L bmi L Entry point LABELPROC

jsr a Arity error

L

moveq d COPYobjreg

CONDfixnumregobj movel dd

beq L

LABEL L

asrl d APPLYfixnumregregreg

mulsl dd

APPLYfixnumregobjreg subql d

movel dd CONDfixnumregobj

bne L

L LABEL

COPYregreg movel dd

jmp a JUMPreg

Figure Sample Program with Gambit Output

LABELn siz e CONT that closure and op erands to b e stored in the closures

A continuation lab el similar to a return address in data slots

other architectures These are not the ob jects returned

l bl f l bl siz e CONDpr im oper and t

1

to a Scheme program that calls callwithcurrent

A conditional branch based on the value of pr im ap

continuation but are an architectural feature to sup

plied to the oper ands on false branch to f l bl other

p ort fully tailrecursive b ehavior

wise to t l bl b oth of which must sp ecify simple lab els

LABELn siz e PROC desc

JUMPoper and siz e

A pro cedure lab el with desc dening the number

Jump to the simple or continuation lab el sp ecied by

or range of arguments exp ected This instruction

the value of oper and

mo dies the stack and registers to account for any dis

crepancy b etween desc and the number of arguments

JUMPoper and siz e nar g s

actually passed as sp ecied by the JUMP instruction

Jump to the address sp ecied by oper and This in

used to arrive here

struction also states that nar g s have b een placed in

the appropriate argument passing lo cations The value

LABELn siz e PROC desc CLOSED

of oper and must b e either a pro cedure lab el a closure

A closure lab el similar to a pro cedure lab el but

ob ject or a primitive pro cedure ob ject

the lab el n may app ear only within a closure ob ject

CLOSURES created by MAKE

Stack Frames

APPLYpr im oper and l oc

1

PVM deals with the stack frame in a novel manner sup

Apply the primitive pro cedure pr im to the oper ands

plying the current stack frame size in the LABEL COND and and store the result in l oc or discard it if l oc is not

JUMP instructions Our approach avoids the problems inher

sp ecied

ent in using virtual machines either based purely around a

COPYoper and l oc

top of stack p ointer or based purely up on a frame p ointer

Copy the value of oper and to the lo cation sp ecied by

Using a stack p ointer leads to varying osets for a given

l oc

stack slot and inecient co de on machines lacking explicit

stack instructions Using only a frame base leaves the top

CLOSURESdescr iption descr iption MAKE

1 n

of stack unknown at garbage collection time and requires

Create n closure ob jects A closure ob ject con

up date instructions on entry and exit of every basic blo ck

tains a co de lab el and a number of data slots Each

While the actual instruction set of PVM makes use of a

descr iption sp ecies a lo cation into which a closure

frame p ointer and frame size information we prefer to think ob ject will b e stored the closure lab el for the co de of

of the machine as having b oth a stack p ointer and a frame sp ecies where values are lo cated after this reorgani

p ointer Since the frame size always sp ecies the distance zation has taken place For a closure lab el the back

b etween the stack p ointer and the frame p ointer either end also sp ecies the lo cation of the p ointer to the clo

p ointer can b e recomputed from the other JUMP and COND sure ob ject so that the front end can generate PVM

instructions cause the stack p ointer to b e recalculated while instructions to access its data slots Our front end also

LABEL instructions recalculate the frame p ointer Within a restricts these to b e registers or stack slots

basic blo ck the stack p ointer is up dated based on the o

The back end also decides how the argument count is

sets of stk op erands encountered so that it always covers

passed from the JUMP instruction to the destination pro ce

the active part of the stack frame

dure or closure lab el This decision is internal to the back

The choice b etween stack p ointer and frame p ointer dis

end since it is needed only to accomplish moving arguments

cipline is sp ecic to the back end see Section We take

from the lo cations where they are passed by the caller to the

advantage of the fact that our front end pro duces the PVM

lo cations where they are exp ected by the destination co de

co de for an entire basic blo ck b efore b eginning native co de

All of this co de is included in the back ends expansion of

generation For each instruction the front end calculates a

the JUMP and LABEL instructions

tight b ound on the size of the stack frame using knowledge

of which slots are referenced b etween the current instruc

tion and the end of the blo ck It supplies this information

FirstClass Pro cedures

to the back end which can then easily implement any one of

Since Scheme programs often make use of rstclass pro ce

four mechanisms the two p ointer mo del ab ove or a single

dures we take a short digression to discuss the mechanism

p ointer mo del frame base frame top or stack The single

Gambit uses to implement them In general pro cedures

p ointer mo dels are derived from the two p ointer mo del by

carry with them the values of any free variables that they

realizing that

reference and we use the traditional name closure to re

fer to the representation used when a pro cedure references

Both the frames current size and its size at entry to

one or more free variables Pro cedures with no references to

the current basic blo ck are known at co de generation

free variables can b e represented simply by the address of

time

the co de that implements them in PVM either a primitive

These frame sizes along with any single p ointer sp ecify

pro cedure ob ject or a pro cedure lab el

the other two p ointers

Gambit allo cates closures on the heap They consist of

a backend dep endent header region typically instructions

Calling Convention

executed when the closure is invoked followed by storage

A second novel asp ect of our virtual machine design is the

for the values of the free variables needed by the pro cedure

implementation of the calling convention PVM itself im

Each entry in the storage area contains either the value of

p oses no sp ecic mechanism but allows the back end to

a free variable if it is known to b e immutable or a p ointer

choose an appropriate mechanism for general pro cedure calls

to the variables storage lo cation Figure shows the clo

The front end will generate PVM instructions for pro cedure

sure ob ject created from the co de shown in Figure see

calls that load argument values into registers or stack lo ca

App endix A for further implementation details

tions as sp ecied by the back end At pro cedure and closure

The runtime storage allo cation required by closures is

lab els the back end is resp onsible for emitting co de if neces

exp ensive compared to other pro cedure representations and

sary to move arguments from their placement by the caller

Gambit attempts to minimize the number of closures that

to the lo cation required by the callee This is based on the

are created The front end p erforms a combined data and

number of arguments actually passed at runtime compared

control ow analysis to discover all pro cedure calls that in

with the number of parameters required by the pro cedure

voke a given pro cedure If all calls can b e lo cated standard

In cases where the front end can analyze a pro cedure call

lambda lifting is p erformed the net eect is to add the free

suciently to avoid the general mechanism it can pro duce

variables to the parameter list of the pro cedure and to mo d

optimized co de by using simple lab els rather than pro ce

ify all of the pro cedure calls to pass these values as argu

dure or closure lab els as the target address Unlike a pro

ments The pro cedure then has no free variable references

cedure lab el a simple lab el implies no stack reformatting

and is represented as a pro cedure lab el

op erations Thus the calling convention used for jumps to

A second technique used to minimize the size of closures

simple lab els is essentially under the control of the front end

and p ossibly eliminate them entirely is to sub divide the free

while for jumps to pro cedure or closure lab els it is under the

variables that are referenced References to global variables

control of the back end

do not need to b e stored in the closure since their values are

The back end sp ecies in particular

directly accessible at runtime Gambit supp orts only one

top level environment Similarly variables that are known

Where arguments are passed based on the number

to have constant values either b ecause of a declaration or

of arguments in the call This is used by the front

from data ow analysis can b e eliminated from the list of

end to generate the co de prior to a JUMP instruction

free variables that must b e stored in the closure Thus

Our front end restricts the choices to combinations of

the storage area of a closure contains values for the formal

registers and stack slots

parameters of lexical parents which are referenced by the

b o dy of the pro cedure and which the compiler cannot infer

Which register contains the value returned by pro ce

to have constant values

dures

CLOSURES instruction Closures are created by the MAKE

This instruction allows multiple closures to b e made si

Where the parameters are lo cated after a LABEL in

multaneously to provide for mutually recursive pro cedures

struction is executed Since pro cedure and closure

Considering the creation of the closures to o ccur atomically

lab els mo dify the stack and registers the back end

with resp ect in particular to garbage collection allows for

ecient implementation in some back ends To make this

more concrete consider the Scheme program makeadder

define makeadder x

shown with its PVM co de in Figure The PVM co de for

lambda y y x

the b o dy of makeadder includes the instruction

The following co de PVM and corresp onding MC is

CLOSURESstkreg MAKE

the b o dy of makeadder It is resp onsible for creating a

closure to represent the value of the lambda expression

which creates one closure ob ject shown in Figure and

stores it on the stack at stk The closure contains space

for one value initiali zed from the contents of reg where

PVM LABELPROC

the value of x happ ens to b e and the closure lab el for the

L bmi L

b o dy of the lambda expression is lab el

jsr a

The second form of the JUMP instruction is used for call

PVM MAKECLOSURESstkreg

ing pro cedures and sp ecies the number of arguments b eing

L lea aa allocate

passed The back end is resp onsible for emitting co de that

movel aa

stores this argument count and arrives at the appropriate

movel xfa length and type

destination address In the case of a closure the destina

addql a unused word

tion is enco ded in the closure ob ject itself in a backend de

movel asp store in stk

p endent manner by the MAKE CLOSURES instruction Thus

movew xeba store JSR opcode

the back end must arrange for a jump to a closure to b e

lea La

indirect whereas a jump to a simple pro cedure is direct

movel aa destination

Furthermore the address of the closure itself must b e made

movel da data slot

available to the co de at the closure lab el since it is needed

PVM COPYstkreg

to reference the values of the free variables stored in the

movel spd

closure

PVM JUMPreg

While PVM do es not further sp ecify the interface b e

cmpl aa GC check

tween the JUMP instruction and the destination LABEL all of

bcc L

our back ends have made the same implementation decision

jsr a

As shown in Figure the header of our closure ob jects is

L jmp a

a short instruction sequence that jumps to the destination

lab el and stores the address of the closures data area into

The following co de is the b o dy of the lambda expression

a known register using the target machines jumpandlink

instruction JSR on the MC

PVM LABELPROCCLOSED

L movel spd reg

Declarations

subql d closure pointer

Like most other Scheme implementations Gambit provides movew dd arity test

a declaration mechanism that allows programmers to tell bmi L

the compiler that it may violate certain assumptions of the jsr a

basic language For example in Gambit the declaration PVM COPYmemregreg

standardbindings allows the compiler to assume that ref

L movel da load x from

erences to global variables with the names of the primitive movel ad data slot

op erations are in fact references to those primitives This

PVM APPLYfixnumregregreg

allows the front end to generate an APPLY COND or JUMP in addl dd

struction that references the primitive directly rather than

PVM JUMPreg

referencing a global variable as required by the language jmp a

denition Similarly the fixnum declaration allows the com

piler to generate co de for the standard numeric op erations

Figure Makeadder A closure generator

that assumes all numbers are small integers and suppresses

overow detection

Some of these declarations like standardbindings are

relevant only to the front end and are available with all

back ends Other declarations like fixnum are meaningful

to only some back ends In Gambit we p ermit the back

end to aect the co de emitted by the front end based on the

length and type

current set of declarations as maintained by the front end

unused JSR

For example the primitive might b e usable in an APPLY

code address

instruction if either the declaration fixnum or flonum is in

slot value of x

eect In this case the front end asks the back end what

primitive could b e used instead of sp ecifying the decla

rations that are currently in eect The back end resp onds

Figure Closure for makeadder for the MC

with either flonum or fixnum or simply if no other op eration is available

Parallelism in PVM Optimization Techniques

We have introduced so far the sequential subset of PVM

Gambit employs a number of standard optimizations in

One of our ma jor goals however is to eciently supp ort the

b oth the front and back ends This section enumerates the

future mechanism for In this mecha

current set of optimizations without further discussion pri

nism a parent spawns a child task and uses a placeholder

marily for completeness We exp ect to add additional opti

to allow the parent task to refer to the value b eing computed

mizations in the future

by the child In earlier systems supp orting futures

there is a ma jor cost asso ciated with spawning a task

Front End Optimizations

arising from the need to create a separate thread of con

trol and a placeholder at the time the child task is spawned

Preferentially allo cating temp orary values to registers

PVM has three additional instructions and one op erand type

Using a direct JUMP to a simple lab el for calling known

to make futurebased parallel computation ecient Our

pro cedures

mo del is inspired by conversations with Halstead based on

Tracking multiple homes for variables

a brief mention in

Keeping values in registers as long as p ossible by track

ing register contents and saving them on the stack

LABELt siz e TASK w

lazily This entails merging variable home informa

Dene a task lab el t that marks the b eginning of a

tion around conditional branches

task A task lab el can b e used in place of a simple

Lambdalifting

lab el A jump to a task lab el however spawns a new

p ossibly parallel task to execute the co de b etween

Optimizing the PVM Co de

the task lab el and its corresp onding DONE instruction

The lab el w is where the parent task continues execu

These optimizations are p erformed on the PVM co de itself

tion after the new task is spawned

and are completely indep endent of b oth the source language

and the target machine

LABELw siz e WORK

Dene a work lab el w that sp ecies where a task

Branch cascade removal by replacing a branch with

should resume execution after it spawns a new task

the instruction at the destination

DONE

Reordering basic blo cks to maximize the number of

End the current task and deliver the result

fallthroughs

Dead co de removal

These three instructions can b e translated by the back

Common co de eliminatio n

end to provide the same future mechanism used by earlier

systems or to provide lazy futures Lazy futures treat task

Back End Optimizations

spawning as a sp ecial kind of pro cedure call When a task

is called it leaves a marker on the stack so that another

In addition to the traditional back end optimizations eg

pro cessor can recreate the parent task in PVM this is p er

branch tensioning Gambit makes use of its stack discipli ne

formed by the task lab el The pro cessor is now eectively

abstraction to optimize the allo cation and deallo cation of

executing on b ehalf of the child task and the parent task is

stack frames It is easy for the front end to use stk op erands

susp ended

in an exclusively stacklike manner ie it only stores into

Should another pro cessor decide to resume the parent

slots into which it has already stored or into the next higher

task that pro cessor splits the stack at the marker allo cates

slot The front end do es this and consequently on machines

a placeholder and b egins executing the co de of the parent

with push and p op instructions like the MC the

task using the placeholder to represent the value computed

back end incrementally allo cates the frame by pushing values

by the child task PVM provides no direct supp ort for this

on the stack as the slot number in stk op erands increases

op eration Instead a pro cedure is supplied by the runtime

Similarly the frame is incrementally deallo cated by p opping

system that understands the format of stack markers and

values when the frame size decreases

the co de supp orting task termination

On machines lacking these instructions such as the MIPS

When control in the child task returns to the stack marker

and HPPA the back end uses a frame top p ointer implemen

created by a task lab el the child will either return as a nor

tation This allows the frame to b e allo cated or deallo cated

mal pro cedure if no other pro cessor resumed the parent

with a single instruction at the end of each basic blo ck con

task or store its result in the placeholder and lo ok for some

veniently lling the branch delay slot on these machines

other pro cessors parent task to resume In PVM this o c

curs when a DONE instruction is executed

Other approaches

In addition to this supp ort for spawning and terminat

ing tasks PVM provides supp ort for the underlying place

We have examined three kinds of virtual machines typi

holder data type through the use of the op erand annota

cally used in implementing Lisp systems byte co des syntax

tion When Gambit compiles co de for a parallel back end it

trees and register transfer languages PVM b elongs to this

places a around appropriate op erands that are p otentially

last class and represents a particular engineering approach

placeholders Appropriate op erands includes the strict

to the design of such an intermediate language This sec

op erands as sp ecied by the back end in APPLY and COND

tion compares PVM to other virtual machines used by the

instructions as well as the destination op erand of JUMP in

Scheme community

structions By using information supplied by the back end

the front end can determine whether the result of a primitive

pro cedure can b e a placeholder this information is then used

to suppress generation of the in references to the value

Byte Co des Presently RTL consists primarily of ASSIGN commands sim

ilar to the COPY and APPLY commands of PVM TEST com

There are a number of well known Scheme implementa

mands similar to the COND of PVM sp ecial purp ose instruc

tions based on a byte co de interpreter Indiana Universitys

tions to call compiler supp ort routines in the runtime sys

Scheme and its descendants MacScheme and Texas In

tem frame adjust commands and commands to generate

struments PC Scheme and Halsteads Multili sp In the in

pro cedure headers

terpreted systems for which they were developed byte co d

Internally Liar like Gambit has a back end mo dule that

ing provides two imp ortant features sp eed of dispatch on

provides a description of the target machine to the front

most hardware platforms and co de space compression if

end of the compiler Liars front end is resp onsible for more

the op co des are based on static instruction frequency statis

compilation decisions than Gambits and consequently the

tics

description is at a much lower level of detail It consists

As an intermediate representation for compilation how

of information ab out the addressing granularity of the ma

ever byte co de leaves much to b e desired First all of the

chine the number of bits used for type co des and data and

byte co de systems mentioned ab ove are based on a pure

mappings from the front ends sp ecial purp ose registers to

stack machine Since many imp ortant hardware platforms

the target machines physical registers The front end of

are not stack based the pro cess of compiling native co de

Liar relies on and directly manipulates four virtual machine

from the byte co de requires recovering the higherlevel in

registers the dynamic link register for return addresses

formation ab out intermediate values that was removed in

a stack p ointer free p ointer into the heap and a p ointer to

generating the byte co de Furthermore the creation of the

a set of memory lo cations C variables shared with the in

byte co de program do es not pro duce information ab out vari

terpreter In addition the front end supp orts the notion of

able referencing patterns and this is essential to p ermit ef

register sets by providing the back end with general purp ose

cient use of hardware registers in the equivalent compiled

pro cedures for allo cating deallo cating and liveness track

co de

ing for groups of registers sp ecied by the back end This

is used for example to allow the back end to separate the

Co de Trees

use of general purp ose and oating p oint copro cessor reg

isters

In interpreted Scheme systems that implement much of the

The primary interface b etween the front and back ends

system co de in Scheme byte co ding is problematic since the

of Liar is through a rulebased language The front end

byte co ded programs have no natural representation within

generates RTL instructions that are matched against the

Scheme itself aside of course from byte strings An ap

rules provided by the back end This p ermits the precise

p ealing alternative is to represent a program as a syntax

nature of Liars virtual machine to remain undened while

whose comp onents are very similar to the pairs and vec

still enabling a variety of back ends to b e written Unfor

tors of standard Scheme This approach is taken in MITs

tunately as the front end changes the RTL instructions it

CScheme sco de derived from the actual instruction set

emits change and the rule sets of each back end must b e

of the Scheme VLSI chip and the University Mas

examined and mo died individua ll y PVMs regular struc

sachusetts Boston UMB Scheme system The type co de

ture on the other hand allows the construction of a back

of each no de in the tree is derived from the syntactic ex

end that handles the complete virtual machine and is thus

pression sp ecial form or combination it represents in the

isolated from many changes to Gambits front end PVM

program The leaves of the tree are constants and variable

implementations ie back ends for Gambit have similar

references

structures since they are case dispatches on the PVM in

This representation is easier to deal with in a Scheme pro

struction and op erand types thus up dating a back end to

gram than the byte co des and faster to interpret than the

accommo date changes in PVM itself is straightforward

original list structure of the program as provided by read

The register transfer language LLM developed at IN

One of the ma jor advantages of a co de tree representation

RIA for the language LeLisp is a much larger language than

is that it can b e easily converted back into a program equiv

either PVM or RTL It has over instructions including

alent to the original Scheme program from which it was de

a number of redundant ones providing control over asp ects

rived Systems can and do use this equivalence for a variety

as diverse as garbage collection and le system op erations

of debugging to ols such as co de insp ectors and pretty print

Implementing such a machine is a ma jor undertaking and

ers This very fact however argues against the syntax tree

not suitable for our environment where a quick p ort to a

as a go o d intermediate co de for compilation the represen

new architecture or op erating system is essential Further

tation provides no information ab out commonly referenced

more the low level of control sp ecied by LLM requires

variables nor any results of data or control ow analysis

the front end to b e more elab orate than we would like and

leaves little ro om for optimizations by the back end

Register Transfer Languages

Neither of the earlier representations were envisioned as in

Current Status

termediate representations for compilation and so it is not

surprising that they serve this purp ose rather p o orly We are At the time of writing we have completed the Gambit imple

mentation for the MC and are in the nal debugging

familiar with three intermediate languages designed sp eci

cally for this purp ose however MITs RTL register trans stages of a p ort to the MIPS machines A p ort to the HPPA

is also nearing completion Preliminary p erformance gures

fer language LeLisps LLM and PSLs cmacros

MITs RTL is an ad hoc language evolved from the ma comparing Gambit to Ts Orbit and MITs Liar compiler are

shown in Figure As that table indicates the MC

chine description language in through a version used in

an early compiler and now part of the Liar compiler implementation achieves very go o d p erformance over a wide

range of b enchmarks This implementation also includes an

option for eciently gathering dynamic usage statistics as John Batali Edmund Go o dhue Chris Hanson Howie

discussed in App endix B Shrob e Richard M Stallman and Gerald Jay Suss

In addition we have a preliminary version of a Scheme man The Scheme architecture system and chip

to C compiler inspired by the work of Bartlett This back In Paul Peneld Jr editor Proc of the MIT Confer

end generates p ortable C co de with go o d p erformance char ence on Advanced Research in VLSI Dedham Mass

acteristics but is not yet capable of pro ducing separately Artech House

compilable mo dules As a result it can currently only b e

Willi am Campb ell A C interpreter for Scheme an

used for compiling rather small Scheme programs

exercise in ob jectoriented design Submitted to the

Software Engineering journal of the British Computer

Future Plans

So ciety

Our next ma jor goal is to create a back end for a sto ck

Jerome Chailloux La machine LLM Rapp ort in

MIMD parallel machine We have made several early pro

terne du pro jet vlsi INRIA May Corresp onds to

totype versions and are encouraged by the results Gam

LeLisp version

bits control and data ow analysis app ear to b e suciently

Richard P Gabriel Performance and Evaluation of

general to allow us to explore a number of mechanisms for

LISP Systems Research Rep orts and Notes Computer

reducing the cost of the touch and future op erations that

Systems Series MIT Press Cambridge MA

dominate the p erformance of our own and other parallel

Scheme systems

Martin L Griss and Anthony C Hearn A p ortable

As part of this work we plan to complete our work on the

LISP compiler Software Practice and Experience

Gambit C back end This involves the implementation of a

separate compilation facility that has already b een designed

Measurements on the single mo dule system indicate that the

R Halstead Multilisp A language for concurrent sym

p erformance is ab out half that of the native co de pro duced

b olic computation In ACM Trans on Prog Languages

by Gambit and we consider the advantage of having a single

and Systems pages Octob er

back end that supp orts a number of hosts to outweigh this

p erformance degradation The separate compilation design

Rob ert H Halstead David A Kranz and Eric Mohr

however has a number of areas in which p erformance may

MulT A highp erformance parallel Lisp In SIGPLAN

degrade and we plan to examine these in detail

Symposium on Porgramming Language Design and

Finally some very preliminary results indicate that it

Implementation June

may b e interesting to consider compiling imp erative lan

Jack Holloway Guy Lewis Steele Jr Gerald Jay Suss

guages such as C and Pascal into PVM We are particu

man and Alan Bell The Scheme chip Technical

larly interested in combining a C to PVM compiler with the

Rep ort AI Memo Mass Inst of Technology Arti

PVM to C back end An early exp eriment indicated that

cial Intelligence Lab oratory

PVMs optimization of pro cedure calls generated C co de

which when compiled by a C compiler outp erformed the

D A Kranz et al Orbit An optimizing compiler

equivalent handco ded C program compiled by that com

for Scheme In Symposium on Compiler Construction

piler We plan to see whether this holds up under closer

pages ACM SIGPLAN June

investigation

Mass Inst of Technology Cambridge MA MIT

Scheme Reference Scheme Release

Acknowledgements

James Miller MultiScheme A Parallel Processing

The authors would like to thank the other Scheme imple

System Based on MIT Scheme PhD thesis Mass

mentors who have help ed us understand b oth their own sys

Inst of Technology August Available as MIT

tems and Gambit David Kranz Chris Hanson Bill Rozas

LCSTR

Jo el Bartlett and Will Clinger Bert Halstead also con

tributed useful ideas and comments with resp ect to the par

James Miller Implementing a Schemebased parallel

allel implementation of Gambit We are esp ecially grateful

pro cessing system International Journal of Parallel

to Chris Hanson and Bill Rozas of the MIT Scheme Team

Processing Octob er

for their help in comparing the co de from various compilers

as well as their help in our eorts to p ort CScheme to the

James Miller and Christopher Hanson IEEE Draft

MIPS Their eorts allowed us to gather the p erformance

Standard for the Scheme

gures included in Figure

IEEE forthcoming

James Miller and Guillermo Rozas Free variables and

References

rstclass environments Journal of Lisp and Symbolic

Computation to app ear

Harold Ab elson and Julie Suss

man Structure and Interpretation of Computer Pro

Guillermo Rozas Liar an Algollike compiler for

grams MIT Press

Scheme Bachelors thesis Mass Inst of Technology

Jo el Bartlett SchemeC a p ortable SchemetoC

compiler Technical Rep ort Digital Equipment

Guy Lewis Steele Jr Rabbit A compiler for Scheme

Corp Western Research Lab Masters thesis Mass Inst of Technology

A Performance Measurements and Co de Comparison The column lab eled cc contains measurements for

some of the b enchmarks that were hand co ded in C

A detailed analysis and explanation of Gambits p erformance

and compiled with the O switch for optimization

is b eyond the scop e of this pap er This app endix provides

using the vendorsupplied C compiler cc

a brief sketch of our p erformance results and compares the

All of the b enchmarks are executed ve times and the

co de generated by Gambit to that of MITs Liar compiler

mean is rep orted In our exp erience the measured

and Ts Orbit compiler Figure shows the results of run

times are rep eatable to within a few p ercent

ning Clingers version of the Gabriel b enchmark programs

For the most part the gures need no explanation The fol

All b enchmarks were run as supplied by Clinger but

lowing p oints may help the reader to b etter interpret them

with two dierences On all systems a compiler

dep endent declaration was supplied that caused arith

metic to b e p erformed in xnum mo de exact integers

with no overow detection only In addition each

Motorola MC

b enchmark was compiled b oth as written and enclosed

in a let expression to allow each compiler to take ad

Gambit Orbit Liar cc

vantage of any ow analysis it p erforms The best tim

b oyer

ings for a given compiler are recorded here We are

browse

unable to nd a consistent pattern to explain which

cpstak

form of the program will p erform b etter for a given

dderiv

compiler

deriv

destruct

A number of pro cedures are used routinely by the

diviter

b enchmarks and their p erformance can dominate the

divrec

p erformance of the entire b enchmark This is particu

puzzle

larly noticeable in the case of the get and put op era

tak

tions in the Boyer b enchmark In order to comp ensate

takl

for this we wrote sp ecialized version of the pro cedures

traverse

symbolstring gensym get and put While the de

triangle

tails of the co de are system dep endent since they re

quire nonstandard pro cedures the algorithms used

MIPS R

are the same on all systems

Gambit Orbit Liar cc scc

We now turn to a more detailed lo ok at the actual co de

tak

pro duced by the three compilers Figure shows the results

takl

of compiling for the MC the following Scheme pro

triangle

gram by Gambit version Orbit version and Liar

Beta release of

Note Timings for Gambit are absolute in seconds All

others times are relative to Gambit

define reversemap f l

define loop l x

Figure Performance Comparison

if pair l

loop cdr l f car l x

loop l

The measurements for the MC family were taken

on a HewlettPackard system with a Mhz

The co de in Figure has b een slightly mo died for presenta

MC CPU megabytes of memory and a lo

tion purp oses We have converted the instruction sequences

cal disk The measurements are based on the HPUX

from each systems private assembler notation into stan

time functions which deliver an estimate of user CPU

dard Motorola syntax In addition the co de from all three

time in units of millisecond s System time and

compilers actually includes interspersed words used by the

time for garbage collection if any are not included

garbage collector and interrupt handlers These have b een

in these numbers All measurements were taken in full

removed to make the co de easier to read They do not aect

multiuser op erating mo de but with only a single

p erformance since they are not executed in the usual case

user logged in

We do not pretend to have undertaken a detailed study

of the co de from these three compilers However from ex

The measurements for the MIPS R CPU were

amination of a number of programs and after discussions

taken on a Digital Equipment Corp oration DECsta

with several of the implementors David Kranz for Orbit

tion with megabytes of memory and medium

Chris Hanson and Guillermo Rozas for Liar we can supply

sp eed lo cal disks running under a preliminary release

the following observations that account for a large part of

of the CMU Mach op erating system Again mea

the dierences in output co de These comments apply to

surements are based on the Mach timing functions

Gambit used with the MC and MIPS back ends

omit system and garbage collection time and were

taken under multiuser conditions

Ob ject Representation

The column lab eled scc contains timings from Jo el Gambit use the three low bits of a data item for the

Bartletts Scheme to C compiler of August type tag with representing xnums and other type L X L adr X cdr L car arg F LOOP RESULT L CONT to to with cons return deallocate frame ARG jump ARG ARG pair ARG RESULT ARG interrupt check interrupt check interrupt check setup jump point Liar entry d g is spL spCONT return f map xa da dd da aa loop loop aa aa aa da ad da aa aa d label d aa xa dd xd interrupt ad interrupt ad aa interrupt a aa aa aa continuation map entry spF rts jmp andl andl movel movel bra bra lea movel movel movel movel movel movel movel cmpb beq bfextu movel orb andl orl on reverse bge movel bge movel movel bge clrl cmpl cmpl cmpl pea label continuation reverse loop arg L X L X L car cdr TEMP F arg LOOP L with L RESULT CONT F X L to to return adr with cons save ARG ARG list deallocate frame return RESULT null save save setup ARG jump TEMP ARG ARG deallocate frame jump point spCONT Orbit asp entry sp aa aL da LOOP ad C C d d C spsp d aa da asp spa a C asp D aa spa d d d aa spa spa aa aa spsp is entry aF movel movel movel movel bra on movel bra bra andb cmpb bne D lea moveq movel cmpl movel movel jmp bne movel pea movel movel moveq jmp jsr movel movel movel movel movel lea D C C C D LOOP X L L X adr cdr F car LOOP CONT F arg L RESULT L error CONT to overflow overflow CONT F L X to count return cons with arity pair jump arg X ARG ARG ARG restore heap heap pair save save save save ARG setup return jump RESULT aCONT point Gambit a dd L L dL L dd da ad spa dd spa ad dd spd spa aa L a dd L asp dsp dsp dsp da ad La a spa d a dd entry entry is dF jsr btst bne bra beq on movel movel movel L movel movel movel movel movel movel movel cmpl bcc jsr btst bne movel movel movel movel movel movel lea jmp movel moveq jmp movel

L L L L L L

Figure Co de Comparison

tags chosen to optimize references to the car and cdr T that is closely related to Scheme All of the b ench

of pairs and direct jumps to pro cedures Orbit uses the marks were run in Scheme compatibility mo de whose

two low bits for the type tag and also chooses for p erformance cost is not clearly understo o d We rewrote

xnums Liar uses the top six bits for a type tag with tak and takl in T and compared the actual co de and

representing F Orbit and Liar use a single ob ject to found no dierences b etween the native T version and

represent the empty list and F Gambit distingui shes the Scheme compatibili ty mo de version

b etween these two ob jects and only F counts as false

There was one very noticeable cost in the T implemen

tation that is not shared by Gambit or Liar This is

Free Pointer Alignment

the co ding of the primitive pro cedure pair Orbits

Gambit keeps the free p ointer o ctabyte aligned at all

two bit type tags do not distingui sh pairs from the

times p otentially wasting space when large numbers

empty list so as to optimize Ts list op eration

of small ob jects are created Orbit and Liar maintain

Thus pair is exp ensive with Orbit when compared

only quadbyte alignment

to the other compilers

Consing

Gambit p erforms consing by inline co de expansion

B PVM Usage Statistics

as do es Liar Orbit p erforms this with a call to an

external pro cedure in order to allow GC checking to

The MC back end allows programs to b e compiled in

b e done when the allo cation o ccurs

a way that gathers measurements of dynamic usage of each

of the PVM instructions and the types of op erands used

GC Detection

This information can b e used for p erformance analysis and

Gambit detects the need for garbage collection by p er

has b een used to allow us to choose what parts of the actual

forming a test at the end of any basic blo ck in which

implementation of PVM deserve careful optimization The

allo cation o ccurs Orbit places this test in the co de

mechanism is b oth simple and very ecient as each basic

that p erforms the allo cation itself while Liar tests at

blo ck is constructed the front end counts the number of

b oth the entry to a pro cedure and the entry to every

each kind of PVM instruction and op erand class used in the

continuation p oint Gambits garbage collector is not

basic blo ck The back end creates a counter for each basic

yet fully functional on all of the back ends The co de

blo ck and generates co de to increment that counter when

and measurements reect the full cost of detecting the

the blo ck is entered at runtime At the end of a run these

need for GC but none of the b enchmarks actually in

counters are used to recreate the statistics

voked the garbage collector

The resulting co de runs to slower than unmea

sured co de allowing sizable programs to b e measured and

Interrupt Testing

analyzed See Figure for a synopsis of the dynamic mea

Gambit do es not test for interrupts since it assumes

surements taken from running Gabriels version of the

a standalone environment rather than a program de

BoyerMoore theorem prover b enchmark

velopment environment This will b e changed for the

parallel implementations in order to allow timer in

terrupts to pro duce a fair scheduler Liar combines

the garbage collection and interrupt check into a sin

Op erand Usage Frequency

gle short co de sequence executed at the start of every

pro cedure and continuation We do not know how this

reg stk obj lbl glob mem

is handled in Orbit

Unknown Pro cedure Call

When calling a pro cedure that cant b e identied at

compile time Gambit loads arguments the continu

ation and the argument count into registers the stack

Instruction Usage Frequency

is used to hold the other arguments if there are any

Any pro cedure that may b e called in this manner will

COPY LABEL APPLY COND JUMP MAKE CLOSURES

b egin with a pro cedure lab el and the co de will compare

the number of arguments passed with the number of

parameters exp ected and will move the arguments or

trap as appropriate Liar passes the arguments and the

continuation on the stack It uses an elab orate mech

Detailed Instruction Breakdown

anism that distingui shes calls to pro cedures named by

variables at the top level of a compilation unit from

LABEL APPLY COND

other pro cedure calls and the interpreter supplies a

simple car pair

number of tramp olines that are used to combine a link

continuation cdr null

time arity test with runtime argument motion An ex

pro cedure cons eq

planation of one part of this mechanism can b e found

in Orbit passes the arguments in registers and

the continuation on the stack A mechanism similar

Figure Dynamic Measurements for Boyer Benchmark

to but somewhat simpler than that of Liar is used for

arity checking

In gathering these measurements we used a version of

T Compatibility

Boyer that is enclosed in a let expression This accounts for Orbit is actually the compiler for a distinct language

the lack of global variable references The pro cedures put

and get which dominate the p erformance of the b ench

mark were implemented using assq

There are only a few comments to b e made on these re

sults First lab el op erands can app ear either directly in a

JUMP instruction or as a source op erand to another instruc

tion for example it may b e stored into a lo cal variable for

later use In this b enchmark lab els app eared almost ex

clusively of the time in this latter context The

primary use of direct jumps is to branch around other arms

of a conditional when the conditional isnt in the tail p osi

tion of an expression Most conditional s in Boyer app ear in

tail p osition we dont know how common this is in general

Scheme co de

The breakdown of the lab el co de is interesting since it

shows the dynamic execution frequency of the various types

of lab el Recall that simple lab els and continuations actually

generate no co de so there is no runtime asso ciated with their

use A pro cedure lab el however requires an arity check

and may require moving values from argument lo cations to

parameter lo cations

The APPLY instruction is used by the front end to request

op en co ding of a primitive that the back end supp orts Fig

ure shows the breakdown by primitive pro cedure of these

op erations that o ccur when Boyer runs The table shows

only those op en co ded primitives that account for more than

of the run time although the actual statistics contain

numbers for all primitives In fact a go o d deal of detail has

b een omitted from all of these tables to make the presenta

tion more tractable

COND is used for all conditionals In the case of Boyer

of the predicates encountered were op en co ded ver

sions of pair null or eq Of the remaining of

the predicates are not op en co ded and the remaining op en co ded predicates o ccur under of the time