Implementing lazy functional languages on sto ck hardware



the Spineless Tagless Gmachine

Version

Simon L Peyton Jones

Department of Computing Science University of Glasgow G QQ

simonp jdcsglasgowacuk

July

Abstract

The Spineless Tagless Gmachine is an abstract machine designed to supp ort non

strict higherorder functional languages This presentation of the machine falls into three

parts Firstly we give a general discussion of the design issues involved in implementing

nonstrict functional languages

Next we present the STG language an austere but recognisablyfunctional language

which as well as a denotational meaning has a welldened operational semantics The

STG language is the abstract machine co de for the Spineless Tagless Gmachine

Lastly we discuss the mapping of the STG language onto sto ck hardware The success

of an abstract machine mo del dep ends largely on how ecient this mapping can b e made

though this topic is often relegated to a short section Instead we give a detailed discussion

of the design issues and the choices we have made Our principal target is the C language

treating the C compiler as a p ortable assembler

This pap er is to app ear in the Journal of

Changes in Version large new section on comparing the STG machine with other

designs Section proling material Section index

Changes in Version prop er statement of the initial state of the machine Sec

tion reformulation of CAF up dates Section new format for state transition

rules separating the guards which govern the applicability of the rules such that from

the auxiliary denitions where Section

Changes in Version introduction of the term lambdaform new subsection on

lambda lifting Section discussion of copyvsshare in Section allow a variable

binding form of default alternative in algebraic case expressions

Changes in Version more explicit discussion of the translation into the STG

language Section some reordering of subsections in Section an overview of the

co de generation pro cess at the start of Section

Changes in Version newformat proling in Section new section on black

holes saying how to avoid space leaks without blackholing Section

Changes in Version app endix added giving gory details Apart from the app endix

this is essentially the version published in the Journal of Functional Programming Very

minor changes to main pap er xed bug in Fig and other typos



Previously entitled The Spineless Tagless Gmachine a second attempt

Contents

Introduction

Overview

Part I the design space : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Part I I the abstract machine : : : : : : : : : : : : : : : : : : : : : : : : : : :

Part I I I mapping the abstract machine onto real hardware : : : : : : : : : :

Source language and compilation route : : : : : : : : : : : : : : : : : : : : : :

Exploring the design space

The representation of closures : : : : : : : : : : : : : : : : : : : : : : : : : : :

Function application and the evaluation stack : : : : : : : : : : : : : : : : : :

Data structures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

The STG language

Translating into the STG language : : : : : : : : : : : : : : : : : : : : : : : :

Closures and up dates : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Generating fewer up datable lambdaforms : : : : : : : : : : : : : : : : : : : :

Standard constructors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Lambda lifting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Full laziness : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Arithmetic and unboxed values : : : : : : : : : : : : : : : : : : : : : : : : : :

Relationship to CPS conversion : : : : : : : : : : : : : : : : : : : : : : : : : :

Op erational semantics of the STG language

The initial state : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

letrec expressions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Case expressions and data constructors : : : : : : : : : : : : : : : : : : : : : :

Built in op erations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Up dating : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Target language

Mapping the STG machine to C : : : : : : : : : : : : : : : : : : : : : : : : :

Compiling jumps : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Optimising the tiny interpreter : : : : : : : : : : : : : : : : : : : : : : : : : :

Debugging : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

The heap

How closures are represented : : : : : : : : : : : : : : : : : : : : : : : : : : :

Allo cation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Twospace garbage collection : : : : : : : : : : : : : : : : : : : : : : : : : : :

Other garbage collector variants : : : : : : : : : : : : : : : : : : : : : : : : : :

Trading co de size for sp eed : : : : : : : : : : : : : : : : : : : : : : : : : : : :

The standardentry co de for a closure : : : : : : : : : : : : : : : : : : : : : :

Stacks

One stack : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Two stacks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Compiling the STG language to C

The initial state : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

letrec expressions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

case expressions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Adding up dates

Representing up date frames : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Partial applications : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Constructors : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Vectored returns : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Returning values in registers : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Up date in place : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Up date frames and garbage collection : : : : : : : : : : : : : : : : : : : : : :

Global up datable closures : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Status and proling results

A The gory details

A Up date ags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

A Black holes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

A Adding llers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

A Performing up dates : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

A Lambdaform info : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

B Stack stubbing

B Implementation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Introduction

The challenges of compiling nonstrict functional languages have given rise to a whole stream

of research work Generally the discussion of this work has b een fo cussed around the design

of a socalled abstract machine which distils the key asp ects of the compilation technique

without b ecoming swamped in the details of source language or co de generation Quite a few

such abstractmachine designs have b een presented in recent years examples include the G

machine Augustsson Johnsson TIM Fairbairn Wray the Spineless

Gmachine Burn Peyton Jones Robson the Oregon Gmachine chip Kieburtz

the CASE machine Davie McNally the HDG machine Kingdon Lester

Burn the h; G i machine Augustsson Johnsson and the ABC machine

Ko opman

Early implementations esp ecially those based on graph reduction were radically dierent

from conventional compiler technology the dierence b etween an SK combinator implemen

tation Turner and say a Lisp compiler is substantial So great was this divergence

that new hardware architectures were developed sp ecically to supp ort the execution mo del

Scheevel Stoye Clarke Norman As understanding has developed though

it has b een p ossible to recognise features of more conventional systems emerging from the

mist and to generate ecient co de for sto ck architectures

In this pap er we present a new abstract machine for nonstrict functional languages the

Spineless Tagless Gmachine set it in the context of conventional compiler technology and

give a detailed discussion of its mapping onto sto ck hardware Our design exhibits a number

of unusual features

In all other the abstract machines mentioned ab ove the abstract machine co de for a

program consists of a sequence of abstract machine instructions Each instruction is

given a precise op erational semantics using a state transition system

We take a dierent approach the abstract machine language is itself a very small func

tional language which has the usual denotational semantics In addition though each

language construct has a direct op erational interpretation and we give an op erational

semantics for the same language using a state transition system

Ob jects in the heap b oth unevaluated susp ensions and head normal forms have a

uniform representation with a co de p ointer in their rst eld Many implementations

examine tag elds on heap ob jects to decide how to treat them With our representation

we never do this instead a jump is made to the co de p ointed to by the ob ject This is

why we call the machine tagless

A p ervasive feature of functional programs is the construction of data structures and

their traversal using patternmatching Many abstract machine designs say very little

ab out how to do this eciently but we pay a lot of attention to it

The machine manipulates unboxed values directly based on the ideas in a companion

pap er Peyton Jones Launchbury This is essential for an ecient implemen

tation of arithmetic but it is usually hidden in the co de generator

There is scop e for exploiting the fruits of b oth strictness analysis and sharing analysis

to improve execution sp eed

Lambda lifting a common feature of almost all functionallanguage implementations

is not carried out Instead the free variables of lambda abstractions are identied but

the abstraction is left in place

The machine is particularly wellsuited for parallel implementations although space

prevents this asp ect b eing discussed in this pap er Peyton Jones Clack Salkild

Almost all the individual ideas we describ e are present in other implementations but their

combination pro duces a particularly fast implementation of lazy functional languages

An earlier version of this pap er Peyton Jones Salkild had a similar title and

introduction to this one The underlying machine b eing describ ed is mostly unchanged but

the presentation has b een completely rewritten

Overview

The pap er divides into three parts Part I explores the design space to show how the STG

machine ts into a wider context Part II introduces the abstract machine and gives its

op erational semantics and Part I I I discusses how the abstract machine is mapp ed onto sto ck

hardware

Part I the design space

The implementation of nonstrict functional languages has tended to b e done in a separate

world to that of real compilers One goal of this pap er is to help bridge the gap b etween

these two cultures To this end we identify several key asp ects of a compiler its representation

of data structures its treatment of function application and its compilation of caseanalysis

on data structures and compare the approach we take with that of others

We hop e that this exercise may b e useful in its own right as well as setting the context for

the rest of the pap er

Part I I the abstract machine

The usual way of presenting an evaluation mo del for a functional language is to dene an

abstract machine which executes an instruction stream The abstract machine is given an

op erational semantics using a state transition system and compilation rules are given for

converting a functional program into abstract machine co de The application of these compi

lation rules is usually preceded by lambda lifting Johnsson which eliminates lambda

abstractions in favour of sup ercombinators functions with no free variables A go o d example

of this approach is the Gmachine Johnsson Peyton Jones whose abstract

machine co de is called Gco de

This approach suers an annoying disadvantage the abstract machine is generally not abstract

enough For example the abstract Gmachine uses the stack to hold many intermediate

values When Gco de is to b e compiled into native machine co de many stack op erations

can b e eliminated by holding the intermediate values in registers The co de generator has

to simulate the op eration of the abstract stack which is used in eect mainly to name

intermediate values Not only do es this pro cess complicate the co de generator but it makes

Gco de harder to manipulate and optimise

To avoid this problem one is driven to introduce explicitlyn amed values in the abstract

machine which is how the Tco de of our earlier pap er was derived Peyton Jones Salkild

Unfortunately the simplicity of the abstract machine is now lost

We take a slightly dierent approach here Instead of dening a new abstract machine we use

a very small functional language the STG language as the abstract machine co de It has the

usual denotational semantics so it is in principle p ossible to check the transformation of the

original program into the STG language is correct But we also give it a direct op erational

semantics using a state transition system which explains how we intend it to b e executed

The problem of proving the entire system correct is thereby made easier than for example

the Gmachine b ecause only the equivalence of the denotational and op erational semantics

of a single language is involved Even so it is a substantial task and we do not attempt it

here

Part I I I mapping the abstract machine onto real hardware

Typically much is written ab out the compilation of a functional program into abstract ma

chine co de and rather little ab out how to map the abstract machine onto the underlying

hardware Yet the abstract machine can only b e considered a success if this mapping works

well that is the resulting co de is ecient

We b elieve that the Spineless Tagless Gmachine comes out well in this regard and devote

considerable space to discussing the mapping pro cess One of the nice asp ects is that a variety

of mappings are p ossible of increasing complexity and eciency

Our target machine co de is the C language This has b ecome common of late conferring as

it do es wide p ortability We may pay some p erformance p enalty for not generating native

machine co de and plan to build other co de generators which do so

Source language and compilation route

We are interested in compiling stronglytyped higherorder nonstrict purely functional lan

guages such as LML or Haskell We exp ect heavy use of b oth higherorder functions and the

nonstrict semantics Hughes

This pap er is only ab out the back end of a compiler Our complete compilation route involves

the following steps

1

The task of proving a simple Gmachine correct is carried out by Lester in his thesis Lester

The primary source language is Haskell Hudak et al a stronglytyped non

strict purelyfunctional language Haskells main innovative feature is its supp ort for

systematic overloading

Haskell is compiled to a small Core language All Haskells syntactic sugar is trans

lated out type checking is p erformed and overloading is resolved Patternmatching is

translated into simple case expressions each of which p erforms only a single level of

matching

Program analyses and a variety of transformations are applied to the Core language

The Core language is translated to the STG language which we introduce in Section

This transformation is rather simple

The co de generator translates the STG language into Abstract C The latter is just an

internal data type which can simply b e printed out as C co de but which can also serve

as an input to a nativecode generator

Strictness analysis plays an imp ortant role in compilers for nonstrict languages enabling the

compiler to determine cases where function arguments can b e passed in evaluated form which

is often more ecient Using this technology compilers for lazy languages can generate co de

which is sometimes as fast as or faster than C Smetsers et al

Usually the results of strictness analysis are passed to the co de generator which is thereby

made signicantly more complicated We take a dierent approach We extend the Core

and STG languages with fulledged unboxed values which makes them expressive enough to

incorp orate the results of strictness analysis by simple program transformations We give a

brief introduction to unboxed values in Section but the full details including the trans

formations required to exploit strictness analysis are given in a separate pap er Peyton Jones

Launchbury

The co de generator which is the sub ject of this pap er is therefore not directly involved in

strictness analysis or its exploitation so we do not discuss it further

Part I Exploring the design space

Exploring the design space

Before introducing the STG machine in detail we pause to explore the design space a little

The STG machine has its ro ots in lazy graph reduction It is now folklore that while graph

reduction lo oks very dierent to conventional compiler technology the b est compilers based

on graph reduction generate quite similar co de to those for say Lisp In this section we

attempt to compare some asp ects of the STG machine with more conventional compilers

Any implementation is the result of a raft of interrelated design decisions each of which

is partly justied by the presence of the others That makes it hard to nd a place to

start our description We pro ceed by asking three key questions which help to lo cate the

implementation techniques for any nonstrict higherorder language

How are function values data values and unevaluated expressions represented Sec

tion

How is function application p erformed Section

How is case analysis p erformed on data structures Section

This section gives the context and motivation for many of the implementation techniques

describ ed in Parts I I and I I I and suitable forward references are given

The representation of closures

The heap contains two kinds of ob jects head normal forms or values and asyet unevaluated

susp ensions or thunks Head normal forms can b e further classied into two kinds function

values and data values A value may contain thunks inside it for example a list Cons cell

might have an unevaluated head andor tail A value which contains no thunks inside it is

called a normal form

It is worth noting that in a p olymorphic language it is not always p ossible to distinguish

thunks whose value will turn out to b e a function from thunks whose value is a data value

For example consider the comp osition function

compose f g x f g x

Is g x a function or not It dep ends of course on the type of g and since compose is

p olymorphic this is not statically determined

For reasons which will b ecome apparent we use the term closure to refer to b oth values and

thunks In the remainder of this section we consider various ways in which closures can b e

represented contrasting the STG machine with other designs

Representing functions

Any implementation of a higherorder language must provide a way to represent a function

value Such a value b ehaves like a susp ended computation when the value is applied to its

arguments the computation is p erformed

The most compact way to represent a function value is as a blo ck of static co de shared by

all dynamic instances of the value together with the values of its free variables Such a

value is commonly called a closure though we use the term in a wider sense in this pap er

The most direct physical representation of such a closure is a p ointer to a contiguous blo ck of

heapallo cated storage consisting of a code pointer which p oints to the static co de followed

by p ointers to the values of the free variables thus

? ?

Free variables

-

Co de

This is the representation adopted by many compiled Lisp systems by SML of New Jersey and

by the Spineless Tagless Gmachine To p erform the computation a distinguished register

the environment pointer is made to p oint to the closure and the co de is executed We call

this op eration entering a closure The co de can access its free variables by osets from the

environment p ointer and its arguments by some standard argumentpassing convention eg

in registers on the stack or in an activation record

Instead of storing the values of the free variables themselves in the closure it is p ossible

to store a p ointer to a blo ck of free variables or even to a chain of such blo cks These

representations attempt to save storage at the cost of slowing down access The Orbit

compiler for example works hard to choose the b est representation for closures including

allo cating them on the stack rather than in the heap whenever p ossible and sharing one blo ck

of free variables b etween several closures Kranz Apart from the compiler complexity

involved considerable extra care has to b e taken in the garbage collector to avoid the space

leakage which can o ccur when a closure captures a larger set of free variables than the closure

itself requires Indeed App els measurements for SML of New Jersey suggest that clever

closurerepresentation techniques gain little and p otentially lose a lot in space complexity

so he recommends a simple at representation App el Chapter

The Three Instruction Machine TIM takes another interesting p osition Instead of repre

senting a closure by a single p ointer it represents a closure by a pair of a co de p ointer and

a p ointer to a heapallo cated frame Fairbairn Wray The frame which is a vector

of co dep ointerframepointer pairs gives the values of the free variables of the closure and

may b e shared b etween many closures These co dep ointerframepointer pairs need to b e

handled very carefully in a lazy system b ecause they cannot b e duplicated without the risk of

duplicating work Prop er sharing can still b e ensured but it results in a system remarkably

similar to the more conventional one mentioned ab ove Peyton Jones Lester Chapter

Representing thunks

In a nonstrict language values are passed to functions or stored in data structures in uneval

uated form and only evaluated when their value is actually required Like function values

these unevaluated forms capture a susp ended computation and can b e represented by a clo

sure in the same way as a function value Following the terminology of Bloss Hudak Young

we call this particular sort of closure a thunk a term which go es back to the early

Algol implementations of callbyname Ingerman When the value of the thunk is

required the thunk is forced

A thunk can in principle b e represented simply by a parameterless function value but it is

inecient to do so b ecause it might b e evaluated rep eatedly This duplicated work is avoided

by socalled lazy implementations as follows when a thunk is forced for the rst time it is

physically updated with its value

There are three main strategies for dealing with up dates in lazy implementations

The nave reduction mo del up dates the graph after each reduction Peyton Jones

By reduction is meant the replacement of an instance of the lefthand side of a

function denition by the corresp onding instance of its righthand side Apart from a

few optimisations this is the up date strategy used by the Gmachine Johnsson

Its main disadvantage is that a thunk may b e up dated with another thunk so the same

ob ject may b e up dated rep eatedly and we do not consider this mo del further

The cell mo del In the cell mo del each closure is provided with a status ag to indicate

whether it is evaluated or not The co de to force that is get the value of a closure

checks the status ag If the closure is already evaluated its value is extracted otherwise

the susp ended computation is p erformed by entering the closure the value is written

into the cell and the status ag is ipp ed Bloss Hudak Young

The selfup dating mo del which is used by the STG machine The cell mo del places the

resp onsibility for p erforming the up date on the co de which evaluates the thunk The

selfup dating mo del instead places this resp onsibility on the co de inside the thunk itself

The co de to force a closure simply pushes a continuation on the stack and enters the

closure if the closure is a thunk it arranges for an up date to b e p erformed when

evaluation is complete otherwise it just returns its value No tests need b e p erformed

The up date overwrites the thunk with a value which therefore must also have a code

pointer b ecause subsequent forces will reenter the thunkturnedvalue This represen

tation is natural for function values as we have already discussed but is something of

a surprise for data values A list cell for example is represented by a co de p ointer to

gether with the two p ointers comprising the head and tail of the list The co de p ointed

to from the list cell simply returns immediately In eect the co de p ointer plays the

role of the ag in cell mo del

2

In Section we explore variants of this scheme in which the co de for a list cell do es rather more than

simply return

Bloss Hudak Young call this mo del closure mo de but the implementation

they suggest is very much less ecient in b oth time and space than that outlined

ab ove b ecause it is based on a translation into Lisp

The latter two mo dels each oer scop e for optimisation Consider the cell mo del for example

Forcing can b e optimised if the compiler can prove that the thunk is certainly already eval

uated or the reverse b ecause the test on the ag can b e omitted Bloss Hudak Young

Furthermore if the compiler can prove that there can b e no subsequent co de forces

on the thunk then it can omit the co de which p erforms the up date

A similar situation holds for the selfup dating mo del If the compiler can prove that a

particular thunk can only b e evaluated at most once which we exp ect to b e quite common

it can create co de for the thunk which do esnt p erform the up date Unlike the cell mo del

the selfup dating mo del cannot take advantage of orderofevaluation analyses

A uniform representation for closures

As indicated ab ove the selfup dating mo del strongly suggests that every heapallocated object

whether a head normal form or a thunk is represented uniformly by a code pointer together

with zero or more elds which give the values of the free variables of the code The STG

machine adopts this uniform representation as can b e seen in the op erational semantics

Section where all heap values are represented uniformly by some co de together with a

sequence of values Indeed this is why the machine is called tagless since all ob jects have

the same form there is no need for a tag to distinguish one kind of ob ject from another

contrast the presentation in Peyton Jones Chapter We use the term closure

to refer to b oth values and thunks b ecause of their uniform representation

The decision to use a uniform representation for all closures has other interesting ramications

which we explore in this section

Firstly when a thunk is up dated with its value it is p ossible that the value will take more

space than the thunk In this case the thunk must b e up dated with an indirection to the

value Figure This causes no diculty for the selfup dating mo del b ecause indirections

can b e represented by a closure whose co de simply enters the target closure Such indirections

can readily b e removed during garbage collection Section

In the cell mo del either a second test must b e made to check for indirections or alternatively

every up dated thunk must b e an indirection Figure shows the latter case Both metho ds

imp ose extra overhead

Secondly the selfup dating mo del also allows other exceptional cases to b e taken care of

without extra tests For example

When a thunk is entered its co de p ointer can b e overwritten with a blackhole co de

p ointer If the thunk is ever reentered b efore it is up dated then its value must dep end

on itself It follows that the program has entered an innite lo op and a suitable error

message can b e displayed Without this mechanism stack overow o ccurs which is less

helpful to the programmer

Before up dating

? ?

Free vars

-

Co de

After up dating small value After up dating big value

-

Value

? ?

Head Tail

-

Indirection co de

Cons co de -

Figure Up dating in the selfup dating mo del

In a system which supp orts concurrent threads of execution exactly the same metho d

can b e used to synchronise threads When a thunk is entered its co de p ointer is

overwritten with a queueme co de p ointer If another thread tries to evaluate the

thunk b efore the rst thread has completed the former is susp ended and added to a

queue of threads attached to the thunk When the thunk is up dated the queued threads

are reenabled

In a system with distributed memory p ointers to remote memory units often have to

b e treated dierently to lo cal p ointers However it would b e very exp ensive to test

for remoteness whenever dereferencing a p ointer In the selfup dating mo del a remote

p ointer can b e represented as a sp ecial kind of indirection and no tests for remote

p ointers need b e p erformed

Before up dating

After up dating

-

Value

? ?

Free vars

-

Co de

Figure Up dating in the cell mo del

Function application and the evaluation stack

Higherorder languages which allow functions as rstclass citizens present interesting

challenges for the compilerwriter An illuminating way of comparing compilation strategies

is to ask how function application is p erformed

Currying

The languages in which we are interested make heavy use of curried functions For example

consider the following Haskell function denition

f x y x

f is attributed the type a b a That is f may b e thought of as a function of one

argument which returns a function which takes the second argument An application of

f say f is short for f An application of f to one argument is p erfectly

acceptable for example map f xs

In strict languages like Lisp Hop e and SML the denition of f would usually b e of the form

f xy x

where f is attributed the type ab a We use ab to denote the type of pairs of

elements of type a and b The function f can only b e applied to a suitable pair and cannot

b e applied to just one argument

In all of these languages it is p ossible and in SML easy to dene curried functions otherwise

they would hardly deserve the title higherorder but compilers usually implement the

uncurried form much more eciently and programmers resp ond accordingly There is the

inverse cultural tradition in nonstrict functional languages where the additional exibili ty

allowed by the curried form means that it is usually preferred by programmers and compilers

typically treat curried application as fundamental

Compiling function application

Compilers from the Lisp tradition usually compile function application as follows evaluate

the function evaluate the argument and apply the function value to the argument When a

known function is b eing applied as is often the case esp ecially in Lisp the evaluate the

function part b ecomes trivial This mo del for function application which we call the eval

apply model is invariably used by compilers for strict languages eg Lisp Hop e SML and the

SECD machine Henderson Landin It is also used in some implementations of

nonstrict languages except that of course only the function is evaluated b efore the application

eg the ABC machine Ko opman and the h; G imachine Augustsson Johnsson

In contrast compilers based on lazy graph reduction treat function application as follows

push the argument on an evaluation stack and tailcall or enter the function There is no

return when the evaluation of the function is complete We call this the pushenter model

it is used by the Gmachine TIM and the STG machine

The dierence b etween the two mo dels seems rather slight but it has a p ervasive eect It is

dicult to say in general which of the two is b etter In essentially rstorder programs they

generate much the same co de For programs which make extensive use of curried functions

the pushenter mo del lo oks b etter For example consider the curried function denition

apply f x y z f x y z

A Lisplike compiler would b e comp elled to evaluate f x then evaluate that function

applied to y and nally apply the result to z A graphreduction compiler would just push

x y and z onto the evaluation stack b efore jumping to the closure for f

The evaluation stack

The main cost of the pushenter mo del of function application is that the link b etween a

function body and an activation frame is broken For example consider apply again In the

evalapply mo del the compiler can allo cate an activation frame for apply which is deallo cated

when the value of f x y z has b een computed In the pushenter mo del all that happ ens

is that three more arguments are pushed on the evaluation stack b efore jumping to f To put

it another way there are no identiable moments at which a new activation frame should b e

allo cated or reclaimed

This pushes the pushenter evaluation mo del in the direction of having a contiguous evaluation

stack rather than a linked list of heapallo cated activation frames as exemplied by the New

Jersey SML compiler App el Jim The idea of heapallo cated activation frames is

very app ealing b ecause it makes it easy to implement callcc App el Jim parallel

threads Co op er Morrisett and certain debugging mechanisms Tolmach App el

But all these things can b e done by allo cating a contiguous stack in mediumsized

chunks in the heap at the price of a little extra complication Hieb Dybvig Bruggeman

Peyton Jones Salkild

Indeed p erformance may well b e b etter using a contiguous stack b ecause of the improved

spatial lo cality which reduces paging and cache misses Contiguous allo cation of fresh activa

tion records is p essimal for caches since they have to b oth fetch useless data since they are

not clever enough to know that it is free space which is ab out to b e allo cated and then write

back an activation frame to main memory which is quite likely to b e garbage already Unless

one uses generational garbage collection and the youngest generation ts entirely within the

cache using a contiguous stack is likely to have far b etter cache p erformance App el

Chapter Wilson Lam Moher Current cache sizes are still to o small to contain

a complete generation but that may change It would b e very interesting to quantify these

eects

The h; G imachine is another interesting design compromise Augustsson Johnsson

Here again there is no contiguous evaluation stack Instead working space is allo cated in

every closure which the h; G imachine calls a frame and the closures under evaluation are

linked together much as heapallo cated activation frames are The p enalties are space usage

is worse b ecause all closures contain the extra space regardless of whether they are b eing

evaluated or not and most are not when a function is evaluated to a partial application

the arguments must b e copied from the functions frame to the applications frame and it is

not always p ossible statically to b ound the amount of working space required unless separate

compilation is abandoned so an exceptionchecking mechanism is required to deal with the

cases where to o little has b een allo cated

Data structures

Stronglytyped functional languages such as Haskell encourage the programmer to dene many

algebraic data types Even the builtin data types of the language such as lists b o oleans

tuples and as we will see in Section integers may b e regarded as algebraic data types

Here for example are representative type declarations for some of them

data Boolean False True

data List a Nil Cons a List a

data Tuple a b c MkTuple a b c

data Int MkInt Int

data Tree a Leaf a Branch Tree a Tree a

Sp ecial syntax for lists and tuples is provided by most highlevel languages but not by the

STG language Data values are built using constructors such as False Cons MkTuple

Branch and taken apart using case expressions For example

case t of

Leaf n e

Branch t t e

In a highlevel programming language data values are usually taken apart using various

patternmatching constructs but its is well known how to translate such constructs into

case expressions with simple singlelevel patterns Wadler We here assume that this

translation has b een p erformed

These op erations of construction and pattern matching are so p ervasive in functional programs

that they deserve particular attention Compilers sometimes implement the builtin types

list tuples numbers in sp ecial magic ways and the programmer pays a p erformance

p enalty for userdened types We take the view that the general mechanisms used for user

dened types should b e made ecient enough to use for builtin types to o Lisp of course

has no userdened types so this question do es not arise

We have already discussed the representation of data values as a co de p ointer together with

zero or more contents elds We now turn our attention to the compilation of case expressions

Notice that a case expression really do es two things it evaluates the expression whose value

it scrutinises and then it selects the appropriate alternative

If the cell mo del is used the case expression must rst force the value to b e scrutinised

Then it must insp ect the value to discover which constructor it is built with and hence which

alternative of the case expression should b e executed It follows that each data value must

contain a tag usually a natural number which distinguishes from each other the constructors

of the relevant data type So the sequence of events is

Force the value

Extract its tag

Take a multiway jump based on the tag

Bind the names in the pattern of the alternative to the comp onents of the data value

Execute the co de for the alternative

In the case of the selfup dating mo del though there are more p ossibilitie s Recall that in

this mo del a closure is forced by entering it regardless of whether it has been forced before So

far we have assumed that the co de for a constructor always returns immediately But other

variants are p ossible It could for example load the tag into a register b efore returning

so that the tag do es not need to b e represented explicitly at all Section Better still

instead of returning to a multiway jump the constructor co de could return to the appropriate

member of a vector of return addresses we call this a vectored return Section These

return conventions can b e chosen on indep endently for each data type

In eect the selfup dating mo del used by the STG machine takes advantage of the fact that

a data value is only ever forced by a case expression This prop erty is unique to the STG

machine Other lazy implementations treat numeric data types as a sp ecial case which are

implicitl y forced by the builtin arithmetic op erations In the STG machine numeric data

types are implemented as algebraic data types and only forced using case Section

The idea can b e taken one step further Consider the expression

case f x of

Nil e

Cons a as e

and supp ose that f x evaluates to a Cons The cell mo del would evaluate f x resulting

in a heapallo cated Cons cell the comp onents of which would b e used in e But supp ose

that we use the selfup dating mo del and that the co de for Cons puts the head and tail of the

Cons cell in registers b efore returning as well as loading the tag into a register if the return

is not vectored Then the Cons cell need never be al located in the heap at al l Since many

functions return data values this optimisation seems quite valuable

In summary the cell mo del separates the forcing of a thunk from the case analysis and

unpacking p erformed by a case expression The selfup dating mo del allows these op erations

to b e woven together which seems to oer interesting opp ortunities for optimisations To b e

fair these optimisations do complicate up dating as we will see when we roll up our sleeves

in Part I I I so the b enet is not entirely without cost Section

Summary

The single most p ervasive design decision in the STG machine is that each closure including

data values is represented uniformly and scrutinised only by entering it The b enets include

3

Ireland cleverly avoids the forcing step when it is not necessary by using the same eld to enco de

the evaluation status ag and constructor tag The case expressions multiway jump has an extra branch for

the case where an unevaluated thunk or indirection is encountered it forces the thunk and then reexecutes

the multiway jump

Cheap indirections are available and are useful when p erforming up dates They cost

nothing when they are not present and can b e eliminated easily during garbage collec

tion

Other exceptional conditions black holes concurrency etc can b e handled in the same

way

A variety of return conventions for constructors are p ossible including vectored returns

and returning the comp onents of the constructor in registers The latter means that

data values may not b e allo cated in the heap at all

What are the costs The main one seems to b e this in the common case when a p ossible

thunk turns out to b e already evaluated the selfup dating mo del takes two jumps one to

enter the closure and one to return while the cell mo del takes only one conditional jump

As we have seen though a jump can often b e saved again by using a vectored return

Worse the rst jump is to an unknown destination which means that the co de generator

cannot keep things in registers The cell mo del only incurs these contextswitching costs if

the thunk is unevaluated Even so the cell mo del may not always win If there are two or

more forces in a row there is the nasty p ossibility of saving the context evaluating one thunk

restoring the context discovering the second thunk is unevaluated and saving the context

for a second time In this sort of situation it may well b e just as go o d to save the context

once and for all at the start of a string of forces as the selfup dating mo del must do

There are also some underlying architectural issues Firstly indirect jumps are more likely

to cause cache misses than nottaken conditional jumps Secondly mo dern RISCs are

well optimised for taking conditional jumps employed by the cell mo del but not for taking

indirect jumps which are needed by the selfup dating mo del In principle if the jump target

address is fetched a few instructions b efore the jump itself and the instruction fetch logic

interprets the indirect jump directly no pip eline bubbles need b e caused Most RISCs are

not yet optimised for this sequence but the Tera architecture is it allows a branch target

to b e prefetched into a register and thereby supp orts zerodelay indirect branches Alverson

et al

In short by always entering a closure when we need its value we pay a single fairly mo dest

upfront cost but get a wide variety of other b enets at no further cost Whether the b enets

outweigh the costs is at present an op en question

Part I I The abstract machine

The STG language

The abstract machine co de for the Spineless Tagless Gmachine is a very austere purely

functional language called the STG language whose syntax is given in Figure Virtually

every functionallanguage compiler uses a small purelyfunctional language as an intermediate

co de eg the enriched Peyton Jones FLIC Peyton Jones

FC Field Harrison Kid Ariola Arvind

The distinguishing feature of the STG language is that it has a formal operational semantics

expressed as a state transition system as well as the usual denotational semantics Indeed it

is exactly this prop erty which justies the title abstract machine co de In particular the

following corresp ondence b etween the STG language and op erational matters is maintained

Construct Op erational reading

Function application Tail call

Let expression Heap allo cation

Case expression Evaluation

Constructor application Return to continuation

The salient characteristics of STG co de are as follows

Al l function and constructor arguments are simple variables or constants This con

straint corresp onds to the op erational reality that function arguments are prepared

either by constructing a closure or by evaluating them prior to the call

It is easy to satisfy this condition when translating into the STG language simply by

adding new let bindings for nontrivial arguments

Al l constructors and builtin operations are saturated This constraint simplies the

op erational semantics of STG co de It is easily arranged by adding extra lambdas

around an unsaturated constructor or builtin application thus p erforming the opp osite

of  reduction

Notice that in a higherorder language we cannot ensure that every function application

is saturated that is gives to the function exactly the number of arguments it exp ects

Pattern matching is performed only by case expressions and the patterns in case

expressions are simple onelevel patterns More complex forms of patternmatching can

easily b e translated into this form Wadler

The value scrutinised by a case expression can b e an arbitrary expression and is not

restricted to b e a simple variable or constant Nothing would b e gained by such a

restriction and some p erformance would b e lost b ecause a closure for the expression

would b e unnecessarily built and then immediately evaluated

There is a special form of binding The STG language has a sp ecial form of binding

whose general form is

f v ;:::; v  x ;:::; x e

n m

Program prog binds

Bindings binds var lf ::: var lf n

n n

Lambdaforms lf vars  vars expr

f a

Up date ag  u Up datable

j n Not up datable

Expression expr let binds in expr Lo cal denition

j letrec binds in expr Lo cal recursion

j case expr of alts Case expression

j var atoms Application

j constr atoms Saturated constructor

j prim atoms Saturated builtin op

j literal

Alternatives alts aalt ::: aalt default n Algebraic

n

j palt ::: palt default n Primitive

n

Algebraic alt aalt constr vars expr

Primitive alt palt literal expr

Default alt default var expr

j default expr

Literals literal j j ::: Primitive integers

j :::

Primitive ops prim j j j Primitive integer ops

j :::

Variable lists vars var ::: var n

n

Atom lists atoms atom ::: atom n

n

atom var j literal

Figure Syntax of the STG language

It has two readings From a denotational p oint of view the free variables v ;:::; v and

n

up date ag  are ignored and the denition binds f to the function x ::: x :e

m

From an op erational p oint of view f is b ound to a heapallo cated closure containing a

co de p ointer and p ointers to the free variables v ;:::; v This closure represents the

n

function x ::: x :e when its co de is executed a sp ecial register will p oint to the

m

closure thereby giving access to its free variables

The righthand side of a binding is called a lambdaform and is the only site for a

lambda abstraction Notice though that the abstraction can have free variables so no

lambda lifting need b e p erformed Section

The up date ag on a lambdaform indicates whether its closure should b e up dated when

it reaches its normal form Section We say a lambdaform or a closure built from

it is updatable if its up date ag is u and nonupdatable otherwise

The STG language supports unboxed values This asp ect is discussed b elow in Sec

tion

A STG program is just a collection of bindings The variables dened by this toplevel set of

bindings are called globals while all other variables b ound in the program are called locals

The value of an STG program is the value of the global main

The concrete syntax we use is conventional parentheses are used to disambiguate application

asso ciates to the left and binds more tightly than any other op erator the b o dy of a lambda

abstraction extends as far to the right as p ossible and where the layout makes the meaning

clear we allow ourselves to omit semicolons b etween bindings and case alternatives

The STG language is similar in some ways to continuationpassing style CPS a p oint we

return to in Section

Translating into the STG language

In this section we outline how to translate a functional program into the STG language We

b egin with an example the wellknown function map Its denition in conventional notation

eg Haskell is as follows

map f

map f yys f y map f ys

The corresp onding STG binding is this

map n fxs

case xs of

Nil Nil

Cons yys let fy fy u f y

mfy fys u map fys

in Cons fymfy

Notice the attened structure the explicit argument lists for every call and the freevariable

lists and up date ags on each lambdaform Since map itself is a global constant it is not

considered to b e a free variable of the lambdaform for mfy

In this example every lambdaform has either no arguments or no free variables To illustrate

the two in combination consider the following alternative denition for map

map f mf where mf

mf yys f y mf ys

Here the recursion is over mf which has free variable f The corresp onding STG binding is

map n f

letrec

mf fmf n xs

case xs of

Nil Nil

Cons yys let fy fy u f y

mfy mfys u mf ys

in Cons fymfy

in mf

Here mf is an example of a lambdaform with b oth free variables and arguments Notice that

mf is a free variable of its own righthand side see Section

The general transformation

In general translation into the STG language involves the following transformations

Replace binary application by multiple application

::: f e e ::: e f fe ; e ;:::; e g

n n

The semantics is still that of curried application of course but the STG machine applies

a function to all the available arguments at once rather than doing so one by one

Saturate all constructors and builtin op erations by  expansion if necessary That is

c fe ;:::; e g y ::: y : c fe ;:::; e ; y ;:::; y g

n m n m

where c is a builtin or constructor with arity n m

Name every nonatomic function argument and every lambda abstraction by introduc

ing a

Convert the righthand side of each let binding into a lambdaform by adding free

variable and up dateag information

Identifying the free variables

The transformation to STG co de requires the free variables of each lambdaform to b e iden

tied The rule is as follows a variable must app ear in the free variable list of a lambdaform

if

it is mentioned in the b o dy of the lambda abstraction and

it is not b ound by the lambda and

it is not b ound at the top level of the program

Thus in the rst version of map in the previous section map do es not app ear in the free

variable list of mfy b ecause it is a global constant On the other hand in the second version

of map mf is a free variable of b oth itself and mfy

The freevariable rule handles mutual recursion without any further complications For ex

ample the Haskell denition

f x y fbody

where

g a agx

g b bgy

would transform to the STG denition

f n xy letrec

g gx n a agx

g gy n b bgy

in

fbody

The letrec builds a pair of closures each of which p oints to the other

The rule ab ove says when a variable must app ear in a freevariable list of a lambdaform Of

course any inscop e variable may app ear redundantly surprisingly there is one situation

in which such redundant free variables prove useful see Section

Closures and up dates

In the STG language letrec expressions bind variables to lambdaforms Two pieces of

denotationally redundant but op erationally signicant information are attached to a lambda

form a list of the free variables of the lambdaform and an update ag In this section we

fo cus on the up date ag

Up dates are an exp ensive feature of whereby the closure representing an

unevaluated expression is up dated with its head normal form when it is evaluated Sec

tion The aim is to avoid evaluating a particular closure more than once

In contrast to the Gmachine which p erforms an up date after almost every reduction the

Spineless Tagless Gmachine is able to decide on a closurebyclosure basis whether up dating

is required It shares this prop erty with TIM and the Spineless Gmachine The up dateno

up date decision is controlled by the up date ag on a lambdaform if the up date ag is u

the corresp onding closure will b e up dated with its head normal form if it is ever evaluated if

it is n no up date will b e p erformed

It is clearly safe to set the up date ag of every lambdaform to u thereby up dating every

closure But we can do much b etter than this The obvious question is to which lambda

forms can we safely assign an up date ag of n without losing the singleevaluation prop erty

We explore this question by classifying lambdaforms into distinct classes

Manifest functions A manifest function is a lambdaform with a nonempty argument list

For example map and mf are manifest functions in the examples of the previous section

Manifest functions do not require up dating b ecause they are already in head normal

form

Partial applications A partial application lambdaform is of the form

vs n fg f fx ;:::; x g

m

where f is known to b e a manifest function taking more than m arguments Like

manifest functions partial applications are already in head normal form and hence do

not require up dating Both manifest functions and partial applications are of course

function values

Partial applications sometimes app ear directly in programs but they also arise as a

result of p erforming up dates Section

Only lambdaforms in precisely the form given ab ove are classied as partial applica

tions For example the lambdaform

xy u let z

in f z

is not classied as a partial application b ecause its b o dy is not in the required form

There is a go o d reason for this if a closure built from this lambdaform was not up dated

the closure for z would b e rebuilt each time it was entered

Constructors A constructor is a lambdaform of the form

vs n fg c fx ;:::; x g

m

where c is a constructor Since constructor applications are always saturated in the

STG language c is b ound to have arity m The up date ag on a constructor is always

n no up date

Thunks The remaining lambdaforms those with an empty argument list but not of the

sp ecial form of a partial application or a constructor are called thunks The lambda

forms mfy and fy are examples of thunks in the previous section

Since thunks are not in normal form it app ears at rst that they should all have their

up date ag set to u However if the compiler can prove that a thunk can be evaluated

at most once then it is safe to set its up date ag to n thereby allowing the up date to

b e omitted

For example consider the following denition

f n pxs let j p n factorial p

in

case xs of

Nil j

Cons yys j

Here it is clear that j will b e evaluated at most once so its closure do es not need to b e

up dated

In summary up dates are never required for functions partial applications and constructors

and may in addition sometimes b e omitted for thunks

The analysis phase which determines which thunks need not b e up dated is called update

analysis Not much work seems to have b een done on this topic but we are working on a

simple up date analyser

Generating fewer up datable lambdaforms

The translation from the Core to STG language is largely straightforward apart from the

up date analysis which can of course b e omitted by agging all thunks as up datable There

is an imp ortant opp ortunity to reduce the incidence of up dates which concerns constructors

and partial applications Consider the Haskell expression

let xs y y y

in

A straightforward translation into the STG language gives

let xs yyy u

let t yy u

let t y u

let t n Nil

in Cons yt

in Cons yt

in Cons yt

in

Three up datable thunks have b een built which will subsequently b e up dated when and if

xs is traversed An alternative and usually sup erior translation is this

let t n Nil

in let t yt n Cons yt

in let t yt n Cons yt

in let xs yt n Cons yt

in

No up datable thunks are built at all The only bad thing ab out this translation is that if xs

was to b e discarded without b eing traversed then the work of constructing all four constructor

closures would have b een wasted This is a strictly b ounded amount of work however

Just the same alternative translation is p ossible when a known function is applied to to o few

arguments the let bindings for the argument expressions can b e lifted up a level so that a

partialapplication lambdaform remains which do es not need to b e up dated

In general

Up dates can b e omitted for parameterless lambdaforms if the b o dy is a head normal

form

Opp ortunities for this improvement may b e enhanced by moving letrec bindings

from the lambdaform to its enclosing context More generally any small b ounded

computation may b e moved from a lambdaform to its enclosing context to exp ose a

head normal form and thereby avoid an up date

Standard constructors

The nal form of the example in the previous section had three lambdaforms of the form

x xs n Cons x xs

for various x and xs Because they all have the same shap e they can clearly all share a single

co de p ointer In general a lambdaform of the form

fx ;:::; x g n fg c fx ;:::; x g

m m

where c is a constructor of arity m is called a standard constructor All such lambdaforms

for a particular constructor c can share common co de

The freevariable list of a global denition is usually empty since the global can only mention

other globals and is represented by a closure consisting only of a co de p ointer However

consider the following global Haskell denition

aList thing

thing

A straightforward translation to the STG language gives

aList n Cons thingnil

nil n Nil

thing

This is p erfectly correct but it means having sp ecialpurp ose co de for aList which has

references to thing and nil wired into it If aList was a list with several items in it each

cell in the list would have a separate co de sequence An alternative translation is

aList thingnil n Cons thingnil

nil n Nil

thing

The lambdaform for aList now has free variables which are not strictly necessary but the

payo is that the lambdaform is now a standard constructor and can use the standard co de

for Cons Instead of b eing represented by a co de p ointer alone aList is now represented by

the Cons co de together with p ointers to the globals thing and nil

We apply this idea throughout not just at top level to make sure that every lambdaform

whose b o dy is a simple constructor application is a standard constructor

For nullary constructors such as Nil it is not only p ossible to share its code but also to

share its closure Thus in the ab ove example the nil global can b e shared by all o ccurrences

of Nil in the program

Lambda lifting

Lambda lifting is a pro cess whereby all function denitions are lifted to the top level by

making their free variables into extra arguments Johnsson Peyton Jones Lester

In a lambdalifted program each lambdaform has either no free variables or no

arguments In contrast to most other abstract machines the STG machine do es not require

the program to b e lambda lifted a righthand side can have b oth free variables and arguments

The op erational dierence b etween the two is fairly slight Consider the following STG

language denition

f n xxx let z xx n y zbody

in fbody

in which the lambdaform for z has the free variables x and x and argument y If lambda

lifting is p erformed a new global function or sup ercombinator z is introduced giving

z n xxy zbody

f n xxx let z xx n z xx

in fbody

Now the program has only thunks like z and sup ercombinators like f and z Op era

tionally what happ ens is that when z is entered it pushes its two free variables x and x

onto the evaluation stack and jumps to z In the original version which the STG machine

can execute directly the free variables can b e used directly from the closure itself

In short the lo cal environment in which the STG machine executes consists of two parts

Section values held in the closure just entered its free variables and values held on

the stack its arguments This twolevel environment reduces somewhat the movement of

values from the heap to the stack but it is not yet clear whether this is a big improvement

or only a marginal one

Full laziness

Consider the binding

f x n y let z x u ez

in ef

where ez and ef are arbitrary expressions Supp ose x and y are b oth free in ef but only x is

free in ez Since the lambdaform for z do es not have y as a free variable this is equivalent

to the pair of bindings

z x u ez

f xz n y ef

Furthermore the latter form may save work if f is applied many times b ecause z will b e

instantiated only once rather than once for each call of f In general each binding can b e

moved outwards until its immediately enclosing lambda abstraction binds one of the free

variables of the binding This transformation is called the ful l laziness transformation and is

describ ed in detail by Peyton Jones Lester

Arithmetic and unboxed values

In a nonstrict functional language implementation when a variable is b ound it is generally

b ound to an unevaluated closure allo cated in the heap When the value of the variable is

required the closure to which it p oints is evaluated and the closure is overwritten with the

resulting value Further evaluations of the same closure will nd the value immediately

This evaluation mo del means that all numbers are represented by a p ointer to a heapallo cated

closure or b ox which contains either information which enables the number to b e com

puted or if the closure has b een evaluated the actual value of the number We call the

actual value an unboxed value it can b e manipulated directly by the instruction set of the

machine

The uniform b oxed representation makes arithmetic horribly exp ensive A simple addition

which takes one instruction in a conventional system requires a sequence of instructions to

evaluate the two op erands fetch their values add them allo cate a new b ox for the result

and place the result in it

One of the innovative features of our compiler is that unboxed values are explicitly part of the

Core and STG languages That is variables may b e b ound to unboxed values functions may

take unboxed values as arguments and return them as results unboxed values may b e stored

in data structures and so on The main motivation for this approach is that we can then b e

explicit ab out the steps involved in say addition To b egin with we declare the following

data type

data Int MkInt Int

This declares the data type of b oxed integers Int as an algebraic data type with a single

constructor MkInt The latter has a single argument of type Int the type of unboxed

integers So the value MkInt represents the b oxed integer stands for the unboxed

constant of type Int

Now given the expression e e say we can rewrite it like this

case e of

MkInt x case e of

MkInt y case x y of

t MkInt t

The outer two case expressions evaluate e and e resp ectively while the inner case expresses

the fact that x and y are added and then their result t is b oxed with a MkInt constructor

By convention we use a trailing for identiers whose values or results are primitive This

is just for human readability the identiers and are distinct but the is not otherwise

recognised sp ecially by the compiler

It turns out that this simple idea allows several optimisations which hitherto were buried in

the co de generator to b e reformulated as program transformations Furthermore the idea can

b e generalised in a number of directions such as allowing general algebraic data types with

unboxed comp onents rather than just Int All of this is discussed in detail in Peyton Jones

Launchbury

For the purp oses of this pap er it suces to establish the following facts

Data types are divided into two kinds algebraic data types are introduced by explicit

data declarations while primitive data types are built into the system Values of primi

tive type can b e manipulated directly by machine instructions and are always unboxed

For example Int is an algebraic type while Int is primitive For the purp ose of this

pap er it suces to identify primitive types with unboxed types though the generalisa

tions discussed in Peyton Jones Launchbury p ermit unboxed algebraic types

as well

All literal constants are of primitive type literals of algebraic type are expressed by

giving an explicit application of a constructor

All arithmetic builtin op erations op erate over primitive values for example ab ove

Denitions for functions op erating over nonprimitive ie algebraic values can b e ex

pressed directly in the STG language and hence do not need to b e built in

Values of unboxed type need not b e the same size as a p ointer For example Double

the type of doubleprecision oatingp oint numbers o ccupy bits while p ointers usu

ally o ccupy bits As a result polymorphic functions can take only arguments of boxed

type b ecause arguments must b e passed to such functions in a uniform representation

Even if unboxed values were always the same size as a p ointer there would still b e a

diculty for the garbage collector in distinguishing a p ointer from a nonp ointer

A let or letrec expression cannot bind a variable of unboxed type Such a binding is

instead made using a case expression The reason for this is that when a variable of

unboxed type is b ound the expression to which it is b ound must b e evaluated imme

diately the whole p oint ab out unboxed values is that they cannot b e represented by

asyetunevaluated closures

In other words in the STG language case expressions p erform evaluation while let

and letrec build closures This uniform semantics gives rise to uniform transformation

laws for example a let expression whose b ound variable is not used can always b e

elided

For the same reason the global toplevel bindings of an STG program cannot bind

values of unboxed type

There are two forms of case expression as the the syntax of Figure describ es One

takes apart a value of an algebraic data type while the other p erforms case analysis on

a value of primitive type

Relationship to CPS conversion

Transformation to continuationpassing style CPS is a technique which has b een used to

go o d eect in several compilers for strict callbyvalue languages App el Fradet

Metayer Kelsey Kranz Steele Though the STG language is lazy

it has much the same avour as CPS nested constructs are attened to an explicit sequence

of simple op erations so that the ow of control is manifest and there is a direct relationship

b etween the remaining language constructs and individual machine op erations

To make the connection explicit the following table shows the approximate corresp ondence

b etween the constructs of the CPS language used by App el App el and those of the

STG language

Operation Appel CPS form STG form

Application APP Application

Lo cal function denition FIX

Record construction RECORD letrec

Thunk construction

Forcing of data values

Selection of alternative SWITCH case on algebraic types

Extraction of record comp onents SELECT

Primitive op erations PRIMOP case on primitive types

There are a few minor dierences b etween App els CPS and the STG language Firstly CPS

based implementations usually unbundle case expressions into forcing multiway selection

and extraction of the comp onents of the data value These are all bundled up together in

case expressions which as we have seen can b e used to advantage by the STG machine

Secondly the STG language uses a single construct letrec to allo cate functionvalued

and datavalued closures thus allowing arbitrary mutual recursion b etween the two It is not

so clear how to achieve this using App els form of CPS

There is a much more imp ortant dierence though the STG language is not a continuation

passing style In CPS every userdened function is given an extra parameter namely the

continuation to apply to its result For example assuming a continuation k the expression

f x y

would b e converted to the CPS form

f x fx fx y r k r

The call to f is made into a tail call passing to f an extra argument the continuation

fx fx y r k r This continuation says what to do after f x has b een com

puted namely add the result to y and pass that value to k In contrast the STG form of the

same expression is

case f x of

MkInt x case y of

MkInt y case x y of

r MkInt r

The continuation to the call to f is passed implicitly when evaluation of f x is complete

control returns to the second case expression The second case evaluates y which of course

is not necessary in App els world since SML is strict A lazy version of CPS would require the

susp ended computation inside a thunk to b e a function taking a continuation as its argument

So the CPS form would really b e

f x fx force y yr fx yr r k r

where force is the function which forces a thunk its rst argument by applying it to forces

second argument the continuation This co de is strikingly similar to the STG form ab ove

This dierence in the way in which continuations are handled clearly distinguishes CPS from

the STG language but it is quite dicult to pin down all the implications of the dierence

For example the CPS version has a natural stackless implementation since every call is a tail

call On the other hand it may thereby incur the cost of heapallo cating the closure for the

continuation and passing it as an argument to f The STG version suggests a stackbased

implementation since the current activation frame contains the environment in which the

continuation should b e executed But of course either implementation is p ossible from b oth

styles

The STG style also seems to b e more natural for curried function application Consider the

call f x y which is left unchanged by the conversion to the STG language If converted

to CPS assuming that the call itself has continuation k this would generate something like

f x w w k y

This is a rather exp ensive and clumsy compilation for an ordinary function application We

exp ect curried function application to b e p ervasive so the STG language provides it as prim

itive

Of course this imp oses an extra requirement on the co de generator for the STG language

it must cop e with functions applied to more or fewer arguments than they are exp ecting

For example f might take one argument x do a lot of computation and nally reduce to a

function which takes the second argument y As Section will show though graph reduction

gives a natural way to provide this functionality

In summary the STG language has a similar avour to CPS but is a little less extreme

So far we have not discovered any opp ortunities for optimisation which are exp osed by CPS

but hidden by the STG language Consel Danvy show that transforming the source

program to CPS may improve the accuracy of some analyses we have not investigated whether

or not the STG language has a similar prop erty

Op erational semantics of the STG language

The STG language is the abstract machine co de for the STG machine In this section we give

a direct op erational semantics for the STG language using a state transition system

A state transition semantics sp ecies a an initial state for the machine and b a series of

state transition rules Each rule sp ecies a set of source states and the corresp onding target

states after the transition has taken place The set of source states is sp ecied implicitly using

patternmatching and guard conditions if a state is in the source set for a given transition

rule we say that the rule matches the state At most one transition rule should match any

given state and if no rule matches the machine halts

The state has ve comp onents

the code which takes one of several forms given b elow

the argument stack as which contains values

the return stack rs which contains continuations

the update stack us which contains update frames

the heap h which contains only closures

the global environment  which gives the addresses of all closures dened at top level

Sequences are used extensively in what follows They are denoted using curly brackets thus

fa ;:::; a g The empty sequence is denoted fg if as and bs are two sequences then as bs

n

is their concatenation and a as denotes the sequence obtained by adding the item a to the

b eginning of the sequence as The length of a sequence as is denoted length as

A value takes one of the following forms

Addr a A heap address

Int n A primitive integer value

In the op erational semantics values are tagged with Addr and Int and so on to distinguish

these dierent kinds of value We discuss later ways to avoid actually implementing this

tagging in a real implementation Section We could add further forms of value for other

primitive data types such as oatingp oint numbers but they are handled exactly analogously

to integers so we omit them to reduce clutter

We use w ; w ;::: to range over values and ws to range over sequences of values

The argument stack as is just a sequence of values The top of the stack is the b eginning

of the sequence The return stack and up date stack will b e dealt with later Sections and

resp ectively

The heap h is a mapping from addresses ranged over by a ; a ;::: to closures Every closure

is of the form

vs  xs e ws

Intuitively the lambdaform vs  xs e denotes the co de of the closure while the sequence

of values ws gives the value of each of the free variables vs We use  to range over up date

ags which can b e either u or n This is exactly the uniform representation discussed in

Section

The global environment comp onent of the state  maps the name of each variable b ound at

the top level of the program to the address of its closure These closures can all b e allo cated

once and for all b efore execution b egins Indeed unlike the other comp onents  do es not

change during execution The STG machine is unusual in binding globals to closures rather

than to co de sequences It is imp ortant to do so however b ecause a global may b e up datable

so there must b e a closure to up date

As we discussed earlier it is p ossible to share the co de for standardconstructor closures

Section In the sp ecial case of constructors with no arguments such as Nil it is

p ossible to share not just the co de for the closure but the closure itself For example all

references to Nil can use the address of a single global closure This is easily done by adding

niladic constructors as a p ossible form of atom Figure and extending  with the address

of a suitable closure for each niladic constructor For the sake of simplicity we do not p erform

this optimisation in the op erational semantics which follows

Finally the code comp onent of the state takes one of the following four forms each of which

is accompanied by its intuitive meaning

Eval e  Evaluate the expression e in environment 

and apply its value to the arguments on the

argument stack The expression e is an arbi

trarily complex STGlanguage expression

Enter a Apply the closure at address a to the argu

ments on the argument stack

ReturnCon c ws Return the constructor c applied to values ws

to the continuation on the return stack

ReturnInt k Return the primitive integer k to the continu

ation on the return stack

The local environment  maps variable names to values The notation v w extends the

map  with a mapping of the variable v to value w This notation also extends in the obvious

way to sequences of variables and values for example vs ws

The val function takes an atom Figure and delivers a value

val   k Int k

val   v  v if v dom 

 v otherwise

If the atom is a literal k val returns a primitive integer value If it is a variable val lo oks

it up in  or  as appropriate val extends in the obvious way to sequences of variables

val   vs is the sequence of values to which val   maps the variables vs

The initial state

We b egin by sp ecifying the initial state of the STG machine The general form of an STG

program is as follows

g vs  xs e

:::

g vs  xs e

n n n n n

One of the g will b e main Given this program the corresp onding initial state of the machine

i

is

Arg Return Update

Code stack stack stack Heap Globals

Eval main fg fg fg fg fg h 

init

g Addr a

::: where 

g Addr a

n n

a vs  xs e  vs

::: h

init

a vs  xs e  vs

n n n n n n

We write a machine state as a horizontal row of its comp onents sometimes with auxiliary

denitions as here introduced by a where clause In this initial state the co de comp onent

says that main is to b e evaluated in the empty lo cal environment the argument return and

up date stacks are empty the initial heap h contains a closure for each global and the

init

global environment  binds each global to its closure

Notice that the values in the range of the global environment are all addresses This reects

the fact that global variables are always b oxed

Applications

We b egin the main op erational semantics with the rule for applications

Eval f xs  as rs us h 

such that val   f Addr a

Enter a val   xs as rs us h 

The top line of the rule gives the state b efore the transition while the b ottom line gives

the state afterwards We use a patternmatching notation for the top line In this case the

rule only matches if the co de comp onent is an Eval of an expression of the given form The

such that clause further constrains the rule to the case where f is b ound to the address of

a closure and not to a primitive value

The rule says that to p erform a tail call the values of the arguments are put on the argument

stack and the value of the function is entered The function is exp ected to b e a closure the

other case when f is not an address but rather is a primitive value instead is dealt with in

Section Notice that the lo cal environment is discarded at this p oint in general the lo cal

environment only has a very lo cal lifetime

The next thing to discuss is the rule for entering a closure We give only the rule for entering

nonupdatable closures the rule for up datable closures is given in Section

Enter a as rs us h a vs n xs e ws 

f

such that length as length xs

0

Eval e  as rs us h 

0

where ws as as

a

length ws length xs

a

 vs ws ; xs ws

f a

When a nonup datable closure is entered the lo cal environment is constructed by binding its

free variables to the values ws found in the closure and its arguments to the values ws

f a

found on the stack Then the b o dy of the closure is evaluated in this environment In this

rule we use a where clause to give values to variables used in the result state of the rule

letrec expressions

As mentioned earlier a let expression constructs one or more closures in the heap

let x vs  xs e

B C

:::

B C

Eval  as rs us h 

B C

A

x vs  xs e

n n n n n

in e

0 0

Eval e  as rs us h 

0

where  x Addr a ;:::; x Addr a

n n

a vs  xs e  vs

rhs

0

::: h h

a vs  xs e  vs

n n n n n rhs n

 

rhs

0

The rule for letrec is almost identical except that  is dened to b e  instead of 

rhs

Case expressions and data constructors

The return stack is used for the rst time when we come to case expressions Given the

expression

case e of alts

the op erational interpretation is push a continuation onto the return stack and evaluate e

When the evaluation of e is complete execution will resume at the continuation which then

decides which alternative to execute The rule for case follows fairly directly

Eval case e of alts  as rs us h 

Eval e  as alts ;  rs us h 

The continuation is a pair alts ;  the alternatives alts say what to do when evaluation of

e completes while the environment  provides the context in which to evaluate the chosen

alternative We will have more to say ab out how this exp ensivelooking environment saving

is p erformed later in Section

The other side of the coin is the rules for constructors and literals Presumably e eventu

ally evaluates to either a constructor or a literal at which p oint the continuation must b e

p opp ed from the return stack and executed The rules for constructors and literals each use

an intermediate state ReturnCon and ReturnInt resp ectively just as the rule for function

application uses Enter Primitive values are dealt with in the next section while the rules

for constructors are given next

Evaluating a constructor application simply moves into the ReturnCon state

Eval c xs  as rs us h 

ReturnCon c val   xs as rs us h 

The rules for ReturnCon return to the appropriate continuation taken from the return stack

ReturnCon c ws as ::: c vs e :::;  rs us h 

Eval e vs ws as rs us h 

Provided that the continuation on the return stack contains a pattern c vs whose constructor

c is the same as that b eing evaluated we just evaluate the righthand side of that alternative

in the saved environment  augmented with bindings for the variables vs to the values of the

actual arguments to c

If there is no such alternative the default alternative is taken The rule for this is easy when

no variable is b ound in the default case

c vs e

C B

:::

C B

;  rs us h  ReturnCon c ws as

C B

A

c vs e

n n n

default e

d

such that c c i n

i

Eval e  as rs us h 

d

However if a variable v is b ound by the default we need to heapallo cate a constructor

closure to which to bind v thus

c vs e

C B

:::

C B

rs us h  ;  ReturnCon c ws as

C B

A c vs e

n n n

v e

d

such that c c i n

i

0

Eval e v a as rs us h 

d

0

where h h a vs n c vs ws

vs is a sequence of arbitrary distinct variables

length vs length ws

This rule is a little complicated and a simple program transformation can eliminate the

variablebinding form of default from the language for algebraic case expressions anyway

let v xs u e

case e of ::: v b in

case v of ::: default b

In implementation terms this version is a little less ecient b ecause a closure for v will b e

allo cated and then up dated whereas using Rule simply allo cates the constructor in its nal

form

Lastly if there is no match and no default alternative no rule matches which is interpreted

as failure

Built in op erations

In this section we give the extra rules which handle primitive values The rule for evaluating

a primitive literal k enters the ReturnInt state

Eval k  as rs us h 

ReturnInt k as rs us h 

A similar rule deals with the case where a variable b ound to a primitive value is entered

Eval f fg f Int k as rs us h 

ReturnInt k as rs us h 

Next come the rules for the ReturnInt state which lo ok for a continuation on the return

stack First the case where there is an alternative which matches the literal

ReturnInt k as ::: k e :::;  rs us h 

Eval e  as rs us h 

Next the cases where the default alternative is taken

k e

C B

:::

C B

rs us h  ;  ReturnInt k as

C B

A

k e

n n

x e

such that k k i n

i

Eval e x Int k as rs us h 

k e

C B

:::

C B

rs us h  ;  ReturnInt k as

C B

A

k e

n n

default e

such that k k i n

i

Eval e  as rs us h 

Finally we need a family of rules for builtin arithmetic op erations which for each binary

builtin op eration have the form

Eval fx ; x g x Int i ; x Int i as rs us h 

ReturnInt i i as rs us h 

Up dating

In this section we cover the up dating technology necessary for a graph reduction machine

Up dates happ en in two stages

When an up datable closure is entered it pushes an update frame onto the up date

stack and makes the argument and return stacks empty An up date frame is a triple

as ; rs ; a consisting of

u u u

as the previous argument stack

u

rs the previous return stack

u

a a p ointer to the closure b eing entered and which should later b e up dated

u

When evaluation of the closure is complete an up date is triggered This can happ en in

one of two ways

If the value of the closure is a data constructor or literal an attempt will b e made

to p op a continuation from the return stack which will fail b ecause the return

stack is empty This failure triggers an up date In the real implementation we

can avoid making the test by merging the return and up date stacks and making

the up date into a sp ecial sort of continuation Section

If the value of the closure is a function the function will attempt to bind arguments

which are not present on the argument stack b ecause they were squirreled away

in the up date frame This failure to nd enough arguments triggers an up date

These situations are made precise in the following rules First we need to add an extra rule

which applies when entering an up datable closure that is one whose up date ag is u The

rule is similar to the usual closureentry rule Rule

Enter a as rs us h a vs u e ws 

f

Eval e  fg fg as ; rs ; a us h 

where  vs ws

f

The dierence is that the argument stack return stack and closure b eing entered are formed

into an up date frame which is pushed onto the up date stack Naturally the real implemen

tation manipulates p ointers rather than copying entire stacks Section Since closures

with a nonempty argument list are never up datable Section we only deal with this case

in the rule given

Next we need new rules for constructors which see an empty return stack When this happ ens

they up date the closure p ointed to by the up date frame restore the argument and return

stacks from the up date frame and try again It may b e that the restored return stack contains

the continuation but it to o may b e empty in which case a second up date is p erformed and

so on until the continuation is exp osed

ReturnCon c ws fg fg as ; rs ; a us h 

u u u

ReturnCon c ws as rs us h 

u u u

where vs is a sequence of arbitrary distinct variables

length vs length ws

h h a vs n fg c vs ws

u u

The closure to b e up dated address a is just up dated with a standardconstructor closure

u

Only a rule for ReturnCon need b e given It is not p ossible for the ReturnInt state to see

an empty return stack b ecause that would imply that a closure should b e up dated with a

primitive value but no closure has a primitive type Section

Finally we need a rule to handle the case where there are not enough arguments on the stack

to b e b ound by a lambda abstraction which triggers an up date The relevant rule is

Enter a as fg as ; rs ; a us h 

u u u

such that h a vs n xs e ws

f

length as < length xs

Enter a as as rs us h 

u u u

where xs xs xs

length xs length as

h h a vs xs n xs e ws as

u u f

The rule will only apply if the number of arguments xs is greater than zero so the closure

b eing entered will b e nonup datable hence the n in the rst line of the rule The closure

to b e up dated address a has as its value the value of the closure b eing entered address

u

a applied to the arguments on the stack as It is therefore up dated with a closure whose

co de is vs xs n xs e the b o dy e is the same as that of a but it has more free

variables xs as well as vs and fewer arguments xs instead of xs After the up date the

Enter is retried

This concludes the basic rules for up dating However one of the constraints in a real im

plementation is that it cannot manufacture compiled co de on the y so we need to b e

careful ab out the co de part of closures which are created by up dating The co de required for

constructors vs n fg c vs is OK b ecause we can precompile it for each constructor c

The co de for partial applications vs xs n xs e is more tiresome since it suggests

that we need to precompile the entire b o dy e of every function for every p ossible partial

application An alternative rule for partialapplication up dates avoids this problem

Enter a as fg as ; rs ; a us h 

u u u

such that h a vs n xs e ws

f

length as < length xs

a

Enter a as as rs us h 

u u u

where xs xs xs

length xs length as

f is an arbitrary variable

h h a f xs n fg f xs a as

u u

Here the closure b eing entered a is used in the new closure The new co de required namely

f xs n fg f xs can b e shared b etween all partial applications to the same number

of arguments All that is required is a family of such co deblo cks one for each p ossible number

of arguments

Part I I I Mapping the abstract machine to sto ck

hardware

We have now completed the abstract description of the Spineless Tagless Gmachine Whilst

it has some interesting features its real justication is that it maps very nicely onto sto ck

hardware with a rich set of design alternatives some of which we have already indicated In

the rest of the pap er we describ e the mapping in detail

Target language

Our goal is of course to generate go o d native co de for a variety of sto ck architectures One

approach to this is to write individual co de generators for each architecture and this is likely

to give the b est results in the end Unfortunately to comp ete with more mature imp erative

languages whose co de generators have evolved and improved over many years we would have

to do a comparably go o d job of co de generation which is a lot of work

Motivated by this concern we generate co de in the C language as our primary target rather

than generating native co de direct In this way we gain instant p ortability b ecause C is

implemented on a wide variety of architectures and we b enet directly from improvements

in C co de generation This approach of using C as a highlevel assembler has gained

p opularity recently Bartlett Miranda In particular the work of Tarditi et

al on compiling SML to C developed indep endently and concurrently with ours addresses

essentially the same problems Tarditi Acharya Lee

Rather than generating C directly we go via an internal datatype called Abstract C This

allows the following sp ectrum of alternatives for the nal co de generation with increasing

eciency and decreasing p ortability

We can generate ANSIstandard C which should b e widely p ortable

We can generate C which exploits various nonstandard extensions to C supp orted by

the Gnu C compiler Stallman

We can generate native machine co de directly

So far we have concentrated only on the rst two alternatives

Compiling via C is very attractive for p ortability reasons but like all go o d things it do es not

come for free In the rest of this section we describ e a few tricks which substantially improve

the co de we can generate using this route usually by exploiting nonstandard extensions to

C provided by Gnu C

4

Here Miranda is not the trade mark It is the last name of a researcher at Queen Mary and Westeld

College London

Mapping the STG machine to C

At rst it app ears sensible to try to map functions from the original functional program

onto C functions but we so on abandoned this approach The mismatch b etween C and a

nonstrict higherorder functional language is to o great

Instead the argument stacks and control stack are mapp ed onto explicit C arrays bypassing

the usual C parameterpassing mechanism All registers such as the stack p ointers heap

p ointer heap limit and other registers introduced later are held in global variables

This approach results in a great deal of globalvariable manipulation The overheads can

b e reduced without losing p ortability by caching such globals in registerallo cated lo cal

variables during the execution of a single co deblo ck based on a simple usage analysis Tarditi

Acharya Lee At the exp ense of p ortability the overheads can b e eliminated entirely

by telling the C compiler to keep particular globals in sp ecied registers p ermanently a highly

nonstandard architecturespecic facility provided by the Gnu C compiler

Compiling jumps

The main diculty with generating C concerns lab els We use the term code label or just

lab el to mean an identier for a co de sequence The imp ortant characteristics of a co de lab el

are that

It can b e used to name an arbitrary blo ck of co de

It can b e manipulated for example it can b e pushed onto a stack stored in a closure

or placed in a table

It can b e used as the destination of a jump

We usually think of lab els as b eing represented by co de addresses The trouble is that C has

nothing which directly corresp onds to co de lab els There are two ways out of this dilemma

which we outline in the following subsections

Using a giant switch

The rst solution is to map lab els onto integer tags and embed the entire program in a lo op

with the following form

int cont

while TRUE do

switch cont

code for label

code for label

and so on

Now a jump can b e implemented by assigning to cont followed by a break statement The

switch statement will then reexecute with the new lab el

The shortcomings of the technique are clear Firstly a layer of indirection has b een imp osed

b ecause lab els are not implemented directly as co de p ointers

Secondly and more seriously separate compilation is made much more dicult The C co de

for the entire program including the runtime system has to b e gathered together into a

single giant C pro cedure and then compiled Not only do es this stress the C compiler quite

substantially and imp ose heavy recompilation costs on even lo cal changes but it also means

that a sp ecial linker has to b e written to paste together the C co de generated from each

separatelycompiled sourcelanguage mo dule

Using a tiny interpreter

Because of these problems we use an alternative metho d based on a nice trick The idea

is to compile each lab elled blo ck of co de to a parameterless C function whose name is the

required lab el Now C do es treat functions as storable values representing each by a p ointer

to its co de The only problem is how to jump to such a co de blo ck The only mechanism C

provides is to call the function but then every jump would make Cs return stack grow by

one more word causing certain stack overow

A C compiler which implemented a tail call as a jump would not suer from this problem but

it would hardly b e a p ortable solution to require such an optimisation for correct op eration

Furthermore C is complicated enough to make the tailcall optimisation quite hard to get

right in the presence of variadic functions for example and no C compiler known to us do es

so

So here is the trick each parameterless function representing a co de blo ck returns the co de

p ointer to which it would like to jump rather than calling it The execution of the entire

program is controlled by the following oneline interpreter

while TRUE cont cont

That is cont is the address of the co de blo ck that is C function to b e executed next The

function to which it p oints is called and returns the address of the next one and so on The

lo op is nally broken by a longjump though one could equally well test cont for a particular

value instead for a fairly minor cost

Here for example is a co de blo ck which jumps to a lab el found on top of the return stack

CodeLabel f

CodeLabel lbl RetSp

return lbl

The result is a fullyp ortable implementation which supp orts separate compilation in the

usual way with a standard linker Lab els are represented directly by co de addresses

Temporary variables used within a single co de blo ck are declared as lo cal variables of the

C function generated for the co de blo ck Their scop e is thereby limited so that a go o d C

compiler will put them in registers where p ossible

It turns out that this idea is actually very old and that we only reinvented it Like several

other clever ideas Steele seems to have b een its inventor he called it the UUO handler in

his Rabbit compiler for Scheme Steele The same idea is used by Tarditi Acharya

Lee who use C as a target for their SML compiler

Optimising the tiny interpreter

In the p ortable tiny interpreter describ ed ab ove a jump has the following overheads

the epilogue generated by the C compiler for the current C function concluding with a

return instruction which p ops the return address to return to the interpreter

a jump to implement the interpreters lo op

a call instruction which pushes the interpreters return address

the prologue generated by the C compiler for the new C function

At the exp ense of p ortability we can make some architecture and compilersp eci c optimi

sations to this jump sequence

Eliminating register saves For architectures with a xed register set most C compilers

implement a calleesaves convention for all registers except a small number of work

registers There is a save sequence at the start of each function and a restore sequence

at the end

Gnu C provides a compiler ag which makes the compiler use a callersaves convention

In conjunction with the directjump optimisation describ ed b elow this eliminates all

register save instructions

Eliminating the frame p ointer Most C compilers generate instructions at the b eginning

and end of functions to set up a frame p ointer register This is redundant b ecause the

compiler can always gure out the osets of lo cal variables from the stack p ointer itself

but it is vital for debuggers

Gnu C provides a compiler ag to suppress framep ointer manipulation at the exp ense

of confusing the debugger

Generating direct jumps Instead of generating return lbl we actually generate

JUMP lbl where JUMP is a macro For a p ortable implementation JUMP expands

to a return statement but the implementation can b e made faster by making JUMP

expand to an inline assemblycode instruction which really do es take a jump Most C

compilers provide an assemblylanguage trap do or which we exploit here Using inline

assembly co de in this way has its pitfalls esp ecially if we simultaneously try to use lo cal

variables Miranda gives details of the tricky things one has to do Miranda

Debugging

The use of a tiny interpreter turned out to have a very useful prop erty which we had not

anticipated it is a tremendous debugging aid

The STG machine frequently takes an indirect jump to the co de p ointed to by a closure If

a bug has caused a closure to b e corrupted this indirect jump usually causes a segmentation

fault or illegal instruction The diculty is that there usually no way of backing up to the

co de which p erformed the jump which is the rst step in identifying the source of the error

Using the unoptimised tiny interpreter provides an easy solution b ecause it can easily record

a trail of the most recent few jumps Since every jump passes through the tiny interpreter it

faithfully records the address of the co de blo ck containing the fatal jump

Furthermore it is easy to add to the tiny interpreters lo op a call to a hygienechecking

routine which checks that the machine state lo oks plausible While it slows down the program

considerably we have found this hygienechecking an invaluable aid for trapping the p oint

at which the machine state b ecomes corrupt rather than the p oint at which the corruption

causes a crash which is often much much later

It is hard to overstate the usefulness of this trick esp ecially since it has no impact at all on the

compiler and the co de it generates Only p eople who have sp ent all night trying to nd the

cause of the heap corruption which subsequently led to a system crash can truly appreciate

it

The heap

The heap is a collection of closures of variable size each identied by a unique address We

use the term pointer to refer to the address of a closure

How closures are represented

Each closure o ccupies a contiguous sequence of machine words which is always laid out as

shown in Figure

The rst word of a closure is called its info pointer and p oints to its info table Following

the info p ointer is a blo ck of words each of which contains a p ointer followed by a blo ck of

words containing no p ointers The distinction b etween the two is that the garbage collector

must follow the former but not the latter There is a single staticallyallo cated info table

asso ciated with each bind in the program text Figure Each dynamic instance of this

binding is a heapallo cated closure whose infop ointer refers to its static info table Rule

The info table contains a number of elds which will b e describ ed later but the most imp ortant

is the rst eld which contains the lab el of the closures standardentry code The op eration

of entering a closure is p erformed by

loading the address of the closure into the Node register and

Info p ointer

t

Pointer words Nonp ointer words

Info table

- - t

Standard entry co de

-

t

Evacuation co de

-

t

Scavenge co de

Other info

Figure The layout of a closure

jumping to the standardentry co de for the closure whose lab el is usually fetched from

the info table by indirecting from Node

The standardentry co de can access the various elds of the closure by indexing from the

Node register The rest of the info table contains

Enough information to enable the garbagecollector to do its job In fact we implement

this information as two co de lab els which are describ ed further in Section

Debugging information for insp ection by a debugger or trace generator

For our parallel implementation enough information to enable the closure to b e ushed

into global memory This to o is actually implemented as a co de lab el

It is usual for heapallo cated ob jects to contain layout information to sp ecify their size and

which of their elds contain p ointers In contrast our closures do not contain any such

information Rather as we shall see size and layout information is enco ded in the info table

Indirection closures are generated by up date op erations and they have a particularly ecient

representation

-

Ind Info Pointer to another closure

The standardentry co de for IndInfo consists of only two instructions one to load the

indirection p ointer from the closure into Node and a second to enter the new closure

In retrosp ect this representation is quite similar to that chosen by the Chalmers group for

their Gmachine implementation Johnsson In their system every heap cell has a

oneword tag which p oints to a table of entry p oints for the various op erations that could

b e p erformed on the cell Our system diers from theirs in two resp ects First and most

imp ortant rather than having a xed collection of tags we generate a new info table for

each bind in the program text together with its asso ciated co de This essentially eliminates

the interpretive unwind used by the Gmachine Second the op eration of entering a closure

involves an indirection to nd the co de lab el to jump to this indirection can b e avoided when

generating native co de directly as Section discusses

Allo cation

The closures for toplevel globals are allo cated statically at xed addresses we call them static

closures A static closure is not necessarily immutable however b ecause it may b e a thunk

which is up dated during execution The alert reader will sp ot that this p olicy gives rise to

a garbagecollection problem which we return to in Section

All other closures are allo cated dynamically from the heap As is now well understo o d for

go o d p erformance it is essential to allo cate from a contiguous blo ck of free space rather than

from a free list App el Free space is delimited by two sp ecial registers the Hp register

p oints to one end of it while the HLimit register p oints to the other Allo cation is done on a

basicblo ck basis so that only one freespace exhaustion check is made for each basic blo ck

Twospace garbage collection

Garbage collection is p erformed by a twospace stopandcopy collector Baker Avail

able memory is divided into two semispaces When garbage collection is initiated all live

closures are copied from one semispace to one end of the other

This copying pro cess involves two basic op erations on closures

Each live closure must b e evacuated from fromspace to tospace

As tospace is scanned linearly each closure must b e scavenged that is each closure to

which it p oints must b e evacuated unless it has already b een evacuated and the new

tospace p ointer substituted for the old fromspace p ointer

The unusual feature of our system is that these two op erations evacuation and scavenging

are implemented by co de p ointed to from the info table of each closure These co de sequences

know the exact structure of the closure and therefore can op erate without interpretive

lo ops and without any further layout information

The evacuation co de which is called as a C function do es the following

It copies the closure into tospace

It overwrites the closure in fromspace with a forwarding pointer which p oints to the

newlyallo cated copy of the closure in tospace

It returns the new tospace address of the closure to the caller

The scavenging co de of a closure also called as a C function do es the following For each

p ointer in the closure

it calls the evacuation co de for the closure to which it p oints

it replaces the p ointer in the original closure with the tospace p ointer returned from

this evacuation call

The scavenging co de knows which of the argument elds contain closure addresses and hence

must b e evacuated and which are not and hence must not b e evacuated

A C function do es the oncep ercollection work of switching spaces and accumulating statis

tical information but almost all the work of garbagecollection is carried out by the evacuate

and scavenge routines of the closures in the heap As in the case of the standardentry co de

of a closure the infotable dispatch mechanism for evacuation and scavenging provides the

opp ortunity to deal with several sp ecial cases for free that is without any further tests

Forwarding p ointers A forwarding p ointer handles the situation where a second attempt

is made to evacuate the closure an attempt to evacuate a closure which has b een

overwritten with a forwarding p ointer simply returns the tospace address found in

the forwarding p ointer There is a nice optimisation available here Most systems

distinguish a forwarding p ointer by some sort of tag bit which has to b e tested just

b efore evacuating Instead we make a forwarding p ointer lo ok just like any other

closure it has an info p ointer and one eld which p oints to the tospace copy The

info table for a forwarding p ointer has rather simple evacuation co de which just

returns the tospace address found in the forwarding p ointer So to evacuate a closure

one simply jumps to its evacuation co de regardless of whether the closure is now a

forwarding p ointer or not No forwardingpointer test is p erformed

All heapallo cated closures are at least two words long in order to leave enough space

for a forwarding p ointer

Indirections All indirections can easily b e removed during garbage collection by another

nice trick All that is required is that the evacuation routine of an indirection jumps

to the evacuation routine of the closure to which the indirection p oints The use of

jumps to rather than calls is delib erate this is a tail call Since indirections are

thereby never moved into tospace they dont have a scavenging routine

Static closures Some closures notably those for global closures Section are allo cated

at xed static lo cations These closures must not b e moved by the garbage collector

This is easily arranged by making their evacuation co de return immediately without

moving the closure

Constructor closures can exist in b oth static and dynamic space Section so in

fact we need two info tables for each constructor one for each of these cases The

standardentry co de for the constructor can still b e shared of course

Small integers A xedprecision integer of type Int is represented by the MkInt construc

tor applied to the primitive integer value Section This in turn is represented by

a twoword closure consisting of the MkInt info p ointer and the primitive integer value

The evacuation co de for MkInt sees if the value of the integer lies in a predetermined

range and if so uses the integer to index a table of staticallyallo cated Int closures

returning the address of this static closure The eect is that all small integers are

commoned up by the garbage collector and made to p oint to one of a xed collection

of smallinteger closures

If the integer is not in the range of the table the closure is evacuated to tospace as

usual There is an easy renement give the new copy a dierent info p ointer which

wont p erform the test again next time b ecause it will certainly fail again

The smallinteger check could of course b e made at the time an Int is allo cated but

that means generating extra co de in lots of places whereas doing it in the garbage

collector requires just one chunk of extra co de The same optimisation applies to Char

closures and all other constructors isomorphic to Int or Char

Other garbage collector variants

Twospace garbage collection works well until the residency of the program approaches half

the real memory available at which p oint the virtual memory system b egins to thrash We

have implemented a dualmo de collector which switches dynamically b etween a singlespace

compacting collector and a twospace collector to try to minimise paging with encourag

ing early results Sansom We are developing a further extension to a generational

collector based on App els simple twogeneration scheme App el

Trading co de size for sp eed

The infotable dispatch mechanism outlined ab ove allows some interesting spacetime tradeos

to b e made

So far we have assumed that each kind of closure has its own evacuation and scavenging

co de which knows ab out its size and layout This requires new evacuation and scavenging

routines to b e compiled for each closure in the program But since the garbage collection

routines for a closure dep end only on its structure it is often p ossible to share them For

example all closures which consist of exactly one p ointer eld apart from the info p ointer

can share the same evacuation and scavenging routines Indeed our runtime system contains

standard garbagecollection routines for a number of common layouts

What if a closure must b e constructed which do es not match one of these standard layouts

It is p ossible to compile sp ecial garbagecollection co de for it but actually we adopt a com

promise p osition which allows us to provide al l evacuation and scavenging routines as part

of the runtime system Instead of generating co de for garbagecollection routines for a non

standard closure we provide generic evacuation and scavenging routines in the runtime

system These routines lo ok in the closures info table to nd certain layout information

namely the number of p ointer words and nonp ointer words in the closure This is contained

in the Other info eld of Figure They then each use a lo op to do their work instead of

having the lo op unrolled as the sp ecialpurp ose routines do Notice that the layout informa

tion is stored in the static info table so there is no extra cost in allo cating the closure The

only extra execution cost is in executing the lo ops in the garbage collection routines

It is for the b enet of these generic routines that closures are laid out with p ointers preceding

nonp ointers This convention means that only two numbers are required to enco de the layout

information It also makes it more likely that a closures layout will t a standard layout

directly supp orted by the runtime system For example all closures with two words of non

p ointers and two of p ointers can use the same routines if the layout convention was more

lib eral there would b e a number of dierent p ossible layouts of such closures The convention

carries no runtime cost of course

The standardentry co de for a closure

There is one particular place where we have found that the use of C prevents an obvious

co de improvement The info p ointer of a closure p oints to a table containing a number of

co de lab els One of these is used much more than the others namely the one used when the

closure is entered

It would b e b etter to arrange that the info p ointer p ointed directly to this co de placing

the rest of the info table just before the co de Then entering the closure takes one fewer

indirections but the other infotable entries are still available by using negative oset from

the info p ointer

This is usually quite easy to arrange when generating native co de but even the Gnu C compiler

do esnt allow the programmer to sp ecify that an array the info table must immediately

precede the rst word of the co de for a function

We abstract away from this issue by using a C macro ENTER c where c contains the

address of the closure to b e entered The usual denition of ENTER is

define ENTERc JUMPc

Stacks

The abstract machine contains three stacks

The argument stack which contains a mixture of closure addresses and primitive values

The return stack which contains continuations for case expressions

The up date stack which contains up date frames

The question is how are these stacks to b e mapp ed onto a concrete machine

One stack

The three stacks all op erate in synchrony so it would b e p ossible to represent them all by

a single concrete stack The ma jor reason we choose not to do so is to avoid confusing the

garbage collector The garbage collector must use all the p ointers in the stack as a source of

ro ots and must up date them to p oint to the new lo cations of the closures Thus it needs to

know which stack lo cations are closure addresses and which are co de addresses or primitive

values

There are a couple of ways around this problem while retaining a single stack One p ossibility

is to distinguish p ointers from nonp ointers with a tag bit usually the leastsignicant bit

This is a nuisance b ecause it makes arithmetic slower and b ecause it makes standard bit

oatingp oint numbers imp ossible It is also rather against the spirit of our implementation

where all type information is static requiring no runtime testing

Another p ossibility describ ed in an earlier version of the Spineless Tagless Gmachine Pey

ton Jones Salkild uses static bitmasks asso ciated with the co de p ointed to by

return addresses on the stack to give the stack layout This works ne but since then we

have introduced the idea of fullyedged unboxed values which fatally wounds this technique

Consider for example the program

pick b f g if b then f else g

h b n pick b n

Here when pick is called the four arguments b and n will b e on the

argument stack The last of these will presumably b e primitive since later or

will b e applied to them Now here is the p oint if garbage collection is initiated

during the evaluation of b there is no context information available to tel l that the bottom

argument on the stack is primitive

To conclude using a single stack seems to require runtime tagging previous ways of avoiding

this cannot cop e with fullyedged unboxed values

Two stacks

The obvious solution which we use is to provide two concrete stacks the Astack for p ointers

and the Bstack for nonp ointers This was the solution adopted by the Gmachine where

the nonp ointer stack was called the Vstack and a number of subsequent systems The

nomenclature we use is taken from the ABC machine Ko opman where A stands

for argument and B for basic value However our argument stack is split b etween the A

and B stacks and the B stack contains other things b esides nonp ointer arguments as will

b ecome apparent The detailed mapping of each of the abstract stacks to these two concrete

stacks is given in subsequent sections

The stack p ointers are held in sp ecial registers SpA and SpB Like other twinstack imple

mentations we make the two stacks grow towards each other to avoid the risk that one will

overow while the other has plenty of space left in this pap er the Astack grows towards

lower addresses In our sequential implementation this stack space is allo cated in a xedsize

area separate from the heap

Compiling the STG language to C

We are now at last ready to discuss the co de which is generated for each of the constructs

in the STG language This section is rather long and detailed We make no ap ology for this

b ecause as remarked earlier an abstract machine can only b e considered a success if it maps

well onto concrete architectures with plenty of opp ortunities for optimisations

We b egin with an overview of the co de generation pro cess for an arbitrary STG expression

by considering the various syntactic forms an expression can take Figure

Cal ls to nonbuiltin functions Section The expression f a ;:::; a is compiled to

n

a sequence of statements which pushes the arguments a ;:::; a onto the appropriate

n

stacks adjusts the A and B stack p ointers to their nal values and enters the function

f As we discuss later this enter may take the form of entering the closure b ound to

f via its info table or of jumping direct to the appropriate co de for f

letrec expressions Section The let expression

let x lf ::: x lf in e

n n

is compiled to a sequence of statements which allo cates a closure in the heap for each

lambdaformlf ;:::; lf followed by the co de for e letrec expressions are treated in

n

the same way the only dierence b eing that the closures allo cated thereby may b e

cyclic

If the lambdaformlf is not a standard constructor the co de generator also pro duces

i

A separate blo ck of co de lab elled x entry obtained by compiling the b o dy of lf

i i

See Section for why it may b e useful to give this co de an extra entry p oint

The declaration for a staticallyinitiali sed array x info which is the info table

i

for x The rst element of the info table is the lab el of the standardentry co de

i

x entry

i

Both of these declarations are hoisted out to the top level rather than app earing em

b edded in the middle of the co de for the let expression In our co de generator this

attening pro cess is p erformed after co de generation the intermediate data type Ab

stract C p ermitting nested declarations

If the lambdaformis a standard constructor the shared info table for the appropriate

constructor can b e used and there is no need to generate x info and x entry

i i

Literals and calls of builtin operators Section A primitive literal k is compiled

to statements which load k into a register exactly which register dep ends on k s type

adjusts the A and B stack p ointers to their nal values and returns to the address on

top of the B stack A call to a builtin op eration works in the same way except that the

op eration is p erformed rst

This makes it sound as if every builtin op eration is asso ciated with a return but an

easy optimisation allows sequences of builtin op erations to b e compiled Section

case expressions Section The primitive case expression

case e of palts

is compiled to co de which saves any volatile variables used by palts on the stacks and

pushes a return address on the B stack followed by the co de for e An arbitrary but

unique lab el is invented for the return address which is used to lab el a separate blo ck

of co de compiled from palts

The co de compiled for palts p erforms case analysis on the value returned if there are

any nondefault alternatives followed by the co de for each alternative expression Like

the co de for lambdaforms this entire co de blo ck is hoisted to the top level

The co de for algebraic case expressions is similar except that the address of a return

vector is pushed instead of a return address Section

Top level bindings Section The toplevel bindings are treated a little dierently

to nested ones Each declaration g lf is compiled to the declaration of a statically

i i

initialised array g closure which represents the static closure for g An info table

i i

g info and standardentry co deblo ck g entry are also pro duced just as for nested

i i

bindings

Standard constructors Section As already mentioned no co de is generated for

lambdaforms which are standard constructors the shared info table and co de for the

constructor b eing used instead It is therefore necessary to generate this info table and

entry co de for each constructor declared in the mo dule

To make these ideas concrete Figure gives the co de compiled for map whose STG co de is

as follows see Section

map n fxs

case xs of

Nil Nil

Cons yys let fy fy u f y

mfy fys u map fys

in Cons fymfy

This co de is written assuming that lists use a vectored return convention and Cons returns

its arguments in registers matters which are explained more fully in Section

The rest of this section explores co de generation in more detail Each subsection corresp onds

to the similarlynumbered subsection of Section which gives the op erational semantics of

the STG language

The initial state

The machine is initialised to evaluate the global main with empty argument return and

up date stacks Section The abstract machines initial heap is not empty but rather

contains a closure for each globallydened variable We implement this by allo cating a static

closure for each such variable Section Each of these closures can b e referred to directly

by its C lab el thus eectively using the linker to implement the global environment 

StgWord mapclosure mapinfo

StgWord mapinfo mapentry rest of info table

mapentry

argument satisfaction check

JUMP mapdirect

mapdirect

stack overflow check

SpB retvec SpB SpB Push return vector

Node SpA ENTER Node Enter xs

StgWord retvec retnil updnil retcons updcons

retnil

SpA SpA Pop args

SpB SpB RetVecReg SpB Grab return vector

JUMP RetVecReg

retcons Head and tail in regs RetData and RetData

Allocate fy and mfy

Hp Hp

heap overflow check

Hp fyinfo Hp SpA Hp RetData

Hp mfyinfo Hp SpA Hp RetData

Return the cons cell

RetData Hp RetData Hp

SpB SpB RetVecReg SpB SpA SpA

JUMP RetVecReg

StgWord fyinfo fyentry rest of info table

fyentry

push update frame This is an updatable thunk

stack overflow check

SpA Node SpA SpA Push y

Node Node ENTER Node Enter f

StgWord mfy mfyentry rest of info table

mfyentry

stack overflow check

SpA Node SpA Node SpA SpA Push fys

JUMP mapdirect

Figure The co de generated for map

Applications

The co de generated for applications follows directly from Rule in the op erational semantics

and consists of two steps

push the arguments on the stack and adjust the stack p ointer

enter the closure which represents the function

We discuss these steps separately b elow Before doing so here is a small example Consider

the binding

apply n fx f xxx

The following co de is generated for the application f xxx When this co de is executed

a p ointer to f is on top of the A stack and under it is a p ointer to x

Node SpA Grab f into Node register

t SpA Grab x into a local variable

SpA t Push extra args

SpA t

SpA SpA Adjust stack pointer

ENTER Node Enter f

Entering a closure

What do es it mean to enter a closure After all the op erational semantics has quite a

few rules dealing with the Enter state The Spineless Tagless machine devolves resp onsibili ty

for all these complications to the closure b eing entered so that the co de to enter a closure

is simple and uniform We establish the following very simple entry convention for closures

when a closure is entered a particular register the Node register points to the closure The

co de for the closure can access its free variables by indexing directly from the Node register

All the caller has to do is to load a p ointer to the closure into the Node register and jump

to the standardentry co de for the closure via its info p ointer As previously discussed this

jump can involve either one or two indirections dep ending on the particular representation

chosen for closures and info tables Section

It is p ossible to make some useful optimisations to this pro cess when entering a nonup datable

closure Many functions are dened at the top level of the program or in standard libraries

map for example Entry to such functions can b e made much more ecient than the

standard entry mechanism just describ ed

Such a function has no free variables so there is no p oint in making Node p oint to its

closure

The co de lab el for the function is statically determined so the jump can b e a direct

one rather than indirecting via the info p ointer

The co de generator knows how many arguments if any the closure is exp ecting so if

at least this number of arguments is b eing supplied by the call the jump can b e made

to a p oint called the directentry point just after the argument satisfaction check see

Section b elow Indeed with a bit more cleverness the stack and heap overow

checks can often b e bypassed as well

The argumentpassing convention at the directentry p oint can b e dierent to the stan

dard ones In particular arguments can b e passed in registers

This should b e b enecial but p erhaps less so than in a strict language b ecause functions

frequently b egin by evaluating one of their arguments so the others have to b e saved

on the stack anyway We have not yet implemented this idea

Of these improvements all but the rst can b e applied to lo callydened functions as well

For example consider the expression

f n xy let g x n z xz

in g y

The call to g can b e made by pushing its argument y onto the stack loading a p ointer to

the closure for g into Node and then jumping directly to the appropriate co de for g Since

the call is to a function whose denition is statically visible the co de generator can compile

direct jumps including bypassing the argument satisfaction check where appropriate

We need to take a bit more care when entering an updatable closure In this case we must

jump to it via its info p ointer and never directly to its standardentry co de b ecause an

up date might have changed the info p ointer At rst it seems that we must also always make

Node p oint to the closure since the standardentry co de for an up datable closure b egins by

pushing an up date frame recording the address of the closure to b e up dated But since the

co de to push the up date frame is compiled individuall y for each closure we can arrange for

it to include the static address of the closure in the up date frame rather than Node

No optimisations at all apply when entering the closure for a lambdab ound variable as in

the case of apply ab ove

Pushing the arguments

Pushing the arguments onto the stack is not quite as simple as it sounds

Firstly the arguments may b e a mixture of p ointers and nonp ointers so each must b e pushed

on the appropriate stack The argument stack of the op erational semantics is thereby split

b etween the A and B stacks

Secondly in our implementation the environment  of the op erational semantics is represented

partly by lo cations in the stacks This is quite conventional in many language implementa

tions It means though that the stacks must b e cleared of the accumulated environment

or p erhaps just part of it see Section b efore pushing the arguments to the call

Of course we need to take a little care here we must not overwrite a stack lo cation which

contains a value which is required for another argument p osition There are several ways to

solve this the simplest b eing to move all the threatened live stack lo cations into registers

when generating C lo cal variables b efore starting to overwrite them

Here is an example

f n xy g yx

On entry to f x and y will b e on the stack It immediately calls g which requires the same

arguments but in the other order so at least one register must b e used during the stack

rearrangement

Such argumentshuing is rather unusual It is much more common for the same argument

to app ear in the same p osition in which case no co de need b e generated at all This is often

the case for recursive functions which pass some arguments along unchanged

letrec expressions

As mentioned earlier let and letrec expressions always compile to co de which allo cates a

closure in the heap for each denition followed by co de to evaluate the b o dy of the letrec

Each of these closures consists of an info p ointer and a eld for each of its free variables

For example the expression

let f fs  xs b

in e

compiles to co de which allo cates a closure for f and then continues with co de to evaluate e

For example consider the denition of compose

compose n fgx let gx gx u g x

in f gx

The co de for the b o dy of compose runs as follows

Allocate heap

Hp Hp Allocate some heap

if Hp HLimit Heap exhaustion check

trigger GC

Fill in closure for gx

Hp gxinfo info pointer

Hp SpA g

Hp SpA x

Call f

Node SpA Grab f into Node

SpA Hp Push gx

SpA SpA Adjust SpA

ENTER Node

Here gxinfo is the staticallyallo cated info table for gx

static int gxinfo

gxentry

scavenge

evacuate

In this info table gxentry is the name of the C function which implements the standard

entry co de for the closure gx scavenge and evacuate are runtime system routines for

p erforming garbage collection on closures containing two p ointers Section

Allo cation

The allo cation of these closures is straightforward and was discussed in Section

References to dynamicallyallo cated closures within a single instruction sequence are made

by osetting from the heap p ointer The co de generator keeps track of the physical p osition

of the heap p ointer so that correct osets can b e made even if it is moved by instructions

within the basic blo ck

Notice the use here of the term single instruction sequence In particular this metho d of

addressing cannot survive over the evaluation triggered by a case expression b ecause such

an evaluation may take an unbounded amount of computation Not only may this move the

heap p ointer unpredictably but it may trigger garbage collection which may rearrange the

relative p ositions of the closures In short at the p oints in the op erational semantics where

the environment  is saved on the return stack a p ointer to each live closure must b e saved

on the p ointer stack Section

The co de for a closure

Much more interesting of course is the standardentry co de for the closure This is the co de

which will get executed if the closure is ever entered The standardentry co de for every

closure b egins with the following sequence

Argument satisfaction check This concerns up dating and is discussed in Section

It is only generated if there are one or more arguments

Stack overow check If the execution of the closure can cause either stack to overow or

if the stacks are organised to grow towards each other collide execution is halted On

a parallel machine which works with many stacks dierent action is taken This stack

overow check can lo ok ahead into all the branches of any case expressions involved

in the evaluation of the closure taking the worstcase path as the overow criterion Of

course if there is no net stack growth no check is p erformed

Heap overow check A similar check is p erformed for heap overow if any heap is allo

cated This was discussed in Section The heap check cannot lo ok ahead into case

branches b ecause the evaluation implied by a case can p erform an unbounded amount

of computation

Info p ointer up date In the case of an up datable closure its info p ointer may now b e

overwritten with a black hole info p ointer or in a parallel system a queue me info

p ointer This is discussed in more detail b elow Section

Up date frame construction For up datable closures only an up date frame is pushed onto

the up date stack This action causes a later up date which overwrites the closure with

its head normal form The implementation of up dates and the mapping of the up date

stack are discussed in detail in Section

Co de is now generated for the b o dy of the closure with the free variables b ound to appro

priate osets from the Node register and the arguments to osets from the appropriate stack

p ointers Like many other compilers ours keeps track of where the stack p ointers are p oint

ing within the current activation record so that at any moment it can generate the correct

oset from the current stack p ointer

Black holes

When an up datable closure is entered its standardentry co de has the opp ortunity to over

write the closures info p ointer with a standard black hole info p ointer provided by the

runtime system Whilst this op eration costs an instruction it has two advantages

If the closure is ever reentered b efore it is up dated the black hole entry co de can rep ort

an error This situation o ccurs in programs where a value dep ends on itself for example

letrec a a in a

If a closure is left undisturb ed until it is nally up dated with its head normal form

there is a serious risk of a space leak For example consider the STG denitions

ns u x x

l u ns last ns

where x pro duces some very long list and last returns the last element of a list If

the thunk for l is left undisturb ed until it is nally up dated with the last element of the

list ns it will retain a p ointer to the entire list rather than consuming it incrementally

This nice example is due to Jones Overwriting the thunk for l with a black

hole immediately it is entered solves this space leak b ecause a black hole retains no

p ointers

It is also p ossible to obtain b oth these advantages in a slightly more subtle way without p er

forming the blackhole overwriting op eration Firstly nontermination of the form detectable

by black holes always results in stack overow The cause of the stack overow can then

easily b e determined by noticing that there is more than one p ointer on the up date stack to

the same closure This can only happ en if its value dep ends on itself It is also p ossible that

the error message obtainable from a p ostmortem of the up date stack could b e rather more

informative b ecause the entire collection of closures involved in the selfdep endent lo op can

b e identied and since they all still have their original info tables attached their source co de

lo cation information could b e shown to o

Secondly we address the spaceleak question In the ab ove example the p ointer to ns retained

by l only matters at garbagecollection time In almost all cases a thunk will b e entered and

up dated b etween garbage collections so that no space improvement is gained by overwriting

with a black hole What we would like to do is to blackhole only those thunks which are

under evaluation at garbage collection time Happily they are exactly the thunks to which

the up date stack p oints So we can safely omit the blackhole up date on thunk entry provided

that instead we b egin garbage collection by blackholing all the thunks p ointed to from the

up date stack

Since b oth these techniques rely on the up date stack they only apply to up datable thunks

up date ag u If a thunk is nonup datable up date ag n it must still b e blackholed by

its standardentry co de For this reason in our implementation we have two variants of the

n up date ag r for reentrant the closure may b e entered many times and should not b e

blackholed and s for singleentry the closure will b e entered at most once and should b e

blackholed The r ag is used for manifest functions constructors and partial applications

while the s ag is only used for thunks where up date analysis has determined that an up date

is not required

In a parallel system the standardentry co de for an up datable thunk should overwrite the

thunk with a queue me info p ointer Section Unlike the blackholing of a sequential

system this op eration cannot b e p ostp oned until garbage collection

case expressions

Pattern matching via case expressions is utterly p ervasive in lazy functional programs all

the more so in the Spineless Tagless Gmachine b ecause the b oxing and unboxing op erations

of arithmetic are done using case expressions rather than by some ad hoc mechanism One

of the strengths of the Spineless Tagless GMachine is that there is a rather rich design space

for how patternmatching can b e implemented including some rather ecient options

case expressions and only case expressions cause evaluation to take place In the op

erational semantics this is expressed by pushing a continuation onto the return stack and

evaluating the expression to b e scrutinised Rule This is mirrored precisely by the co de

generated for case expressions

In co de generation for most languages the act of pushing a continuation or return address

is immediately followed by a function call It is worth noticing in passing that this is not

the case for the STG language there may b e a signicant gap b etween the instructions

which push the continuation and the instruction if any which actually transfers control For

example consider the expression

case case f x of

of

The continuation for the outer case is pushed then the continuation for the inner case and

then the call to f is made Few programmers write this sort of co de but it arises as a result

of program transformations within the compiler The order of the two case expressions can

b e interchanged but only at the risk of co de duplication

The main p oint of interest is how continuations are represented Recall that a continuation

in the op erational semantics consists of two parts

The alternatives of the case expression

The environment  in which they should b e executed

The representation of alternatives is intimately connected with the co de generated for prim

itive values and constructors so we defer discussion of the rst topic until the following

sections Environmentsaving is indep endent of constructors so we discuss it rst

Saving the lo cal environment

The lo cal environment is saved by saving in the stacks the values of all variables which are

live that is free in any of the alternatives The way in which a live variable is saved dep ends

on where it currently resides

It may already b e in a stack for example if it was an argument to the current closure

No co de need b e generated

It may b e in a register or it may b e b ound to an oset from the heap p ointer In this

case it must b e saved in the appropriate stack

It may b e in the closure currently p ointed to by Node In this case there are two

p ossibilitie s save the variable itself or save Node

The latter reduces the number of saves b ecause saving Node eectively saves the values

of several variables at once On the other hand an extra memory access is subsequently

required to get the value of the variable

Another problem with saving Node is that the entire contents of the closure must then

b e retained by the garbage collector even though the continuation may only use some

of its elds The spaceleak avoidance mechanisms describ ed in Section cannot b e

applied It is p ossible to rescue the space b ehaviour by compiling a bitmask to indicate

which elds of the closure are live but it is complicated

Our current p olicy is to avoid these diculties by saving all variables individually

When saving a variable in a stack we can economise on stack usage by reusing stack slots

b elonging to variables which are now dead There is also a useful side b enet the structure

p ointed to by dead p ointers in the stack cannot b e reclaimed by the garbage collector so

overwriting such p ointers with live ones helps to avoid space leaks One could exp eriment

though we have not yet done so with generating extra instructions to overwrite dead p ointers

whose slots are not to b e reused sp ecically in order to make their space reclaimable We

call this stack stubbing

Constructor application s

The expression scrutinised by a case expression must eventually evaluate either to a primitive

value or a constructor application We deal with the latter case in this section deferring the

primitive case to Section

The co de generated for a constructor application must return control to the appropriate

alternative of the case expression making the argument of the application available to the

alternative This is just what is done by the rules for constructors in the op erational semantics

Rules and

For example consider the expression

let

hd n xs case xs of

Cons yys y

Nil error

single wws n Cons w ws

in

hd single

where w and ws are b ound by some enclosing scop e The co de generated for the case ex

pression in hd pushes a continuation and enters the closure for xs which is b ound to single

in this case The co de for the constructor application Cons wws in the b o dy of single

should return control to the appropriate alternative returning w and ws in some agreed way

Remember that constructor applications in the STG language are always saturated

There are two main asp ects to consider

There may b e several alternatives so there is the question of how the appropriate one

is selected

The constructor for a particular alternative may have arguments in which case these

need to b e communicated to the co de for the alternative

These two issues are now discussed in turn

Selecting the alternative

The simplest p ossible representation for the alternatives is a single co de lab el which we call

a return address pushed on the B stack Control is returned by the constructor application

to the lab elled co de when evaluation of the scrutinised ob ject is completed

If there is only one member of the algebraic data type tuples for example the evaluation

of the single alternative can pro ceed immediately If there is more than one member of the

type lists for example the tag of the ob ject is put into a particular register RTag and a

C switch statement is generated to p erform the case analysis on RTag In a nativecode

generator this multiway jump can b e compiled using a tree of conditionals or using a jump

B stack

Return vector

-

-

Co de for Nil alternative

-

Co de for Cons alternative

Figure Vectored returns

table dep ending on the sparsity of the alternatives but using C as our target co de allows to

delegate this choice to the C compiler

This is not the only p ossible representation for the alternatives Another p ossibility is to

represent the alternatives by a p ointer to a table of co de lab els with one entry in the table

for each constructor in the data type Figure illustrates the situation for a case expression

which is scrutinising a list A p ointer to this table which we call a return vector is pushed

on the B stack by the co de for the case expression Then instead of loading RTag the co de

for the constructor application can transfer control directly to the appropriate destination

thus saving a jump We call this a vectored return and the p ointer to the return vector a

vectored return address

The imp ortant p oint to note is that the return convention can be chosen independently on

a datatype by datatype basis A particular case expression will only scrutinise ob jects of a

particular type In practice we somewhat arbitrarily use vectored returns for data types

with up to eight constructors b ecause this catches the vast ma jority of data types without

risking wasting co de space on large sparselyused return vectors

Returning the constructor arguments

There is a corresp ondingly simple convention available for communicating the constructor

arguments to the alternative make the Node register p oint to a constructor closure containing

the appropriate values The co de for the alternative can then address its comp onents by

indexing from the Node register as usual

This works ne and is simple enough but a much b etter alternative is readily available if

there are suciently few arguments return them in registers If the closure b eing scrutinised

is already a constructor then not much is gained indeed something may b e lost b ecause all

its comp onents may b e loaded into registers when p erhaps the alternative only requires one

of them But there is a terric gain when a thunk is scrutinised b ecause it may thereby

avoid ever building the constructor in the heap The most critical example of this is ordinary

integer arithmetic Consider the following example

neg n x case x of

MkInt x case neg x of

y MkInt y

neg is the function which negates an integer It op erates by evaluating the integer x to extract

its primitive value x negating it to give y and then returning the integer MkInt y Now

under the simple return convention the b oxed value MkInt y would b e constructed in the

heap Node would b e made to p oint to it and control returned to the continuation If instead

the comp onent of the constructor y is returned in a register the value need never b e built

in the heap except as result of an up date Section It turns out that this has a big eect

on p erformance

As b efore the imp ortant p oint is that the return convention can be chosen on a datatype by

datatype basis Integers are not a sp ecial case For example list cons cells can b e returned

by putting the head and tail values into sp ecic registers Indep endently a vectored or non

vectored return convention can b e chosen Even the choice of which registers are used to

return values can also b e made indep endently for each data type For example a oating

p oint number can b e returned in a oatingp oint register

For the reason given b efore it is probably not a go o d idea to return constructors with many

arguments entirely in registers We therefore make a virtue of necessity there are only a

limited number of registers and return larger constructors by allo cating them in the heap

and making Node p oint to them

It turns out that the returninregisters convention makes up dates substantially harder as we

shall see but the gain is well worth it

Arithmetic

Supp ose that the expression scrutinised by a case turns out to evaluate to a primitive value

that is either a primitive literal Rule a variable whose value is primitive Rule or

an arithmetic op eration whose result is primitive Rule All three of these rules enter the

ReturnInt state which takes action dep ending on the alternatives stored on top of the return

stack

The return convention for primitive values is simple The continuation on top of the return

stack is always a return address p ointing directly to the continuation co de The primitive

value itself is returned in a standard return register chosen indep endently for each primitive

data type For example one register can b e used for integers and another for oating p oint

values The co de generated for a primitive literal simply loads the sp ecied value into the

appropriate return register p ops the return address from the B stack and jumps to it Simi

larly the co de generated for a primitive variable just loads the value of the variable into the

return register and returns and arithmetic follows in the same way

The co de generated at the return address implements the case analysis implied by the alter

natives if any using a suitable C switch statement Often there is only one alternative

which binds a variable to the value returned This is easily done by binding the variable to

the appropriate return register

There is a very imp ortant sp ecial case when compiling expressions of the form

case v v of alts

for builtin arithmetic op erations It would b e p ointless to push a return address evaluate

v v and return to the return address These op erations can easily b e shortcircuited

and it is practically essential to do so We can express this equivalence by doing some simple

transformations on the rules to give the derived rules

case fx ; x g of

C B

k e

C B

x Int i

C B

 Eval ::: as rs us h 

C B

C B

x Int i

A

k e

n n

x e

Eval e x Int i i as rs us h 

where k i i j n

j

case fx ; x g of

C B

x Int i :::

C B

 Eval as rs us h 

C B

A k e x Int i

:::

Eval e  as rs us h 

where k i i

In particular once this optimisation is implemented the expression

case fx ; x g of x e

compiles to the simple C statement

x x x

where x x and x are the C lo cal variables used to hold the values of x x and x

Adding up dates

So far everything has b een quite tidy tree reduction is nice and easy Sadly graph reduction

is harder and up dates are quite complicated This is much the trickiest part of the Spineless

Tagless Gmachine Still we b egin bravely enough

Representing up date frames

Recall that when a closure is entered it has the opp ortunity to push an up date frame onto

the up date stack An up date frame consists of

A p ointer to the closure to b e up dated

The saved argument and return stacks

After an up date frame is pushed execution continues with empty argument and return stacks

Of course we dont actually copy the argument and return stacks onto a separate up date

stack Instead we dedicate two registers called the stack base registers to p oint just b elow

the b ottommost word of the A and B stacks resp ectively The argument and return stacks

can now b e saved and then made empty merely by saving the stack base registers in the

up date frame and making them p oint to the current top of the A and B stacks

Where is the up date stack kept It could b e represented by a separate stack all of its own

but we have chosen to merge it with the B stack The minor reason for this is to avoid yet

another stack The ma jor reason is that it makes available an imp ortant optimisation which

we discuss b elow Section

To conclude the op eration of pushing an up date frame Rule is done by

Pushing an up date frame onto the B stack

Setting the stack base registers to p oint to the top of their resp ective stacks

The alert reader will have sp otted that a p ointer to the closure to b e up dated has thereby

ended up on the B stack We discuss this in Section

An up date is triggered in one of two ways either a function nds to o few arguments on the

stack or a constructor application nds an empty return stack These two situations are

discussed in the following sections

Partial applications

When a closure is entered which nds to o few arguments on the stack an up date is triggered

This is describ ed by Rules and a The check for to o few arguments is called the argument

satisfaction check and o ccurs at the start of the co de for every closure which takes one or

more arguments cf Section

A minor complication is that the arguments are split b etween the A and B stacks but this

presents little diculty If the last argument is available then certainly all the others will b e

so the check is p erformed only on the stack which contains the last argument

The argument satisfaction check is p erformed by subtracting the stack base p ointer of the

appropriate stack from the corresp onding stack p ointer giving a dierence in words This

is compared with the statically calculated number of words required for all the arguments

which are passed on that stack If to o few words are present a jump is taken to a runtime

system routine UpdatePAP which p erforms the up date Once the up date has b een done

UpdatePAP concludes by reentering the closure which Node should b e p ointing to The

argument satisfaction check is thereby p erformed again as Rule requires in case a further

up date is needed

A sp ecial case is required for toplevel closures b ecause the co de entering the closure may not

have made Node p oint to it Section In this case just b efore jumping to UpdatePAP

the argumentsatisfactioncheck co de loads a p ointer to the closure into Node Recall that

toplevel closures are statically allo cated so their address is xed

What do es UpdatePAP do It follows Rule a

First it builds in the heap a closure representing the partial application whose structure

is given b elow

Next it overwrites the closure to b e up dated obtained from the up date frame with

an indirection to the newlyconstructed closure

It restores the values of the stackbase registers from their values saved in the up date

frame

It removes the up date frame from the B stack sliding down the p ortion of the stack if

any which is ab ove it

Finally it reenters the closure p ointed to by Node

What do es the partialapplication closure lo ok like In the most general case it contains

The info p ointer PAPInfo

The total size of the closure and the number of p ointers in it As well as b eing used

by the storage manager this information is required by the standardentry co de of

PAPInfo see b elow

The p ointer to the function closure which is in Node

The contents of the A stack b etween the top of stack and its stack base p ointer

The contents of the B stack b etween the top of stack and its stack base p ointer

If this partialapplication closure is entered the standardentry co de of PAPInfo pushes the

saved stack contents onto their resp ective stacks using the size information to determine

how many words to move to which stack and then enters the function closure saved in the

partial application closure Its garbagecollection routines use the size information stored in

the closure to guide their work

So much for the general case A couple of optimisations are readily available Firstly a col

lection of sp ecialised PAPInfo p ointers can b e provided for various combinations of numbers

of p ointer and nonp ointer words For example PAPInfo is used when there is one

p ointer word and no nonp ointers The advantages of such sp ecialised info p ointers are there

is no need to store the eld sizes in the closure and the entry and garbagecollection co de is

faster b ecause it has no interpretive lo op There is of course a small executiontime cost in

UpdatePAP to decide whether a sp ecial case applies

Secondly if the new closure is small enough it can b e built directly on top of the closure to

b e up dated

Constructors

The other way in which an up date can b e triggered is when a constructor nds an empty

return stack It lo oks as though the co de for a constructor application has to test for an empty

return stack indeed this is just what is implied by Rule This lo oks exp ensive b ecause

constructors are so common Furthermore the return stack is almost always nonempty so

the test is in vain Data structures are often built once and then rep eatedly traversed Each

time pattern matching is p erformed on a data structure a continuation is pushed on the

return stack and the closure representing the data structure is entered Since it is already

evaluated it returns immediately p erhaps using a vectored return but it must rst p erform

the returnstack test

So now comes the tricky part Since up date frames and continuations are b oth stored on the

B stack if a return address is not on top of the stack then an up date frame must b e If

we make the top word of each update frame into a code label UpdateConstr the constructor

could just return without making any test In the common case where there is no up date

frame this do es just what we want If an up date is required there will b e an up date frame on

top of the B stack so the return will land in the UpdateConstr co de which can p erform

the up date and then return again p erhaps to another up date frame or p erhaps to the real

continuation

Before we examine the complications it is worth lo oking at the crude costs and b enets The

cost is one extra instruction for every up date frame pushed The b enet is the omission of

a couple of instructions one of them a conditional jump from every constructor evaluation

If data structures are traversed rep eatedly constructor evaluations will o ccur substantially

more often than up dates The b enets lo ok signicant

The trouble is that this trick interacts awkwardly with the various return conventions for

constructors discussed in Sections and With the simple return conventions ev

erything works ne The case alternatives are always represented by a simple co de lab el on

the B stack and the result is returned by making Node p oint to the constructor closure All

UpdateConstr need do is to overwrite the closure to b e up dated with an indirection to this

constructor closure restore the stack base registers and return again

Vectored returns

Life gets more complicated when we add vectored returns There is no problem with providing

a vectored form of UpdateConstr each entry in its return vector p oints to co de which p erforms

the up date and then returns in its turn in a vectored fashion The diculty is that when the

update frame is created the return convention is not known This is b ecause the type of the

expression may b e p olymorphic Consider for example the compose function which lo oks

like this in the STG language

compose n fgx let gx gx u g x

in f gx

The co de for the closure gx do es not know its type Hence when it pushes the up date frame

it cannot know whether a vectored return is to b e exp ected or not In short UpdateConstr

B stack

Return vector

- -

Co de for Nil alternative

-

Co de to up date with Nil

-

Co de for Cons alternative

-

Co de to up date with Cons

Figure Vectored up dates

must be able to cope with either a vectored or a nonvectored return

If we were generating machine co de this would present little problem We just adopt the

convention that the p ointer to a vector table p oints just after the end of the table so that

the table is accessed by indexing backwards from the p ointer Now UpdateConstr lab els

ordinary co de immediately preceded by its vector table Sadly C do es not allow us to sp ecify

the relative placement of data and co de in this way so instead we have to adopt the convention

that nonvectored returns b ehave just like vectored returns through a vector table with one

entry cf Section This imp oses an extra indirection on nonvectored returns

Returning values in registers

Unfortunately matters get worse when we consider the idea of returning constructor values

in registers Now UpdateConstr has no way to gure out how to p erform the up date b ecause

it has no way to tell what return convention is b eing used Can the closure which pushed

the up date frame push a version of UpdateConstr appropriate for the data type As just

discussed the answer is no b ecause of p olymorphism

This lo oks like a rather serious problem There is a way round it but it is rather tricky

The idea is this the up date frame may not know the type of the value b eing returned but the

case expression which caused the evaluation in the rst place certainly does So UpdateConstr

do es not p erform an up date at all it merely records that an up date is required by placing

a p ointer to the closure to b e up dated in a sp ecial register UpdatePtr It is up to the case

expression continuation to p erform the up date How do es the continuation know whether

an up date is p ending Simple each entry in the caseexpressions return vector is expanded

to a pair of co de lab els Figure The rst of these is just as b efore ie the co de for the case

expression alternative the other p erforms an up date on the closure p ointed to by UpdatePtr

and then jumps to the rst We call these the normal return code and update return code

resp ectively All UpdateConstr has to do to precipitate the up date is to return to the up date

return co de rather than the normal return co de which it can do merely by increasing its

oset into the return vector by one

The costs are surprising slight There is a static space cost as each vector table now doubles

in size The extra co de to p erform the up date can b e generated once only for each constructor

and then p ointed to from all the return vectors for its data type This p erconstructor up date

co de can still nd its way to the appropriate case alternative provided the p ointer to the

return vector is kept handy in a register

One ob jection remains which lo oks serious supp ose there are several up date frames on top

of each other b efore the real continuation is reached This can arise in programs like the

following

let x

in

let x x u x

in

let x x u x

in

x

When x is entered it will push an up date frame and then enter x which will push another

up date frame and enter x When x reaches head normal form it will nd two up date frames

on top of the stack reecting the fact that b oth the closure for x and that for x must b e

up dated with xs value This lo oks like a rather sp ecial case but it do es arise in practice any

closure which may return the value of another closure has the same prop erty For example

the denition of x could b e

x xz u case of

Nil x

Cons pps p

The problem with multiple up date frames is that the UpdatePtr register can only p oint to

one closure Fortunately there is an easy solution Recall that all return vectors including

that for UpdateConstr consist of paired entries and that UpdateConstr returns to the update

return code rather than the normal return code of the pair The up date return co de of the

UpdateConstr vector therefore knows that it is not the rst up date frame and so UpdatePtr

is already in use One p ossibility would b e to chain together all the closures to b e up dated

but there is a simpler way the second and subsequent up date frames just up date their

closures with an indirection to the one p ointed to by UpdatePtr When the latter is nally

up dated all the up dating has b een successfully completed This can result in chains of at most

two indirections and remember that indirections are all eliminated by the garbage collector

Finally what of nonvectored returns We still need a pair of co de addresses to return to

as in the vectored case If the results are returned in a heapallo cated closure p ointed to by

Node there is no problem the up date return co de just p erforms the up date and jumps to the

normal return co de If results are b eing returned in registers then the up date return co de

needs to p erform case analysis on RTag to gure out how to p erform the up date As b efore

there need b e only one copy of this up date return co de

Up date in place

The up date technology just describ ed has another very imp ortant b enet it al lows the up

dated closure to be overwritten directly with the result if it is smal l enough rather than being

overwritten with an indirection to the result

Up to now our uniform return convention has meant that closures are only ever overwritten

by indirections even though it is often the case that it is in principle p ossible to overwrite it

directly with the result Not only do es this introduce extra indirections but more seriously

it gives rise to a lot of extra memory allo cation Kieburtz and Agapiev sp ecically identify

and quantify this shortcoming Kieburtz Agapiev

If the dualreturn mechanism of the previous section is used however then this shortcoming

can easily b e overcome The co de p erforming the up date knows exactly how the closure is

laid out so if it is small enough it can directly overwrite the closure to b e up dated For

example here is the co de to p erform up dates for a list Cons cell

ConsUpd

UpdatePtr ConsInfo

UpdatePtr Head

UpdatePtr Tail

JUMP ReturnVector

Here we are assuming that the Cons constructor returns its head in a register Head and tail in

Tail ConsInfo is the info table for the Cons constructor The address of the return vector

is assumed to b e in a register ReturnVector The oset of picks the rst co de address of

the pair for Cons

For larger constructors the new ob ject cannot b e built directly on top of the old one so a new

ob ject must b e built in the heap and the old one up dated with an indirection to it The usual

heapexhaustion check must b e made and garbage collection triggered if no space remains

The up date routine must then b e careful to save any p ointers b eing returned in registers into

a place where the garbage collector will nd them

How do es the up dating co de know if the closure to b e up dated is large enough There are

two main p ossibilities

We can establish a global convention for the minimum size of up datable closures making

them all large enough by padding if necessary to contain a list cons cell seems a

plausible guess There is scop e for another small optimisation here if a closure is b eing

allo cated whose type is say Int then it cannot p ossibly b e up dated by anything other

than an integer so it do es not need to b e padded out to conscell size

The up dating co de can lo ok in the closures info table to nd its size and either up date

in place or use an indirection dep ending on what it nds This costs more time than

the unconditional scheme but has the merit that it will succeed in up dating in place

more often This is another asp ect of the design which we plan to quantify

Up date in place is not always desirable Supp ose that the value of the thunk turned out to

b e an alreadyexisting constructor which returns its comp onents in registers Then up datein

place will overwrite the thunk with a copy of the constructor No work is duplicated thereby

but there is a p otential loss of space if b oth copies stay live for a while At worst a great many

copies of the same ob ject could b e built in this way substantially increasing the space usage

of the program The p oint is this once copied there is no cheap mechanism for commoning

up the original with the copy

Our preliminary measurements suggest that up to of all up dates copy an alreadyexisting

constructor though the gure can o ccasionally b e much higher for the program rep orted by

Runciman Wakeling it is We plan to make more careful measurements to

see how imp ortant this eect is If it turns out to b e signicant we will implement a simple

extension of the dualreturnaddress scheme outlined in Section whereby each element

of the return vector is a triple of return addresses We omit the details but the scheme has

the eect of always up dating a thunk with an indirection whenever the value of the thunk is

an alreadyexisting heap ob ject

Up date frames and garbage collection

Up date frames which include a p ointer to the closure to b e up dated are kept on the B stack

At rst this lo oks rather awkward b ecause the garbage collector exp ects all p ointers to b e on

the p ointer stack but it actually turns out to b e quite convenient b ecause of the following

observation if the only pointer to a closure is from an update frame then the closure can be

reclaimed and the update frame discarded

We can take advantage of this during garbage collection in the following way First p erform

garbage collection as usual but without using the p ointers from up date frames as ro ots Now

lo ok at each up date frame and see if it p oints to a closure which has b een marked as live If

so and a copying collector is b eing used adjust the p ointer to p oint to the new copy of the

closure If not squeeze the up date frame out of the B stack altogether

It is easy to nd all the up date frames b ecause the stack base register for the B stack always

p oints to the topmost word of the topmost up date frame and the saved stack base register for

the B stack p oints to the next up date frame and so on This gives a toptob ottom traversal

but it turns out that a b ottomtotop traversal makes the squeezingout pro cess much more

ecient for two reasons

Since the squeeze must move data towards the b ottom of the stack otherwise the stack

would creep up in memory working from b ottom to top means that each word of the

B stack is moved only once

When an up date frame is removed the stackbase p ointers for the next up date frame

ab ove it need to b e adjusted This is easy to do when working b ottom to top

Happily it is easy to make a toptob ottom traversal reversing all the p ointers and then

make the b ottomtotop traversal to do the work

The result of all this is that the garbage collector reclaims redundant up date frames The

main b enet is the saving in up dates p erformed This optimisation was p erformed but not

do cumented in Fairbairn and Wrays original TIM implementation

Global up datable closures

As previously discussed Section each globallydened variable is b ound to a statically

allo cated closure Since such closures have no free variables except of course other statically

allo cated closures there is no need to treat them as a source of ro ots during garbage collec

tion

But some of these global closures may have no arguments and hence b e up datable we call

such such argumentless toplevel closures constant applicative forms or CAFs For example

ints is a CAF whose value is the innite list of integers

ints u from zero

zero n MkInt

where from is a function returning the innite list of integers starting from its argument

There are two diculties

If such a CAF is up dated there will b e p ointers from the static space into the dynamic

heap The question is how is the garbage collector to nd all such p ointers

With the garbagecollection techniques describ ed in Section closures in static space

need dierent garbagecollection co de from those in the dynamicallyallo cated heap

The two are readily distinguishable by address but it is unfortunate if every up date

is slowed down by a test when the vast ma jority of up dates are to dynamic closures

There is more than one way to solve this problem but the one we have adopted is as follows

The idea is to arrange that

CAFs which are b eing evaluated or whose evaluation is complete are linked together

onto the CAF list which is known to the garbage collector This solves the rst of the

ab ove problems

All up date frames p oint to closures in the dynamic heap thus solving the second prob

lem

We achieve these goals by adding a little extra co de to the start of the standardentry co de for

a CAF Figure The extra co de do es the following it allo cates a black hole closure in the

heap whose purp ose is to receive the subsequent up date it pushes an up date frame p ointing

to this black hole and it overwrites the static CAF closure with a threeword CAFlist cell

p ointing to the black hole in the heap and linked onto the CAF list

In the example shown in Figure the CAFs p q and r have all b een entered and hence are

linked onto the CAF list we use CL to abbreviate the info p ointer for a CAFList cell The

evaluation of p is not yet complete so it still p oints to a black hole in the dynamic heap info

p ointer BH The evaluation of q has b een completed and the black hole has b een up dated

with an indirection info p ointer I to its value s has not b een entered so it consists of a

info p ointer S only

A CAFList cell lo oks like any other closure If entered it simply enters the heapallo cated

closure to which it p oints b ehaving just like an indirection The garbage collector knows

STATIC SPACE

p q r s

- - -

Caf List

CL S

CL CL

DYNAMIC HEAP

- - -

I

BH

-

Figure Global up dates

ab out the CAF list and walks it iteratively evacuating the heapallo cated closure to which

each cell p oints and up dating the cell appropriately This is slightly p essimistic since it

holds onto the value of every CAF even though the program may never reference it again

but there is no avoiding this unless the garbage collector traverses the co de as well

Status and proling results

We have built a compiler for Haskell whose back end is based on the STG machine just

as describ ed ab ove The whole implementation has b een constructed rather carefully so

that it may b e used as a motherb oard into which other implementors may plug in

their optimisation passes All the source co de is available by anonymous FTP by contacting

haskellrequestdcsglasgowacuk

Apart from the STG machine technology describ ed in the current pap er the ma jor innovations

of the compiler are

The systematic use of unboxed values to implement builtin data types

A new approach to inputoutput based on monads which allows the entire IO system

to b e written in Haskell Hammond Peyton Jones Wadler This is done via

a generalpurp ose mechanism which allows arbitrary calls to b e made from Haskell to

C As a result the IO system can b e readily extended without mo difying the compiler

or its runtime system

The Core language which serves as the main intermediate data type in the compiler

is actually based on the secondorder lambda calculus complete with type abstraction

and application This p ermits us to maintain complete type information in the presence

of extensive program transformation as well as accommo dating other front ends whose

type system is more expressive than Haskells

The implementation covers almost the whole language but virtually no optimisations have

yet b een implemented As a result we have not compared its absolute p erformance with

other compilers This omission is an imp ortant shortcoming of this pap er which will b e

rectied by a followup pap er

We have b egun to gather simple dynamic statistics however Figure shows some output

taken from a run of a simple typeinference program This program takes some lines

of Haskell source co de apart from functions used from the Prelude and the sample run

allo cated ab out megabytes of heap The prole has the following main headings

Allo cations split into various categories Most allo cation is for thunks The prop ortion

of datavalue allo cation seems surprisingly low b ecause most data values are built by

up dating a thunk rather than by p erforming new allo cation in a letrec This

program uses monads heavily so quite a lot of functionvalued closures are allo cated

Stack highwater marks are self explanatory

Enters with a classication of what kind of closure is b eing entered In this case a rather

low prop ortion of function calls bypass the argument satisfaction check again

due to the very higherorder nature of the program

Returns which give information ab out the datavalue returns which to ok place In this

run almost all were vectored and in registers The third classication tells how many

returns were from entering an alreadyevaluated constructor some in this case

In the cell mo del the enterreturn sequence would not b e p erformed for these cases

Up date frames classies various forms of up date frame which is rather uninformative in

this case

Up dates The fth line counts the number of times two or more up date frames were stacked

directly on top of one another The last line counts the number of up dates in which an

alreadyexisting value was copied by an up dates Section

Acknowledgements

This pap er has b een a long time in gestation I would like to thank those who have b een

kind enough to give me help and guidance in improving it Andrew App el made detailed

comments ab out the relationship of this work to his Geo Burn Cordy Hall Denis Howe

Rishiyur Nikhil Chris Okasaki Julian Seward and the two anonymous referees gave very

useful feedback The implementation of the compiler itself owes much to the work of Cordy

Hall and Will Partain

Sadly my friend and colleague Jon Salkild who coauthored the rst STG pap er died most

suddenly in

ALLOCATIONS words total

admin goods slop

avg words of admin goods slop

function values

thunks

data values

big tuples

partial applications

blackhole closures

partialapplication updates

datavalue updates

Total storagemanager allocations words

STACK USAGE

A stack max depth words

B stack max depth words

ENTERS of which direct to the entry code

the rest indirected via Nodes info ptr

thunks

data values

function values

of which bypassed argsatisfaction chk

partial applications

indirections

RETURNS

in registers the rest in the heap

vectored the rest unvectored

from entering a new constructor

the rest from entering an existing constructor

UPDATE FRAMES omitted from thunks

standard frames

constructor frames

blackhole frames

UPDATES

data values

in place allocated new space with Node

partial applications

in place allocated new space

updates followed immediately

inplace updates copied

Figure Example output of dynamic proling information

Bibliography

R Alverson D Callahan D Cummings B Koblenz A p ortereld B Smith June

The Tera computer system in Pro c International Conference on Sup ercomputing

Amsterdam

AW App el Garbage collection can b e faster than stack allo cation Info Pro c Lett

AW App el Compiling with continuations Cambridge University Press

AW App el Feb Simple generational garbage collection and fast allo cation Software

Practice and Exp erience

AW App el T Jim Jan Continuationpassing closurepassing style in Pro c ACM

Conference on Principles of Programming Languages ACM

ZM Ariola Arvind June A syntactic approach to program transformations in

Symp osium on Partial Evaluation and SemanticsBased Program Manipulation Yale

L Augustsson Compiling lazy functional languages part I I PhD thesis Dept Comp

Sci Chalmers University Sweden

L Augustsson T Johnsson Sept Parallel graph reduction with the nuGmachine

in Pro c IFIP Conference on Functional Programming Languages and Computer Ar

chitecture London ACM

Henry Baker Apr List pro cessing in real time on a serial computer CACM

JF Bartlett Jan SCHEME to C a p ortable SchemetoC compiler DEC WRL RR

A Bloss P Hudak J Young Co de optimizations for lazy evaluation Lisp and

Symbolic Computation

Geo Burn SL Peyton Jones John Robson July The Spineless Gmachine in Pro c

ACM Conference on Lisp and Functional Programming Snowbird

C Consel O Danvy Sept For a b etter supp ort of static data ow in Functional

Programming Languages and Computer Architecture Boston Hughes ed LNCS

Springer Verlag

EC Co op er JG Morrisett Dec Adding threads to Standard ML CMUCS

Dept Comp Sci Carnegie Mellon Univ

AJT Davie DJ McNally CASE a lazy version of an SECD machine with a at

environment in Pro c IEEE TENCON Bombay

Jon Fairbairn Stuart Wray Sept TIM a simple lazy abstract machine to execute

sup ercombinators in Pro c IFIP conference on Functional Programming Languages

and Computer Architecture Portland G Kahn ed Springer Verlag LNCS

AJ Field PG Harrison in Functional programming Addison Wesley

P Fradet D Le Metayer Jan Compilation of functional languages by program trans

formation ACM Transactions on Programming Languages and Systems

K Hammond SL Peyton Jones PL Wadler Feb A new inputoutput mo del for

purelyfunctional languages Dept of Computing Science University of Glasgow

P Henderson Functional programming application and implementation Prentice Hall

R Hieb RK Dybvig C Bruggeman June Representing control in the presence of

rstclass continuations in Pro c Conference on Programming Language Design and

Implementation PLDI

P Hudak SL Peyton Jones PL Wadler Arvind B Boutel J Fairbairn J Fasel M Guzman

K Hammond J Hughes T Johnsson R Kieburtz RS Nikhil W Partain J Peterson

May Rep ort on the functional programming language Haskell Version

SIGPLAN Notices

John Hughes Apr Why functional programming matters The Computer Journal

PZ Ingerman Thunks Comm ACM

E Ireland Jan The Lazy Functional Abstract Machine in Pro c th Australian

Computer Science Conference Hobart World Scientic Publishing

Thomas Johnsson Lambda lifting transforming programs to recursive equations

in Pro c IFIP Conference on Functional Programming and Computer Architecture

Jouannaud ed LNCS Springer Verlag

Thomas Johnsson Compiling lazy functional languages PhD thesis PMG Chalmers

University Goteb org Sweden

Thomas Johnsson June Ecient compilation of lazy evaluation in Pro c SIGPLAN

Symp osium on Compiler Construction Montreal

R Jones March Tail recursion without space leaks Department of Computer Science

University of Kent

R Kelsey May Compilation by program transformation YALEUDCSRR PhD

thesis Department of Computer Science Yale University

RB Kieburtz Oct A RISC architecture for symbolic computation in Pro c ASPLOS

II

RB Kieburtz B Agapiev Sept Optimising the evaluation of susp ensions in Pro c

workshop on implementation of lazy functional languages Asp enas

H Kingdon D Lester GL Burn The HDGmachine a highly distributed graph

reducer for a transputer network Computer Journal

PWM Ko opman Functional programs as executable sp ecications PhD thesis Uni

versity of Nijmegen

DA Kranz May ORBIT an optimising compiler for Scheme PhD thesis Department

of Computer Science Yale University

PJ Landin March A corresp ondence b etween Algol and Churchs lambda calculus

Comm ACM

D Lester Apr Combinator graph reduction a congruence and its applications PhD

Thesis Programming Research Group Oxford

Erik Meijer Sept Generalised expression evaluation in Pro c workshop on implemen

tation of lazy functional languages Asp enas

E Miranda Apr How to do machineindependent fast threaded co de Dept of Com

puter Science Queen Mary and Westeld College London

SL Peyton Jones The implementation of functional programming languages Prentice

Hall

SL Peyton Jones FLIC a functional language intermediate co de SIGPLAN Notices

SL Peyton Jones C Clack J Salkild June Highp erformance parallel graph reduc

tion in Pro c Parallel Architectures and Languages Europ e PARLE E Odijk M

Rem JC Syre eds LNCS Springer Verlag

SL Peyton Jones J Launchbury Sept Unboxed values as rst class citizens in

Functional Programming Languages and Computer Architecture Boston Hughes

ed LNCS Springer Verlag

SL Peyton Jones D Lester May A mo dular fullylazy lambda lifter in Haskell

Software Practice and Exp erience

SL Peyton Jones DR Lester Implementing functional languages a tutorial Prentice

Hall

SL Peyton Jones Jon Salkild Sept The Spineless Tagless Gmachine in Functional

Programming Languages and Computer Architecture D MacQueen ed Addison

Wesley

C Runciman D Wakeling April Heap proling of lazy functional programs De

partment of Computer Science University of York

P Sansom Aug Combining copying and compacting garbage collection in Pro c

Fourth Annual Glasgow Workshop on Functional Programming Springer Verlag

Workshops in Computer Science

Mark Scheevel Aug NORMA a graph reduction pro cessor Pro c ACM Conference

on Lisp and Functional Programming

S Smetsers E No cker J van Groningen R Plasmeijer Sept Generating ecient co de

for lazy functional languages in Functional Programming Languages and Computer

Architecture Boston Hughes ed LNCS Springer Verlag

RM Stallman Feb Using and p orting Gnu CC Version Free Software Foundation

Inc

GL Steele Rabbit a compiler for Scheme AITR MIT Lab for Computer Sci

ence

William Stoye Thomas Clarke Arthur Norman August Some practical metho ds for

rapid combinator reduction in Pro c ACM symp osium on Lisp and functional

programming

D Tarditi A Acharya P Lee March No assembly required compiling Standard ML

to C School of Computer Science Carnegie Mellon University

AP Tolmach AW App el June Debugging Standard ML without reverse engineer

ing in Pro c ACM Conference on Lisp and Functional Programming Nice ACM

DA Turner A new implementation technique for applicative languages Software

Practice and Exp erience

P Wadler Ecient compilation of pattern matching in The implementation of func

tional programming languages SL Peyton Jones ed Prentice Hall

PR Wilson MS Lam TG Moher Jan Caching considerations for generational

garbage collection Department of Computer Science University of Texas

A The gory details

This app endix contains gory implementationspecic notes for the Glasgow Haskell compiler

ToDo add some stu about stack stubbing

A Up date ags

There are really three kinds of up date ag

Updatable Up date me with my normal form

Singleentry Dont up date me with my normal form but you can overwrite me with a

black hole to prevent a space leak I promise I will b e entered at most once

Reentrant Dont up date me with my normal form or with a black hole I may b e

entered and reevaluated more than once Manifest functions constructors and partial

applications are always reentrant Note that the latter two have zero arguments so zero

arg things may b e reentrant

A Black holes

There are three reasons for overwriting a thunk with a black hole immediately it is entered

To prevent space leaks

To give a b etter error message when theres an innite lo op

To do thread synchronisation in a parallel system

The rst two can b e dealt with in a dierent way by the garbage collector Space leaks can

b e prevented by blackholing the things on the up date stack at GC time Innite lo ops can

b e detected by a p ostmortem on the up date stack following a stack overow So until we

need parallelism blackholing is a waste of instructions

A Adding llers

The great invariant is this if a closure is to b e mo died either by b eing overwritten with a

black hole or by up dating a ller is written over the tail of the closure so that the initial

segment is exactly FIXEDHDRSIZEMINUPDSIZE words long

MINUPDSIZE is at least to allow for cons cells and linked indirections Another truth

GEN ob jects are always bigger than FIXEDHDRSIZEMINUPDSIZE that leaves ro om for

a GEN ller following a minsize closure slot

That means that the black hole and up date co de have a welldened xedsize thing to mo dify

All such closures are thunks of either SPEC or GEN form If the latter the ller is set with

SETGENFILLER Node slop

where slop is the size of the closure excl xed hdr of course MINUPDSIZE

If the closure is of SPEC form and there are more than MINUPDSIZE words of ptrs nonptrs

the ller is set with

SETSPECFILLER Node slop

where slop is the size of the closure MINUPDSIZE In the SPEC case the slop argument is

guaranteed to b e an integer so can b e used in building a lab el Remember SPEC closures

have no variable header so the size is just ptrs nonptrs which the compiler can calculate

All this is done at the end of the basic blo ck just b efore Node is discarded or assigned But

if Node is b eing loaded from the closure itself we have to go via a temp orary so we need

SETGENFILLERANDLOADNODE Node finalnode size

SETSPECFILLERANDLOADNODE Node finalnode size

If onespace collection isnt b eing done these ller macros generate nothing except of course

the LOADNODE variants still load Node

A Performing up dates

The up datep erforming co de b ehaves as follows

If the new closure doesnt t in a minsize closure slot

Allocate the new object from the heap

UPDIND Hpxxx HeapCheckxxxLive

If the new closure ts in a minsize closure slot

UPDINPLACE UpdPtr HeapCheckxxxLive

Fill in UpdPtr

if incompletely filled use SETUPDFILLER UpdPtr slop

In generational GC the UPDINPLACE co de checks for a oldgen up date and if so allo cates

a minsize closure in newspace up dates the oldgen thing with a p ointer to the new

space thing and makes Up dPtr p oint to it

The SETUPDFILLER lls in any remaining slop The slop argument is guaranteed to

b e an integer

A Lambdaform info

Now we are ready to summarise the remarkably subtle questions of entry and up date con

ventions

Reentrant Up datable Singleentry

No de must If has fvs or ar Yes note If has fvs or using cost

p oint to it ity and using centres note

cost centres note

Can jump di Yes note No Yes

rect to co de

Push up date No Yes note No

frame

Black hole on No Optional notes i has fvs notes

entry note

UPDBHUPDATABLE UPDBHSINGLEENTRY

Use ller if s No Yes unless static note I has fvs note

gc

SETxxxFILLER ditto

Note We NO LONGER assume that any closure with no free variables is allo cated stati

cally However we do assume that HAS FVS implies NOT STATIC

Note We never blackhole an up datable static closure Instead it will b e overwritten with

a CAFList cell p ointing to a newly allo cated black hole

Note Blackholing can b e done by the garbage collector by running down the up date

stack and blackholing any p ending up datees and innite lo ops detected by a p ost

mortem on the up date stack

Note A singleentry closure which has free vars is always blackholed to avoid space

leaks The trick mentioned in Note do esnt work for them b ecause they dont sit on

the up date stack

One could argue that space leaks from omitting the blackhole for singleentry things

are similar to other unavoidable space leaks but we dont we blackhole them

A singleentry closure with no fvs is never blackholed it cannot give rise to a space

leak and we trust the singleentry ag which should mean it cant form part of a

lo op Even if thats not true the up date stack p ostmortem would reveal the lo op

Notice that this means that static singleentry closures which have no fvs are never

blackholed

Note For up datable static closures the up date frame will p oint to the newlyallo cated

black hole

Note If a closure is up datable No de must b e made to p oint to it even if it is a static

nofv closure Reason it may have b een up dated with a CAF indirection We have

to indirect via the closure to get the entry co de anyway b ecause it might have b een

up dated so it is no big deal to load No de to o In theory we could have a dierent

CAF indirection co de for each CAF which knows the closure address but it saves no

time

Note The no demustpointtoit conditions ensure that No de always p oints to a closure

which is to b e blackholed

Note For reentrant things with arity there is still a choice as to whether to jump to

the fast or slow entry p oint dep ending on how many args the thing is applied to

Note When costcentre proling No de must p oint to anything which isnt a HNF even

if it has no fvs so that the cost centre can b e extracted from the closure

Note Singleentry closures only need a ller if they have b een blackholed Hence the

condition NB fvs implies not static

Note Static up datable closures dont need a ller b ecause the black hole freshly allo

cated for them is already the standard size

Note This rule says that No de do esnt need to p oint to nofv closures with arity

But that could b e a problem b ecause such things have an argumentsatisfaction check

which needs to know where the closure is We solve this by allo cating such closures

statically this cant result in a space leak b ecause they are HNFs already and have no

fvs nor can it result in loss of laziness

Imp orted things which we know nothing ab out are entered as if they were up datable things

with no free vars

A Stack stubbing

Black holing closures is a way to avoid space leaks but there is another imp ortant source

which is not caught thereby namely p ointers on the stack which happ en to b e dead For

example

f x case x of not involving x

Here x is passed to f on the stack but is dead as so on as it is entered

The simplest solution is

Just b efore every tail call overwrite any stackheld ptrs which are now dead with a

p ointer to a sp ecial Stub closure Stub is a static closure with no p ointers inside it so

it plugs any space leak

The entry co de for Stub elicits an error message b ecause the stack slot is supp osed to

b e dead

How do we know which slots are dead Because they are b ound to variables which arent free

in the continuation

Unfortunately this stackstubbing takes instructions to p erform We can improve matters

somewhat

Use dead stack slots to save volatile variables in for a case expression Not only do es

this mean that fewer slots need to b e stubb ed but it also reduces stack growth

There is one way of avoiding the remaining stackstubbing instructions namely by attaching a

bitmask to return vector addresses pushed on the B stack which identify the dead Astack

p ointers This is quite a bit more work so we dont do it at present but we count how

many stubbing instructions we execute

A Implementation

We need to keep extra information in the co de generator state

We need to keep track of which Astack slots are used for what purp ose in particular

which slots are used to store which variables

This info is used when saving volatile variables at a case expression to identify dead

slots and at a tail call to identify slots which must b e stubb ed

We can do this just by adding an extra comp onent to the co de generator state car

ried around by the monad and making sure we keep it up to date when we alter the

environment

When compiling a tail call we need to know which variables if any are used in the

continuation if any so that any others o ccupying stack slots b elow the tailcall SpA

can b e stubb ed

This is easy to o just add a piece of inherited information to the monad rather like the

setller stu The dierence is that at a case expression the neededvar info go es into

the case branches rather than into the scrutinee

The scheme so far is slightly p essimistic Consider the expresion

f x let y in

case x of

y

The co de generated for this is

Allocate y

Save y on the stack

Push the continuation

Enter x

Now since x is still live at the p oint we save y we will allo cate a new stack slot for y and

have to stub the x slot just b efore entering x It would b e b etter to save y in xs slot We

can sp ot this as a sp ecial case p erhaps including the slightly more general case where the

scrutinee is a function application

Index

directentry p oint h; G imachine

Enter

Astack

entering see closures

ABC machine

environment p ointer

Abstract C

evacuation

activation frame

Eval

address

evalapply mo del

algebraic data type

evaluation stack

allo cation

argument satisfaction check

forcing see thunks

argument stack

forwarding p ointer

arithmetic

frame

frame p ointer

Bstack

free variables

black holes

full laziness

builtin op eration

function application

cache

function values

CAF list

Gmachine

CAFs

garbage collection

callcc

generational garbage collection

case expression

global environment

case expressions

globals

cell mo del

graph reduction

closure

closure mo de

Haskell

closures

head normal forms

entering

heap

static

heap overow check

co de

co de p ointer

indirection

continuation

indirections

constant applicative form see CAFs

info p ointer

constructors

info table

niladic

initial state

standard

inputoutput

continuation

integers

continuationpassing style see CPS

small

Core language

CPS

lambda lifting

currying

lambdaform

laziness

data structures

lo cal environment

data values

saving

debugging

lo cality

state transition system lo callydene d functions

status ag lo cals

STG language

main

strictness analysis

manifest functions

sup ercombinator

monads

susp ension

nonup datable

Tco de

normal forms

tag big

normal return co de

tagless

target language

op erational semantics

threads

Three Instruction Machine see TIM

paging

thunks

PAPInfo

forcing

parallel execution

representing

partial applications

TIM

pattern matching

p ointer

unboxed values

primitive data type

uniform representation

primitive values

up datable

proling

up date ag

pushenter mo del

reentrant

singleentry

queue me

up date frame

reduction

up date return co de

reentrant

up date stack

register saves

UpdatePAP

return address

up dates

return convention

in place

return stack

values

return vector

vectored return

ReturnCon

vectored returns

ReturnInt

scavenging

SECD machine

secondorder lambda calculus

selfup dating mo del

sequences

singleentry

space leak

stack base registers

stack overow check

stack stubbing

stacks

standard constructors see constructors

standardentry co de