R-Code a Very Capable Virtual Computer

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Walton, Robert Lee. 1995. R-Code a Very Capable Virtual Computer. Harvard Computer Science Group Technical Report TR-37-95.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:26506458

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

RCo de a Very Capable Virtual Computer

Rob ert Lee Walton

TR

Octob er

Center for Research in Computing Technology

Harvard University

Cambridge Massachusetts

RCODE

A Very Capable

Virtual Computer

A thesis presented

by

Rob ert Lee Walton

to

The Division of Applied Sciences

in partial fulllment of the requirements

for the degree of

Do ctor of Philosophy

in the sub ject of

Computer Science

Harvard University

Cambridge Massachusetts Octob er

c

by Rob ert Lee Walton All rights reserved

iii

ABSTRACT

This thesis investigates the design of a machine indep endent virtual computer the R

CODE computer for use as a target by high level language compilers Unlike previous

machine indep endent targets RCODE provides higher level capabilities such as

a garbage collecting memory manager tagged data type maps array descriptors

register dataow semantics and a shared ob ject memory Emphasis is on trying to

nd universal versions of these high level features to promote interoperability of future

programming languages and to suggest a migration path for future hardware

The memory manager design combines b oth automatic garbage detection and an

explicit manual delete op eration It p ermits ob jects to b e copied at any time

to compact memory or expand ob jects It traps obsolete addresses and instantly

forwards copied ob jects using a software cache of an ob ject map It uses an optimized

writebarrier and is b etter suited for realtime than a standard copying collector

RCODE investigates the design of type maps that extend the virtual function ta

bles of C and similar tables of HASKELL EIFFEL and SATHER RCODE

prop oses to include numeric types and sizes in type maps and to inline information

from type maps by using dynamic case statements which switch on a type map iden

tier When confronted with a type map not seen b efore a dynamic case statement

compiles a new case of itself to handle the new type

RCODE also investigates using IEEE oating p oint signalingNaNs as tagged data

and making array descriptors rst class data

RCODE uses a new register dataow execution ow mo del to b etter match the

coming generation of sup erscalar pro cessors Functional dataow is used for op era

tions on register values and memory op erations are treated as unordered IO Bar

riers are introduced to sequence groups of unordered memory op erations A detailed

semantic execution ow mo del is presented

RCODE includes a shared ob ject memory design to supp ort multithreaded program

ming within a building where network shared ob ject memory reads and writes take

several thousand instructionexecutiontimes to complete The design runs on exist

iv

ing symmetric pro cessors but requires sp ecial caches to run on future withinbuilding systems

v

ACKNOWLEDGMENTS

For an older student with a family to supp ort nancing is very imp ortant I would

like to thank the following p eople and institutions that made this thesis nancially

p ossible my wife my mother and my wifes family the Division of Applied Sci

ences of Harvard University the Advanced Research Pro jects Agency Contract Nr

FC and Professor Thomas Cheatham

Contents

Introduction

Background

SCODE

RCODE

VCODEs

Metho dology

Context

Hardware Trends

RCODE Feature Selection

Memory Management

Data Types

Execution Flow

Shared Ob ject Memory

Summary of the Thesis

Some Related Work

Memory Management

Goals

Requirements vi

CONTENTS vii

Denitions

Memory and Reachability

Marking Scavenging and Sweeping

Mutators and Frames

Copying and Spaces

Forwarding

Swizzling

Manual Deletion and Type Change

Conservative Pointer Comp onents

Two Level Addressing

Related Work and Alternatives to Ob ject Maps

Address Registers

MultiPro cess SinglePro cessor Systems

Address Register Loads and Stores

Shared Memory MultiPro cessors

Copying Stops

Address Register Overhead

The Interrupt Check Alternative

Other Work Related to Address Registers

Concurrent Garbage Detection Algorithms

The Standard Marking Algorithm

Ephemeral Marking

Concurrent Marking Conditions

Snapshot and NonSnapshot Detectors

NonSnapshot Detectors

Snapshot Detectors

Read and Write Barriers

CONTENTS viii

Read Barrier Detectors

Write Barriers Detectors

The Write Barrier End Game

Write Barrier End Thrashing

Snapshot Detector Barriers

Comparison of ReadBarrier WriteBarrier and Snapshot De

tectors

Copying Collectors

Forwarding

Read Barrier Forwarding

Write Barrier Forwarding

Replication Forwarding

Two Level Addressing

Comparison of ReadBarrierForwarding WriteBar

rierForwarding ReplicationForwarding and Two

LevelAddressing Detectors

The RCODE Write Barrier

The Bitstring AND Write Barrier Test

Lo cal and Global Heaps A Future Use for the Write

Barrier Test

Deferred Action Buers

Write Barrier Mutator Overhead

Related Work on Write Barriers

NonMutator Overhead

Summary

Data Types

Goals

CONTENTS ix

Requirements

Basic Data Types

Memory Units

Tagged Values

Pointers

The Pointer Type Field

Scalar Types

Loads and Stores

Related Work on Dynamic Compilation

Type Maps

Issues to b e Analyzed

Type Map Related Work

Type Matching

Static Type Matching

Dynamic Type Matching in RCODE

Dynamic Type Matching via Formal Types

The Formal Type ANY

Formal Comp onent Descriptors

Constancy of Type Maps

Adding Comp onent Descriptors to Type Maps

Type Variables

Actual Type Sp ecication

An Actual Type Sp ecication Example

Comp onent Descriptor Contents

RCODE Comp onent Descriptors

RCODE Actual Comp onent Descriptors

RCODE Formal Comp onent Descriptors

CONTENTS x

Sub ob jects

Examples of Deciencies in C

Type Maps in Other Languages

Array Descriptors

Summary

Execution Flow

Goals

Requirements

Principals

Virtual Computer Representation

Dataow for Register Values

RAM Memory is an InputOutput Device

Based on Cases Calls Returns Barriers and Exception Throws

Barriers are at Routine Top Level

Return Terminates Subsequent Partitions

Exception Catches are Cases Attached to Call Instructions

Exception Throws are like Returns with Termination

OutofLine Equals Inline

Blo cks are Routines

Lo ops are TailRecursive Routines

Traps are like Subroutine Calls

NaNs instead of Traps

Either Strict or NonStrict

Eager but for Traps

Memory Reads have No Side Eects

Case Statements Propagate Permissions

CONTENTS xi

Flexible Value Lists

Second Class Frame Pointers

Controllable Inlining

Related Work

RCODE Machine Language Control Flow Semantics

RCODE Basic Register Dataow

Registers

Instructions

Type Co des

Frame Memory

Register Frames

Frame Heap

Value Lists

Execution State

The Execution

Tree No de States

State Transition Rules

State Maintenance

Cases

Barriers

Calls and Returns

Exception Throws

Blo cks

Lo ops

Inlining

Lo calizing

Coroutines

CONTENTS xii

NonSignalingNaNs

Traps

Continuations

Recording State

Implementation Challenges

Simulating the RCODE Computer

NonSignalingNaN Outputs

Trap Implementation

Summary

Shared Ob ject Memory

Goals

Requirements

Main Problems

The Cache Coherency Problem

The Thread Synchronization Problem

The Write Delay Problem

The Atomic Transaction Problem

The Hotsp ot Problem

Principals

The Ordered Partition Mo del

CommutativeAssociative Op erations

Type Change

Per Comp onent Access Disciplines

Volatile Comp onent Caching

Probabilistic Error Detection

Related Work

CONTENTS xiii

RCODE Shared Ob ject Memory Semantics

Access Disciplines

Access Sp ecications

Access Sp ecication Combination

The Volatile Cache

The ReadOnly Access Discipline

The WriteOnly Access Discipline

The Volatile Access Discipline

The WriteOnce Access Discipline

The Accumulate Access Discipline

The Atomic Transaction Access Discipline

Summary

List of Figures

Language Interfaces

SCODE Advanced Example

Example ATCODE Information Graph

Addressing Data

Address Load and Store RISC PseudoCo de

Write Barrier RISC PseudoCo de

Deferred Buer Pro cessing Lo op RISC PseudoCo de

A Vector of Bit Elements

Bit and Bit Tagged Values

Bit Tagged Pointer

The Pointer Type Field

Copy Types

Type Maps

Static Type Matching

RCODE Default Type Matching

Dynamic Type Matching by Formal Types

RCODE NonConstant Comp onent Descriptors

Array Descriptors xiv

LIST OF FIGURES xv

Tagged Register Values

Example Arithmetic Instruction

Example Instruction Execution Tree

RCODE Pointers and Types

Combining RCODE Pointers and Comp onent Descriptors

Access Sp ecication Combination

Chapter

Introduction

Background

After a long career in computer software development I returned to graduate school

a few years ago sp ecically to develop a programming language such that

G A high school student knowing a bit of algebra and geometry can inves

tigate and change commercially written games editors and simulators

G The language is a suitable foundation for enhancements that merge the

b est of

C Fortran

C Ada Eiel

SML Haskell Id

Lisp Emerald Smalltalk

MatLab Mathematica Latex

I did this for two reasons First having a family made me aware how dicult it is for

average p eople to learn existing commercial programming languages In particular

I b ecame ambitious to design a language in which companies would write computer

games and high schoolers would enjoy mo difying them

Second I have had an extensive career sp ecializing in high p erformance software that

glues other software together via communications or common data structures and

CHAPTER INTRODUCTION

this gave me insight into what made existing programming languages go o d or bad

from the interfacing p oint of view In other words I have written in the neighborho o d

of lines of co de much of it glue software in assembly language C LISP Ada

and C and I have developed opinions on how existing languages should b e changed

to make interfacing easier

Ones rst idea of course is to design a neat little language and after doing a

1

partial denition of one of these and lo oking at the fate of SCHEME and EIFFEL

two go o d neat little languages I decided that a more p otent metho d is required

The problem is that it is dicult to convince p eople such as the commercial compa

nies mentioned in G ab ove to switch to a new language without oering some very

go o d reason to do so The only go o d reason I can think of is that the new language

is a go o d candidate for a standard that would b e much b etter than the previous

standards

I susp ect that many asp ects of mo dern computer languages cannot b e successfully

standardized But there are a few that might b e and these form the basis of an

approach

The rst things that lend themselves to standardization are high performance op

erations Any op eration that is high p erformance is short and do es not have very

many degrees of freedom Therefore it is p ossible to work through all p ossibilities

and choose a b est one Thus we have b een able to standardize on the bit byte and

on twos complement integer arithmetic It is not clear that twos complement integer

arithmetic is actually b etter than ones complement but is it not noticeably worse

and there is little chance that someone will come up with something signicantly

b etter

Conversely low p erformance op erations are dicult to standardize There is always

some twist that works much b etter in some sp ecial case imp ortant to someone Con

sider for example trying to standardize on a single universal sort algorithm

The second kind of thing that lends itself to standardization is universal glue By

this I mean some format for communication that many programmers adhere to so

their programs will interoperate with other programs

There are several examples of data structures used as universal glue One is LISP

symbols cons cells and various types of numbers This p ermits many

1 Called for the record The Game Language and unpublished

CHAPTER INTRODUCTION

dierent functions to interoperate

A similar example is MatLab Here the data structure is a D matrix of complex

oating p oint numbers stored as two matrices of oating p oint numbers one for the

real part and one for the imaginary part The imaginary matrix may b e omitted

One dimension size may b e set to to get a vector Both dimension sizes may b e set

to and the result is called a scalar And there is a text ag which if set means

the numbers are to b e interpreted as ASCI I character co des ie text is represented

A vector with the text ag set is therefore a character string And that is the whole

show datastructurewise up on which a large collection of interoperating MatLab

subroutines is built

The other kind of universal glue is syntax The LISP Sexpression syntax provides

a way of interfacing programs to each other COMMONLISP enhances this syntac

tic glue with its lambda list extensions such as optional arguments and keyword

arguments

Universal glue pays for itself not b ecause it is optimally ecient in terms of computer

cycles or human eort but b ecause without it programs cannot interoperate The

more p otent forms of universal glue make it easier to program certain classes of

interoperating programs which is why these forms of glue have b een successful Most

successful pieces of universal glue have a fairly simple structure

None of the lower p erformance universal glues to date eg LISP and MatLab has

more than a restricted audience However the C language has a very wide audience

b ecause it b oth plays the role of universal glue and has high p erformance

So what parts of a programming language might we successfully standardize

Under the guise of universal glue one can standardize some syntax type checking and

calling conventions an interface that basically upgrades the LISP Sexpression and

COMMONLISP lambda list by adding strong typing and inx op erators I introduce

such an interface which I call SCODE for surface code

All programming languages execute programs in some abstract machine mo del For

example the machine mo del for C includes a stack containing function frames For

another example the mo del for COMMONLISP includes a memory with COMMON

LISP ob jects and garbage collection The runtime system is part of the machine

mo del but so are the hardware conventions used to make subroutine calls and main tain memory

CHAPTER INTRODUCTION

Standardizing a machine mo del for use by a set of programming languages seems

to b e necessary if you want the languages to interoperate So I prop ose a standard

abstract machine mo del called RCODE which stands for register code b ecause it

supp orts a register data ow mo del of program execution RCODE supp orts garbage

collection and other advanced programming language features RCODE is in eect

an interface to a virtual computer on which many programming languages may run

together RCODE tends to standardize high p erformance interfacing and therefore

is b oth universal glue and high p erformance

Lastly programming languages often have inputoutput supp ort eg printfscanf

in C and in C Because I want to supp ort games and simulations I exp ect

to need something more p otently visual My ro ot idea is to dene data structures

that are simple and logical and easy for programs to manipulate but which can b e

displayed visually using predened and complex visualization programs I call these

data structures VCODEs A simple example of a VCODE consists of ASCI I text

buers p ointers into these buers and character string lab els all organized into a

directory information structure This particular VCODE has the name ATCODE

standing for ASCII Text Code

The word VCODE itself stands for visualization code but the visualization pro

grams that display VCODEs can b e replaced by vocalization programs that p ermit

blind p eople to hear the data and so VCODE can equally well stand for vocalization

code

The relationship of the three interfaces SCODE RCODE and VCODE is indi

cated in Figure

This thesis describ es the rationale for a design of RCODE In order to give a more

complete background I introduce SCODE RCODE and VCODEs in more detail

immediately b elow but then discuss nothing but RCODE for the rest of the thesis

SCODE

SCODE is a surface syntax which adds to the capabilities of COMMONLISP lambda

lists One imp ortant feature to b e added is strong typing To my mind there are

two reasons for doing this First many programmers cannot write ecient co de in

COMMONLISP even though in theory one can do so by adding suitable declarations

It is to o hard to tell when COMMONLISP declarations are needed for eciency

CHAPTER INTRODUCTION

User

Editors

?

Surface Syntax

SCODE

Typing

Compilers

?

Virtual Computer

RCODE

with

Advanced Features

Application

Programs

?

Visualizable

VCODEs

Data Structures

Visualization

Programs

?

User

Figure Language Interfaces

CHAPTER INTRODUCTION

purp oses Second overloading as in C and Ada solves many naming problems

that arise when writing large systems of programs The programmer need only worry

ab out getting unique type names if the names of most functions and variables include

types implicitly

Other imp ortant features to b e added to SCODE are purely syntactic features like

inx op erators and an improved scheme for lexical analysis

The following is the denition in a p ossible version of SCODE for the prototype of

the additionsubtraction expression

define v x y z

where

x y z v are each an n

and an n is a number

Here the define statement is dening the form

x y z

SCODE expressions are sequences of clauses each consisting of some keywords fol

lowed by some arguments In a form arguments are indicated by identiers in paren

thesis eg x y z The keywords in the form we are dening are and

The rst clause dened by x has no keyword and the indicates that this

clause can b e omitted The second clause dened by y has the keyword

and the indicates this clause can b e rep eated zero or more times Thus y is

really a vector of values The third clause dened by z is similar but the

keyword is The b etween the second and third clause indicates that instances of

these clauses can b e switched in order The lack of a after the rst clause means

any rst clause must come b efore any later clauses

Forms are SCODE expressions with form variables such as x in various places

where sub expressions would b e placed Types are also SCODE expressions and n

ab ove is a form variable for a type Form variables are indicated by surrounding them

by parentheses in at least one of their uses within the form Typing is essentially a

pro cess of matching expressions and assigning expressions as form variable values

This is just an application of the unication algorithm analogous to unication in

PROLOG

I call this p ossible version of SCODE a unifying clausal grammar

CHAPTER INTRODUCTION

Example Function Prototype

define sort x into y such that efyiyjg

for all i j

where

x y are each a vector from type to type

and y is a writeonly variable

and i j are each a type variable

and e is a boolean

and type is an enumeration

Example Use of Ab ove Prototype

sort companyemployees into v such that

viname vjname for all i j

Figure SCODE Advanced Example

Without getting into more details we give another example of a function prototype

in Figure

Although I have done some preliminary work on SCODE it is not close to b eing

formally dened

RCODE

Languages with garbage collection have problems interoperating with languages that

do not have garbage collection Even worse would b e the case of two languages with

dierent garbage collectors trying to interoperate with each other

Therefore it seems that developing a standard garbage collecting memory manager

is very imp ortant to the future of computing

If you standardize on a memory manager that includes garbage collection then you

must standardize on other asp ects of memory layout as well including program stack

and frame organization and how arguments are passed You may also have to stan

dardize on the way memory loads and stores are handled In the end you have what

CHAPTER INTRODUCTION

amounts to a virtual computer with builtin memory management

One might as well build machine indep endence into this computer to o Prop osals to

develop a machine indep endent computer or equivalently a machine indep endent low

level programming language are quite old One of the original prop osals was to make

2

a language called UNCOL or Universal Computer Oriented Language which was a

suitable target language for every compiler yet was machine indep endent UNCOL

was prop osed sp ecically to reduce the burden of pro ducing M N compilers for M

source languages and N types of hardware Instead there would b e M source ana

lyzers compiling the languages to UNCOL and N code generators compiling UNCOL

to machine language for a total of M N compilers

source analyzer co de generator

source

machine

UNCOL

language

language

The UNCOL idea did not work originally p erhaps in part for the following two

reasons First machines were not very standard some had bit words some

bit words some supp orted bit bytes some had or bit bytes and so forth

Now machines are very standardized as far as data structuring is concerned which

makes it much easier to dene an UNCOL since now only instruction set dierences

need b e hidden The second diculty with UNCOLs was that they were not very

dierent from higher level languages such as C and FORTRAN So p eople who wanted

to work on developing an UNCOL were often told to use C instead In fact C

has b een used as an UNCOL successfully to implement CStr section

+

COMMONLISPYHS EIFFELM section and other languages But of

course C do es not include a standard garbage collecting memory manager

RCODE is an UNCOL that includes a garbage collecting memory manager and is

packaged as a virtual computer It has its own data organization instruction set etc

Implementations on real computers must promise to make the virtual computer lo ok

real and b e ecient

The RCODE virtual computer lo oks something like a RISC machine It has

virtual registers p er routine execution so there is no need to map more than one

2

A mo dern survey of UNCOLs can b e found in Maca One study committee commented in

that the concept had b een discussed by many indep endent p ersons as long ago as It might not b e dicult to prove that this was wellknown to Babbage

CHAPTER INTRODUCTION

variable to a single register The registers are bits and can hold tagged data

including bit oating p oint numbers The instructions are bits and have plenty

of space for option ags However b oth the registers and instructions are virtual

RCODE is compiled to real machine languages that take advantage of of the fact

that the types of values in registers are known at compile time so RCODE registers

can b e mapp ed to real untagged registers RCODE instructions can b e mapp ed

to real untagged instructions and RCODE will execute eciently Nevertheless

when RCODE routines are stopp ed during debugging the debugging interface of

the RCODE virtual computer makes things lo ok as if RCODE were b eing directly

executed

Besides a garbage collecting memory manager RCODE includes other features not

found in normal computers These features are all high p erformance universal glue

features that we feel will b e necessary for programming languages in the future

Included are type maps array descriptors tagged data register data ow and shared

ob ject memory Later in this chapter we will introduce these in more detail

RCODE gets it name register co de from one of its features register data ow

VCODEs

Some of the programs we are interested in having high school students mo dify are

games simulations spreadsheets word pro cessors and picture pro cessors All of

these contain some data structured in a way natural to the program and some fairly

large piece of programming that p ermits users to visualize this data

The VCODE idea is to dene data structures natural to classes of programs such

that the programming necessary to visualize these data structures can b e a prewritten

xed visualization program plus a few simple application sp ecic rules Then the ap

plication programmer will only have to worry ab out manipulating the data structures

which are natural to the application and will not have to worry ab out programming

visualization

A side b enet of this approach is that the visualization program can b e replaced by

a vocalization program that will make the data structures accessible to blind p eople

Since the data structures are natural to the application this should b e an ecient

way of pro ceeding

Therefore a VCODE is a data structure natural to a class of programs such that

CHAPTER INTRODUCTION

ro ot

+ * + *

mainc line

break

p oint

debugger buer p ointer

le

* +

subrc

buer

le

Figure Example ATCODE Information Graph

a visualization and vocalization program can b e prewritten to provide a visual

vocal interface b etween a p erson and the data structure with the help of some

simple application and p erson sp ecic rules

Work has b een done on the simplest VCODE which is built around ASCI I text

buers and is called ATCODE or ASCII Text Code The ATCODE information

structure is an information graph whose no des are buers p ointers at characters in

buers p ointers at regions in buers and symbols The symbols are just character

strings used as lab els as in LISP but without any asso ciated values or functions

An example information graph is given in Figure

In this example there can b e several programs running one of which is the debugger

The debugger has several file attributes each of which is an ASCI I text buer

mapp ed onto a le These buers can have breakpoint attributes which are each a

p ointer to a line in the le where a breakp oint has b een set by the debugger

The visualization program can display the file buers just like an editor The

visualization program may b e given the rule

debuggerfilebufferbreakpointpointer highlight green

which causes the visualization program to display any debugger file breakpoint

line in green The visualization program may b e given the rule

CHAPTER INTRODUCTION

debuggerfilebuffer on b send breakpoint cursor

which causes the visualization program to send the command

breakpoint copyofcursor

to the debugger program whenever the user types a b while a debugger file is the

current edit buer

Other VCODEs could represent data that could b e visualized as spreadsheets More

advanced VCODEs could represent abstract syntax trees that could b e rendered into

displays with variablewidth fonts and mathematical formulas or could represent data

in a simulation that could b e rendered into D pictures as in a ight simulator

Metho dology

Clearly I am interested in dening general purp ose interfaces In order to do so one

may engage in the following kinds of activities

M Dene interfaces and analyze them Analysis consists of listing options

1

and giving reasons pro and con for each option A more or less exhaus

tive search of the reasonable option space is desirable

Analysis might use to ols such as rough timing estimates and small pieces

of pseudo co de expressing either implementation or usage

M Build a system employing the interface and test it on a small scale

2

M Build a system employing the interface and attempt to market it to

3

the world at large This may consist of giving the system away for

free and attempting to get a large number of customers One could

also charge a fee for the system but since we are dealing with general

purp ose interfaces this is not so likely as in other areas of software

My approach is to mine metho d M analysis until it starts running dry I b elieve it

1

yields more meaningful results for general purp ose interfaces than metho d M small

2

scale implementation It is also cheaper

In the case of RCODE there is enough analysis work for at least one thesis and this thesis contains only analysis

CHAPTER INTRODUCTION

Context

Programming languages take ab out years to develop and p opularize and then last

p erhaps years so

C Our interfaces are for use primarily in the years AD

1

We have a strong interest in what should b e rather than what is This is an appro

priate attitude for a PhD thesis and it causes no problems on the software side for

that is under our control But hardware is a dierent matter We need to develop the

art of getting computer manufacturers to put sp ecial hardware into every computer

sold to high school students

A hint on how to pro ceed comes from the current relationship of RISC to CISC

computers Digital Equipment Corp oration is phasing out pro duction of their VAX

computers Instead they emulate the VAX on their new RISC ALPHAs This

emulation technology actually consists of compiling binaries most of the time and

typically has an ineciency factor of Dig section Thus a MIPS ALPHA

runs VAX co de at the rate of MIPS which is a very fast VAX

Therefore our metho d will b e to design software that will run with some ineciency

factor p ossibly up to or but hop efully much lower on existing hardware and will

run optimally without ineciency compared to current software if new hardware

is added to current computers This consideration only aects RCODE and not

SCODE or VCODE which are not hardware sensitive

What must then happ en is for the software to b ecome p opular b efore any hardware

is built After the software b ecomes widespread computer manufacturers will nd it

b enecial to make the optimizing hardware

In order to b ecome p opular while b eing inecient RCODE will have to oer ca

pabilities worth the ineciency In some cases this my b e done by simply making

the new capabilities optional additions to existing ecient capabilities and making

the new capabilities as ecient as p ossible Then commercial programmers can chose

whether to use the new capabilities for a piece of co de or stick with the old capa

bilities Usually over half the lines of co de in a program are not executed enough to

make eciency a problem and so there may b e a lot of use for advanced capabilities

even if they are inecient An example of this is the several hybrid LISPC systems

CHAPTER INTRODUCTION

such as EMACS in which LISP is used for control where eciency is not required

and C is used for op erations that must b e ecient

In other cases the new capabilities will not b e optional In these cases they must b e

worth something signicant

So in summary

C RCODE must b e p opularized on existing hardware but may require

2

new hardware to run optimally RCODE may b e inecient by a factor

of as much as or when run without new hardware but should b e

fast enough to b ecome p opular

C An inecient feature in RCODE must either b e optional or must

3

have worth sucient to overcome the adverse value of its ineciency

Optional features must interoperate well with standard features

Hardware Trends

Since I am attempting to develop programming languages for the future in particular

for AD we must pay attention to the likely development of hardware in

this time frame By this I mean we must consider current hardware trends not

new hardware one might hop e for We are interested mostly in p ersonal computer

hardware what every high school student will have

From the p oint of view of programming languages computer hardware is changing

more rapidly now than at any time in the last several decades Three ma jor themes

are increased parallelism the memory wall and increased network sp eed

The increase in parallelism means that sequential execution is no longer a viable

assumption even for p ersonal computers Some consequences for our C time frame

1

AD are

H Even p ersonal computers are switching to sup erscalar pro cessors which

1

+

simultaneously execute two or four instructionsE A sup erscalar

pro cessor needs to execute instructions out of order so that it can keep its multiple pip elines full of useful work

CHAPTER INTRODUCTION

H Shared memory multiprocessors have made their app earance for desk

2

top workstations I exp ect to see to pro cessors p er p ersonal com

puter in our time frame

The memory wal l is the name given in WK for the phenomenon that pro cessor

sp eeds are increasing at the rate of p er year while DRAM memory sp eeds are

only increasing at the rate of p er year The faster existing micropro cessors

such as the ALPHA with a p eak sp eed of MIPS are already suering secondary

+

cache miss times of instructionsF with corresp onding adverse impact on

timingSit App endix A Some consequences for our time frame are

H Organizing lists using small discontiguous CONS cells is b ecoming very

3

inecient due to large cache miss times and so lists will probably b e

represented in the future by variable length vectors analogously to

character strings

H Subroutine calls will have a high overhead and routines should b e in

4

lined much more frequently Do cumentation for the ALPHA already ad

vises that routines shorter than instructions should b e inlinedSit

A

The memory wall also provides an additional reason for wanting to reorder instruc

tions as p er H ab ove Because memory is b ecoming slower relative to pro cessors

1

pro cessors need memory read instructions to b e executed as early in the instruction

stream as p ossible

The increase in network sp eed is indicate by the progression from Ethernet to FDDI

to ATM Inexp ensive b oardmounted lasers with gigabit p er second sp eeds are b e

coming available The latency limit for getting from one side of a building to another

at the sp eed of light is of the order of microsecond and this is likely the only limit

on network communications that will b e op erative during our time frame We exp ect

to see withinbuilding networks evolve somewhat in the direction of shared mem

ory buses though the latency will b e on the order of instruction executions

and networks will have to op erate somewhat dierently from normal shared memory

b ecause of this

Although I do not exp ect high school students to have very high sp eed network

connections at home even in our time frame they will have such in school Therefore

the consequences for us are

CHAPTER INTRODUCTION

H A large number of pro cessors connected by networks capable of pag

5

ing in an ob ject in instruction execution times will b e available

sometimes These should at least b e usable by parallel compilers and

text pro cessors

Of course there will b e other hardware developments such as networks that replace

the current phone system But these are not high p erformance so they have little

impact on RCODE and I will ignore them

RCODE Feature Selection

RCODE denes a virtual computer such that the assembly language for this virtual

computer is a go o d target language for compilers but still machine indep endent

The purp ose of RCODE is to get everyone to standardize on a common garbage

collector and some similar very capable features The next question is what exactly

should these very capable features of RCODE b e

Memory Management

Clearly one feature should b e garbage collection However when I thought ab out

trying to get commercial programmers in general to accept garbage collection as p er

G page I b ecame uncomfortable Garbage collection has not b een sellable to

that bunch for years The most commonly given reason is the desire to eliminate

pauses but the technology to eliminate pauses has b een around for years at least

I concluded that commercial programmers would not want to buy into a memory

manager that did not p ermit manual deletion via the go o d old fashion delete op er

ation There are many instances in programming when you know something should

b e deleted and you do not want to have to track down and kill all the p ointers to

it However once an ob ject is manually deleted use of any dangling reference to it

should b e agged as an error esp ecially when students are mo difying co de So in this

sense manual deletion should b e strongly typed

I also concluded that ability to move an ob ject at any time is an extremely strong

selling p oint in a memory manager One use of this is to compact memory very slowly

CHAPTER INTRODUCTION

to avoid interfering with the time resp onse of the application co de Another use is to

expand a table when needed

Lastly I feel that commercial programmers will want the memory manager to have

go o d timing characteristics and p erform in realtime when necessary

Thus I come to our rst RCODE features

F RCODE will have a memory management system with automatic gar

1

bage collection a strongly typed manual delete op eration and an op

eration to move an ob ject at any time

F RCODE will have a memory management system with timing charac

2

teristics suitable for realtime applications

For the last few decades long term data has b een stored in les But this is changing

and in the future data will b e stored in databases that store ob jects which p oint at

each other When such a ob ject oriented database is input to main memory there

are two problems First all the ob ject p ointers must b e adjusted to p oint where the

ob jects land in memory Second ob jects must not b e read until they are needed as

the entire database will often b e much to o big to put into main memory

The op eration of adjusting p ointers in this situation go es under the name of swiz

zling the p ointers Since ob jects are not read into memory until they are needed

swizzling p ointers must b e delayed until the p ointers are used Thus memory man

agement gets involved and we are led to our next feature

F RCODE will have a memory management system that supp orts de

3

layed swizzling of p ointers so external databases can b e input eciently

into main memory

The chapter on memory management prop oses ways of providing the ab ove features

Data Types

The next issue I faced after garbage collection was what asp ects of data types

should b e present at run time and directly supp orted by RCODE Asso ciated with

every type of ob ject must b e some function to p erform garbage detection activities

CHAPTER INTRODUCTION

Sp ecically there must b e a scavenger function for every type that discovers the

p ointers in any ob ject of that type and also discovers for each p ointer the type of

the ob ject it p oints at

Mo dern ob ject oriented languages asso ciate with every type of ob ject a list of functions

that can do various things to the ob ject such as print it or display it graphically In

the high p erformance ob ject oriented languages such as C this list is organized

as a vector and the compiler always knows at compile time the index of the element

it wants from the vector to p erform a given op eration

We call such a vector of information ab out a type a type map and we make type

maps visible in RCODE The garbage collector is our rst customer for type maps

but there are other imp ortant uses

In mo dern programming languages it has b ecome imp ortant to distinguish b etween

the type a routine thinks an ob ject has which we call the ob jects formal type

and the type the ob ject actually has its actual type Type maps may b e used to

make ob jects of dierent actual types app ear to have the same formal type Then

one routine can b e used on many actual ob ject types

There are many uses for this For example programs written at one time can b e

applied to data dened at a dierent time if a type map can comp ensate for the

dierences b etween what the program exp ects and the actual data format Or p oly

morphic routines can b e written to handle any type of ob ject that has certain com

p onents and op erations

There are two reasons for making type maps part of RCODE The rst is to b e

sure that dierent programming languages can interoperate The way type maps are

created and passed b etween routines needs to b e standardized and certain comp o

nents of type maps such as the scavenging functions mentioned ab ove need to b e

standardized Other comp onents could however b e language dep endent

The second reason for making type maps part of RCODE is eciency The functions

sp ecied by type map comp onents should often b e inlined Also many of these

functions reduce to a single load or store instruction so that the only information

really in the type map is the displacement size and numeric type of a comp onent

Sp ecial hardware can make type maps containing this kind of information ecient

but b efore such hardware is built software techniques are needed One such technique

uses dynamic case statements that switch on the type map input to a particular piece of co de and dynamically compile new cases when the case statement sees a new

CHAPTER INTRODUCTION

type map

Thus the feature we want RCODE to supp ort is

F RCODE will supp ort type maps and will eciently handle execution

4

of very small functions selected from a type map and also eciently

handle loads and stores in which only displacement numeric type and

size are stored in the type map

Just as the map from an actual ob ject to a formal ob ject ie the ob ject as seen

by a routine is enco ded in a type map a linear map from a list of subscripts to an

address within an array is enco ded in a map we call a array descriptor One of the

things I learned during my decades writing glue software is that array descriptors are

the right way to treat array accesses This is b ecause there are many instances when

pieces of an array need to b e treated as arrays in their own right Thus

F RCODE will supp ort array descriptors treating them as rst class

5

data except that they may not b e mo died once created

Lastly LISP and other languages p ermit use of tagged data whose type is not known

at compile time There are times when this is indisp ensable as when LISP S

expressions are used to describ e linguistic information Therefore

F RCODE will supp ort tagged and untagged data and an ecient tran

6

sition b etween the two kinds of data

The chapter on data types prop oses ways of providing these RCODE data type

features

Execution Flow

We are free to develop completely new languages and thus escap e the baggage of the

past Two problems suggest our languages should b e more functional

First as indicated by H ab ove page compilers need to b e able to reorder

1

instructions to get hardware eciency Functional languages are insensitive to order

of execution as long as computations terminate

CHAPTER INTRODUCTION

Second in order to make it p ossible for high schoolers to understand programs well

enough to change them we need debuggers that do a very go o d job of displaying

what a program is doing Functional languages are single assignment languages in

which each variable gets only a single value during its lifetime and this makes it

easier to understand the program and to display what the program is doing

Functional languages can handle most programming tasks easily and eciently But

on the other hand they usually cannot express everything required in a single pro

gram They cannot handle ob jects and arrays with changing state They have trouble

with algorithms that require state maintenance such as histogram computation or

graph tracing

The extra eciency one might get from functional languages derives from the fact

that they can b e executed in data ow order each primitive op eration eg add

can execute whenever its inputs are ready To maximize the parallelism this p ermits

sp ecial hardware is needed to detect when the inputs to an op eration are ready see

+

for example the work of Burton SmithSmi ACC and ArvindNik Pap

But RCODE must run eciently without sp ecial hardware

Therefore RCODE uses a compromise It is p ossible to compile a functional language

eciently for a sequential computer as long as all the variables are registerlike in

that they are lo cal to a routine cannot b e aliased at runtime using p ointers and

therefore the compiler can gure out at compile time when they will receive their

values But variables that are global or are array elements addressed by general

subscripting cannot b e handled eciently using data ow execution order on existing

computers So the compromise is

F RCODE will b e a functional data ow language with registerlike vari

7

ables that treats RAM memory as an IO device

RCODE gets its name register code from this feature which is investigated in

the chapter on execution ow

Shared Ob ject Memory

As mentioned ab ove H page sometimes student computers will b e connected

5

in a buildingwide network capable of paging in an ob ject in instruction

CHAPTER INTRODUCTION

execution times Some means of easily using this feature to write parallel compilers

text pro cessors simulators calculators and games is needed

The memory of the computers on the network b ecomes with the help of the network

a kind of disklike memory that can store ob jects and retrieve them more than

times faster than current disks The question is what kinds of op erations can b e

p erformed on ob jects in this shared object memory

First many ob jects will turn out to b e readonly after they are created just as many

les are The op erations on these are creation reading an ob ject and reading part

of a large ob ject

For functional programming it is also convenient to have ob jects that go through

an initial writeonly phase during which many pro cesses write them but cannot read

them Then the ob jects are switched to readonly after which they cannot b e written

A histogram is similar but undergo es an accumulateonly phase during which many

pro cesses may add to its elements Then when it is nished it is switched to read

only

This thinking leads to the following

F RCODE will supp ort a shared ob ject memory in which individual ob

8

ject comp onents can b e b e marked as readonly writeonly writeonce

or accumulateonly as part of the comp onent type Also ob jects may

change types dynamically so their comp onents can for example switch

from writeonly to readonly

However this approach is not always enough so

F RCODE shared ob ject memory will supp ort atomic multiob ject trans

9

actions

Sp ecial hardware will b e necessary to make a real networked shared ob ject memory

but this is not available even approximately to day However shared memory sym

metric multiprocessors have b een available for some time and are working their way

toward student computers see H page This leads to

2

F RCODE shared ob ject memory will b e ecient on existing symmet

10

ric multiprocessors but may require sp ecially designed hardware for

eciency on a withinbuilding network

CHAPTER INTRODUCTION

The chapter on shared ob ject memory explores these features

Summary of the Thesis

The remainder of the thesis consists of four chapters matching the four groups of

features just introduced Each of these chapters may b e read indep endently of the

others

There is also a separate RCODE Architecture Description Wal which maps the ideas

of this thesis onto a sp ecic architecture However this thesis and the Architecture

Description are indep endent of each other

Some Related Work

Each of the following chapters cite work related to the topic of the chapter But

b ecause the chapters are on ambitious advanced features many historical eorts to

provide student usable languages and UNCOLs are not listed Therefore we list some

of these briey here

Some of the more p opular student oriented languages are BASIC SCHEME and

MATLAB Others are SMALLTALK and MATHEMATICA Some more recent at

tempts in these directions are EIFFELMey and SATHERInt which develop

+

type dispatching ideas related to type maps SELFU which develops dynamic

compilation ideas and DYLANApp which combines strong typing and LISP

The original UNCOL prop osal is describ ed in Maca CPW Some eorts over

the years to build UNCOL languages are JANUSCPW HW and ANDFMacb

Def which are machine indep endent languages with approximately the same ca

pabilities as C but with lower level data types and only basic op erations

Chapter

Memory Management

Goals

The main goal of this chapter is to

G Find a design for a universal memory manager that everyone can share

The main motivation for this goal is that it is very hard for two pieces of co de to

interoperate if they each assume a dierent memory manager Thus however dicult

our goal may b e there is considerable value in reaching it

The metho d for achieving this goal is to examine all the reasonable alternative ways

of building a memory manager that might meet the requirements of the next section

and pick the alternatives that seem b est There are quite a few alternatives leading

to a lengthy analysis

Requirements

The RCODE memory manager is to supp ort new programming languages meeting

the requirement see page

G A high school student knowing a bit of algebra and geometry can inves

tigate and change commercially written games editors and simulators

CHAPTER MEMORY MANAGEMENT

Clearly one memory management feature should b e garbage collection However

when I thought ab out trying to get commercial programmers in general to accept

garbage collection as p er G I b ecame uncomfortable Garbage collection has not

b een sellable to that bunch for years The most commonly given reason is the

desire to eliminate pauses but the technology to eliminate pauses has b een around

for years at least

I concluded that commercial programmers would not want to buy into a memory

manager that did not p ermit manual deletion via the go o d old fashion delete op era

tion There are many instances in programming when you know an ob ject should b e

deleted and you do not want to have to track down and kill all the p ointers to it The

ob ject may have substantial memory resources and you do not want to wait until an

automatic garbage detector has run b efore these are recovered and reused However

once an ob ject is manually deleted use of any dangling reference to it should b e

agged as an error esp ecially when students are mo difying co de So in this sense

manual deletion should b e strongly typed

After b eing manually deleted memory would b e returned to some free list for imme

diate reuse Users would b e able to design their own free memory lists and memory

allo cators

For example an ob ject may b e deleted when it ceases to b e in some directory and

is not op en In this case there may b e one reference count for the directory and

another for b eing op en and the ob ject is deleted when b oth counts go to zero The

directory count may b e used separately to initiate cleanup action

Or as another example a communication system may use xed size message buers

that are allo cated from queues of free blo cks of memory of the appropriate size and

returned to these queues as so on at they are no longer needed In this case there

might only b e one reference count of the number of queues a buer is on plus the

number of times the buer has b een op ened by pro cesses

In general users of the memory manager and not the manager itself should maintain

reference counts Compilers may automate this where appropriate

I also concluded that ability to move an ob ject at any time is a strong selling p oint

in a memory manager One use of this is to compact memory very slowly to avoid

interfering with the time resp onse of the application co de Another use is to expand

a table when needed

CHAPTER MEMORY MANAGEMENT

Lastly I feel that commercial programmers will want the memory manager with go o d

overall timing characteristics that can p erform in realtime when necessary

Thus I come to our rst RCODE memory management features

F RCODE will have a memory management system with automatic gar

1

bage collection a strongly typed manual delete op eration and an op

eration to move an ob ject at any time

F RCODE will have a memory management system with timing charac

2

teristics suitable for realtime applications

The other memory management feature concerns ob ject oriented databases which I

exp ect to b e used increasingly in the future When such a ob ject oriented database is

input to main memory there are two problems First all the ob ject p ointers must b e

adjusted to p oint where the ob jects land in memory an op eration called swizzling

the p ointers Second ob jects must not b e read until they are needed as the entire

database will often b e much to o big to put into main memory

Since ob jects are not read into memory until they are needed swizzling p ointers must

b e delayed until the p ointers are used Thus memory management gets involved and

we are led to our last memory management feature

F RCODE will have a memory management system that supp orts de

3

layed swizzling of p ointers so external databases can b e input eciently

into main memory

This chapter is organized into four parts The rst part contains denitions The

second part discusses two level addressing my solution to the need for strongly typed

manual deletion and dynamic ob ject movement The third part discusses how to

structure garbage detection in order to minimize timing impact on application pro

cesses and introduces a bit string AND write barrier test to this end The fourth

part discusses other overheads of automatic garbage collection

Below I will reference other work that impacts on sp ecic parts of my analysis

Some more general surveys of memory management are Wilsons mo dern survey

of garbage collectionWil the introduction to Hayess recent thesisHay Co

hen and Nicolaus survey of garbage collectionCoh Cohens survey of

compactionCN and Hickey and Cohens analysis of p erformanceHC

General references can b e found in the surveys just cited

CHAPTER MEMORY MANAGEMENT

Denitions

In this section I will dene an abstract mo del of memory management suitable for

investigating the design of the RCODE memory manager As b ets an abstract

mo del I omit details which I do not b elieve will aect the outcome

Memory and Reachability

Memory is a sequence of objects free blocks and gaps A gap is an unimplemented

piece of memory a free blo ck is piece of memory that is free to b e allo cated to ob jects

Each ob ject free blo ck or gap has a nonzero p ositive integer length and each has

an address which equals the sum of the lengths of the previous ob jects free blo cks

and gaps in memory

Ob jects contain pointers to other ob jects A p ointer to an ob ject is in eect the

address of the ob ject The dierent places where p ointers can b e stored within an

ob ject are called pointer components A p ointer comp onent may b e empty meaning

it contains no p ointer Such a p ointer comp onent is said to b e nul l Two p ointer

comp onents in the same ob ject may not overlap

An ob ject is said to b e reachable if it is one of a particular set of ob jects called the root

set or if it can b e reached from the ro ot set by following the p ointers in ob jects In

order to b e accessed an ob ject must b e reachable and furthermore once an ob ject

b ecomes unreachable it stays unreachable Unreachable ob jects may therefore b e

garbage collected

Marking Scavenging and Sweeping

Marking is the pro cess of putting a mark on each ob ject that is reachable and leaving

the mark o of most ob jects that are unreachable To this end each ob ject may b e

thought of as having a marked bit which is set if the ob ject is marked and clear

otherwise To mark an ob ject is to set its marked bit

To scavenge a p ointer comp onent is to mark the ob ject p ointed at by the comp onent

1

if the comp onent is not null and the ob ject p ointed at is not already marked To

1

Sometimes this is called scanning the comp onent

CHAPTER MEMORY MANAGEMENT

scavenge an ob ject is to scavenge all its p ointer comp onents All ob jects that have

b een marked must b e scavenged so that it is convenient to think of each ob ject as

2

having a scavenged bit which is set when the ob ject is scavenged and clear otherwise

There are questions of timing in concurrent algorithms do you set the scavenged bit

just b efore or just after scavenging an ob ject

To sweep means to nd all unmarked ob jects and turn them into free blo cks To

sweep an ob ject means to turn the ob ject into a free blo ck if and only if its marked

bit is o

Mutators and Frames

A mutator is an application program pro cess that is trying to do useful work and

for whose b enet the garbage collector algorithm is b eing run There may b e any

number of mutators running on any number of hardware pro cessors Each mutator

has a stack which is a sequence of ob jects called frames The frame at the top end

of the stack is called the top frame A mutator may p erform the following actions

Create a new ob ject all of whose p ointer comp onents are null and write a

p ointer to this new ob ject into the top frame

Write a p ointer into a p ointer comp onent The p ointer written must b e the

value of a p ointer comp onent in the top frame and the ob ject containing the

p ointer comp onent to b e changed must b e a nonframe ob ject p ointed at by the

top frame

Read a p ointer from a p ointer comp onent into the top frame The p ointer

comp onent must b e in a nonframe ob ject p ointed at by the top frame and the

value of that p ointer comp onent will b e written into the top frame

Loadroot load a p ointer to a nonframe ro ot ob ject into the top frame

Reference an ob ject Some nonp ointercomponent part of the ob ject is read or

written The ob ject must b e a nonframe ob ject p ointed at by the top frame

2

In the literatureWil the marked and scavenged ags are often represented by assigning one

of three colors to an ob ject white unmarked and unscavenged grey marked and unscavenged black marked and scavenged

CHAPTER MEMORY MANAGEMENT

Push a new frame into the top of the stack The new top frame is created as

a new ob ject with null p ointer comp onents Then some p ointer comp onents of

the previous top frame may b e copied into the new top frame

Pop the top frame from the stack The top frame must not b e reachable after

it has b een p opp ed Some p ointer comp onents of the top frame b eing p opp ed

may b e copied into the frame that b ecomes the new top frame

Ob jects including frames may not contain p ointers to frames

The frames in every mutator stack are part of the ro ot set

A stack may only b e changed by its own mutator and that mutator may only change

the top frame of its stack

In the RCODE virtual computer mutator stacks may share frames that are not top

frames and so the abstract mo del of this chapter do es not exactly t the mo del of

RCODE in other chapters But as stated ab ove I have intentionally omitted this

and other extra complexities from my memory mo del b ecause they make no essential

dierence to the analysis of this chapter

Copying and Spaces

To copy an ob ject means to move it in memory changing its address To compact

memory means to copy ob jects so as to glue free blo cks together to make larger free

blo cks Ob jects may b e copied for reasons other than compacting memory They

may b e moved b ecause their size has increased and they cannot expand where they

are

A space is a subset of memory such that every ob ject and every free blo ck is either

completely outside or completely inside the space Typically a space is a contiguous

region of memory addresses but it may also b e a set of pages

Ob jects are sometimes copied b etween spaces Some spaces may b e memory accessible

by only some pro cessors in a multiprocessor system Database les that are not

actually in memory may b e viewed as if they were a space in memory for our purp oses

Therefore a mutator may not b e able to reference or mutate ob jects in some spaces

and the ob jects may have to b e copied to another space in order for the mutator to access them

CHAPTER MEMORY MANAGEMENT

A copying collector is a kind of garbage collector that works by copying all reachable

ob jects in one space called the fromspace into a second space called the tospace

After all reachable ob jects are copied the fromspace is freed This kind of garbage

collector naturally compacts memory

Forwarding

Consider the situation where an ob ject has b een copied one or more times and there

is a single most recent copy of the ob ject along with p ossible older copies In this

situation which I call the forwarding scenario it is desirable to adjust each p ointer

comp onent p ointing at the ob ject to p oint at the most recent copy This adjustment

is called forwarding the p ointer comp onent A p ointer that p oints at the most recent

copy of the ob ject it p oints at is said to b e forwarded

As an aid to forwarding a p ointer to the new lo cation of a copied ob ject is often left

3

at the previous address of the ob ject

The question arises which copy of an ob ject do es a mutator read or write If the

mutators all access only the most recent copy of an ob ject I say the memory manage

ment system employs eager forwarding If the mutators may read a nonrecent copy

of an ob ject and must write all copies of the ob ject identically I say the memory

management system employs lazy forwarding

In an eager forwarding system p ointers may b e forwarded when they are read into

the top frame In many systems eager or lazy p ointer comp onents are forwarded

when they are scavenged

Swizzling

Consider the situation where there are two spaces one inaccessible to mutators and

one accessible to mutators The inaccessible space has other desirable prop erties it

might b e p ermanent disk storage for example or it might b e accessible to mutators

other than those we are considering Ob jects may have copies in either space or in

3

Storing this address is sometimes called forwarding the ob ject but I carefully avoid this

terminology since in my scheme it might mean instead that all p ointer comp onents in the ob ject were forwarded

CHAPTER MEMORY MANAGEMENT

b oth and may b e copied back and forth b etween the spaces However it is desired

that an ob ject copy in one space only p oint at ob ject copies in the same space

In this situation which I call the swizzling scenario adjusting a p ointer comp onent in

one space to p oint at a copy of its target ob ject in the same space is called swizzling

the p ointer comp onent To swizzle all p ointer comp onents in an ob ject is called

swizzling the ob ject A p ointer stored in one space that p oints to the same space is

4

said to b e swizzled

An ob ject that has a copy in one space may have an empty place reserved for it in

the other space Such an empty place is called an ob ject slot

When a p ointer is swizzled a check is made to see if the ob ject p ointed at has already

b een copied and if yes the p ointer is swizzled to p oint at this copy If no a slot is

allo cated for the ob ject and the p ointer is swizzled to p oint at this slot

In the swizzling scenario it is necessary to delay copying ob jects from the inaccessible

space to the accessible space until the ob jects are actually accessed by the mutators

b ecause if all ob jects were copied there would b e to o many copies There would even

b e to o many copies if all ob jects p ointed at by a copied ob ject were copied for then

all ob jects reachable from the ro ot of a database would b e copied when the ro ot was

copied In order to delay copying it is necessary to delay swizzling

Pointer comp onents might not b e swizzled until they are used to access the ob jects

they p oint at This is called demand pointer swizzling To implement this something

must b e done to distinguish unswizzled p ointers from swizzled p ointers

Or all the p ointer comp onents of an ob ject may b e swizzled when the ob ject is rst

accessed with the swizzled p ointers p ossibly p ointing at empty slots into which other

ob jects will b e copied This is called demand object swizzling In this situation the

act of using a p ointer to access an ob ject will cause the ob ject to b e copied if the

p ointer p ointed at an empty slot and will cause the ob ject to b e swizzled if the

p ointer p ointed at an unswizzled ob ject To implement this something must b e done

to distinguish empty slots or unswizzled ob jects from swizzled ob jects

Swizzling and forwarding should not b e confused with each other The same memory

management system might implement each by a dierent metho d The two level

addressing scheme used in RCODE do es exactly this

4 A history of swizzling is given in SKW section

CHAPTER MEMORY MANAGEMENT

Manual Deletion and Type Change

Manual deletion refers to deleting an ob ject and most of its memory in resp onse to

an op eration executed by a mutator Strongly typed manual deletion requires that all

accesses to a manually deleted ob ject b e trapp ed as errors Strongly typed manual

deletion of an ob ject is conceptually similar to copying the ob ject into an inaccessible

region of memory and instantly forwarding all p ointers to the ob ject

Typechange refers to changing the type of an ob ject in a strongly typed system

in resp onse to an op eration executed by a mutator For example a writeonly ob ject

might b e made readonly Strongly typed typechange requires that the addresses used

to access the ob ject under the old type must b e immediately invalidated and the

ob ject must b e given a new address This is similar to deleting an ob ject manually

and allo cating a new ob ject that is a copy of the original Attempts to access the

ob ject using its old address result in detected errors just as in the case of strongly

typed manual deletion

Conservative Pointer Comp onents

Sometimes it is not known what parts of an ob ject are p ointer comp onents though

it is p ossible to list all places in the ob ject that might b e p ointer comp onents Places

that might b e p ointer comp onents are called conservative pointer components For

purp oses of nding all reachable ob jects knowing all conservative p ointer comp onents

may suce if one can tell a valid ob ject p ointer from random data What one do es

is to try to interpret each conservative p ointer comp onent value as a p ointer to an

ob ject and if this fails ignore the value while if it succeeds declare the ob ject p ointed

at to b e reachable

A value might b e interpretable as a p ointer and yet not really b e a p ointer Thus an

unreachable ob ject might accidentally b e declared to b e reachable More imp ortantly

one must never forward or swizzle a conservative p ointer comp onent

A number of pap ersZor have b een written on conservative garbage collectors which

are just garbage collectors that use conservative p ointer comp onents in lieu of knowing

where all the p ointer comp onents are However there do es not seem to b e any reason

why RCODE should not b e able to correctly identify p ointer comp onents so I will

not mention conservative collectors or conservative p ointer comp onents again in this thesis

CHAPTER MEMORY MANAGEMENT

Two Level Addressing

Feature F page requires strongly typed manual deletion and instant forwarding

1

of p ointers to ob jects whenever the ob jects are moved The only way to provide

strongly typed manual deletion seems to b e to asso ciate a trap ag with each ob ject

such that when the trap ag is set all accesses of the ob ject are trapp ed Similarly

when an ob ject is copied the only way to immediately forward all p ointers to the

ob ject seems to b e to make all accesses of the ob ject use indirect addressing through

a table lo cation that holds the current address of the ob ject

These ideas lead to the scheme I call two level addressing that seems to b e the

only way of satisfying the requirements of F Two level addressing provides a trap

1

ag that can also b e used to provide demand ob ject swizzling as required by fea

ture F page

3

In the two level addressing scheme p ointers contain an object number and a within

ob ject byte displacement The ob ject number sp ecies an entry in a table called the

object map and this entry has the base address of the ob ject see Figure The

base address and the withinob ject displacement are added to get the address of the

byte accessed The ob ject map entry also contains a trap ag that may b e set to trap

all references to the ob ject

To move an ob ject dynamically a pro cess do es the following

Set the trap ag of the ob ject to stop all pro cesses accessing the ob ject

Move the ob ject

Up date the ob ject map entry to p oint at the new address of the ob ject

Clear the ob ject trap ag

Restart any pro cesses that stopp ed b ecause they tried to access the ob ject

Strongly typed manual deletion of an ob ject merely requires setting the trap ag in

the ob ject map entry and deleting the memory used by the ob ject without deleting

the ob ject map entry The ob ject map entry is automatically collected by a mark

and sweep garbage collector at some later time

Stopping pro cesses that try to access a moving ob ject is not optimal It is p ossible to stop only pro cesses that write an ob ject as long as the source and destination copies

CHAPTER MEMORY MANAGEMENT

ob jectmap vector of ob jectmapentries

ob jectmapentry byteaddressofob ject trapag S M

ob jectnumber index of ob jectmapentry in ob jectmap

p ointer ob jectnumber

withinob jectbytedisplacement

addressregister ob jectnumber withinob jectbyteaddress

p ointer

ob jectmap

ob jectnumber

bytedisplacement

ob ject

store

6

entry

Pi

P

P

load

P

?

P

I P



P

P

ob jectnumber

trap byteaddress

ag

addressregister

Figure Addressing Data

of the ob ject do not overlap If pro cesses that write an ob ject write b oth copies of the

ob ject when the ob ject is b eing copied it is p ossible to copy the ob ject incrementally

only stopping writing pro cesses while copying each increment to avoid writecopy

conicts and running writing pro cesses for a while b etween copy increments

The trap ag can also b e used to implement the strongly typed typechange op eration

The ob ject whose type is b eing changed is given a new ob ject number with a new

ob ject map entry and the old ob ject number is invalidated by setting the trap ag

in its ob ject map entry

The trap ag can b e used to implement demand swizzling by allo cating ob ject map

entries with their trap ags set to serve as slots where ob jects are to b e copied

when they are rst accessed When an ob ject that has only such a map entry is rst

accessed the trap allo cates memory for the ob ject copies the ob ject from external

CHAPTER MEMORY MANAGEMENT

memory and swizzles the p ointers in the ob ject so that they p oint either at other

ob jects already copied into memory or at ob ject map entries serving as slots

Such a system in which an ob ject map entry serving as a slot has no asso ciated

memory for the ob ject itself approximates demand p ointer swizzling b ecause only a

5

small amount of memory the ob ject map entry is allo cated for unaccessed ob jects

An alternative system asso ciates a blo ck of memory sucient to contain the ob ject

with the ob ject map entry of a slot and initiates copying the ob ject into that memory

Then when the ob ject is rst accessed the trap waits until ob ject copying is nished

and then swizzles all p ointers in the ob ject This alternative system implements

demand ob ject swizzling

Related Work and Alternatives to Ob ject Maps

Ob ject maps seem to b e the only way to quickly obsolete the old address of an ob ject

that has b een deleted and to quickly forward addresses of ob jects that have b een

moved Both the trap ag and address indirection through the ob ject map entry

seem necessary for these functions

Hardware has b een built to implement ob ject maps but no such hardware survives

to day among common computers

One of the earliest references to such hardware was the following piece of folklore that

the author ran into in the mids

A large military system was built on a computer whose core memory was

so small that data and program had to b e copied constantly b etween

it and magnetic tap e The computer had a very large number of index

registers which were in core memory and could b e used to let data

and instruction blo cks move around in memory freely while the program

ran But using them doubled the memory access time so the system

designers decided not to They built the system and it ran They then

assigned a small group to rewrite the system using the index registers As

exp ected the new system was much smaller and simpler than the original

Unexp ectedly the new system ran faster

The Intel iOrg was another system that built ob ject maps into hardware

5 A similar hybrid system is prop osed in VD

CHAPTER MEMORY MANAGEMENT

Bro oks prop osed in Bro to implement ob ject maps without trap ags in software

by p erforming the required indirect addressing on almost every memory access I

will describ e this work in more detail b elow

If quick obsolescence of old addresses is not required two software systems that do

not continually indirect through ob ject maps for reading can b e used

In the Pegasus system of North and ReppyNR two or more copies can exist

simultaneously and any copy can b e read without further checking Write op erations

are made atomic and up date all copies Writes must b e atomic to avoid race situations

with copying Obsolete addresses are not completely ushed until the end of the

garbage collection cycle

Atomic write op erations take a lot of time b ecause of synchronization overhead and

are not very practical for applications such as up dating each element of a large array

So an alternative is prop osed by Nettles OToole Pierce and HainesNOPH in

which a complete set of copies of all used ob jects is pro duced while the application

pro cesses continue to use the originals of the ob jects The copies p oint only at copies

while the originals p oint only at originals A write log is pro duced by the application

pro cesses that is used to up date the copies The system switches over to the copies at

the end of a garbage collection This system cannot handle multiple pro cessors writing

to the same ob ject comp onent b ecause each pro cessor would have an indep endent

unsynchronized write log

These software solutions that do not continually indirect through ob ject maps do

not supp ort manual deletion type change or swizzling do not supp ort use of ob ject

copying to increase the size of the ob ject do not p ermit an ob ject and its copy to

overlap and do not supp ort immediate reuse of memory vacated by a copied ob ject

Address Registers

In order to get to the p oint where sp ecial hardware for ob ject maps is built we rst

need to p opularize software that needs this hardware To do this we need a software

approach that do es the same thing with usable eciency The rest of the discussion

of two level addressing will center on the solution to this problem that I am prop osing

for RCODE A dierent solution of similar quality is describ ed near the end of this

discussion

The key idea is to introduce the address register as a software implementation of an

CHAPTER MEMORY MANAGEMENT

ob ject map cache entry An address register is very analogous to a machine register

but has sp ecial prop erties To access an ob ject one must load a p ointer p ointing

somewhere within the ob ject into one of the pro cessors address registers Address

registers can b e saved and restored like any other register

Pointers not stored in address registers are stored in the form of an ob ject number

plus a byte displacement Such p ointers do not have to b e changed when an ob ject

is moved

Pointers stored in address registers are stored in the form of an ob ject number plus

a byte address of a byte within the ob ject see Figure page When an ob ject

is moved each address register p ointing into the ob ject must b e up dated by adding

the oset of the move into the byte address in the register

When an address register is saved it is stored as a p ointer in the form that contains

the displacement Loading an address register converts a byte displacement to a byte

address and saving the register converts a byte address to a byte displacement Thus

the only byte addresses p ointing into an ob ject are in the ob jects ob ject map entry

and in address registers

An ob ject map entry also contains a trap ag that is checked whenever an address

register is loaded and causes the pro cessor to trap if it is on The fact that the

trap ag is checked only when the address register is loaded means that on a multi

pro cessor system where one pro cessor may set a trap ag while another is using the

ob ject the second pro cessor will not see the trap ag until it has b een interrupted

and saved its address registers

An address register that is no longer in use needs to b e cleared by the application

co de so it will not cause inadvertent traps if the trap ag of the ob ject it p oints at is

turned on Interrupts can save and restore an address register at any time so if the

trap ag is on a pro cess can trap at any time during an interrupt restore op eration

In most implementations ob ject numbers can b e the direct addresses of ob ject map

entries The byte address part of an address register is typically stored in an actual

machine register The ob ject number part of the address register is stored in global

memory where it can b e seen by other pro cessors in a multiprocessor system see

b elow

Ab ove I said the byte address in an address register has to p oint within an ob ject

However this is not literally true The byte address can p oint b eyond either end of

the ob ject as long as any address actually used to access part of the ob ject p oints

CHAPTER MEMORY MANAGEMENT

within the ob ject

MultiPro cess SinglePro cessor Systems

In a multiprocess singlepro cessor system the single pro cessor should have only one

set of address registers which are shared among pro cesses Each pro cess that is not

running has its address registers saved in the form of p ointers When the pro cess is

resumed it will reload the pro cessor address registers with these saved p ointers and

trap if the trap ag of any ob ject p ointed at is set

An example use of this system is moving an ob ject on a single pro cessor multiprocess

system The trap ag for the ob ject is set by the pro cess that is going to move the

ob ject at a time when no address register is p ointing at the ob ject At this time all

other pro cesses have saved their address register contents as p ointers If the moving

pro cess is interrupted while it is moving the ob ject the interrupting pro cess runs

until it loads an address register with a p ointer to the ob ject Then the interrupting

pro cess is trapp ed saves its address registers as p ointers and stops waiting for the

ob ject move to nish When the moving pro cess nishes the move it clears the trap

ag and restarts any pro cesses waiting for the move to nish These pro cesses then

reload their address registers with the new lo cation of the ob ject

From the p oint of view of a pro cess addressing an ob ject but not moving it the

direct byteaddress part of any address register p ointing at the ob ject is sp ontaneously

adjusted whenever another pro cess moves the ob ject

This metho d can also b e applied to a single user pro cess under a standard op erating

system to supp ort user pro cess interrupts The standard op erating system do es not

have to b e mo died as the user pro cess can b e treated as a virtual computer with

its own interrupt system

Address Register Loads and Stores

There are some problems loading and storing address registers in a system where

interrupts can o ccur b etween any two instructions

Loading address registers is not a problem if one makes the rule that the byte address

in an address register can b e meaningless garbage Then the ob ject number part of

the register can b e loaded rst b efore any meaningful byte address is loaded To

CHAPTER MEMORY MANAGEMENT

make the rule work adding address osets to garbage byte addresses must not cause

failures Generally computers have unsigned integer add instructions that ignore

overows and can b e used to add address osets without trapping even if they are

adding the osets to garbage values

Making an appropriately atomic address register store op eration can b e trickier This

op eration subtracts the byte address of an ob ject in its ob ject map entry from the

byteaddress in an address register to pro duce a displacement The displacement

cannot b e stored in the same lo cation as either argument however If an interrupt

o ccurred and the ob ject was moved during the interrupt any lo cation holding a byte

address should b e adjusted but any lo cation holding a displacement should not so

the two kinds of lo cation must b e distinct What is needed is an atomic three address

subtract instruction even on a two address computer

One can make an atomic three address subtract instruction on a two address computer

without any extra normal execution overhead by the following trick The rule is

enforced that no interrupt can o ccur b efore a registertoregister subtract instruction

unless the previous instruction is also a registertoregister subtract If this rule

can b e enforced the atomic subtraction needed to store an address register X can b e

done as follows

D mapentryaddressXob jectnumber

D D Xbyteaddress

D D

where D and X byteaddress are in registers and no interrupt is allowed just b efore

their subtraction

The rule can b e enforced by programming the interrupt routine to check if the next

instruction after the interrupt is a registertoregister subtract instruction If it is the

interrupt routine emulates that instruction b efore completely saving the interrupted

pro cess state The only overhead is p er interrupt and it is small since b oth the time

to check the next instruction and the time to emulate a registertoregister subtract

instruction are small

If we are applying this metho d to supp ort user pro cess interrupts in a single user

pro cess under a standard op erating system only the user pro cess interrupt trap rou

tine needs to b e mo died Again the standard op erating system do es not need to b e mo died

CHAPTER MEMORY MANAGEMENT

Shared Memory MultiPro cessors

In a symmetric multiprocessor shared memory system the simple approach to getting

all pro cessors to recognize a newly set trap ag is to interrupt them all and get each

to check its address registers However this is inecient for the pro cessors not setting

the trap ag and it is desirable to move some of the work o to the trap ag setting

pro cessor

This is done by putting the ob ject number part of each address register in global

memory so the trap ag setting pro cessor can nd out which other pro cessors are

referencing the ob ject in their address registers Only these pro cessors need b e inter

rupted Note that the address register load op eration must write the ob ject number

b eing loaded into global memory b efore reading the trap ag for that ob ject This

requires a memory barrier op eration b etween the ob ject number write and the trap

ag read on newer faster pro cessorsSit

This approach do es not scale well to large numbers of pro cessors For such systems

sp ecial hardware is indicated for this and other reasons eg cache coherency

This metho d can also b e applied to multiple user pro cesses under a standard op erating

system provided any pro cess that sets an ob ject trap ag can get any other pro cess

that shares the same memory to interrupt suciently promptly The standard

op erating system do es not have to b e mo died as the user pro cesses are treated

as virtual pro cessors Note that strict priority scheduling of user pro cesses would

normally not b e allowed as a trap ag setting pro cess might not b e able to get lower

priority pro cesses to interrupt

Copying Stops

Stopping pro cesses that try to access an ob ject b eing copied disrupts realtime re

sp onse However the eect is no worse than IO interrupts if the ob jects b eing copied

are small enough to b e copied in approximately the time taken by an IO interrupt

routine If the ob ject moving pro cess is part of the garbage collection system it can

detect when pro cesses have b een stopp ed and slow down copying to avoid to o many

such stoppages in a short p erio d of time

It is imp ortant to copy each ob ject at a priority level appropriate for the pro cesses

accessing the ob ject since these pro cesses will have to wait for the copying pro cess

to nish

CHAPTER MEMORY MANAGEMENT

Large ob jects can b e copied if they are only accessed by low priority pro cesses stop

ping is OK Or large ob jects can b e xed in memory and not moved Another option

is to b e sure the source and destination of the copy do not overlap and only stop

writing pro cesses And yet another idea is to preserve the page alignment of a large

ob ject when it is copied and copy most of the ob ject by merely copying page table

entries without actually copying the contents of the pages

Thus the worst case is when high priority pro cesses must write a large ob ject that

is itself one of a set of ob jects dynamically created and destroyed on a slower time

scale and the large ob ject cannot b e copied by copying page table entries If this

case cannot b e avoided the following scheme can b e used to copy the large ob ject

incrementally Readers are not stopp ed during the copy but are interrupted at the

end of the copy to get the correct new ob ject lo cation Writers are stopp ed during

each copy increment but are allowed to run b etween increments to maintain their

realtime resp onse Writers write b oth copies of the ob ject while it is b eing copied

so at the end of the copy b oth copies of the ob ject are identical Writers must b e

stopp ed during copy increments to avoid writecopy conict

Address Register Overhead

The overhead of the address register implementation of two level addressing consists

mostly in the time taken to load and store address registers With ordinary address

registers load and store would each take one RISC instruction With two level

addressing load takes RISC instructions and store takes instructions The

lower b ounds on these estimates are the instruction counts from Figure and the

upp er b ounds are simply double the lower b ounds

On mo dern computers it may b e more imp ortant to count secondary cache misses

than it is to count instructions in order to determine timing Use of address registers

tends to add one secondary cache miss for the ob ject map entry for every address

register load Presumably address register stores would not add further misses most

of the time

For two address instruction computers there is an additional p er interrupt overhead

to check whether the next instruction is a register to register subtract and emulate

it if it is

While these overheads are signicant for most programs they should average much

CHAPTER MEMORY MANAGEMENT

In the following P is a p ointer in memory and A is an address register The p ointer

is b eing loaded into the address register or the address register is b eing stored

in the p ointer Aobjectnumber is the ob ject number part the address register

stored in global memory and Abyteaddress is the byte address part stored in

a hardware register Ob ject numbers are addresses of ob ject map entries Each

pseudoinstruction b elow is intended to map to one RISC instruction on a typical

RISC computer

Address tempobjectnumberreg Pobjectnumber

Load Aobjectnumber tempobjectnumberreg

memorybarrierinstruction

temptrapreg mapflagwordtempobjectnumberreg

temptrapreg

temptrapreg AND trapflagmaskconstant

if temptrapreg nonzero

then call trapsubroutine

subroutine argument is in tempobjectnumberreg

subroutine may return to this point to resume

process

Abyteaddress mapaddresstempobjectnumberreg

tempdisplacementreg Pbytedisplacement

Abyteaddress

Abyteaddress tempdisplacementregister

Address tempobjectnumberreg Aobjectnumber

Store tempbyteaddressreg

mapaddresstempobjectnumberreg

interrupts adjust tempbyteaddressreg when

tempobjectnumberreg object moves

tempdisplacementreg

Abyteaddress tempbyteaddressreg

Pobjectnumber tempobjectnumberreg

Pbytedisplacement tempdisplacementreg

Figure Address Load and Store RISC PseudoCo de

CHAPTER MEMORY MANAGEMENT

less than the factor of overhead of emulating a VAX on an ALPHA mentioned on

page If a MIPS ALPHA can emulate a MIPS VAX it should b e able to

emulate an RCODE virtual computer running much faster than MIPS at least

as far as the memory manager is concerned

Note that very imp ortantly the overhead is zero for stepping addresses through an

array after the address register is loaded Byte addresses in address registers can b e

incremented or decremented directly

The address register load overhead may app ear to b e a high price to pay for languages

like LISP that have small cons cells However b ecause memory sp eeds are falling

b ehind pro cessor sp eedsWK mo dern computers are suering very large overheads

for secondary cache misses so small cons cells are b ecoming inecient anyway Thus

address register load overhead may b e less signicant on future computers

The Interrupt Check Alternative

There is a classical way of handling ob ject map like structures in LISP implementa

tions Interrupts are not p ermitted except when a sp ecial interrupt check op eration

o ccurs in the co de Between such op erations where interrupts cannot o ccur direct

addresses to ob jects may b e put into any register as long as trap ags are checked

when addresses are loaded from ob ject maps When an interrupt check op eration is

executed all address information outside the ob ject maps b ecomes invalid if and only

if an interrupt o ccurs and in that case any subsequent use of addresses must reread

them from ob ject map entries and recheck the entry trap ags

Interrupt check instructions can b e placed at the b eginning of subroutines and lo ops

This scheme can b e used when LISP is compiled into a language like C The interrupt

check op eration checks a global interrupt ag and if on calls an interrupt pro cessing

routine which according to the rules of C may change any ob ject map entry as these

are globally accessible

The interrupt check scheme has op erations very similar to address register loading

done at similar times However instead of storing address registers the interrupt

check scheme do es interrupt check op erations

In the interrupt check scheme addresses outside ob ject maps b ecome invalid when

a subroutine is called whereas in the address register scheme it is p ossible to have

address registers which remain valid across subroutine calls

CHAPTER MEMORY MANAGEMENT

In a multiprocessor system setting an ob jects trap ag do es not take eect until

all pro cesses that are accessing the ob ject interrupt just as for the address register

scheme Therefore just as for the address register scheme the ob ject numbers of

ob jects b eing accessed need to b e put into global variables where they can b e seen by

the pro cess setting the trap ag Then that pro cess can interrupt only other pro cesses

that are accessing the ob ject whose trap ag it is setting

Thus on a multiprocessor system the interrupt check scheme needs to implement the

ob ject number part of address registers These can b e treated like their byte address

counterparts b ecoming invalid when an interrupt o ccurs or a subroutine is called

In a serious realtime system where pro cesses sharing a common garbage collected

memory can indenitely preempt each other with priority scheduling the op erating

system scheduler must b e mo died for b oth the address register and interrupt check

schemes For the address register scheme the scheduler must save address registers in

the appropriate manner For the interrupt check scheme the scheduler must wait for

a pro cess to reach its next interrupt check op eration b efore interrupting the pro cess

to switch to another pro cess sharing the same memory This requires setting a timer

in case the user pro cess gets caught in an innite lo op

In a system that is merely interactive and not realtime pro cesses can b e treated

like indep endent pro cessors if none can preempt any other sharing the same memory

and the op erating system need not b e mo died

It is not clear whether the address register scheme is b etter than the interrupt check

scheme but b oth implement ob ject maps with manual ob ject deletion and dynamic

moving For RCODE I have prop osed the address register scheme b ecause it seems

a b etter match to future hardware and do es not require as intrusive tinkering with

the way interrupts are scheduled

Other Work Related to Address Registers

Bro oksBro introduces a scheme that has an ob ject map equivalent and p erforms an

indirect address op eration through the ob ject map for almost every memory access

To implement this on the Motorola a single hardware register named ipr

is given the sp ecial function of holding the address of the b eginning of the ob ject

currently b eing accessed This register holds the only direct address outside the ob ject map

CHAPTER MEMORY MANAGEMENT

What serves as the ob ject map in this scheme is a word in front of each ob ject copy

that holds the address of the b eginning of the current version of the ob ject Given

the address in ipr of some copy of the ob ject the address of the current copy can

b e loaded into ipr by loading the word just b efore the lo cation p ointed at by ipr

This is done when returning from an interrupt so a pro cess can b e interrupted by a

second pro cess that moves an ob ject

The computer has addressing mo des in which two registers are added to form

an address so this computer has no need for any register to address any p oint in an

ob ject other than the b eginning of the ob ject

There is no trap ag facility with this scheme no ability to do manual deletion and

no ability to interrupt a pro cess copying an ob ject with another pro cess that tries to

access the ob ject and is then stopp ed Nevertheless there is a version of an ob ject

map and ipr is a primitive version of an address register

Concurrent Garbage Detection Algorithms

In addition to manual deletion and dynamic ob ject moving our memory manager

must provide for automatic garbage collection of ob jects and also collection of the

ob ject map entries for manually deleted ob jects as required by F page This

1

leads us to a study of various concurrent ob ject marking algorithms

In this section I will exhaustively investigate the options for concurrent marking

algorithms and pick the variants that seem to b e most promising There are a

number of options resulting in a lengthy analysis

The Standard Marking Algorithm

Garbage detection algorithms all attempt to mark reachable ob jects see sections

and The standard algorithm is the following

Initialize Turn o the mark and scavenged bits of all ob jects

Ro ot set Turn on the marked bits of the ro ot set

Scavenge For each ob ject whose marked bit is on and whose scavenged bit is

o scavenge the ob ject and set its scavenged bit

CHAPTER MEMORY MANAGEMENT

This algorithm maintains an imp ortant invariant no scavenged ob ject p oints at an

unmarked ob ject This invariant which need not b e maintained precisely at every

instant must hold at the instant marking nishes

A problem with this algorithm is how to maintain a list of all ob jects that have b een

marked but not scavenged

This algorithm assumes that after it is done all ob jects will b e swept so unmarked

ob jects will b e returned to the free list Compactication may b e done as part of

sweeping Whenever compactication is done forwarding must b e done Discussion

of forwarding is deferred till later

Ephemeral Marking

Studies have shown Hay Chapter that of all ob jects allo cated have fairly

short lifetimes Also some systems have initial program loads that contain tens of

megabytes of ob jects that need never b e garbage collected

Garbage collection can b e adapted to these facts by dividing ob jects into two kinds

permanent objects and ephemeral objects Then an ephemeral garbage collection is

just a standard garbage collection in which the p ermanent ob jects are treated as part

of the ro ot set

By a ful l garbage collection I will mean one that can collect p ermanent and ephemeral

ob jects ie one that ignores the distinction b etween the two kinds of ob jects or more

sp ecically one that has as small a ro ot set as p ossible

In order to do ephemeral garbage collection eciently a list of all p ermanent ob jects

that might p oint at ephemeral ob jects needs to b e kept at all times so the many

p ermanent ob jects that do not p oint at ephemeral ob jects can b e ignored The

ob jects on this list are called ephemeral root ob jects Permanent ob jects that are

not ephemeral ro ot ob jects are called notephemeralroot ob jects Thus the eciency

of ephemeral garbage collection dep ends up on the prop ensity for p ermanent ob jects

to b e notephemeralro ot ob jects or in other words for the list of ephemeral ro ot

ob jects to b e short

One can add two bits to each ob ject the permanent bit to indicate that the ob ject is

p ermanent and the notephemeralroot bit to indicate that the ob ject is a p ermanent

ob ject that do es not p oint at ephemeral ob jects An imp ortant invariant must b e

maintained no notephemeralro ot ob ject p oints at a nonp ermanent ephemeral

CHAPTER MEMORY MANAGEMENT

ob ject This invariant is strictly analogous to the invariant which says no scavenged

ob ject may p oint at an unmarked ob ject the p ermanent and notephemeralro ot bits

are resp ectively analogous to the marked and scavenged bits and the list of ephemeral

ro ot ob jects is analogous to the list of unscavenged marked ob jects

Thus the analogy

scavenged notephemeralro ot

marked p ermanent

is quite precise

This analogy is not an accident At the start of an ephemeral garbage collection

it is p ossible to set the marked bit of each ob ject equal to the ob jects p ermanent

bit and set the scavenged bit equal to the notephemeralro ot bit However this is

not what is generally done Instead the marked bits of all p ermanent ob jects are

p ermanently set and never cleared to cause p ointers to p ermanent ob jects to b e

ignored for marking Similarly the scavenged bits of all notephemeralro ot ob jects

are either p ermanently set or p ermanently cleared dep ending on the type of garbage

collector nonsnapshot or snapshot see b elow to prevent any action when p ointers

are stored into notephemeralro ot ob jects except of course for the p ossible action of

turning o the notephemeralro ot bit of the ob ject and putting it on the ephemeral

ro ot ob ject list

During an ephemeral marking garbage detection every ephemeral ro ot ob ject is scav

enged This scavenging op eration can b e programmed to discover whether the ephem

eral ro ot ob ject which might p oint at an ephemeral ob ject do es in fact p oint at any

ephemeral ob ject If not the ephemeral ro ot ob ject can b e declared to b e a not

ephemeralro ot ob ject and taken o the list of ephemeral ro ot ob jects

Sometimes instead of keeping a list of ephemeral ro ot ob jects a list of the exact

p ermanent ob ject p ointer comp onents that might p oint at ephemeral ob jects is kept

These are called ephemeral root pointers However I do not prop ose this for RCODE

b ecause it do es not t into the ecient testing scheme given b elow in section

Often several ephemeral garbage collections are run simultaneously using a se

quence of successively smaller p ermanent ob ject sets each consisting of successively

older ob jects This scheme is referred to as generational garbage collection

CHAPTER MEMORY MANAGEMENT

Concurrent Marking Conditions

It is desirable to run garbage detection concurrently with mutators It is also desirable

to run several ephemeral garbage detections concurrently with each other as in a

generational scheme where one garbage detection detects garbage among very new

ob jects with all older ob jects in its p ermanent ob ject set and another detects garbage

among relatively old ob jects with only ob jects in the initial program load in its

p ermanent ob ject set

In order to run the mutator concurrently with marking it must maintain the invari

ants

I No scavenged ob ject p oints at an unmarked ob ject

s

I No notephemeralro ot ob ject p oints at an ephemeral ob ject

e

Both these invariants have the same structural form Supp ose we asso ciate two bits

M for marked and S for scavenged with each ob ject Then invariant I says

s

If ob ject X p oints at ob ject Y then X S Y M

Invariant I says exactly the same thing if we reinterpret M to mean p ermanent

e

and S to mean notephemeralro ot

A nonephemeral marking pro cess strives to meet three conditions at some single

p oint in time

All ro ot ob jects are marked

All marked ob jects are scavenged

Invariant I ab ove holds

s

An ephemeral marking pro cess strives to meet the following alternate set of conditions

at some single p oint in time while ignoring p ointers to p ermanent ob jects ignoring

notephemeralro ot ob jects and maintaining the list of ephemeral ro ot ob jects

All ro ot ob jects are marked

CHAPTER MEMORY MANAGEMENT

All ephemeralro ot ob jects are marked

All marked ob jects are scavenged

Invariants I and I ab ove hold

s e

Snapshot and NonSnapshot Detectors

The marking conditions describ ed ab ove need not hold at all times Rather they

must hold at some particular sp ecied time There are two common choices the end

of the garbage detection and the b eginning of the garbage detection Detectors using

the later choice the b eginning of garbage detection are called snapshot detectors

In this section I will talk ab out storing a p ointer into an ob ject X If X is a top

frame this store is actually done by a read op eration as dened in section

page If X is not a top frame this store is done by a write op eration

NonSnapshot Detectors

In a nonsnapshot detector the conditions given ab ove must hold at the end of garbage

detection

In order to make invariant I hold at the end of the garbage detection the mutator

s

when it is ab out to store a p ointer to Y into the ob ject X and discovers that X S

Y M can merely arrange that either Y M b e turned on or X S b e turned o

sometime b efore the end of the garbage detection Thus although the invariant may

b e violated temp orarily it will b e reestablished b efore the end of garbage detection

Note that for a nonsnapshot detector the bit X S should b e turned on as so on as

any p ointer comp onents in X are scavenged Thus in general it is turned on at the

b eginning of scavenging X Eciency aside no harm is done by turning X S on

prematurely

In order to make invariant I hold at the end of the garbage detection the same

e

pro cedure may b e followed but with X S always turned o sometime b efore the

end of garbage detection This indicates that X is no longer a notephemeralro ot

ob ject but may now p oint at an ephemeral ob ject

CHAPTER MEMORY MANAGEMENT

Snapshot Detectors

In a snapshot detector the conditions given ab ove must hold at the b eginning of

garbage detection Ob jects allo cated since the b eginning of garbage collection cannot

b e collected and are neither marked nor scavenged

I dene an ob ject to b e marked at the b eginning of garbage collection if and only

if it is marked at the end of garbage collection and similarly dene an ob ject to b e

scavenged at the b eginning of garbage collection if and only if it is scavenged at

the end

In order to make invariant I hold at the b eginning of the garbage detection the

s

mutator when it is ab out to store over a p ointer to Y that was previously stored

in the ob ject X and b oth ob jects were allo cated b efore the b eginning of garbage

detection and X S Y M can arrange that Y M b e turned on sometime

b efore the end of the garbage detection Note that the invariant itself is not checked

but clearly X was reachable at the b eginning of garbage detection or else we would

not b e storing into it

Note that for a snapshot detector the bit X S should not b e turned on until after

all p ointer comp onents in X have b een scavenged Thus in general it is turned on at

the end of scavenging X Eciency aside no harm is done by turning X S on late

Then it can b e proved that all the required conditions hold at the b eginning of

garbage detection Briey if ob ject X p oints at ob ject Y at the b eginning of garbage

detection and X is scavenged at the b eginning ie by the end of detection then

we must show that Y is marked at the b eginning ie by the end of detection

There are two cases First the p ointer comp onent in X p ointing at Y is not changed

until after X S is set Then when X is scavenged Y will b e marked The second

case is when the p ointer comp onent is changed b efore X S is set But then Y will b e

marked as a consequence of the p ointer to Y b eing stored over while X S is o

In order to make the ephemeral invariant I hold we can assume that it holds at the

e

b eginning of the garbage detection and arrange as in the nonsnapshot case for it

to hold at the b eginning of the next detection More sp ecically when the mutator is

ab out to store a p ointer to Y into the ob ject X and discovers that X S Y M

it arranges that X S b e turned o sometime b efore the b eginning of the next garbage

detection Here X S is on if X is a p ermanent ob ject that is not an ephemeral ro ot

and Y M is on if Y is a p ermanent ob ject

CHAPTER MEMORY MANAGEMENT

In theory ob jects allo cated after garbage collection starts are ignored by a snapshot

detector In practice they can have b oth their M and S bits set and will b e eectively

ignored

It is p ossible to replace any single test X S Y M by the simpler test

Y M ie Y M is o whenever a p ointer to Y in an ob ject X is stored over

without marking any more ob jects than one would otherwise mark assuming ob jects

allo cated after detection b egan are marked Supp ose X S is on when the p ointer

from X to Y is stored over Then if the p ointer to b e stored over existed when X

was scavenged Y M will b e on But if not then the p ointer was stored after garbage

detection started Therefore Y was reachable after detection started and hence either

Y was allo cated after detection b egan or Y was reachable when detection b egan

Read and Write Barriers

In the discussion of ephemeral nonsnapshot and snapshot detectors three dierent

tests were mentioned

T Nonsnapshot scavenge X S Y M

ns

T Snapshot scavenge X S Y M

ss

T Ephemeral ro ot X S Y M

e

When a p ointer is read into a top frame these tests must b e applied where X is a

top frame A test applied during a read op eration called a read barrier

When a p ointer is written into an ob ject that is not a top frame these tests must b e

applied where X is not a top frame A test applied during a write op eration is called

a write barrier

Nonsnapshot detectors come in two avors Those that enforce I with the aid of

s

T read barrier tests but use no write barrier tests are called read barrier detectors

ns

Those that enforce I with the aid of T write barrier tests but use no read barrier

s ns

tests are called write barrier detectors

Snapshot detectors on the other hand must always use a T write barrier test It

ss

is p ossible to program them to also require a T read barrier test but this should

ss

never b e done as it is obviously inecient

Invariant I must always b e maintained with the help of a T write barrier test

e e

CHAPTER MEMORY MANAGEMENT

Read Barrier Detectors

A read barrier detector is a nonsnapshot detector that scavenges all top frames b efore

it scavenges any other ob ject More sp ecically the rule is enforced that after the

S bit is set for any ob ject that is not a top frame then no top frame will contain

a p ointer to an unmarked ob ject Therefore write barriers will not b e necessary to

maintain invariant I as any p ointer written will p oint at a marked ob ject

s

Write barriers are still necessary to maintain I

e

A read barrier detector sets all frame S ags at the very b eginning of detection so

the read barrier test T reduces to Y M ie Y M is o The frames do

ns

not have to b e immediately scavenged the mutators may run while they are b eing

scavenged But they must b e completely scavenged b efore the I invariant S ag of

s

any nonframe ob ject is set

Note that the read barrier need not actually mark an unmarked ob ject It need merely

schedule the ob ject to b e marked at some time b efore the end of garbage detection

Write Barriers Detectors

A write barrier detector is a nonsnapshot detector that do es not scavenge frames

until the end of detection and uses no read barrier test whatso ever Therefore the

top frame may contain p ointers to unmarked ob jects and a T write barrier test is

ns

needed to supp ort invariant I A similar T write barrier test is needed to supp ort

s e

invariant I

e

The Write Barrier End Game Finishing a write barrier detection

is not trivial The following is a reasonable ending algorithm

Wait until a time when all nonframe ro ot ob jects have b een marked and all

marked nonframe ob jects have b een scavenged

Scavenge stacks from the b ottom up To scavenge a frame set its S bit and

scavenge it But if a frame b eing scavenged b ecomes the top frame clear its S

bit and stop scavenging the frames stack

If there are any unscavenged marked nonframe ob jects at the end of this step

go back to step

CHAPTER MEMORY MANAGEMENT

Stop all mutators and scavenge all unscavenged frames

If any ob jects are newly marked by this scavenging restart mutators and go

back to step

Terminate the detection and restart the mutators

The time needed when mutators are stopp ed in step can b e a problem Use can

b e made of co de optimized to check each frame to see if it p oints at any unmarked

ob jects

Write Barrier End Thrashing If the write barrier detection ending

algorithm ab ove must rep eatedly go back to step from steps or this is known

as write barrier end thrashing

There are theoretical examples where such thrashing can b e severe Supp ose a mu

tator has built a very long LISP list reachable only from its top frame and then

applies the LISP NREVERSE function to destructively reverse the elements of the

list Supp ose the garbage detector reaches step of the ending algorithm just after

NREVERSE starts

During the NREVERSE function the top frame maintains p ointers to the head of

the unreversed part of the list and the head of the already reversed part An element

is moved from the head of the unreversed part and the element is changed to b ecome

the head of the reversed part Once changed it is no longer p ossible to reach the yet

unreversed part of the list through the moved element

So step will mark the heads of the reversed and unreversed parts of the list Then the

mutator will b e restarted and move the head of the unreversed part to the reversed

part and change it The garbage detector will then not b e able to reach the rest of

the unreversed part of the list through already marked ob jects The detector will run

until it reexecutes step without marking the unreverse remainder of the list Then

at step the whole pro cess will rep eat and may continue doing so as long as the

NREVERSE function continues

Note that this sort of b ehavior cannot happ en unless mutators destroy the connectiv

ity of the part of the ob ject graph reachable only from the stack The NREVERSE

function is particularly p erverse in this resp ect

The garbage detector can detect such thrashing by counting the number of times it

reaches steps and Furthermore by keeping a count of how many times unmarked

CHAPTER MEMORY MANAGEMENT

ob jects were discovered for each mutator in steps and the detector can identify

which mutators are causing thrashing

The only danger is that there will b e so little extra work for the detector discovered

by step or step that the cost of the end algorithm will not b e prop erly amortized

over the extra detection work discovered If this b egins to happ en the detector may

take one of the following sp ecial steps

Switch into a mo de where the allo cator always sets new ob jects as b eing marked

and scavenged This prevents thrashing caused by simply allo cating new ob

jects

Switch into a mo de where the mutators causing trashing are stopp ed while

garbage detection attempts to nish These mutators may b e restarted after

some signicant amount of work has b een done by the garbage detector even

if detection has not nished

Switch into a mo de where some mutators have a read barrier that marks any

ob ject when a p ointer to it is copied into the mutators top frame

Whether any of these steps are necessary and which are b est to use is a very prob

abilistic and exp erimental sub ject Clearly mo de will work well but it requires

considerable extra co de though this extra co de is not executed most of the time

Write barrier end thrashing is signicant b ecause it is the only comp onent of write

barrier detector overhead that cannot b e rationally b ounded One would hop e that

the overhead of a detector could b e b ounded in terms of reasonable parameters such

as number of used ob jects rate of ob ject allo cation rate of byte allo cation number

of p ointer writes and so forth And in fact every overhead of a write barrier detector

can b e so b ounded except write barrier end thrashing

Write barrier end thrashing is likely to b e similar to other kinds of thrashing that can

happ en in a computer system For example cache thrashing and virtual memory

page thrashing It will b e unlikely in practice but unpredictable in theory But it is

p ossible to detect write barrier thrashing and take some easy countermeasures such

as mo des and ab ove

CHAPTER MEMORY MANAGEMENT

Snapshot Detector Barriers

A snapshot detector b egins by scavenging all top frames and thereafter whenever

a frame b ecomes a top frame stops its mutator and scavenges the frame b efore

p ermitting the mutator to continue Then no T read barrier test is needed

ss

In a snapshot detector an ob jects I invariant S bit cannot b e set until after the

s

ob ject is scavenged and it is the setting of this bit for frames that makes the read

barrier test unnecessary So each top frame must really b e scavenged with its mutator

stopp ed b efore the mutator can b e p ermitted to continue

However scavenging a top frame for a snapshot detector can b e done by making a

copy of the frame with the mutator stopp ed and then scavenging the copy while the

mutator resumes

A snapshot detector always requires b oth T and T write barrier tests The rst

ss e

can b e simplied to test Y M ie Y M is o

Comparison of ReadBarrier WriteBarrier and

Snapshot Detectors

The following are some comparisons b etween these three kinds of detector

Readbarriers make p ointer reads more exp ensive while write barriers make

p ointer writes more exp ensive There are typically many more p ointer reads

than writes so the total overhead for read barriers can b e much greater

If a garbage detection takes a long time a write barrier detector may detect

signicantly more garbage than a snapshot detector or a read barrier detector

This is b ecause ob jects created during garbage detection but which are never

reachable except from a mutator stack and which b ecome unreachable b efore

the end of garbage detection are usually classied as garbage by a write barrier

detector

Both a readbarrier and a snapshot detector however eectively mark any

newly create ob ject

The tests T and T needed to maintain invariants I and I when a p ointer

ns e s e

is written into an ob ject are so similar in a write barrier detector that they can

b e bundled together and done in parallel by executing a single bitstring AND

CHAPTER MEMORY MANAGEMENT

op eration So b oth kinds of tests can b e done in the same amount of time it

would take to do one kind of test

But in a snapshot detector the previous value of the p ointer comp onent must

b e tested to maintain invariant I and the new value of the p ointer comp onent

s

to maintain invariant I so the totality of b oth kinds of write barrier tests takes

e

twice as long as for a write barrier detector

If we examine our invariants

I No scavenged ob ject p oints at an unmarked ob ject

s

I No notephemeralro ot ob ject p oints at an ephemeral ob ject

e

we see that only I can b e maintained by a read barrier I must b e maintained

s e

by a write barrier b ecause the invariant is maintained by clearing the S ag of

the ob ject into which the p ointer is b eing stored making it an ephemeralro ot

rather than setting the M ag of the ob ject p ointed at making it p ermanent

Thus a readbarrier ephemeral detector must also use a separate write barrier

to o

Write barrier detectors have some trouble getting the stacks scavenged at the

end of detection without using op erations that disrupt realtime p erformance

Snapshot detectors have some trouble getting stacks scavenged at the b eginning

of detection without using op erations that disrupt realtime p erformance

Read barrier detectors have no such problems since they merely need to start

scavenging stacks at the b eginning of detection and can take as long as they

like while mutators are running to actually scavenge the stacks

All but the last comparison is a reason to favor write barrier detectors Therefore

RCODE uses a write barrier detector This means that RCODE must live with the

p ossibility of write barrier end thrashing section and control such thrashing

with measurement to ols and other means I do not anticipate this thrashing to b e

much of a problem in the real world

CHAPTER MEMORY MANAGEMENT

Copying Collectors

A copying collector organizes some of memory into two disjoint spaces the fromspace

and the tospace and assuming all ob jects it wants to detect as garbage are initially

in the fromspace attempts to copy all reachable ob jects in the fromspace into the

tospace

Sp ecically a copying collector copies a fromspace ob ject to tospace either when the

ob ject is marked or when it is scavenged or in b etween To distinguish these three

options I refer to copyonmark detectors copyonscavenge detectors and mark

copyscavenge detectors

Thus for a copyonmark detector b eing marked is synonymous with b eing in to

space and b eing unmarked is synonymous with b eing in fromspace for the set of

ob jects under consideration

Copying collectors have b een used with read barrier detectors and prop osed for write

barrier detectors

Some advantages and disadvantages of copying collectors are

A copyonmark read barrier detector can do forwarding at the same time as

the read barrier check Therefore all p ointers in the top frame p oint at copied

ob jects This is one of only two ways of doing forwarding that has mutators

using only forwarded p ointers for ob jects that have b een copied so there are no

problems with two copies of an ob ject b ecoming inconsistent see Forwarding

b elow

Note that if p ointers used by mutators are to b e forwarded when read using a

read barrier marking cannot b e delayed in the way it otherwise could b e If

marking were delayed the mutator might end up using a nonforwarded p ointer

which would b e a problem if the two copies of the ob ject b ecame inconsistent

If a copyonmark detector forwards p ointers b oth during the read barrier and

when a p ointer is scavenged then at the end of garbage detection all p ointers

will b e forwarded

Write barrier and snapshot copyonmark detectors all copyonscavenge de

tectors and all markcopyscavenge detectors do not have the nice prop erty of

having mutators only access tospace copies of ob jects and not the fromspace originals

CHAPTER MEMORY MANAGEMENT

Compaction happ ens automatically without sweeping

For copyonmark detectors the list of marked unscavenged ob jects is easily

maintained by simply scavenging ob jects in the same order in which they were

copied If copies are allo cated in tospace in order of increasing addresses then

ob jects that are marked can b e scavenged in order of increasing addresses and it

is merely necessary to keep the address of the b oundary b etween scavenged and

unscavenged ob jects in tospace to know which ob jects have b een scavenged

For markcopyscavenge detectors a separate list of ob jects to b e copied must

b e maintained but the list of copied ob jects to b e scavenged is easily maintained

as just describ ed

For copyonmark detectors the cost of marking an ob ject b ecomes high b e

cause the ob ject must b e copied

In realtime systems it is b etter to p erform copying at the convenience of the

system in order to stretch the overhead out evenly over time But mutators

do marking via read or write barriers on their own schedule and with copyon

mark detectors this means copying is done on the mutators marking schedule

which may bunch copies at awkward moments In other words the mutator

overhead of a copyonmark detector is not rationally b ounded

Both the copyonscavenge and markcopyscavenge detectors solve this prob

lem

Copying collectors require a separate tospace at least as big as the set of reach

able ob jects in fromspace and cannot simply compact a single memory space

by sliding ob jects to one end

The order in which ob jects app ear in memory changes and this might adversely

impact time eciency amount of memory required or predictability dep ending

on the total situation There is much debate on the memory order ob jects should

app ear in for b est eciency However allo cation order app ears to b e one of the

b etter orders

Because RCODE uses two level addressing it do es not need to worry ab out forward

ing p ointers in other ways Therefore the principal advantage of a copyonmark

detector with read barrier forwarding do es not apply to RCODE

CHAPTER MEMORY MANAGEMENT

The principal disadvantage of copyonmark collectors the p ossibility of uncontrolled

mutator delay copying a set of ob jects referenced by the mutator conicts with re

quirement F page that RCODE b e suitable for realtime applications There

2

fore RCODE do es not use a copyonmark collector algorithm There is no reason

for RCODE to asso ciate copying and scavenging so RCODE do es not use a copy

onscavenge or a markcopyscavenge algorithm

Forwarding

When ob jects are copied by garbage collectors a principal problem is making sure

all mutators access the same copy or consistent copies There are four ways of doing

this readbarrierforwarding writebarrierforwarding replicationforwarding and

two level addressing All four ways of doing this are discussed in this section for

completeness although for reasons other than just forwarding RCODE uses the two

level addressing scheme describ ed in section ab ove

Read Barrier Forwarding

A readbarrierforwarding garbage collector is a sp ecial case of a copyonmark copying

collector with a read barrier detector A readbarrierforwarding collector ensures that

all ob ject p ointers in top frames p oint at ob ject copies in tospace and not at the

originals in fromspace Therefore mutators only access tospace copies

With a read barrier detector all p ointers in top frames have b een marked With

a copyonmark collector all ob jects that have b een marked have also b een copied

to tospace The read barrier that is part of the garbage detector is augmented to

forward all p ointers to ob jects which have already b een copied with the result that

all p ointers in top frames are to tospace

Therefore all mutators access only the tospace versions of copied ob jects and the

collector is of the eager forwarding variety

The read barrier in this detector may not delay marking but must mark and copy any

unmarked ob ject immediately so it can forward p ointers to the ob ject immediately

Pointer comp onents are also forwarded when they are scavenged Thus at the end

of garbage detection all ob jects have b een copied to tospace and all p ointers have

b een forwarded and p oint at tospace

CHAPTER MEMORY MANAGEMENT

A readbarrierforwarding collector do es not have to b egin by copying all ob jects

p ointed at by top frames Instead it can run mutators and frame scavenging simul

taneously until all the frames are scavenged as long as no nonframes are scavenged

until all frames have b een scavenged

One of the standard kinds of garbage collector in use to day is the readbarrier

forwarding collector which is generally knownWil as The Copying Col lector

Write Barrier Forwarding

A writebarrierforwarding garbage collector is a sp ecial case of a copyonmark copy

ing collector with a write barrier detector A writebarrierforwarding collector may

have some mutators accessing the tospace version of a copied ob ject while other

mutators are still accessing the fromspace version

The scavenger and write barriers are augmented to forward all p ointers to ob jects

which have b een copied Therefore by the end of garbage detection all p ointers will

have b een forwarded and p oint at tospace

Because mutators may access either copy write op erations must write all copies of an

ob ject To do this they must synchronize with each other and with pro cesses copying

ob jects This synchronization tends to imp ose a high cost in time on write op erations

Read op erations may access any copy and the collector is of the lazy forwarding

variety

The write barrier in this detector may not delay marking but must mark and copy any

unmarked ob ject immediately so it can forward p ointers to the ob ject immediately

It is p ossible to avoid the extra write overhead on ob jects b eing initialized for the

rst time and therefore it is p ossible to avoid it for all writes in a purely functional

language Thus it is p ossible to avoid the overhead for most writes in a mostly

functional language

The Pegasus system of North and ReppyNR describ ed on page is a write

barrierforwarding garbage collector The garbage collector prop osed by Bro oks de scrib ed on page mixes the writebarrierforwarding collector and ob ject map ideas

CHAPTER MEMORY MANAGEMENT

Replication Forwarding

A replicationforwarding garbage collector do es not require ob jects to b e copied at

any given time nor do es it require the detector b e of any particular type such as

read barrier or write barrier

The mutators in a replicationforwarding collector always access the originals of the

ob jects in fromspace At some p oint not necessarily during detection a copy of all

marked ob jects is made in tospace by a background pro cess All the p ointers in the

tospace copies are forwarded but none of the p ointers in the fromspace copies are

forwarded

The goal of copying is to copy all marked ob jects Copying can start any time after

marking starts and nish any time after marking nishes and b efore the next marking

cycle starts

New ob jects allo cated after a write barrier detection ends but b efore copying ends

can b e allo cated in tospace and never b e copied so no p ointers to them will require

forwarding New ob jects allo cated after a read barrier or snapshot detection b egins

can likewise b e allo cated in tospace

The stack frames are not copied till the end of the copying Until this time they

contain only p ointers to fromspace Then at the end of the copying mutators are

stopp ed the stack frames are copied the p ointers in the stack frames are forwarded

and the mutators resume with the copied stacks

This algorithm for copying the stacks can b e made more sophisticated along the same

lines as the detection ending algorithm for write barrier detectors in order to stop

mutators for a shorter p erio d of time However there is no analog of write barrier end

thrashing as nothing is left to do at this p oint except copy and forward the stacks

In order to keep the copies up to date each pro cess makes a write log of all the writes

it makes into ob jects that might have a copy These write logs are pro cessed by the

copying pro cess

Because the write logs can consist of log buers private to the writing pro cess while

the log buer is b eing lled writes do not have to synchronize with the copy pro cess

This makes writes much more ecient than they would b e if they had to synchronize

A deciency of this system is that two mutators may not write the same ob ject

comp onent b ecause of time race condition problems up dating the copies This could

b e addressed by having sp ecial write op erations that were synchronized with each

CHAPTER MEMORY MANAGEMENT

other at the cost of extra synchronization overhead p er write In a single pro cessor

system it might b e addressed by building a fast atomic op eration to write a log

common to all pro cesses

The system can b e designed so new ob jects do not have copies until after they are

initialized and therefore writes done to initialize an ob ject need not b e logged

The collector of prop osed by Nettles OToole Pierce and HainesNOPH describ ed

on page is a replicationforwarding collector

Two Level Addressing

A two level addressing memory manager requires that p ointers consist of two parts

an ob ject number which identies an ob ject and a within ob ject byte displacement

that identiers a byte lo cation within the ob ject All accesses to the ob ject use the

ob ject number to lo ok up the current address of the ob ject in a table called the ob ject

map The ob ject map also contains a trap ag for each ob ject that can b e used to

trap all accesses to the ob ject

With two level addressing an ob ject can b e moved at any time The trap ag is set

to stop pro cesses that try to access the ob ject the ob ject is moved the trap ag is

cleared and any pro cesses stopp ed by the trap ag are restarted

There is a more detailed description of two level addressing ab ove in section

Two level addressing makes ob ject copying completely indep endent of garbage detec

tion or collection

Comparison of ReadBarrierForwarding

WriteBarrierForwarding ReplicationForwarding and

TwoLevelAddressing Detectors

The following are comparisons of the forwarding mechanisms

Readbarrierforwarding can stop a mutator for as long as it takes to copy the

ob jects it is referencing It may b e necessary to copy many such ob jects at one

time and this makes it dicult to b ound mutator execution time

Writebarrierforwarding can make noninitialization writes take much longer

than normal b ecause of synchronization problems

CHAPTER MEMORY MANAGEMENT

Replicationforwarding do es not easily p ermit two distinct pro cesses to write

the same comp onent of an ob ject unless sp ecial synchronized writes are used

that take much longer than normal writes

Twoleveladdressing is implemented with address registers into which p ointers

must b e loaded to b e used and in software implementations op erations to load

and store address registers take much longer than normal

Twoleveladdressing p ermits a number of other things to b e done b esides just

forwarding For example ob jects can b e copied to increase their size or an

ob ject can b e deleted and the ob ject trap ag set to detect dangling p ointers

The other forms of forwarding do not supp ort such op erations

RCODE uses two level addressing b ecause it supp orts manual deletion and ob ject

size increases as well as garbage collection forwarding

The RCODE Write Barrier

One of the advantages of a write barrier detector listed on page is that the tests

T and T can b e combined into a single test using a bit string AND op eration

ns e

In this section we will describ e the RCODE write barrier and analyze its eect on

mutator p erformance

The Bitstring AND Write Barrier Test

The T write barrier test to ensure that no scavenged ob ject p oints at an unmarked

ns

ob ject and the T test to ensure that no notephemeralro ot ob ject p oints at a non

e

p ermanent ephemeral ob ject are op erationally identical if we make the equivalences

scavenged notephemeralro ot

marked p ermanent

To p erform these tests for several garbage detections running in parallel we may

asso ciate two bit vectors with each ob ject eg in its ob jectmapentry S and M

CHAPTER MEMORY MANAGEMENT

Then when a p ointer to ob ject X is stored in ob ject Y all the write barrier tests are

combined into the single test Y S X M This test is called the bitstring

AND write barrier test If this test is true sp ecial action must b e taken that may

mark ob jects and up date lists

If the matching bits of S and M that were resp ectively on and o have the inter

pretations scavenged and marked the action is to mark ob ject X ie set the

X M bit and to put X on the list of ob jects to b e scavenged This action need not

b e done immediately it may b e p ostp oned until any time b efore the end of garbage

detection

If the matching bits of S and M that were resp ectively on and o have the inter

pretations p ermanent ob ject that is not an ephemeral ro ot ob ject and p ermanent

ob ject then the action is to clear the Y S bit put ob ject Y on the list of ephemeral

ro ot ob jects and mark Y Again this need not b e done immediately it may b e

p ostp oned until any time b efore the end of garbage detection

Several garbage detections can run asynchronously using the same bit vectors Each

detection is assigned a dierent pair of matching bits to mean marked and scav

enged Several dierent sets of p ermanent ob jects can also b e maintained at the

same time by assigning other pairs of matching bits to mean p ermanent and not

ephemeralro ot

Most computers do not have a single bit string x AND NOT y instruction but do

have an AND instruction For these the bit vector M can b e stored complemented

in memory so that each write requires only a single bit string AND instruction to

p erform all write barrier tests

Lo cal and Global Heaps A Future Use for the Write Barrier

Test

Here I will mention a p ossible additional future use for the write barrier test just

describ ed

Because memory sp eeds are falling b ehind pro cessor sp eedsWK memory lo cal

to one pro cessor is b ecoming more ecient than memory shared b etween several

pro cessors This is partly just a matter of not having to maintain secondary cache

coherence for lo cal memory

As a consequence it may b ecome exp edient in the future to have separate lo cal and

CHAPTER MEMORY MANAGEMENT

shared heaps Each pro cessor would have a local heap only it could access and several

pro cessors would share a separate shared heap The rule would b e enforced that the

shared heap could not p oint at lo cal heaps so that the several pro cessors would not

have troubles accessing ob jects p ointed at by the shared heap

To enforce this rule a pair of S and M bits would b e assigned to enforce the invariant

I No shared heap ob ject p oints at a lo cal heap ob ject

l

The M and S bits would b oth b e o for lo cal ob jects and b oth b e on for shared

ob jects so the write barrier test would have the standard form X S Y M

to trap on storing a p ointer to a lo cal ob ject Y into a shared ob ject X

I do not exp ect this to b e very useful in the immediate future but it may b e within

the decade

There is however a technical diculty with using the X S Y M test to

detect illegal stores of lo cal p ointers into shared heap ob jects The diculty is that

for all other applications of this test the p ointer needs to b e written in the ob ject

6

b efore the test is made whereas for the use we are describing here we would like

the p ointer to b e stored only after the test has b een passed

For eciency we will have to live with storing the p ointer b efore the test but this

will mean that for a short time after an illegal p ointer is stored the shared ob ject will

actually contain a p ointer to a lo cal heap This situation can b e detected and xed

when the deferred action buer is pro cessed see section b elow for deferred

action buers Also it may b e p ossible to use separate ranges of ob ject numbers for

dierent lo cal and shared heaps so that illegal uses of lo cal p ointers will b e detected

no matter how they were communicated Then the write barrier test will b e just a

convenience for detecting the guilty party

Deferred Action Buers

In a concurrent system there is a problem with setting M bits clearing S bits and

manipulating related lists these actions must b e atomic I will assume hardware with

6

Because there is a scavenger pro cess that turns an ob jects S bit on and then lo oks at p ointers in

the ob ject it is necessary for the mutator to put the p ointer in the ob ject for the scavenger pro cess

to see b efore the mutator lo oks at the S bit to see whether the scavenger has already pro cessed the ob ject and therefore will not see the p ointer

CHAPTER MEMORY MANAGEMENT

a relatively high atomic synchronization overhead for these actions and therefore

introduce a metho d of delaying and batching them

Our store p ointer instruction will do just the following On storing a p ointer to ob ject

Y into ob ject X if X S Y M p ointers to X and Y will b e written into a

deferred action buer Each pro cess will have its own buer and so will not need to

interlock these buer writes When the buer is full its pro cess will synchronize with

other pro cesses to set M bits clear S bits and maintain lists

There are two ways to keep the time needed to pro cess a full deferred action buer

reasonable for fast high priority pro cesses First the buer may b e made just long

enough to amortize the cost of synchronizing in which case the cost of pro cessing

it should not b e more than several times the cost of synchronizing Or second full

buers may b e shipp ed to a lower priority pro cess for pro cessing

Write Barrier Mutator Overhead

The mutator write barrier pseudoco de is shown in Figure Based on this the

write barrier takes RISC instructions b est case and instructions worst

case where the lower b ounds are from the gure and the upp er b ounds are double

the lower b ounds The b est case is when the write barrier test indicates no further

action is required The worst case is when a deferred action buer entry is written

Here we assume that in the worst case the end of the deferred action buer is not

reached so the subroutine to pro cess this buer is not called

When the end of the deferred action buer is reached the deferred action buer

end pro cessing routine is called This routine lo cks out the scavenger pro cesses and

executes the lo op in Figure to pro cess the buer entries

For each entry either some X S bits are turned o or some Y M bits are turned on

It is imp ortant to do this so on after the write barrier test indicates it is necessary

so it will normally b e done by the mutator Otherwise a single unmarked Y ob ject

may pro duce many deferred action buer entries b efore it is marked for example

Then an action is sent to a scavenger that will p erform the actual pro cessing The

scavengers run in background with dierent priorities there might b e a faster high

priority scavenger for ephemeral garbage collection and a very low priority scavenger

for a full garbage collection The action is sent to a buer asso ciated with the desti

nation scavenger An op eration co de is sent as part of the action to tell the scavenger

CHAPTER MEMORY MANAGEMENT

In the following Yobjectnumberreg is a register holding the ob ject number of

an ob ject Y and AX is the address register p ointing at an ob ject X that contains

a pointercomponent into which Y s ob ject number is to b e stored The current

entry in the deferred action buer is p ointed at by the deferredentryreg

register and the p oint just after the end of the buer is p ointed at by the

deferredendreg An actionentry in this buer has two comp onents X

and Y Other considerations are as in Figure page

pointercomponent AXbyteaddress Yobjectnumberreg

memorybarrierinstruction

tempreg mapSbitstringAXobjectnumber

tempreg mapnotMbitstringYobjectnumberreg

tempreg tempreg AND tempreg

if tempreg nonzero then

Xdeferredentryreg AXobjectnumber

Ydeferredentryreg Yobjectnumberreg

deferredentryreg actionentrysizeconstant

if deferredentryreg deferredendreg

call deferredactionbufferendsubroutine

Figure Write Barrier RISC PseudoCo de

what to do ie which bits of X S were cleared and which bits of Y M were set If

more than one bit is changed there may b e work for more than one scavenger but

the mutator only sends the action to the highest priority of these scavengers and

after that one scavenger has done its part it forwards the action to the next highest

priority scavenger

The time needed to pro cess a deferred action buer entry is RISC instructions

where as b efore the lower b ound is from the pseudoco de and the upp er b ound is

double the lower b ound and in this case the assumption is made that the p er entry

overhead for starting and stopping the pro cessing of one buer is half the p er entry

lo op overhead which is instructions from Figure The overhead for starting

and stopping the pro cessing of one buer includes the time to lo ck out scavengers

b efore pro cessing the buer and reset this lo ck after pro cessing the buer The buer

CHAPTER MEMORY MANAGEMENT

The following lo op pro cesses the action entries in a deferred action buer For each

entry containing the ob ject numbers of X and Y some X S bits may b e turned o and

some Y M bits may b e turned on the complement of M is called notM and its bits are

turned o Then an op eration is sent to one of several to scavengers with dierent

priorities In the case given b elow the second scavenger is sent to using a buer whose

current entry is p ointed at by scavengeentryreg

loop

Xobjectnumberreg Xdeferredentryreg

Yobjectnumberreg Ydeferredentryreg

Sreg mapSbitstringXobjectnumberreg

notMreg mapnotMbitstringYobjectnumberreg

tempreg Sreg AND notMreg

jump to dispatchtabletempreg

case

Sreg Sreg AND caseSmaskconstant

mapSbitstringXobjectnumberreg Sreg

notMreg notMreg AND casenotMmaskconstant

mapnotMbitstringYobjectnumberreg notMreg

scavengeopcodescavengeentryreg

caseopcodeconstant

scavengeXscavengeentryreg Xobjectnumberreg

scavengeYscavengeentryreg Yobjectnumberreg

scavengeentryreg scavengeentrysizeconstant

if scavengeentryreg scavengeendreg

call scavengebufferdone

jump to endofdispatchedcode

endofdispatchedcode

deferredentryreg actionentrysizeconstant

if deferredentryreg deferredendreg

jump to next iteration of loop

endofloop

Figure Deferred Buer Pro cessing Lo op RISC PseudoCo de

CHAPTER MEMORY MANAGEMENT

need only b e long enough to amortize this time

The mutator write barrier overhead can b e viewed as having two comp onents the

time for the test instructions and the time for the mutator to pro cess an action

instructions I will call the rst comp onent the mutator test time and the

second comp onent the mutator action time The total mutator action time during

any complete garbage detection is prop ortional to the number of used ob jects plus

the number of ob jects added to the ephemeral ro ot list Thus the total action time

is b ounded by considerations similar to those that b ound the total scavenger pro cess

time These are discussed b elow in section

For high priority pro cesses the deferred action buer may b e sent to a lower priority

pro cess for handling moving the action time o to lower priority

It is imp ortant to note that when a new ob ject is initialized only the write barrier

test time instructions o ccurs This is b ecause the new ob jects scavenged ags

are not set including the ag that really means notephemeralro ot No action is

written in the deferred action buer and no buer pro cessing o ccurs However there

may b e exceptions to this during the Write Barrier End Game see section

ab ove Also the test must still b e made for two reasons rst illegal stores of lo cal

p ointers into global storage must still b e detected see section Lo cal and

Global Heaps ab ove and second the new ob ject may have its scavenged ag set

during the write barrier end game

The imp ortance of a fast write barrier test should b e clear

Related Work on Write Barriers

Write barrier tests are describ ed by Hosking Moss and Stefanovic in HMS which

measures various write barrier schemes used to maintain ephemeral ro otset lists of

p ermanent ob jects that reference ephemeral ob jects The tests there involve compar

ing generation numbers trapping if one is less than the other Unlike our system

there is no ability to suppress a trap if one has already o ccurred for the ob ject b eing

stored into S bit already cleared Also there is no ability to have the T write

ns

barrier test b e for free in the presence of the T ephemeral ro ot write barrier test

e

NonMutator Overhead In a concurrent system garbage collection pro cesses run concurrently with mutators

CHAPTER MEMORY MANAGEMENT

These collection pro cesses may do op erations such as scavenging marked ob jects

sweeping unused ob jects into free blo cks and compacting used ob jects to make a

single large free blo ck

The garbage collection pro cesses require a certain amount of CPU time to pro cess

Each used ob ject for markingscavengingsweepingcompacting

Each ephemeral ro ot ob ject for markingscavenging

Each p ointer in a used or ephemeral ro ot ob ject for scavenging

Each byte of a used ob ject for compacting

Each unused ob ject for sweeping

Each p ointer to a ro ot ob ject for marking

I will call one garbage collection algorithm execution a gc cycle Thus for each used

ob ject there is a certain amount of CPU time required during each gc cycle which I

refer to as the gc time overhead of the used ob ject This is a linear function of the

number of p ointers in the ob ject and the size of the ob ject There is a similar linear

function for ephemeral ro ot ob jects but the size in bytes of such ob jects do es not

contribute There is a gc time overhead for each unused ob ject which is generally a

small xed time regardless of the contents and size of the ob ject And lastly there

is a small xed gc time overhead for each p ointer to a ro ot ob ject to mark the ro ot

ob ject

To keep the total gc time overhead b ounded one must have

An upp er b ound for the number of used ob jects

An upp er b ound for the number of p ointers to ro ot ob jects

An upp er b ound for the number of ephemeral ro ot ob jects

An upp er b ound for the number of p ointers in used and ephemeral ro ot ob jects

An upp er b ound for the number of bytes in used ob jects

An upp er b ound for the allo cation rate of new ob jects

CHAPTER MEMORY MANAGEMENT

An upp er b ound for the rate of allo cating bytes in new ob jects

A lower b ound for the number of bytes of memory available for used and unused

ob jects

A lower b ound on the number of used plus unused ob jects allowed

These items can b e used to nd b ounds for the time taken by gc cycles and this with

the various upp er b ounds gives a minimum eciency for the garbage collector

A garbage collection cycle also needs memory for lists of ob jects to b e scavenged and

lists of ephemeral ro ot ob jects This memory has a size prop ortional to the number

of used ob jects and the number of ephemeral ro ot ob jects resp ectively

There is no problem b ounding all the ab ove appropriately for realtime applications

as long as the allo cation rates are not to o high and there is several times as much

memory as required to hold the used ob jects

There is one part of the write barrier garbage collector nonmutator overhead that is

not b ounded by the ab ove considerations This is the overhead resulting from write

barrier end thrashing that was discussed in section ab ove

There is also a question of how to dene used object Bounds on the number of

used ob jects and the number of bytes they consume are exp erimentally measured for

most systems so any exp erimentally measurable denition will work We therefore

chose to dene an ob ject as used during a garbage collection cycle if it exists and is

not collected at the end of the cycle Similar practical denitions can b e made for

the other quantities ab ove

Use of ephemeral garbage collection helps keep the amount of memory needed down

by not counting p ermanent ob jects in any of the ab ove b ounds except that ephemeral

ro ot ob jects count slightly in the time b ound

Realtime systems often use queues of free blo cks of several xed sizes to get go o d

realtime allo cation and deallo cation times This can also b e done with our memory

manager using manual deletion to return an ob jects memory to a free queue Then

the byte allo cation rate for the garbage collection cycle do es not have to count bytes

allo cated and freed in this manner Therefore cycles can b e very much longer than

they would otherwise have to b e and the cycle can b e very ecient The only part

of an unused ob ject that the cycle must collect is the ob ject map entry which is relatively small

CHAPTER MEMORY MANAGEMENT

Summary

I b elieve that a memory manager based on two level addressing as implemented in

software by address registers and on the bitstring AND write barrier test as discussed

ab ove will b e a strong candidate for a universal memory manager that will p ermit

many dierent languages to interoperate

Chapter

Data Types

Goals

As an intermediate programming language the RCODE virtual machine language

provides only data types matched to the hardware Thus the data types of RCODE

are various forms of integer and oating p oint numbers and various forms of p ointers

The main goal of this chapter is to

G Standardize data types so that dierent high level languages dier

ent computer hardware and programs written by dierent p eople at

dierent times can interoperate

This is not a new goal It has b een pursued by the computer hardware community

for decades and has resulted in standardizing on the bit byte the twos comple

ment integer and the IEEE oating p oint number format As far as number types

are concerned the dierences b etween big and little endian computers and b etween

the alignment requirements of dierent computers are the only signicant remaining

issues

RCODE follows in this trend but is very careful ab out p ointer data b ecause it must

supp ort the garbage collecting memory management of Chapter RCODE is also

very careful ab out big and little endian and alignment issues

In addition RCODE introduces other ways of formatting information that make it

easier for programs written by dierent p eople at dierent times to interoperate The

CHAPTER DATA TYPES

rst of these is tagged values which are used in many LISP implementations The

second is including type information in p ointers so machine co de discovers at run

time the exact size of numbers p ointed at The third is type maps so machine co de

discovers at run time b oth the size and displacements of structure comp onents The

fourth is a type matching system so machine co de can translate at run time the

actual types of data as expressed by type maps into the formal types exp ected by

the co de also expressed by type maps The fth is array descriptors so that machine

co de discovers array layouts at run time

The metho d of this chapter is simply to extend the basic hardware data types that

have b ecome standard in the last several decades RCODE includes all the extensions

I have thought of that address G ab ove and that have a reasonable chance of b eing

passably ecient on existing hardware see C page and b eing very ecient on

2

p otential future hardware Only o ccasionally do es this chapter discuss alternatives

that have not b een included in RCODE for one reason or another

Requirements

The basic RCODE goal see page

G A high school student knowing a bit of algebra and geometry can inves

tigate and change commercially written games editors and simulators

leads to programming languages of the strongly typed LISP variety Basic LISP

computes with tagged values but if the user provides sucient strongly typed lan

guage style declarations LISP can deal with untagged values and b e much more

ecient on existing computers Thence the requirement quoted from page

F RCODE will supp ort tagged and untagged values and an ecient tran

6

sition b etween the two kinds of data

We also want programming languages that supp ort mixing co de written by dierent

p eople some commercial some students and some noncommercial professionals To

get this co de to interoperate prop erly it is desirable to delay until runtime matching

the actual type of data to the formal type exp ected by functions to the extent

CHAPTER DATA TYPES

allowed by eciency goals One way of doing this is to represent types by type maps

that tell the sizes and displacements of comp onents of ob jects of the type and the

addresses of functions that op erate on these ob jects A type map is a vector organized

analogously to a page table and can b e eciently implemented in future hardware

RCODE attempts to make type maps as capable as they reasonably can b e and

therefore strives to meet the following requirement

F RCODE will supp ort type maps and will eciently handle execution

4

of very small functions selected from a type map and also eciently

handle loads and stores in which only displacement numeric type and

size are stored in the type map

Just as the map from an actual ob ject to a formal ob ject ie the ob ject as seen by a

routine is enco ded in a type map a linear map from a list of subscripts to an address

within an array is enco ded in a map we call a array descriptor One of the things I

learned during my decades writing glue software see Background page is that

array descriptors are the right way to treat array accesses This is b ecause there are

many instances when pieces of an array need to b e treated as arrays in their own

right Thus

F RCODE will supp ort array descriptors treating them as rst class

5

data except that they may not b e mo died once created

The rest of this chapter is organized into three main sections The rst describ es basic

RCODE number p ointer and tagged types and explains how RCODE p ointers

containing numeric type information are used The second analyzes type maps The

third describ es RCODE array descriptors

Basic Data Types

Integer and oating point number types are standardized by existing hardware R

CODE supp orts signed twos complement integers of sizes from to bits unsigned

binary integers of sizes from to bits and oating p oint numbers of four sizes

1

and bits

1

The format for a bit oating p oint IEEE number can b e found in Inc For bit IEEE

numbers the only sensible parameters are bits of mantissa and of exp onent If denormalized

CHAPTER DATA TYPES

One set of basic data type issues concerns how to transmit data eciently b etween

big and little endian computers and b etween computers with dierent alignment

requirements A second set of issues concerns how to represent the tagged values

required by LISP A third set concerns how to represent and use p ointers This last

includes whether or not to include type information in p ointers what information

to include and how to use this information eciently at run time The following

subsections consider these three sets of issues in turn

Memory Units

A memory unit is a data comp onent that preserves its addressability when transp orted

without reformatting b etween computers of dierent data formats By addressability

we mean merely the ability to compute the address of the memory unit By computers

of dierent data formats we include in particular computers of dierent endianho o ds

An aligned bit byte is a memory unit

On the other hand if we transmit a bytealigned vector of bit elements from a big

endian computer to a little endian computer we get a result in which the elements

cannot reasonably b e addressed unless all bytes of the vector are reversed in memory

and the direction in which the elements are access is also reversed see Figure

Clearly the bit elements of this vector are not memory units The entire bit

vector might b e considered a memory unit b ecause even without reversing its bytes

the vector itself is addressable after transmission

RCODE requires that all memory units have a size in bits that is a p ower of two and

that all memory units b e aligned have an address that is an exact multiple of their

length Addressing memory units of this kind is the same for big and little endian

computers except for units of size less than bits For these RCODE requires

big endian addressing It is p ossible to insist on big endian addressing of memory

units even on little endian computers by just complementing some of the low order

bitaddress bits when accessing the unit b ecause memory units are aligned on p ower

of b oundaries Big endian is preferred b ecause that is what IO devices usually

7

numbers are supp orted this gives a precision of or whichever is greater and a range

in excess of to Only load and store op erations and not arithmetic would b e sup

p orted for bit oating p oint numbers Implementations might have to do bit oating p oint computations to only bit precision for eciency

CHAPTER DATA TYPES

Big Endian

Bytes

Computer

Elements

Transmit

Little Endian

Bytes

Computer

Elements

Reverse Byte Order

Little Endian

Bytes

Computer

Elements

Figure A Vector of Bit Elements

prefer

RCODE requires that ob jects which are to b e copied consist of contiguous disjoint

memory units with the displacement of each memory unit from the b eginning of the

ob ject b eing a multiple of the size of the memory unit The alignment of the ob ject

is then the size of the largest memory unit

With these RCODE requirements when an ob ject is transmitted b etween a big

and a little endian computer the only endian conversion required is reversing the

bytes of each memory unit larger than bits This solves the problem of eciently

transmitting data b etween computers of dierent endianho o d

RCODE do es not p ermit arbitrary bit elds eg a bit quantity or unaligned bit

quantity to exist outside memory units Such elds can only exist inside memory

units and can b e extracted from these memory units by op erations such as integer

shifts Therefore such bit elds can b e ignored when transmitting data b etween

computers and reformatting memory units to meet the needs of dierent computers

The reasons for this are

As discussed ab ove arbitrary bit elds are not memory units and cannot sen

sibly b e copied b etween computers of dierent endianho o d

CHAPTER DATA TYPES

Many mo dern computers do not have instructions for loading and storing bit

elds directly into memory

The consequence of the RCODE requirements is that to access a comp onent one

must rst access the memory unit that contains it and then access the comp onent

within the memory unit Displacements of memory units can b e used to load or store

memory units Displacements of comp onents within memory units can b e used to

extract comp onents from the memory units or merge comp onents into the memory

units

The displacements of the memory units are computationally like one would exp ect

displacements to b e but the displacements of comp onents within memory units are

more like the numeric type of the comp onent and need to b e enco ded in instructions

by a compiler to b e most ecient on many existing computers

One can generalize RCODEs restrictive notion of memory units rather easily to

include any value that is a multiple of bits in length and is aligned on a byte b ound

ary When transmitted b etween dierent endian computers the bytes of the unit are

reversed The vector in Figure would b e an example The ma jor disadvantage is

that unaligned data are much much slower to access than aligned data on the coming

generation of computers a factor of slower on the Alpha Sit p A

and A

RCODE memory units also have formats signed integer unsigned integer oating

p oint and tagged value These can b e used to translate memory units when copying

data b etween dierent kinds of computers For example oating p oint data might

b e converted Format sp ecic translation of memory units is not needed when b oth

computers have IEEE format oating p oint numbers

Tagged Values

In order to supp ort LISP RCODE provides bit tagged values mo deled on the

IEEE bit oating p oint number Floating p oint numbers are represented as p er

the IEEE format Tagged integers and p ointers are represented as signalingNaNs

IEEE values that will cause a trap if input to a oating p oint op eration The non

signalingNaN is reserved for use in indicating the output of an instruction whose

inputs were illegal see Chapter particularly section page

Figure gives the basic format of RCODE and bit tagged values For

CHAPTER DATA TYPES

tagged value oating p oint j nonsignalingNaN j taggedsignalingNaN

Tagged SignalingNaNs

32 32

tag

value

32 96

tag

value

tag integer j unsignedinteger j b o olean j p ointer j

Figure Bit and Bit Tagged Values

bit tagged values bit quantities can b e represented by a bit tag and a

bit value where the tag is chosen to make the whole an IEEE signalingNaN This

is used to represent bit signed and unsigned integers It is also used to represent

sp ecial values such as the OMITTED value that can b e passed to indicate an omitted

optional argument and the ERROR value that contains a bit error co de and can

b e passed to indicate an error

When a pointer is stored in a bit tagged value the bit value part is used to hold

the object number that designates the ob ject b eing p ointed at This ob ject number

references an ob ject map entry in a table and the ob ject map entry holds the address

of the ob ject see section in Chapter for details particularly Figure on

page The bit tag in the p ointer tagged value is used to hold a bit type map

number that references a type map which describ es the apparent type of the ob ject

Type maps are discussed b elow in section of this chapter bits is the largest

number of bits that can reasonably b e allo cated for a type map number while still

making the value a bit IEEE signalingNaN and leaving ro om for other tagged

values

A LISP system has b een built by UmemuraUme that represents LISP tagged

CHAPTER DATA TYPES

data as bit VAX oating numbers Pointers without type map numbers are rep

resented by invalid oating p oint numbers that cause traps when input to arithmetic

op erations In this LISP system the integer and the oating p oint number are

not distinguished and oating p oint hardware instructions are used for LISP integer

arithmetic Arithmetic eciency is comparable to FORTRAN

RCODE register dataow execution as describ ed in Chapter is supp orted by

bit tagged values that are used as the values of virtual registers The registers

are virtual in that they all have an asso ciated type co de that restricts their range of

values and p ermits them to b e stored more compactly in actual implementations For

example the type co de of a register might restrict the registers values to bit signed

integers so the implementation would store the register in a bit hardware register

but the RCODE debugging interface would present the value to a source language

debugger as a bit tagged value So the human user can think of all register values

as b eing bit tagged values while the hardware can view them more eciently

One of the dangers of using a virtual computer mo del as an intermediate represen

tation for program co de is that the constraints of packing information into the xed

size data units of a virtual computer will corrupt the semantics of the intermediate

representation RCODE avoids most of this problem by using very large virtual

registers and very large virtual instructions Thus after some exp erimentation

bits was settled on as large enough for a virtual register though there is little reason

one could not go to bits

2

The bit tagged value format is based on the bit IEEE oating p oint format

just like the bit tagged value format Floating p oint numbers including the in

nities are represented in IEEE format NonsignalingNaNs are reserved to indicate

the outputs of instructions with illegal inputs bit quantities such as signed and

unsigned integers are prefaced with a tag bit tagged p ointer data are discussed

in the next section

Pointers

A full p ointer in RCODE is a bit tagged pointer formatted as in Figure

The ob ject number designates an ob ject the unsigned displacement designates the

2

See Inc Table Its not completely clear to me whether this is an ocial IEEE standard yet for the bit size

CHAPTER DATA TYPES

24 40 32 32

FFFFC

unsigned

16

ob ject

type

tag

displacement

number

Figure Bit Tagged Pointer

displacement in bits of a memory unit within the ob ject and the type designates the

type of the comp onent stored in the memory unit or the comp onent that b egins with

the memory unit Types will b e explained b elow

bit p ointers are usually stored in virtual registers and therefore are not stored by

implementations directly in the format given In particular implementations usually

store only a bit displacement which is in units of either bits or bytes dep ending

up on the type eld in the p ointer This means that implementations have p ermission

to implement only bits of displacement for bit aligned memory units and implement

only bits of displacement for byte aligned memory units

The Pointer Type Field

There are three dierent formats for the bit type eld of a bit tagged p ointer

formatted as in Figure The scalar type which we will explain b elow is used for

simple number and p ointer comp onents The mapped type is used for comp onents

asso ciated with a type map and will b e explained later within section A mapp ed

type contains a type map number that p oints at a type map The type map has the

scalar type of the comp onent in element so even a mapp ed type has a scalar type

Lastly the routine type is just a means of expanding the bit type eld to hold

extra information namely p ointers to three routines that p erform the comp onent

load store and loadaddress functions The routine type contains a routine map

number that p oints at a routine map which holds this extra information and also

holds the bit type that would have b een stored where the routine type was stored

if we did not need the extra information This bit type cannot itself b e a routine

type but can b e a scalar or mapp ed type In either case one eventually nds a scalar

type so even a routine type has a scalar type

CHAPTER DATA TYPES

8 4 12 8

Scalar

copy type

Type

comp onent

access

mem unit size

sp ec

size

within mem unit oset

8 4 20

Mapp ed

Type

access

type map number

sp ec

4 20 8

Routine

Type

access

routine map number

sp ec

type

scalar type

Type

Routine

Map

Map

size

load routine

store routine

loadaddress routine

Figure The Pointer Type Field

The access specication eld of a scalar mapp ed or routine type sp ecies whether

the comp onent is writeonly readonly readwrite or something more complicated

On existing computers access sp ecication information needs to b e compiled into

instructions to b e ecient

In parallel computing it is desirable to break co de into blo cks within which memory

op erations may b e reordered by the compiler or hardware Memory read op erations

on a readonly comp onent can b e reordered without problem but readwrite and

even writeonly op erations on the same comp onent cannot b e Thus it is desirable

to introduce other more orderindep endent access sp ecications such as writeonce

accumulator and atomictransaction These are discussed in Chapter

CHAPTER DATA TYPES

Scalar Types

The scalar type contains a copy type and a component size These two pieces of

information are just what is needed to copy a scalar value b etween registers In

addition the scalar type contains a memory unit size and a withinmemoryunit oset

that are needed to copy the scalar value b etween a register and memory

Copy types are listed in Figure

integer Signed integer

unsigned Unsigned binary integer

unsignedN Unsigned binary integer that is an exact multiple

of N for N or

oat Floating p oint number

tagged or bit tagged value

p ointer or bit tagged p ointer

ob ject number bit ob ject number

type map number bit type map number

routine map number bit routine map number

reduced p ointer bit p ointer containing a bit ob ject number

and a bit unsigned displacement

contiguous sub ob ject Ob ject that is a contiguous set of memory units

discontiguous sub ob ject Ob ject that is scattered in memory

Figure Copy Types

Besides the normal number types there are sp ecial types for storing numbers that are

exact multiples of small p owers of two An unsigned copy type value for example

can b e used to store an unsigned integer that is an exact multiple of without storing

the low order bits of the integer The value actually stored in memory is multiplied

by to get the true value Such a value can store a size that app ears to b e in units

of one bit when in fact it is in units of one byte

There are a variety of p ointer formats A reduced pointer contains just the information

an implementation would store for a bit tagged p ointer but without the type eld

see Figure The displacement is stored as bits and is in either bit or byte units

CHAPTER DATA TYPES

dep ending on the missing type eld which must b e added later by an instruction

A contiguous subobject is a contiguous sequence of memory units that contain sub

ob ject comp onents accessed via a type map A discontiguous subobject is a set of

comp onents accessed via a type map with the comp onents b eing p ossibly scattered

in memory Sub ob jects are discussed more b elow in section

The scalar type stores the size of the memory unit containing the scalar comp onent

and the oset of the comp onent within that memory unit For integers the memory

unit size can b e or bits and the oset within the memory

unit is up to bits long The displacement stored in the p ointer is the displacement

of the memory unit and must b e an exact multiple of the memory unit size since

all memory units in RCODE are aligned Pointer comp onents including ob ject and

type map numbers are all aligned and their memory unit size and comp onent size

are equal Pointer comp onents only come in or bit sizes

Loads and Stores

The next question is what happ ens when an RCODE load or store instruction inputs

a bit tagged p ointer for the purp ose of reading or writing a memory lo cation

RCODE has an instruction to check whether the type eld of a p ointer matches a

particular value exactly If this instruction has b een applied to the p ointer which the

load or store instruction is going to use the RCODE load or store can b e compiled

for existing computers into a single machine load or store instruction containing the

sp ecied type information

It is also p ossible that the p ointer was computed in the current subroutine by some

instruction that sp ecied the p ointer type eld or the type eld will b e known at

compile time for some other similarly straightforward reason

But if the type eld is not known the load or store must cop e with many dierent

types I prop ose that RCODE implementations do this by making the load or store

instruction into a case statement that switches on the type eld of the p ointer Fur

thermore the cases of this case statement are dynamically compiled whenever a new

value for the type eld is seen by the case statement it compiles a new case of itself

to handle the new type eld value Thus load and store instructions compile into

dynamic case statements

If load and store instructions using the same type eld can b e batched this will

CHAPTER DATA TYPES

b e more ecient It may even b e appropriate to batch such instructions with non

memory reference instructions as when the batch forms a tight lo op Such batching

is facilitated by the fact that RCODE tries to p ermit instructions to b e as reorder

able as p ossible see Chapter This kind of batching is clearly exp erimental and

considerable exp erience will b e required to quantify its eciency

When a routine receives an argument that is a p ointer to an ob ject or sub ob ject the

routine typically applies a type matching op eration to get a new p ointer with a match

type Then this new p ointer is used with load and store instructions These load and

store instructions are dynamic case statements switching on the match type map

number But the implementation can streamline this situation by having the load

and store dynamic case statements switch directly on the type map number from the

original p ointer argument so the type matching op eration is never actually executed

except while compiling a new case

An advantage of the dynamic case statement approach is that one is not limited to

compiling cases that contain single load or store instructions but may compile inline

whole small routines to p erform loads and stores Routine types take advantage

of this by sp ecifying routines to p erform comp onent load store and loadaddress

op erations These routines can b e compiled inline if they are small or calls to them

can b e compiled if they are large Inlining is b ecoming increasingly imp ortant for

example one manufacturer of newest generation computers recommends that routines

of or fewer instructions b e inlined for eciency Sit Section A Number

Load and store instructions op erate on the scalar type of a p ointer A p ointer may

have a mapp ed type which in turn has a scalar type stored in a type map Thus the

p ointer may p oint at scalar numeric elements of an array for example while there

is still a type map asso ciated with the type of these elements This type map may

contain function p ointers for op erations on the array elements such as the max

op eration mentioned in section b elow Calls to such functions may b e separately

treated as dynamic case statements or may b e batched with load and store instruc

tions that are so treated

Dynamic case statements have two dierences from the hash table based dynamic

dispatching common to ob ject oriented languages First the table searched by a

dynamic case statement is very much smaller than the hash tables often it may b e

only to elements and sometimes or so Second the co de b eing dispatched to

can b e inlined at the call site by the dynamic case statement

CHAPTER DATA TYPES

Related Work on Dynamic Compilation

Holzle and Ungar have implemented a compiler for the SELF language in UU

that is very similar to dynamic case statements The dierence is that the SELF

system do es not compile new cases dynamically at runtime but rather feeds back to

a compiler a histogram of the frequency with which various types are encountered at

runtime by a particular ob ject oriented metho d call If a case has not b een compiled

inline it is handled by an general ob ject oriented metho d dispatcher Their system

handles of all metho d invocations with cases compiled inline

There has b een considerable additional work on dynamic or adaptive compiling in

general and not just compiling case statements switching on types The work on

SELF referenced in UU can b e used as a starting p oint for accessing this literature

The advantage of RCODE should b e that compilation of new cases should b e very

fast b ecause RCODE is like machine language Thus dynamic onthey compilation

should b e doable with little runtime p enalty This remains to b e tested in practice

And of course if one wants full optimization then recompilation of full routines

using knowledge of the types seen so far would b e necessary and take time

Lastly some way of handling particular situations where classical ob ject oriented

hash table dispatch is really the b est approach b ecause a particular dynamic case

statement would have to o many cases that would not b enet from inlining is also

needed by an RCODE implementation

Type Maps

3

A type map is a vector of component descriptors see Figure where each com

p onent descriptor tells how to access a comp onent within a blo ck of memory that

represents an ob ject Type maps are used to complete the denition of a type at run

time

For example supp ose a subroutine is passed an ob ject that it knows to b e of formal

type animal It knows that all animal ob jects have a weight comp onent but

it do es not know at compile time where the weight comp onent is within the ob ject

However it do es know that at run time a type map will b e provided to enable the

ob ject to b e viewed as an animal and that element of this type map will b e the

3

Type maps in some form were invented at least as early as Str p

CHAPTER DATA TYPES

NonConstant Comp onent Descriptor

displacement of memory unit

and displacement of comp onent

Type Map

within memory unit or

Comp onent Descriptor displacement of comp onent

size of memory unit in bits or

Comp onent Descriptor

alignment of comp onent

size of comp onent in bits

copy type identier

Comp onent Descriptor

actual type identier or

Comp onent Descriptor

match type map identier

access sp ecication

Comp onent Descriptor

loadstoreloadaddress

routine identiers

Figure Type Maps

displacement of the weight comp onent in the ob ject

A type map may also contain constants These may b e thought of as descriptions of

comp onents that dep end only on the type of an ob ject Such constant comp onents

can b e addresses of functions that tell how to p erform op erations on ob jects of a given

type Or constant comp onents can b e addresses of typerelated variable data such

as free lists for allo cating ob jects of a given type

For example supp ose a subroutine is passed an array whose elements have formal

type lattice element The subroutine knows that all lattice element types have a

max op eration but do es not know at compile time how to p erform this op eration

However it do es know that at run time a type map will b e provided to enable the

array element values to b e viewed as having values of type lattice element and that

element of this type map will b e the address of a function that will implement the

max op eration for these values

Type maps are computed as follows Every ob ject or value has an actual type that

denotes the ob ject format and contents Every piece of co de that references the

ob ject or value considers the ob ject or value to have a formal type which denotes

CHAPTER DATA TYPES

what the co de knows ab out the format and contents of the ob ject or value There is

a type matching operation that inputs an actual type and a formal type and outputs

a type map This type map is called the match type map and assists the co de in

interpreting the ob ject or value as having the desired formal type as the ab ove two

examples illustrate

Note that languages usually p ermit every actual type to b e used as a formal type

Also formal types that are not also actual types are often called abstract types and

are supp orted only by more recent languages

Below I will discuss type maps somewhat abstractly It may help the reader to known

that in the end RCODE nonconstant comp onent descriptors turn out to b e just like

bit tagged p ointers page ab ove without the ob ject number

In RCODE type maps provide a notion of a virtual object that is mapp ed onto an

actual object The actual ob jects and the co de using them may b e created by dierent

programmers at dierent times and the type maps x up the dierence by presenting

the co de with the correct virtual ob jects Or many dierent types of actual ob ject

with something in common may b e pro cessed by the same co de by making all these

ob jects app ear to have the same formal type using type maps All of this promotes

the goal G of p ermitting programs to interoperate

Type maps are required to b e actual vectors they may not b e conceptual vectors

implemented by hash tables If we were to allow type maps to b e hash tables some

of the analysis would b e quite dierent Another way of putting this is to say that

any use of hash tables is conned to the type matching op eration and the output

of this matching op eration is a type map that is a true vector that will b e accessed

using element indices known at compile time Furthermore as we will see b elow

in sections and type matching can often b e done using only vector

element accesses with indices known at compile time

Issues to b e Analyzed

A basic principle of RCODE is to provide maximally capable type maps The metho d

of this section is to analyze what is p ossible with type maps and include the b est of

these p ossibilities in RCODE The issues I will discuss are

How might type matching b e done

CHAPTER DATA TYPES

Should type maps b e constant

How should actual types b e sp ecied

What information can b e put into a comp onent descriptor

What is in an RCODE comp onent descriptor

How are sub ob jects handled

Examples of deciencies in existing languages

One issue not analyzed is how to architect future hardware that will optimally im

plement more capable type maps

Type Map Related Work

Although limited type maps have b een used by various languages for decades no

4

one seems to have attempted a systematic analysis of their p otential Note that

type maps in our sense are intended to b e high p erformance we exclude the many

low p erformance dispatch table metho ds and metho ds that always make outofline

subroutine calls

Detailed information has b een published on the type maps used in CStr

p EIFFELMS page SATHER MS and HASKELLJon sec

tion

Languages that have made some use of type maps have not b een comprehensive ab out

doing so Probably this is b ecause type maps are introduced only to implement certain

features of the languages

However we b elieve that the most successful programming languages provide rela

tively complete control over their target hardware and runtime systems that is over

their computer mo del Examples are C and LISP Pro ceeding in this manner we

4

The pap er by Milton and SchmidtMS comes close It contains an analysis of EIFFEL and

SATHER dispatching but do es not contain topics such as general type matching and inclusion

of type information in comp onent descriptors that arise when one attempts to analyze type maps indep endently of any language

CHAPTER DATA TYPES

b elieve languages should provide complete control over type maps p ermitting the

latter to do all they can

C uses type maps that it calls virtual function tables which contain displacements

and function addresses These tables could easily contain data addresses and ad

dresses of other type maps but C do es not supp ort this Thus C do es not

provide complete control of its computer mo del This assertion also applies to other

languages

We discuss such issues in more detail in sections and after our analysis

of type maps

In order to design RCODE a comprehensive analysis of the p ossibilities of type maps

is required Since none is available I have pro duced my own the current version of

which is rep orted in the following subsections

Type Matching

The type matching operation inputs an actual type and a formal type and outputs a

5

type map

There are two kinds of type matching static type matching for which enough is

known at compile time to p ermit the compiler to ensure that matching will work if

there are no compile errors and dynamic type matching for which the actual type

is suciently unknown at compile time that the match might fail There are several

ways to do dynamic type matching

Static Type Matching

A language such as HASKELL p erforms type matching by applying one of the fol

lowing cases

The compiler knows the actual type In this case the compiler computes the

type map Note that the compiler always knows the formal type

5 +

In HASKELLHF H the actual type is called a type the formal type is called a class and the type map is built rather explicitly by an instance declaration

CHAPTER DATA TYPES

The compiler do es not know the actual type but do es know a formal type X

and knows that formal type X inherits from the formal type Y for which we

desire to obtain the type map The compiler also knows the lo cation of an X

type map resulting from matching the formal type X with the actual type see

Figure And furthermore when the X type map was created a p ointer to

the desired Y type map matching Y to the actual type was stored in in the

X type map along with p ointers to type maps for any other formal type from

which X inherited So the compiler merely outputs co de to read the p ointer to

the Y type map from the X type map

Type X Inherits from Type Y

X Type Map Y Type Map

Figure Static Type Matching

Dynamic Type Matching in RCODE

Communication systems and general databases transmit and hold data of arbitrary

type When extracting data from such subsystems co de is compiled that do es not

know the actual type of the datum until run time and needs to test whether the

actual type can b e matched to various formal types

There are several ways to do this dynamic type matching The following system is

prop osed for RCODE

In RCODE b oth actual and formal type are represented by type maps called the

actual type map and the formal type map The RCODE type matching operation

CHAPTER DATA TYPES

pro duces a match type map from an actual and a formal type map This op eration is

memoizing it remembers its former results and replays them When it has no former

result it traps to software that can b e supplied by the compiler or user

RCODE supplies default trap software for p erforming matches that have not b een

done b efore see Figure This software exp ects formal type maps to have formal

component descriptors that give information ab out corresp onding match type map

descriptors Each comp onent descriptor in an actual or formal type map has an

asso ciated bit componentID If the ith comp onent descriptor in the formal type

map has comp onent ID x the actual type map comp onent descriptor with comp onent

ID x is found and this b ecomes the ith comp onent descriptor in the match type map

Match Formal Formal

Type Map Type Map Comp IDs

index

i

comp onent

actual formal

 

descriptor descriptor

ID

Actual Actual

Type Map Comp IDs

comp onent

actual

 

descriptor

ID

Figure RCODE Default Type Matching

An error o ccurs if there is no actual comp onent descriptor with ID x or if the ac

tual descriptor with ID x violates any sp ecication that is enco ded in the formal

comp onent descriptor of ID x see Formal Comp onent Descriptors b elow page

As mentioned in the section on Loads and Stores ab ove page the type matching

op eration can b e optimized away when its match type map result is used by a dynamic

case statement by having the case statement switch on the actual type input to the

type matching op eration instead of on the match type output Thus although the

CHAPTER DATA TYPES

type matching op eration may need a hash table to lo ok up memoized results the

hash table will not b e used during normal execution but only when compiling a new

dynamic case

Dynamic Type Matching via Formal Types

Supp ose type X inherits from type Y and we have some actual type T whose ob jects

may contain an X ob ject with its Y sub ob ject or may contain just a Y ob ject with

no X ob ject see Figure The result of matching T with a formal type Y is some

Y type map and the result of matching T with type X is some X type map or an

error If ob jects of type T contain an X ob ject with its Y sub ob ject then we could

put a p ointer to the X type map in some particular element of the Y type map and

easily obtain a p ointer to the X type map from a p ointer to the Y type map But if

ob jects of type T just contain a Y ob ject outside any X ob ject then there is no X

type map and we can put a null in the Y type map element were the p ointer to

the X type map should b e to indicate that the actual type do es not match with X

However it may b e that ob jects of type Y are not at displacement zero when included

in ob jects of type X so in order to pass from ob jects of type Y to ob jects of type X

one must also subtract the displacement of Y ob jects in X ob jects from the address

of the Y ob ject So in addition this displacement needs to b e stored in all type maps

for formal type Y

Thus when we derive a type X from a type Y we add to all Y type maps two

constants rst the address of an X type map or null and second the displacement of

Y ob jects in X ob jects In RCODE it turns out that these two pieces of information

can b e combined in a single comp onent descriptor that makes the X ob jects lo ok

like comp onents of some Y ob jects If a Y ob ject is in fact in an X ob ject this X

comp onent is present otherwise it is null

A disadvantage of this approach o ccurs if there are very many types X that inherit

from one type Y Then the number dierent formal type maps corresp onding to Y

go es up as the number of types X inheriting from Y and the number of constants in

each of these type maps go es up as the number of types X inheriting from Y and so

the total number of constants in type maps go es up as the square of the number of

types X inheriting from Y

An alternative approach just stores the identier of the actual type from which a

type map was derived in the type map p ermitting matches to other formal types

CHAPTER DATA TYPES

Type X Inherits from Type Y

Y Inside X Y Outside X

Type Map Type Map

X Type Map

null

displacement

unused

X Ob ject

Y sub ob ject

Figure Dynamic Type Matching by Formal Types

using this identier However these matches would presumably have to b e done using

a hash table

Supp ose all comp onents of Y are accessed by obtaining their displacement from the

appropriate Y type map Then it is p ossible to insist that the displacement of Y in X

always b e zero and any actual nonzero displacement b e comp ensated for by adding

the true displacement of Y in X to the displacements of the Y comp onents in the Y

inside X type map see Figure This would p ermit the displacement of Y in X to

b e omitted from this type map as it could b e assumed to b e zero It is only necessary

to include this displacement if it is needed as when some Y comp onents are at xed

osets from the start of Y or when the start of a contiguous Y is needed to copy Y to

another lo cation in memory It is also necessary to include this displacement in any

system see Actual Type Sp ecication b elow page in which ob jects b egin with

an actual type identier sp ecifying their p osition within other larger ob jects Usually

the displacement will b e needed for one of these reasons

CHAPTER DATA TYPES

The Formal Type ANY

Conceptually it is p ossible to have a formal type ANY from which all other types

inherit As suggested in the last section a match type map for some actual type

and ANY would contain p ointers to all type maps resulting from matching the actual

type Such a match type map is conceptually equivalent to an actual type identier

and can b e used to implement the dynamic match op eration by simply reading a

vector element

ANY match type maps would b e very sparse vectors and would take a lot of

memory unless they were stored as dispatch tables instead of as type map vectors

But such dispatch tables are similar to and less exible than the RCODE type

matching memoizer

The RCODE default type matching algorithm sp ecied ab ove is conceptually at least

as general as ANYbased type matching If each formal comp onent descriptor had

its own unique comp onent ID then we could add to each actual type that had a cor

resp onding comp onent the correct actual comp onent descriptor with that comp onent

ID

Whether this is economical dep ends up on how types are actually used The RCODE

mo del tends to view actual and formal types as subsets of the universe of all p ossible

kinds of comp onents where a particular kind of comp onent say the weight of a

b o dy is shared by many dierent types If this view is close to the users the

RCODE mo del will b e economical

Formal Comp onent Descriptors

RCODE represents formal types by formal type maps containing formal comp onent

descriptors Formal comp onent descriptors contain information ab out corresp onding

match type map comp onent descriptors Formal type maps and formal comp onent

descriptors are always known at compile time

For example a formal comp onent descriptor may sp ecify the corresp onding match

type map comp onent descriptor completely p ermitting co de using this match de

scriptor to compile into instructions that contain all information from the descriptor

This is just what is needed to compile the C programming language

Or the formal comp onent descriptor may sp ecify everything but the comp onent dis

placement p ermitting the instructions to contain all information but the displace

ment which they get from the match type map at run time

CHAPTER DATA TYPES

If a formal comp onent descriptor leaves to o much unsp ecied ecient comp onent

load instructions cannot b e compiled in the normal way Instead each load instruc

tion is compiled as a case statement switching on the identier of the match type

map Whenever a new match type map that the instruction has never seen b efore is

encounterred another case of the instruction is dynamically compiled

Similar dynamic case statement metho ds are used for store instructions and for load

address instructions used to lo cate comp onents It may b e p ossible to batch such

cased instructions together when they are dep endent on the same match type map

and pro duce a single case statement for the entire batch thus reducing case branch

overhead

When the type map comp onents are binary functions implementing op erations such

as max these op erations may b e made into dynamic case statements switching

on the type map with the functions p ointed at by the type map b eing dynamically

compiled inline

Dynamic case statements and batching may b e used for such op erations just as for

load and store instructions

Thus formal comp onent descriptors in RCODE can sp ecify enough to compile e

cient co de directly as in a C compiler or leave enough op en to require dynamic case

statements for loads stores and other op erations

Constancy of Type Maps

RCODE assumes that type maps do not change during the execution of a program

so that co de can b e dynamically compiled using information in type maps Thus

type maps are like inlineable co de changing them forces recompilation of all co de

that inlined information from the type maps

Similarly RCODE assumes the match op eration do es not change and can b e mem

oized

Adding Comp onent Descriptors to Type Maps

In the section Dynamic Type Matching via Formal Types ab ove page we

noted that deriving a new formal type X from an existing formal type Y might add

comp onent descriptors to the existing Y type maps Adding comp onent descriptors

to the end of existing type maps do es not compromise the constancy of type maps

It also p ermits new formal op erations to b e dened for existing formal types when

CHAPTER DATA TYPES

new co de is loaded which can b e useful

Type Variables

An ob ject type eg C class may have variables asso ciated with it as well as

constants

There are two reasons not to store such variables in a type map

The same formal ob ject type may require many match type maps to accom

mo date dierent ob ject layouts ie actual types and these should all share

the same type variables So each match type map should have a p ointer to the

shared type variables

It is imp ortant to know that type map information is constant so compilation

initial or dynamic can trust the information not to change any more often

than co de changes

Actual Type Sp ecication

An actual type identier can b e put into three places

In machine instructions

In the data

In a p ointer to the data

These options are roughly in order of increasing exibility

An actual type identier may b e the address of a type map or a numeric co de When

actual type identiers are stored in ob jects there are reasons for using numeric co des

instead of addresses

The data is to b e copied directly b etween RAM memory and an external medium

CHAPTER DATA TYPES

The data is in RAM memory shared by several dierent programs that are

linked separately and may need dierent versions of type maps For example a

particular function may b e at dierent addresses in the co de space of dierent

links so that function addresses stored in type maps are determined by the type

and the program

The data needs to b e compact and cannot aord the space of a full type map

address

Reasons and are also valid when actual type identiers are stored in p ointers

RCODE uses type map numbers to identify type maps and uses type maps to identify

actual types RCODE can store type map numbers in p ointers or in data Numeric

co des are used for all three reasons ab ove and also b ecause the RCODE memory

management system indirects all addresses anyway to p ermit ob jects to b e moved at

any time so the indirection needed to lo ok up a numeric co de in a table would b e

done by RCODE in any case

An Actual Type Sp ecication Example

Reason arises in large realtime systems and we will sketch an example b ecause

most language designers are unfamiliar with such systems Our example large real

time system has a few hundred thousand lines of co de loaded at one time into the

same symmetric multiprocessor shared memory computer Communication is by mes

sage passing b etween separate programs running at dierent priority levels This is

b ecause no one has gured out how to eciently debug large realtime systems with

out breaking them up into mo dest sized programs that take in a single stream of

messages and send out messages to several streams Also it is uneconomical to try

to link to o much co de together so each mo dest sized program is linked separately

In spite of the fact that communication is by message passing the messages are

stored in shared memory and only p ointers are actually passed the messages b ecome

readonly once sent This is b ecause systems that actually copy messages even

just memory to memory use so much computer time doing the copying and cause

such worries for programmers trying to optimize message size that they are not comp etitive

CHAPTER DATA TYPES

So far we have describ ed systems that actually exist Now let us hypothesize an

6

ob ject oriented approach to organizing the co de Most co de will op erate on ob jects

that are packed into messages When a message arrives in the input of a program

the program control co de gures out what to do constructs output messages and

then passes pieces of input and output messages expressed as ob jects to a subroutine

library to p erform computations

The ob jects will b e denoted by addresses of actual type identiers stored in messages

Given the address of an actual type identier a subroutine can lo cate the type maps

it needs by type matching op erations and these tell how to access everything else

in the ob ject as the ob ject is stored in that particular message starting from the

address of the actual type identier The same formal type of ob ject may b e packed

in dierent ways into dierent messages by varying the actual type identier and type

maps

In this hypothesized paradigm for developing programs separate programmers called

system integrators establish which ob jects are packed into which messages and write

the program control co de The mission of these integrators is to get each computation

done at the right priority level and within the prop er latency To do this they move

information and computation around in ways that would upset a mathematical pro

grammer who just lo oked at the computations that needed to b e done The paradigm

will work if the ob ject subroutine libraries give the system integrators enough ro om

to maneuver

Note that what we have just describ ed is not currently done b ecause supp ort for type

maps is to o weak in any existing or prop osed language we know of Instead the details

of the structure of each message are written into the algorithm co de for pro cessing

that message This limits co de reusability and makes it harder to move computa

tions around making system integration more dicult It is not automatically clear

however whether the system we have just prop osed would improve reusability and

system integratability enough to b e economical

The system we have just prop osed stores type identiers in the data and passes

p ointers to these identiers to subroutines But this system is not as exible as one

that stores type identiers in the p ointers and not in the data as we shall see

6

This approach was suggested to the author by a comment of Dr Richard M Elkin of Raytheon Corp oration

CHAPTER DATA TYPES

Consider the following problem with the system prop osed ab ove that stores type

identiers in the data The system integrators nd that an existing message needs

to b e viewed as containing several new ob jects which it did not previously need to

app ear to contain but whose comp onents are already in the message This means

new actual type identiers need to b e added to the message but nothing else In a

realtime system this might not b e a problem but if we add long term data storage of

messages to the system then we may nd it very inconvenient to change messages

just to view them in a new way

Therefore we prop ose a dierent system in which each message b egins with an actual

type identier that is the only actual type identier in the message Matching this to

an application sp ecic formal type obtains a type map that contains for each ob ject

in the message a sub ob ject comp onent descriptor telling the actual type and relative

lo cation of that ob ject From these descriptors p ointers to ob jects are obtained that

contain the actual types of the ob jects These p ointers with their contained actual

types are passed to subroutines that compute with ob jects

This hypothetical example indicates some of the p ossibilities of type maps and some

of the variations that can b e played on the type map theme RCODE enco des actual

types as type map numbers in either or bit tagged p ointers RCODE can

also enco de actual types as type map numbers stored in ob jects and use reduced

p ointers page to p oint at these type map numbers Reduced p ointers have only

an ob ject number and a displacement and must b e combined with the type map

number they p oint at to make a full bit tagged RCODE p ointer

Comp onent Descriptor Contents

A comp onent descriptor might include some of the following information see the

ab ove sections particularly sections and for denitions of terms

The displacement within the ob ject of the memory unit containing the comp o

nent and the displacement of the comp onent within this memory unit if the

comp onent is contained in a memory unit

or

the displacement within the ob ject of the comp onent if the comp onent is not

contained in a memory unit

The size of the memory unit containing the comp onent if the comp onent is contained in a memory unit

CHAPTER DATA TYPES

or

the alignment of the comp onent in memory if the comp onent is not contained

in a memory unit

The size of the comp onent if the comp onent is contained in a memory unit

or

if the comp onent is not contained in a memory unit the comp onent will b e a

sub ob ject and its size will b e stored in one of its comp onents as determined by

its type map the size comp onent may b e a constant stored in the type map or

a value stored in the sub ob ject

An identier for the copy type of the comp onent Some standard copy types

are integer oating p oint p ointer contiguous sub ob ject and discontiguous

sub ob ject see page

An identier for the actual type of the comp onent for use in type matching

or

an identier for a match type map that is the result of matching the actual type

with some formal type

An access sp ecication indicating whether the comp onent is readonly write

only writeonce readwrite etc

Identiers of routines to b e called to p erform the load store and loadaddress

op erations on the comp onent if the comp onent is not to b e accessed by standard

memory load store and loadaddress instructions

The value of the comp onent if the comp onent is a constant determined only

by the type map Of course in this case much of the ab ove information is

unnecessary

Some of this information may b e used by existing computers without to o much in

eciency and without b eing compiled into instructions Memory unit displacements

fall in this category Other information must b e compiled into machine instructions

on current computers to b e ecient

CHAPTER DATA TYPES

RCODE Comp onent Descriptors

RCODE comp onent descriptors are bit tagged values Type maps and comp o

nent descriptors in RCODE are like co de and are virtual in the same sense their

apparent value is not the same as their implementation

RCODE has two kinds of comp onent descriptors actual and formal The actual

comp onent descriptors app ear in actual and match type maps while the formal com

p onent descriptors app ear in formal type maps

RCODE Actual Comp onent Descriptors

RCODE has two kinds of actual comp onent descriptors constant and nonconstant

Constant component descriptors are just constants which are the values of the

comp onents describ ed The comp onents are therefore constants dep endent only on

the type or more sp ecically on the type map Function p ointers are typically

constant descriptors A contiguous ob ject size that is not stored in its ob ject is a

constant descriptor

Any RCODE bit tagged value may b e used as a constant comp onent descriptor

except for nonconstant comp onent descriptor values

Nonconstant component descriptors have almost the same format as RCODE

bit tagged p ointers Their format given in Figure is just like the bit p ointer

except the ob ject number in the p ointer is replaced by a lo ck number the displacement

is signed in the comp onent descriptor and unsigned in the p ointer and the tag is

FFFF for the nonconstant comp onent descriptor and FFFFC for the p ointer

16 16

Lo ck numbers are only used for comp onents involved in atomic transactions see

section of Chapter

An RCODE comp onent load instruction sp ecies a p ointer that p oints at an ob ject

or at a sub ob ject inside an ob ject and also sp ecies a component index that is the

index of a type map element The p ointer must contain a mapp ed type see Figure

page that has a type map number identifying a type map The load instruction

comp onent index is the index of an element in this type map vector and this element

is the comp onent descriptor that will b e used to access the comp onent

If this comp onent descriptor is a constant comp onent descriptor the value of the

comp onent descriptor is the value of the comp onent loaded by the load instruction

CHAPTER DATA TYPES

RCODE NonConstant Comp onent Descriptor

24 40 32 32

FFFF

16

displacement

type

lo ck number

tag

in bits

type scalar type j mapp ed type j routine type

4 12 8 8

Scalar

copy type

Type

comp onent

access

mem unit size

sp ec

size

within mem unit oset

4 20 8

Mapp ed

Type

access

type map number

sp ec

4 20 8

Routine

Type

access

routine map number

sp ec

type

scalar type

Type

Routine

Map

Map

size load routine

store routine

loadaddress routine

Figure RCODE NonConstant Comp onent Descriptors

CHAPTER DATA TYPES

If the comp onent descriptor is not a constant comp onent descriptor a p ointer to

the comp onent is formed by taking the type from the comp onent descriptor the

ob ject number from the ob ject p ointer and the sum of the displacements from the

comp onent descriptor and the ob ject p ointer This p ointer is then used as input to

the load instruction in order to load the comp onent as describ ed ab ove in the section

on Loads and Stores page

An RCODE comp onent store instruction computes a p ointer in the same way as a

comp onent load instruction Constant comp onent descriptors cannot of course b e

used with store instructions

RCODE Formal Comp onent Descriptors

RCODE formal comp onent descriptors have almost the same format as RCODE

actual comp onent descriptors but have some sp ecial values for the various elds and

have some extra elds Any actual comp onent descriptor can b e used as a formal

comp onent descriptor without any sp ecial eld values or extra elds When sp ecial

values or extra elds are present the tag in the descriptor is changed from FFFF

16

to FFFF to indicate their presence The extra elds are stored where the lo ck

16

number is stored in an actual comp onent descriptor the lo ck number is not needed

in a formal comp onent descriptor

Formal comp onent descriptors are matched against actual descriptors to determine

whether the actual descriptor is legal in some situation If the formal comp onent

descriptor has no sp ecial values in any of its elds and no extra elds the actual

descriptor must equal the formal descriptor exactly

The displacement comp onent size copy type and access sp ecication elds of a for

mal comp onent descriptor can b e given the sp ecial UNKNOWN value which indicates

the corresp onding eld of the actual comp onent descriptor can b e anything

The displacement eld of a formal descriptor can b e given one of several sp ecial

ALIGNED values which constrain the displacement in the actual comp onent de

scriptor to b e a multiple of some p ower of two from through without otherwise

constraining the actual displacement

One of the sp ecial elds is the match ag If this is set any type map number supplied

by the actual comp onent descriptor must denote a match type map obtained by

matching some actual type map number with the formal type map number supplied

CHAPTER DATA TYPES

by the formal comp onent descriptor

Another sp ecial eld is the load convert eld which can b e set to one of the values

NOCONVERT STATICCONVERT or DYNAMICCONVERT If this eld has the

NOCONVERT value and reading the comp onent is p ermitted by its access type

then any sp ecied formal copy type and comp onent size must match the actual com

p onent descriptor copy type and comp onent size If the value is STATICCONVERT

it must always b e p ossible to convert a value with the actual comp onent copy type

and size to a value with the formal comp onent copy type and size No range error can

o ccur during this conversion But if the load convert value is DYNAMICCONVERT

then it must b e p ossible to p erform this conversion for many likely values of the actual

type but other values may lead to a range or overow error during runtime

So for example an bit unsigned actual integer is statically convertible to a bit

signed formal integer but only dynamically convertible to an bit signed integer In

the latter case values from on will cause overow errors

Similarly there is another sp ecial store convert eld with the same set of values as

the load convert eld that controls conversion of values of the formal scalar type to

values of the actual scalar type in the same manner that the load convert eld controls

conversion in the opp osite direction

Sub ob jects

A sub ob ject is part of an ob ject It can b e the whole ob ject or just a subset

A sub ob ject has a subobject address that consists of the ob ject number of the con

taining ob ject and a displacement within that ob ject

A contiguous subobject is a contiguous set of disjoint memory units the rst of which

is at the sub ob ject address A contiguous sub ob ject has a size and an alignment The

sub ob ject may b e copied from one memory lo cation to another by copying it as a bit

string resp ecting the require alignment It may b e copied into or from a sequence of

registers by copying each sub ob ject memory unit to or from a separate register

All memory units in a sub ob ject must b e aligned relative to the b eginning of the

sub ob ject The alignment of a sub ob ject is the size of its largest memory unit A

contiguous sub ob ject can b e copied to a computer with a dierent endianho o d and

all that need b e done on receipt is to reformat eg byte reverse each memory unit

individually

The size in bits of a contiguous sub ob ject is stored as the second comp onent of the

CHAPTER DATA TYPES

sub ob ject as describ ed by the second element of a type map the rst comp onent is

the scalar type The size in bytes or words may b e stored in the sub ob ject itself by

giving this second comp onent a type such as unsigned or unsignedpage

A discontiguous subobject is a set of parts of memory units and has an alignment

equal to the size of the largest of these memory units All the memory units must

b e aligned prop erly relative to the address of the discontiguous sub ob ject A discon

tiguous sub ob ject has no size

A discontiguous sub ob ject cannot b e copied at all and therefore cannot b e passed as

an argument or returned by copying However addresses p ointing at the discontigu

ous sub ob ject can b e passed

A bit tagged p ointer that p oints at a sub ob ject contains the address of the sub

ob ject and a mapp ed type giving the type map to b e used in accessing the sub ob ject

The scalar type of the sub ob ject has the contiguous sub ob ject or discontiguous

sub ob ject copy type and stores the alignment of the sub ob ject as the comp onent

size eld of the scalar type

In addition to loading and storing comp onents of sub ob jects it is p ossible to load a

contiguous sub ob ject into registers or pro duce a copy of the sub ob ject in the frame

heap and similarly a contiguous sub ob ject may b e stored from registers or from a

frame heap copy of the sub ob ject

Examples of Deciencies in C

Note that C uses the word class where we use the word type

C has type maps called virtual function tables that hold virtual function addresses

and displacements used to adjust the current ob ject p ointer when a virtual function

is called Str pp

A C type map also holds a p ointer to a typeinfo ob ject that holds information

ab out the actual type of the largest ob ject containing a given C ob jectStr

The type map also holds the displacement of the given C ob ject in its largest

containing ob ject

A C ob ject may contain sub ob jects called virtual bases that can b e at dierent

displacements relative to the address of the ob ject dep ending on the actual type of the

largest ob ject containing the addressed ob ject But C type maps do not contain

CHAPTER DATA TYPES

the displacements of virtual bases rather instead of having the ob ject p oint at a

type map and the type map have the displacement of the virtual base C has the

ob ject p oint at the virtual base directly This system is faster by one indirection but

has other overhead problems in terms of b oth amount of memory taken by an ob ject

and time required to adjust p ointers when an ob ject is copied Otherwise the two

systems are equivalent Str p

Type maps p ermit any comp onent to have a displacement determined by a type map

for the containing ob ject C p ermits this only for virtual bases which are like

unnamed comp onents But a C ob ject may not contain two dierent virtual bases

with the same type Because C derived its notion of type maps merely to supp ort

its notion of inheritance C has an incomplete implementation of the notions of

virtual ob ject or type map

C do es not p ermit arbitrary constants in type maps eg one cannot have what

C might call a virtual static datum which would require a p ointer to the datum

in the type map This makes it dicult to have variables asso ciated with a class used

as an abstract type ie a formal type For example when co ding a LISP interpreter

in C one might like to asso ciated a dierent free list with each derivative of the

abstract type LISP ob ject but this is not directly p ossible in C

Because C do es not explicitly put addresses of type maps in other type maps

C cannot p erform many base class to derived class strongly typed conversions as

eciently as it might

Another limitation of C is that virtual function table addresses cannot b e used as

copyable data type tags for tagged data If a class Y inherits from a class X X has

data and virtual functions and Y has virtual function denitions but no new data

copying a Y value into an X variable gives a copied value that p oints at the virtual

function table of X not of Y That is the p ointer to the virtual function table of

Y that was stored in the Y value is not copied it is overwritten after the copy by a

p ointer to the virtual function table of X

C puts virtual function table addresses in ob jects and do es not p ermit virtual

function table addresses to b e made part of p ointers or otherwise passed separately

As a consequence many things are imp ossible with C that are p ossible with

HASKELL For example in C one cannot have an array of bit scalars which

have op erations determined by a virtual function table without storing the address

of the table in every array element

CHAPTER DATA TYPES

C denotes type identiers stored in ob jects by p ointers instead of by integer co des

This makes it dicult to copy ob jects to and from external memory or to place ob jects

in memory shared by separately linked programs

In summary the notions of virtual ob ject and type map are not really part of the

C language and the C implementation of type maps is very incomplete

Type Maps in Other Languages

HASKELL uses type maps passed to functions as separate arguments These type

maps contain only function addresses and addresses of other type mapsJon sec

tion They do not contain comp onent displacements or addresses of static data

The use of type map addresses inside type maps is limited to static type matching

and is not used for dynamic type matching

Conceptually EIFFEL uses a matrix whose columns are type maps one for every

actual typeMS page The actual type number is placed in the ob ject The

matrix is very sparse and is packed by a mechanism that requires an extra vector

access The comp onent descriptors are either comp onent displacements or function

addresses There is no type matching op eration every ob ject can b e viewed in only

one way

SATHER uses hash tables SATHER uses type mapsMS that contain

function addresses comp onent displacements static data addresses and simple con

stants Ob jects contain type map numbers or type map addresses that designate

the type maps needed to consider the ob jects as having dierent actual or formal

types An ob ject address p oints to a vector of these type map numbers and the type

map is selected according to which view of the ob ject is needed

Neither EIFFEL or SATHER make provision for passing type maps as parameters or

using type maps with numeric values as in our lattice element example on page

None of the languages we have discussed include type information in comp onent

descriptors

None of the languages we have discussed make provision for placing type map ad

dresses inside type maps for the purp ose of dynamic type matching Because doing

this automatically might take memory prop ortional to the square of the number of

types derived from a given type a language should have features for sp ecifying just

which dynamic type matches are to b e done this way

CHAPTER DATA TYPES

Array Descriptors

An array descriptor describ es a linear map from a list of subscripts to a displacement

This displacement is then added to a p ointer to get a p ointer to an element in an array

Or the displacement may b e added to a comp onent descriptor to get a descriptor for

an array element within an ob ject

The following is an example use of array descriptors from my past

A system for running algorithms on medium resolution images is b eing

built Many image op erators such as convolution op erators shrink the

image so the output is smaller than the input But the size of the image

at medium resolution is not big enough to withstand this shrinkage So

some metho d of avoiding shrinkage is required

Algorithmically the metho d is to expand the input by mirror imaging so

the output will b e the same size as the original input To eect this an

input matrix of the right size is created and sliced up into parts corre

sp onding to the original input and various mirror images of that original

input

The software b eing used to handle arrays in this system treats array de

scriptors as rst class ob jects With the help of array descriptors the

parts of the expanded input matrix are presented as separate arrays to a

standard array copy routine that simply copies one array to another This

array copy routine by the way is not so simple as to invite replicating its

co de it optimally handles arrays of any of element types

An RCODE array descriptor is a vector of RCODE dimension descriptors followed

by either a comp onent descriptor or a bit tagged p ointer see Figure An

RCODE array descriptor can b e part of a type map can b e passed as part of an

argument list or a routine result list and can b e stored inside an ob ject RCODE

assumes that an array descriptor will not b e changed during use

An RCODE dimension descriptor is a bit tagged value with the format given in

Figure Given an integer subscript I and the dimension descriptor pictured in

the gure the check

base I l imit

CHAPTER DATA TYPES

Array Descriptor

Dimension Descriptor

Dimension Descriptor

Dimension Descriptor

Dimension Descriptor

Pointer or Comp onent Descriptor

Dimension Descriptor

24 40 32 32

FFFFC

16

step

base limit

tag

Figure Array Descriptors

is made to see if the subscript is legal and if yes the displacement

I step

is generated and added to the displacement of the base element of the array This

latter displacement is provided by the comp onent descriptor or p ointer that ends the

array descriptor

Here I base and l imit are all bit signed integers and step is a bit signed

integer displacement in units of one bit Just as for comp onent descriptors and

bit tagged p ointers RCODE implementations do not store all bits of the step

they store only bits in units of one byte or one bit dep ending on the type eld in

the comp onent descriptor or p ointer that ends the array descriptor

CHAPTER DATA TYPES

Summary

RCODE enhances the data types of machine hardware with several types that pro

mote interoperability of hardware and software Tagged types are included for lan

guages like LISP Type maps are included to allow software and data written by

dierent p eople and at dierent times to b e mixed and to allow one piece of co de to

run on data of more than one actual type Array descriptors make it easier to address

arrays in a strongly typed manner

My hop e is that this set of data types will advance the cause of languages that mix

the features of LISP and C and the causes of ob ject oriented programming and array

pro cessing

I also think it is clear that structures such as tagged data type maps and array

descriptors need to b e standardized if programming languages are to interoperate

Chapter

Execution Flow

Goals

As an intermediate programming language the RCODE virtual machine language

denes an instruction set and an execution ow semantics for that instruction set

Previous eorts to dene intermediate representations for programs have generally

used a sequential execution ow mo del But RCODE instead pursues a dierent

goal

G Execution ow semantics should p ermit instructions to b e reordered by

implementations as much as practical

The primary reason for this goal is eciency One consideration is the app earance

of sup erscalar pro cessors that execute several instructions at once see H page

1

Another consideration is the fact that pro cessors are b ecoming much faster than

main memory relatively sp eaking so there is an increasing need to move memory

load instructions as early in the instruction stream as p ossible see the memory

wall page

A secondary reason this goal is acceptable is the desire to switch to more functional

programming languages b ecause these languages are easier to understand and debug

CHAPTER EXECUTION FLOW

Requirements

Functional languages derive extra eciency from the fact that they can b e executed

in dataow order each primitive op eration eg add can execute whenever its inputs

are ready To maximize the parallelism this p ermits sp ecial hardware is needed to

detect when the inputs to an op eration are ready see for example the work of

+

Burton SmithSmi ACC and ArvindNik Pap

But RCODE must run eciently on existing sequential computers

Therefore RCODE uses a compromise It is p ossible to compile a functional language

eciently for a sequential computer as long as all the variables are registerlike in

that they are lo cal to a routine cannot b e aliased at runtime using p ointers and

therefore the compiler can gure out at compile time when they will receive their

values But variables that are global or are array elements addressed by general

subscripting cannot b e handled eciently using dataow execution order on existing

computers So the compromise is

F RCODE will b e a functional dataow language with registerlike vari

7

ables that treats RAM memory as an IO device

F leads to languages that are functional with resp ect to registerlike variables but

7

whose memory loads and stores have no certain order To manipulate memory prop

erly a barrier operation is introduced as in some mo dern functional languages eg

the ID barrierNik to force completion of part of a blo ck of co de b e

fore the rest of the blo ck is executed We call such a language a register dataow

language

The op erations of a functional language that involve registers can b e mapp ed directly

to a register dataow language The op erations that create values in memory require

the following algorithm

Allo cate an ob ject and return a writeonly p ointer to it

Write the comp onents of the writeonly ob ject These comp onent write op era

tions may b e done in any order no comp onent is written twice

Execute a barrier that ensures all of the ab ove has nished

CHAPTER EXECUTION FLOW

Convert the writeonly p ointer to a readonly p ointer and discard the writeonly

p ointer

Return the readonly p ointer for use by the rest of the program

Functional languages can handle most programming tasks easily and eciently But

on the other hand they usually cannot express everything in a single program They

cannot handle ob jects and arrays with changing state They have trouble with al

gorithms that require state maintenance such as histogram computation or graph

tracing

The barrier operation p ermits a register dataow language to handle the necessary

nonfunctional language op erations Memory op erations b etween barriers can b e

reordered by the hardware but op erations on dierent sides of a barrier and dynam

ically within the same blo ck cannot cross the barrier and this introduces enough

sequentiality to maintain state

In addition to barriers atomic operations are helpful in parallelizing algorithms that

demand some sequentiality For example an atomic testandset bit op eration can

b e used in a parallel graph traversal to ensure no no de is visited more than once

As a more sp ecic example in computing a histogram one may use barriers and

atomic add op erations as follows

Allo cate the histogram and return a writeonly p ointer to it

Zero the comp onents of the writeonly histogram These comp onent zero op er

ations may b e done in any order no comp onent is written twice

Execute a barrier that ensures all of the ab ove has nished

Convert the writeonly p ointer to an accumulateonly p ointer and discard the

writeonly p ointer

Accumulate into the comp onents of the accumulateonly histogram Each ac

cumulate op eration is atomic and the accumulate op erations are pairwise com

mutative so these op erations may b e reordered without aecting the result

even though many of them change the same comp onent

Execute a barrier that ensures all of the ab ove has nished

CHAPTER EXECUTION FLOW

Convert the accumulateonly p ointer to a readonly p ointer and discard the

accumulateonly p ointer

Return the readonly p ointer for use by the rest of the program

The facts that pure functional languages can b e translated easily into register dataow

languages and that pure functional languages can handle most co ding tasks mean that

users will not have to write explicit barriers very often This means register dataow

languages may b e practical

A register dataow language has several op erations that have sideeects Memory

write op erations are the most obvious of these However the subroutine return op

eration also has side eects when it is inside a case statement This is b ecause the

case statement is selecting one of several p ossible returns from the routine each with

a dierent set of results The side eect is the setting of the result list which is

analogous to the writing of a variable

Because there are side eecting op erations in the language control ow op erations

must b e sensitive to this A case statement for example cannot evaluate memory

write or subroutine return statements inside any case until it is certain that case has

b een selected

Because of the presence of memory write op erations subroutine returns must b e

treated dierently than they would b e in a purely functional language In a purely

functional language once all the values are returned from a routine the routine can

terminate So a return statement can terminate its routine But in a register dataow

language the return statement may not generally terminate other op erations in its

routine b ecause they may lead to memory writes

The op eration that throws an exception is roughly similar to a return op eration but

it do es terminate all op erations in routines b eing returned from

The barrier operation holds up all op erations in its blo ck that follow the barrier until

all op erations in the blo ck b efore the barrier are done Supp ose a blo ck executes a

execute return op eration b efore the barrier It seems most natural in this case to not

the part of the blo ck after the barrier on the grounds that the blo ck has already

returned Co de b efore the barrier should execute even after the return op eration as

indicated ab ove

Memory read op erations must not have side eects if we are to p ermit their sp eculative

execution inside case statement cases that have not yet b een selected

CHAPTER EXECUTION FLOW

Routine calls may b e sp eculative An inlined routine may app ear to execute sp ecu

latively even on a serial computer b ecause its instructions may b ecome interleaved

with those of its caller and some of these instructions may b e executed sp eculatively

Routine calls may execute in parallel This will happ en even on a serial computer

when two routines are inlined and their instructions mixed

So a register dataow language is a dataow language in which memory op erations

are inputoutput op erations there is a within blo ck barrier op eration and other

op erations are mo died accordingly

The RCODE computer is a virtual computer that executes a register dataow ma

chine language The name RCODE stands for Register dataow CODE

The RCODE machine language is built on certain principals that are introduced

in the next section Subsequent sections discuss related work RCODE machine

language semantics and implementation challenges

Principals

The RCODE virtual computer language register dataow semantics are based on the

following principals which will b e elab orated as necessary in subsequent sections

RCODE is a virtual computer machine language so the most basic language op

erations are implemented by an RCODE instruction Thus I will use the word

instruction b elow as a synonym for operation

In reading the following principals it may help to think of the RCODE machine

language as essentially a RISC language whose typical instruction has two input

registers and an output register

Virtual Computer Representation

RCODE represents programs in a virtual computer machine language instead of as

a programming language

Either a virtual computer machine language or a programming language can b e used

to represent programs There are several advantages to using a virtual computer machine language

CHAPTER EXECUTION FLOW

The virtual computer can b e simulated and used to explain to students how

computers work

The virtual machine language is closer to actual hardware and this makes

precise implementation easier

The virtual machine language is similar enough to actual hardware that changes

in actual hardware can b e promoted directly by features of the virtual machine

language

For these reasons RCODE takes the virtual computer approach instead of the low

level programming language approach

A main danger in the virtual computer approach is that instruction and data pack

ing problems will adversely inuence the semantics of the result To combat this

RCODE uses very unpacked bit virtual instructions and very unpacked bit

virtual tagged register values

Dataow for Register Values

Values that can b e stored in registers are computed using dataow execution or

der Once a value is stored in a register the register cannot b e changed thereafter

Each register value may b e computed as so on as sucient inputs to its computing

instruction are available Any execution order which do es this is p ermitted unless

constrained by sp ecial instructions such as barriers and cases

This p ermits co de to b e reordered unless prohibited by barriers or cases

RAM Memory is an InputOutput Device

RAM memory is treated as an input output device Memory instructions can execute

as so on as their register inputs are available Memory read instructions do not wait

until the memory lo cation b eing read has b een written

This p ermits memory instructions to b e reordered unless prohibited by barriers or

cases By not waiting for empty memory words to b ecome full need for sp ecial

hardware is avoided

CHAPTER EXECUTION FLOW

Based on Cases Calls Returns Barriers and

Exception Throws

Normal execution ow is controlled by the register dataow principal and just ve

other instructions the case instruction that selects one of several cases to execute

the subroutine call the subroutine return the barrier instruction and the exception

throw instruction Other constructs such as blo cks and lo ops are equivalent to rou

tines see b elow Still other constructs such as coroutines and continuations are

dened by minor extensions of the basic mo del

Limiting the number of ways execution ow may b e controlled provides for an eco

nomical abstract description of execution ow rules

Case instructions may select one of several return instructions to return from a sub

routine These return instructions each write dierent values into the result list and

therefore have a side eect

Barriers are at Routine Top Level

A barrier instruction may app ear at the top level of a routine to cause all instructions

b efore the barrier to b e done b efore any instructions after the barrier is done The

barrier instructions in a routine divide the routine into partitions

Barrier instructions may only b e at the top level of a routine and not inside cases or

partitions But they may also b e at the top level of blo cks or lo ops since these are

like routines see b elow

I do not know of any reasonable semantics for barriers inside cases and so forbid

such

Return Terminates Subsequent Partitions

A return instruction in one partition of a routine terminates subsequent partitions

of the routine A return instruction do es not terminate other instructions in its own

partition and these other instructions may complete after the return instruction

Because the order of instructions within a partition may b e rearranged it may b e

necessary to execute some instructions within the partition after a return instruction

execution But terminating subsequent partitions is appropriate b ecause I b elieve it is the semantics most users will exp ect

CHAPTER EXECUTION FLOW

Exception Catches are Cases Attached to Call

Instructions

A call instruction can optionally have an exception catch case With resp ect to this

the call instruction b ehaves like a case instruction and selects the exception catch case

if and only if the call instruction receives an exception value list from an exception

throw

Exception catch capabilities are normally attached to blo cks But in our scheme call

instructions and separate routines play the role of blo cks so I add catch capability

to call instructions

Exception Throws are like Returns with Termination

An exception throw instruction is like a return instruction but with the following

dierences

An exception throw instruction communicates a list of exception values whereas

a return instruction communicates a list of result values

An exception throw instruction terminates all instructions in whichever routines

it is completing whereas a return instruction allows instructions in the same

partition to complete after a return instruction executes

OutofLine Equals Inline

Execution of outofline routines has the same semantics as putting the routine inline

This implies the implementation may start executing a routine b efore all of its argu

ments have b een computed b ecause the routine may b e inlined and its instructions

may b e reordered and mixed with those of its caller Also co de using a return value

may start executing b efore the routine is completely done But conversely the pro

grammer cannot dep end on this b ehavior see Either Strict or NonStrict b elow

The RCODE debugging interface makes execution of inline and outofline routines

lo ok the same However inline routines often app ear to execute in nonstrict or

dataow order even on a sequential machine so this principal implies the ability to

deal with nonstrict execution order in debugging

CHAPTER EXECUTION FLOW

Blo cks are Routines

A blo ck provides a way of grouping instructions Values can b e returned from the

blo ck with return instructions

Semantically a blo ck is just like a routine and it should b e p ossible to convert blo cks

into routines and vice versa

Lo ops are TailRecursive Routines

A lo op is a blo ck of instructions that rep eats itself

Semantically a lo op is just like a tailrecursive routine that calls itself to iterate and

it should b e p ossible to convert lo ops into tailrecursive routines and vice versa

Traps are like Subroutine Calls

An instruction trap o ccurs when an instruction is confronted with op erands it cannot

handle Instruction traps are semantically equivalent to subroutine calls as if the

instruction had b een turned into a call instruction The subroutine called is selected

from a trap dispatch table and can emulate the instruction by returning the prop er

results The subroutine can also throw an exception

Traps dier from explicit call instructions only in that implementations can optimize

them away see Eager but for Traps b elow

Traps may b e used in languages like LISP to p erform arithmetic op erations on sp ecial

numbers such as very long integers

NaNs instead of Traps

Many conditionals are written using shortcircuit AND instructions for which the rst

op erand must b e true in order for it to b e legal to attempt to compute the second

op erand For example the rst op erand may test that a p ointer is not NULL and

then the second op erand may use the p ointer

A great deal of sequentiality would b e introduced if one could not attempt to compute

later op erands to a sequence of shortcircuit AND instructions b efore the rst op erand

has b een computed To avoid such sequentiality instructions are dened to pro duce

nonsignalingNaN values when they are given illegal inputs instead of trapping

Traps still o ccur if the inputs are such that a software trap routine might compute a

CHAPTER EXECUTION FLOW

value or if the instruction is a control ow instruction that involves side eects such

as a case instruction using a value that must not b e a nonsignalingNaN to select a

case

If a short circuit AND instruction inputs a rst op erand that is false and a second

op erand that is a nonsignalingNaN the AND instruction simply outputs false The

nonsignalingNaN is discarded b ecause it is an unneeded op erand of a short circuit

instruction

When a nonsignalingNaN is output by an instruction this value says the instruction

could not execute If a nonsignalingNaN is fed as input to an instruction that

controls side eects it will cause a trap to the debugger and examination of the

output of various previous instructions will reveal the original oending input

Note that I assume a debugger interface that makes the virtual computer lo ok real

In the virtual computer each instruction outputs to a register which is not changed

after it is initially set by the instruction so these outputs will b e available thenceforth

for insp ection by the debugger

Either Strict or NonStrict

An implementation should b e free to optimize co de by not computing inputs that are

not required by an instruction b efore executing the instruction For example if the

rst input to a b o olean AND instruction is false the second input is not necessary

and need not b e computed

Inputs that are not required to execute an instruction are called nonstrict so an

implementation is free to treat inputs as nonstrict

But an implementation should also b e free to insist that all inputs to an instruction

b e computed b efore executing the instruction That is co de should not dep end up on

unrequired inputs not b eing computed

Inputs that must b e computed b efore executing an instruction are called strict so

an implementation is free to treat inputs as strict

The same holds for arguments to a subroutine which an implementation is free to

treat as either strict or nonstrict In the nonstrict case the subroutine may start

to execute b efore all its arguments have b een computed and in the strict case the

subroutine will not start until all its arguments have b een computed

Note that the freedom to not compute inputs b efore executing an instruction or

CHAPTER EXECUTION FLOW

routine is not the same thing as the freedom to never compute inputs at all See

Eager but for Traps b elow

The freedom to view execution of routines as nonstrict is used when the routines

are inlined and their instructions are reordered among the other instructions of their

caller

Eager but for Traps

Dataow languages can b e eager meaning they compute everything as so on as

necessary inputs are ready or lazy meaning they compute something only when

it is needed Because memory write instructions have side eects it is easiest to use

eager semantics

Thus everything is computed sometime unless it is in a disabled case of a case

instruction or it follows a barrier that is preceded by a return instruction execution

As an optimization an implementation can still suppress a computation if it can b e

shown to b e unneeded by any side eect pro ducing instruction In order to make this

useful an implementation is p ermitted to suppress a computation whose only side

eect might o ccur b ecause of an instruction trap However an implementation is not

required to suppress such computations

In eect the implementation is p ermitted to assume for the purp oses of suppressing

instruction execution that if an instruction do es not have side eects then its trap

routine will not have side eects

Memory Reads have No Side Eects

A memory read instruction may not have a side eect It may b e done zero or more

times without aecting other instructions

Except of course that the time when a read instruction is done do es aect the value

read

Case Statements Propagate Permissions

The computer schedules instructions for p ossible execution The scheduled instruc

tions form an execution tree whose no des include routine calls case instructions

individual cases and ordinary instructions see Figure page In the follow

ing discussion I ignore blo cks lo ops and barriers

CHAPTER EXECUTION FLOW

Each no de of the tree may have one of at least four states

no permission

sideeectfree permission

sideeect permission

withdraw permission

These states are ordered and no des can only change from an earlier state in the list

to a later state in the list

A case instruction selects one of several cases to execute where each case is itself a

sequence of instructions The case instruction works by propagating p ermissions to

its cases It can initially propagate p ermission to execute sideeectfree instructions

to any case it choses as long is the case instruction itself has this p ermission Later

when the case to b e executed is known sideeect p ermission is propagated to the

selected case to p ermit the execution of all instructions including those with side

eects and withdraw p ermission is propagated to other cases to stop all instructions

from executing

Tree no des other than case instruction no des propagate states down the tree toward

the leaves

If a routine or blo ck or lo op is terminated by an exception withdraw p ermission

is propagated down the tree from the routine execution no de

Sideeectfree p ermission allows arithmetic and memory read instructions to execute

But memory writes subroutine returns and exception throws require sideeect p er

mission

Permissions are propagated from a caller to the called routine execution In particu

lar if a trap routine executed with sideeectfree p ermission decides to execute an

exception throw that will b e delayed until sideeect p ermission arrives if it ever

do es

Note that barriers are not accounted for in the execution ow mo del just given They

add extra tree no des and extra states Blo cks and lo ops on the other hand can b e

treated by replacing them with equivalent routines

By using p ermission propagation we can allow more instructions to b e executed sp ec

ulatively allowing instructions in cases to b e moved ahead of the computation of

which case to select

The ab ove sounds like p ermissions are propagated dynamically at runtime But

CHAPTER EXECUTION FLOW

when implemented on sequential computers this is not true Instead the p ermissions

are propagated statically by the compiler as it generates actual hardware co de by

compiling RCODE virtual instructions The instruction execution tree is determined

by the program counter in such an implementation The compiler uses the instruction

execution tree at each p oint of execution to select the next instruction to b e executed

and then computes a new instruction execution tree that will b e valid just after that

instruction executes

Flexible Value Lists

A value list is a list of register values used to communicate b etween routine executions

Examples are argument lists routine result lists and exception value lists

Value lists can b e of arbitrary length and are received by sp ecial instructions Value

lists can b e received incrementally so a pattern can b e matched to the value list and

co de selected based on the case matched

However for eciency value lists can have a signature that can sp ecify the length

of the list and types of the values well enough to p ermit the value list to b e stored

in real machine registers Thus there are more constrained value lists that are very

ecient and more exible value lists that are less ecient

Exception value lists often require the most exibility For example it is desirable

for these to contain diagnostic error messages that can b e printed to tell what went

wrong When an exception is rethrown it may b e appropriate to make additions

to these messages The catcher of an exception may or may not want to print the

messages

Another use of exible value lists is for argument lists for routines like those in an

op erating system shell language where there may b e very long variable length lists

of les or similar arguments

On the other hand argument and result lists for many subroutines such as math

routines computing with a few real numbers square ro ot sine etc need to b e very

ecient

Second Class Frame Pointers

Data can b e allo cated in routine frames but p ointers to such data which are called

frame pointers can only b e stored in registers and value lists Furthermore it must

CHAPTER EXECUTION FLOW

b e p ossible to nd all such p ointers at RCODE run time

This p ermits values of arbitrary size to b e passed as argument result and exception

values but p ermits the stack to b e managed by a simple fast memory reclaimer

Controllable Inlining

RCODE routines should b e inlined when they are compiled to machine co de and

not b efore There must b e some easy to use scheme to control when a routine is

inlined

By delaying inlining until RCODE is compiled changes to inlined routines can b e

handled more eciently

Related Work

Several computers have b een built with ful lempty bits on memory words and load

instructions that delay until the word loaded is full Examples of such computers

+

are the Denelcor HEPSmi and its mo dernization the TeraACC and the MIT

Monso onPap

Our notion of barriers is taken directly from Arvinds ID languageNik which was

designed to run on the Monso on This notion arises b ecause in a dataow language it

must b e p ossible to determine when a routine has nished executing in order to free

the memory taken by the routines frame So there is a well dened notion of when

a routine execution is done A barrier merely waits until some partition of a routine

is done b efore p ermitting the next partition to execute

Notice there is some question ab out whether routines called by a given routine execu

tion must b e done for the given routine execution to b e done In some ID implemen

tations this would not b e necessary just for the purp ose of freeing the routine frame

This is b ecause these implementations do not read data from routine frames they

only write data to frames and each routine execution knows whether all writes to its

frame are done yet However to make the ID barrier work prop erly it is necessary to

dene a routine execution to b e done only if all the routines it called are also done

RCODE similarly requires called routines to b e done b efore a routine is done

RCODE control ow is very similar to that of ID with the following dierences

RCODE is a virtual computer machine language and ID is a programming

CHAPTER EXECUTION FLOW

language

RCODE is optimized for sequential sup erscalar or very long word computers

with scheduling done at RCODE compile time while ID is designed for sp ecial

hardware that do es scheduling at run time

RCODE treats memory op erations as IO while ID provides ow control based

on memory word fullempty bits

RCODE p ermits either nonstrict or strict execution while ID requires non

strict execution

RCODE supp orts exceptions and ID do es not

RCODE supp orts traps and ID do es not

RCODE supp orts sp eculative execution more explicitly than ID

Another thread of research on control ow in functional languages is work on monads

continuations and similar constructs which is discussed in JW The idea is that

there are IO functions that take as input a sp ecial kind of argument a world

value that contains the state of the world and these IO functions pro duce as output

a normal value paired with a mo died world value First there are primitive IO

functions that call the op erating system to p erform actual inputoutput Then there

are schemes for creating comp osite IO functions which are eectively sequences of

primitive IO functions each with its output world connected to the input world

of the next Simply forcing computation of the nal world value of a comp osite IO

function will cause the sequence of primitive IO functions it represents to b e executed

in strict sequential order

The world value b ehaves something like the done ag in implementations of ID A fork

op eration in which two parallel computations on the world are spawned is prop osed

in JW section A join op eration in which two parallel computations are b oth

required to complete in order to pro duce a nal world do es not seem to b e discussed

in the literature It should b e p ossible to introduce an op eration to merge worlds to

pro duce a new world however

Introduction of forks and joins loses the safety of the language since parallel paths

might conict and induce nondeterministic results It might b e b etter to try to build world splitting and combining op erations that split a world into indep endent worlds

CHAPTER EXECUTION FLOW

and then combined mo died versions of these indep endent worlds into a mo died

version of the original world

In any case I feel that the join op eration which in ID and our work is implemented by

the barrier op eration is fundamental to parallelism and languages need it to function

prop erly

RCODE Machine Language Control Flow

Semantics

The RCODE virtual computer is a RISC style machine with a separate set of registers

for each routine execution In the following sections we will give the semantics of the

RCODE virtual computer that relate to control ow

RCODE Basic Register Dataow

RCODE endows each routine execution with a vector of registers Each register

is bits and stores tagged data Registers are computed by instructions each of

which take some registers as input The output registers of an instruction are distinct

from the output registers of any other instruction There are no goto instructions

so each instruction can execute at most once and each register can b e set at most

once lo ops are treated as tailrecursive routines Each instruction contains a type

co de that restricts the p ermitted values of its output registers and p ermits ecient

execution on untagged architectures Instructions may b e executed in dataow order

or they may b e executed in sequential order without dierence except for side eects

involving nonregister memory or IO

We will go into these considerations in more detail in the following subsections

Registers

An RCODE register holds a bit tagged value see Figure Registers are

virtual in an actual implementation most registers hold values of more restricted

types that are stored in a more compact format see Type Co des b elow page

For example an RCODE register restricted to have a bit integer value would b e

stored in a bit machine register or a bit memory word in a routine execution frame

CHAPTER EXECUTION FLOW

tagged value oating p oint j nonsignalingNaN j taggedsignalingNaN

TaggedSignalingNaN

32 96

tag

value

tag integer j unsignedinteger j b o olean j p ointer j

Figure Tagged Register Values

The bit tagged register value has the format of an IEEE bit oating p oint

numberInc Table It may hold a standard oating p oint value including the

innities or a nonsignaling NaN value as is sp ecially discussed b elow see Non

SignalingNaNs page or a signalingNaN value that represents a tagged value

In other words the RCODE tagged data scheme is built on top of the IEEE oating

p oint number scheme using the latters signalingNaNs for tagged values that are

not oating p oint numbers or missing values and using nonsignaling NaNs to

represent missing values

The types of tagged values represented by signalingNaNs include

b o olean values TRUE or FALSE

bit signed integers

bit unsigned integers

p ointers of various formats

other sp ecial values

See section of Chapter for a more complete description of RCODE tagged

values

Instructions

Typical RCODE instructions are bits formatted as in Figure Typical instruc tions have a single output register whose values are constrained by the type co de eld

CHAPTER EXECUTION FLOW

8 8 16 16 8 8

op

source source

type

arith

arith

co de

register register

op co de

mo dier

co de

16

number number

arith op co de j j j j

type co de BOOLEAN j INTEGER j j TAGGED

Figure Example Arithmetic Instruction

asso ciated with the register as discussed b elow The output register number is as

signed according to the p osition of the instruction in its routine and is not stored in

the instruction

Typical instructions have two input registers whose numbers are elds of the instruc

tions All register numbers are relative to the current routine execution frame The

input register numbers must b e less than the instructions output register numbers

This guarantees that strict execution order will work see the Either Strict or Non

Strict principal ab ove

The instruction contains an op eration co de and op eration mo dier bits A typical

op eration co de is addition and a typical mo dier is round toward zero

RCODE instructions are virtual they are not actually stored or interpreted by

implementations They dene a language for representing programs that is to b e

compiled into actual machine co de

Some instructions dene groups of other instructions A case instruction is followed

in memory by its cases each of which b egins with a sp ecial bit instruction word

followed by the instructions in the case Routines are similarly organized into parti

tions and instructions that send or receive value lists also include groups of bit

instruction words We will not delve deep er into instruction formatting in this pap er see Wal

CHAPTER EXECUTION FLOW

Type Co des

Each register has an asso ciated bit type code that is stored in the instruction that

outputs the register This type co de constrains the p ossible values of the register For

example the type co de might sp ecify that the register can only hold bit signed

integers This would p ermit an implementation to use just a bit hardware register

to hold the value of a bit virtual RCODE register

The following is a list of RCODE type co des

b o olean

bit signed integer

bit unsigned integer

bit oating p oint number

bit signed integer

bit unsigned integer

bit oating p oint number

bit tagged value

bit oating p oint number

bit tagged value

various forms of p ointer

other sp ecial data

The bit tagged value type co de means the value in the register is unrestricted

RCODE also supp orts bit tagged values that are based on IEEE bit oating

p oint numbers and are designed for implementing LISP

No matter what the type co de is a register can store a nonsignalingNaN value

These values are used to indicate the instruction that output them had illegal inputs

See NonSignalingNaNs b elow page for more details

Frame Memory

Frame memory holds all the data necessary to supp ort register dataow execution

but excluding the general heap memory In the general heap memory ob jects may

b e freely allo cated and freely p oint at each other This is not so in frame memory

where ob jects may only b e allo cated with a stack or tree like discipline and may not

b e p ointed at except by registers and value lists see b elow for value lists

CHAPTER EXECUTION FLOW

In implementations that are not parallel frame memory is stack like In implemen

tations that are parallel frame memory is tree like

Frame memory has three parts

Register Frames Register frames hold RCODE registers There is one register

frame p er routine execution

Value Lists Value lists are used for communication b etween routine executions

The kinds of value list are

Argument lists created by call instructions and received by called routines

Result lists created by return instructions and received by call instructions

Exception value lists created by exception throw instructions and received

by exception catches

Frame Heap The frame heap includes ob jects similar to those in the general

heap except all p ointers to frame heap ob jects must b e stored in registers or in

value lists The consequence of this last rule is that frame heap ob jects are allo

cated with a stack or tree like discipline which p ermits very fast ecient memory

reclamation

The frame heap works with register frames to p ermit routines to have lo cal values

that are full edged memory ob jects except for the restriction that p ointers to

frame heap ob jects may not b e stored in any ob jects

The following sections describ e the three parts of frame memory in more detail

Register Frames

A register frame is a vector of registers The registers in a frame have register numbers

increasing from zero

When an RCODE routine is scheduled a register frame is allo cated for that routine

execution When the register frame is created output registers are allo cated for each

instruction in the routine These registers are allo cated in the same sequence as the

instructions app ear in the routine so the numbers of the output registers of two

instructions are always in the same order as the instructions app ear in the routine

CHAPTER EXECUTION FLOW

The number of registers in the frame is xed by the number and kinds of instructions

in the routine and is determined when the frame is allo cated and not changed

thereafter

The types of values that can b e stored in each register is determined by the type co de

of the instruction which outputs to the register This is also determined when the

frame is allo cated and not changed thereafter

Each instruction may have input register numbers These must b e less than any

output register numbers of the instruction This enforces the rule that strict execution

sequencing is p ermitted see the Either Strict or NonStrict principal ab ove

Register frames are organized as a tree the parent of each frame is its callers frame

except that the ro ot frame has no parent If completely strict execution is used

this register frame tree would b e a stack Here by strict execution we mean executing

instructions in the order they app ear within a routine not executing any part of a

routine until all its inputs are ready and not using the results returned by a routine

until the routine is done

But in the presence of inlined routines and instructions reordered to match hardware

routines will app ear to execute nonstrictly even on existing sequential computers

The RCODE virtual computer is required to make inlined routines app ear to the de

bugger to have b een executed out of line by the OutofLine Equals Inline principal

on page ab ove But then these routines will app ear to b e nonstrictly executing

out of line routines though they will in reality b e inline routines whose instructions

have b een intermixed with instructions from other parts of their caller

The register frame tree is sometimes called the routine execution tree as its no des

can b e thought of as routine executions as well as register frames

A register frame may b e deallo cated from memory when its routine is done and when

any return list or expression value list this routine is sending to another frame in the

routine execution tree is no longer needed or has b een reattached to the receiving

frame

Frame Heap

Many languages p ermit arrays to b e allo cated in a routine execution frame on the

stack with an array size not determined until execution of the routine Some

languages eg Ada ANS p ermit such arrays to b e returned as result values to

CHAPTER EXECUTION FLOW

the caller of the routine

The frame heap addresses these needs It is a heap where ordinary ob jects may b e

allo cated as long as p ointers to these ob jects are only stored in registers or value

lists This restriction on where p ointers to the ob jects are stored p ermits fast stack

like memory reclamation to b e implemented

The reason such memory reclamation can b e implemented is that when a routine

execution is done all frame heap ob jects allo cated by the routine execution subtree

ro oted at the routine are either no longer in use or are p ointed to by the result list or

exception value list returned to the call instruction that scheduled the routine Thus

a memory reclamation algorithm can merely compact the ob jects p ointed at by the

returned value list and free the rest of memory allo cated by the execution subtree

Since the values to b e compacted are not visible until the value list is used there is

no problem moving them and adjusting their addresses in the value list

Value Lists

An RCODE value list is a vector of register values There are three kinds of value

list argument lists result lists and exception value lists

An RCODE value list is constructed and sent by a call instruction argument list

return instruction result list or exception throw instruction exception value list

It is received by a routine execution argument list call instruction result list or

call instruction exception catch case exception value list

Exception value lists may b e received by a call exception catch case partly read and

forwarded for further pro cessing by an exception rethrow to another call exception

catch case or by a call to a subroutine The receiver of an exception value list may

known only enough ab out the list to deco de the rst part and then pass the list on

Argument lists may b e received by a routine execution that similarly deco des and

uses the rst part and then after replacing the rst part passes the list on to another

routine

These capabilities suggest that value lists should b e made fairly rst class data in

RCODE However for eciency reasons value lists may need to b e restricted

In RCODE exible rst class value lists are variable length vectors of tagged values

stored in memory Ecient restricted lists are xed length vectors of register values

with each value b eing asso ciated with an bit type co de the same way as registers

CHAPTER EXECUTION FLOW

are asso ciated with a type co de RCODE also supp orts hybrid lists with a short

part in real hardware registers followed by a variable length part in memory

Sp ecically an RCODE value list is a vector called a register value vector Each

element of this vector is either a register value or a p ointer at or into a vector of

register values called a memory value vector Normally a value list either has no

p ointers to memory value vectors or only the last register value vector element is a

p ointer to a memory value vector However while writing subroutines to manipulate

value lists it is desirable to pass p ointers into memory value vectors b etween routine

executions and to p ermit this register value vectors are allowed to contain several

p ointers at memory value vectors

A value list has a signature which must b e agreed up on by all the senders and

receivers of the value list The signature consists of a xed length vector of bit

type co des and an equal length vector of memory vector ags The length of these

vectors is the length of the register value vector part of the value list

Values in a value list are bit tagged register values virtually sp eaking For each

signature p osition whose memory vector ag is o the signature bit type co de

describ es the p ermitted values of the register value vector element in that p osition

in the same way that the bit type co de in an instruction describ es the p ermitted

values of an instruction output register but with one dierence The dierence is

that an implementation is free to p ermit or disallow nonsignalingNaN values unless

the value is a tagged value in which case nonsignalingNaN values are allowed

For each signature p osition whose memory vector ag is on the signature bit type

co de describ es the p ermitted values for the elements of the memory value vector

p ointed at by that the register value vector element in that p osition Again the

implementation is free to p ermit or disallow nonsignalingNaN values unless a value

is a tagged value in which case nonsignalingNaN values are allowed

The receiver of a value list must allo cate as many registers as there are type co des

in the value list signature The asso ciated values are copied into these registers if

their memory vector ags are o Pointers to memory value vectors are copied if the

memory vector ags are on

New memory value vectors may b e allo cated by a routine and made by copying from

other memory value vectors Memory value vectors may b e passed as the end of an

argument list result list or exception value list to another routine execution The

length of a memory value vector must b e determined when it is allo cated but the

CHAPTER EXECUTION FLOW

vector may b e written incrementally Pointers into memory value vectors may b e

passed as arguments and returned as result values in register value vectors

Memory value vectors are like frame heap ob jects but dier in that they can hold

p ointers to frame heap ob jects if their type co de p ermits

Execution State

The Execution Tree

All scheduled instructions may b e viewed as part of a tree the instruction execution

tree The no des in this tree are as follows see Figure for an example

Leaves Instructions such as arithmetic op erations are normally leaf no des of the

tree and have no children However if they trap they b ecome like subroutine call

instructions

Calls Call instructions have as children rst the routine execution and second an

optional exception catch case A call instruction acts in part like a case instruction

that may select the exception catch case if an exception value list is returned If

the exception catch case is selected it may return a result list for the call or may

throw an exception itself

Routine Executions Routine executions have as children the partitions of the

routine b eing executed in the order the partitions app ear in the routine

Partition Partitions have as children the individual instructions of the partition

Note that any instructions inside the cases of case instructions are excluded only

the case instruction is directly a child of the partition and instructions inside cases

are descendents of the case instruction

Case Instructions Case instructions have as children their cases

Cases Cases have as children the individual instructions of the case Note that

as for partitions if a case instruction is such a child then the instructions inside

its cases are not such children

We have excluded blo cks and lo ops from this discussion They are introduced b elow

by giving equivalences b etween them and routines

CHAPTER EXECUTION FLOW

main routine

execution

H

H

H

H

H

H

H

Hj

partition partition

H

H

H

H

H

H

H

Hj

case

add

instruction

instruction

H

H

H

H

H

H

H

Hj

case case

H

H

H

H

H

H

H

Hj

multiply

call

instruction

instruction

H

H

H

H

H

H

H

Hj

routine

exception

case

execution

H

H

H

H

H

H

H

Hj

partition partition

Figure Example Instruction Execution Tree

CHAPTER EXECUTION FLOW

Tree No de States

Each no de in the execution tree has one of the following states

no permission

readfree permission

sideeectfree permission

sideeect permission

withdraw permission

done

These states are ordered increasing from no p ermission to done

A no de is said to b e withdrawn if and only if it has withdraw p ermission

No des can execute and schedule children Scheduling children involves adding no des

to the execution tree Execution involves setting output registers and p erforming

memory op erations or other inputoutput A no de can schedule children b efore or

during its execution

State Transition Rules

The following are general rules for assigning states to no des

When an instruction is scheduled it is added to the execution tree and given

the no p ermission state

The state of a no de never decreases but always increases from no p ermission

toward done

If a no de has children then it passes its state onto any child if that state is

higher than the childs current state and the child is not a case or partition

If all the children if any of a no de achieve the done state and the no de has

nished its own execution then the no de is given the done state

New no des may b e added to the execution tree when instructions are sched

uled but not as children of withdrawn or done no des or of no des that have

nished their execution No des may not b e deleted from the execution tree

CHAPTER EXECUTION FLOW

A no de can b egin to execute when it is in a state with sucient p ermission and

enough inputs are ready For no des other than those that involve sideeects

or read from memory readfree p ermission is adequate to b egin execution

For memory read instructions sideeectfree p ermission is required For

instructions that involve sideeects sideeect p ermission is required One

of these three p ermissions with a higher state value is sucient when only a

lower valued p ermission is required

A no de that is not withdrawn cannot complete its execution until all required

inputs are available to it It need not require all its p otential inputs

During execution a no de that is not withdrawn may decide to trap If a no de so

decides it spawns a new routine execution child and b ehaves thenceforth like a

call instruction with no call exception catch case

If a no de is withdrawn it will quickly nish execution and then pass to the done

state as so on as its children are done If the no de has not b egun execution

it nishes execution without having actually done anything If the no de has

b egun execution it will nish execution without necessarily doing all of what

it would normally do

An instruction is called an atomic instruction if even when withdrawn the

instructions execution either do es nothing or do es what the instruction would

normally do Unless sp ecied otherwise instructions are not atomic In partic

ular nonatomic memory write op erations may write only part of the memory

they normally change

The goal is to get all no des in the execution tree lab eled done

According to these rules a no de that is done or withdrawn cannot spawn new children

As a consequence a subtree all of whose leaves are done or withdrawn cannot grow

All states tend to propagate down toward the leaves of the execution tree except the

done state Because a no de may not b ecome done until its children are done this

state propagates up from the leaves

State Maintenance

The state transition rules are intended to b e applied at RCODE compile time by

an RCODE compiler and not at run time for sup erscalar and very long word com

CHAPTER EXECUTION FLOW

puters This is to b e done by assuming that outofline routine calls are strict so all

scheduling really concerns only the co de within a routine and the routines it inlines

An imp ortant consequence is that the state of the execution tree is a function of

the program counter value on a sup erscalar or very long word computer So the

state do es not have to b e maintained at routine run time in order to b e presented by

the RCODE debugging interface to debuggers Instead the program counter can b e

merely lo oked up in a table to nd the instruction execution tree state relevant to a

routine but this table may b e rather large as noted in Implementation Challenges

b elow

Cases

A case instruction selects one of several cases to execute Each case is a group of

instructions

A case instruction works by propagating p ermissions down a tree of scheduled in

structions A case instruction and its cases do the following

When the case instruction is scheduled it allo cates registers for itself and its

cases and schedules its cases

When the case is scheduled it is allowed to schedule its children unless the case

is withdrawn However a case is not required to schedule its children unless it

has received sideeect p ermission

When a case instruction receives readfree p ermission it is allowed to propagate

that p ermission to any of its scheduled cases but is not required to Ditto with

sideeectfree p ermission

When a case instruction can compute which case it will select it propagates

withdraw p ermission to all other scheduled cases

When a case instruction receives sideeect p ermission and can also compute

which case it will select it propagates sideeect p ermission to that case

If a case instruction receives withdraw p ermission it propagates that to all its

scheduled cases that are not done

CHAPTER EXECUTION FLOW

If a case instruction receives an illegal input such as a nonsignalingNaN that

the case instruction is not prepared to accept the case instruction propagates

withdraw p ermission to all its scheduled cases that are not done and schedules

a new routine execution child to execute the case trap routine The state of the

case instruction is then propagated to this trap routine execution no de and the

case instruction thereafter b ehaves like a call instruction

A case instruction may b e congured to recognize a nonsignalingNaN as a

legal input and use it to select a particular case

An instruction in ie descended from a case cannot input registers that are

the output registers of instructions in its sibling cases ie cases with the same

parent case instruction

Cases propagate new states down to their children as p er the default rules for exe

cution tree no des And b oth cases and case instructions b ecome done when all their

scheduled children are done and the case instructions have executed as p er default

rules

Barriers

A routine is divided into partitions Each partition is a sequence of instructions just

like a case of a case instruction The partitions may b e thought of as b eing separated

by barrier instructions though no such instructions are actually enco ded in RCODE

Routines execute by propagating p ermissions down trees of scheduled instructions A

routine execution and its partitions do the following

When the routine execution is scheduled it allo cates registers for its partitions

and schedules the partitions

When the partition is scheduled it is allowed to schedule its children unless

the partition is withdrawn However a partition is not required to schedule its

children unless it has received sideeect p ermission

When a routine execution receives readfree p ermission it is allowed to propagate

that p ermission to any of its scheduled partitions but is not required to do so

CHAPTER EXECUTION FLOW

When a routine execution has sideeect p ermission it must propagate that

p ermission to its rst scheduled partition that is not done provided no partition

is withdrawn When that partition b ecomes done this pro cess rep eats

If a routine execution receives withdraw p ermission it propagates that to all

its scheduled partitions that are not done

If a partition executes a return instruction then withdraw p ermission is propa

gated to all subsequent partitions of the routine execution b eing returned from

If a partition is withdrawn all subsequent partitions of the same routine exe

cution are withdrawn

Partitions propagate new states down to their children as p er the default rules for

execution tree no des And b oth routine executions and partitions b ecome done when

all their scheduled children are done as p er the default rules

Calls and Returns

A call instruction has an optional exception catch case A call instruction that has

no exception catch case cannot receive an exception value list A call instruction

with an exception catch case b ehaves like a case instruction that p erforms the call to

determine whether to select the exception catch case and selects that case only if an

exception value list is returned

The steps in executing the call and the routine it calls are as follows when the call

do es not receive an exception value list

When the call instruction is scheduled it allo cates a set of registers for its

argument list for the address of the routine b eing called for the result list to

b e returned by that routine for exception values if an exception catch case is

present and for the exception catch case if that is present

A call instruction with an exception catch case may chose to schedule the case

and propagate readfree p ermission to it any time after the call instruction is

scheduled and b efore the call instruction is done or withdrawn

The address of the routine b eing called is computed

CHAPTER EXECUTION FLOW

The routine execution is scheduled by allo cating its register frame and schedul

ing all the partitions of the routine

The routine and its caller execute according to p ermissions propagated from the

call instruction to the routine execution The routine may read values from the

argument list after they have b een computed Not all values in the argument

list need have b een computed by this time

A return instruction in the routine is given sideeect p ermission This sp ecies

the routine execution result list as the list of registers sp ecied by this return

instruction to hold results

The routine and its caller execute The routine may read more values from the

argument list after they have b een computed The caller may read values from

the result list after they have b een computed Not all values in the result or

argument lists need have b een computed by this time

The routine execution b ecomes done b ecause each of its partitions b ecomes

done The routines register frame is reclaimed except for the result list which

may b e still in use by the caller and is detached from the called routines register

frame and reattached to the callers register frame

If a call instruction has an exception catch case then when the routine execu

tion child of the call instruction b ecomes done and no exception value list has

b een returned to the call instruction the call instruction propagates withdraw

p ermission to the exception catch case if it has scheduled it

It is quite p ossible for several result lists to b e attached to the callers frame at one

time The caller may make calls in order to pro cess the rst such list and these calls

may create more result lists

If the call instruction receives an exception value list what happ ens is as given ab ove

with the following mo dications and additions

If a call instruction receives an exception value list it may or may not receive a

result list The routine execution may or may not execute a return instruction

Also some result values may b e received but others not received This is

b ecause a return instruction may have b een executed but only some result

values may have b een computed b efore the exception throw

CHAPTER EXECUTION FLOW

If a call instruction receives an exception value list and the call instruction is

not withdrawn it waits till the routine execution is done and then propagates

sideeect p ermission to its exception catch case

Note that by propagating only readfree p ermission to the exception catch case

and by waiting for the routine execution to b e done b efore propagating side

eectp ermission we have the eect of a barrier b etween the routine execution

and the exception catch case execution except that if the routine returns it

do es not stop the catch case

A call exception catch case once selected by an exception value list executes

like a normal case instruction case However if it do es not rethrow the excep

tion or throw a new exception past its parent call instruction it must provide

result values for its parent call instruction It do es this with a special return in

struction If the sp ecial return instruction attempts to return some result value

that has already b een returned by the routine execution the sp ecial return in

struction traps and cannot b e continued It is p ossible for the exception catch

case to discover which result values were returned by the routine execution

A call instruction without an exception catch case cannot receive an exception value

list

The rules just given are based on an optimistic strategy that assumes exceptions

will not b e frequent Thus the caller do es not wait to see if a routine will have an

exception b efore using results from the routine So if a routine has an exception it

may also have results But if a routine has an exception and no results results must

b e provided if execution is to continue from the call instruction In general exception

catch co de that tries to return its own results is likely to assume that no results were

returned and it will b e an error if some are

Exception Throws

An exception throw instruction do es the following

The exception throw instruction waits until it has sideeect p ermission and

until all its inputs are available

CHAPTER EXECUTION FLOW

A search is made through the ancestors of the exception throw instruction for

a call instruction with an exception catch case

All the no des discovered during this search b efore the found call instruction are

given withdraw p ermission which then propagates down other branches of the

tree

However if during the search a no de is found that already has withdraw p er

mission the exception throw instruction ab orts and nishes execution without

any other eect

The inputs to the exception throw instruction are used to construct an exception

value list which is sent to the found call instruction

Note that if more than one exception throw happ ens in the same routine execution

all but one exception value list will b e discarded

Blo cks

A block instruction introduces a blo ck of instructions that are equivalent to a routine

called by the blo ck instruction Thus blo cks are inlined routines

A blo ck instruction is like a call instruction except for the following

When a blo ck is scheduled it allo cates registers for the partitions in the blo ck

and after these allo cates the result and exception value registers that a call

instruction would allo cate

Blo ck instructions do not have arguments Instead the instructions of a routine

simply reference registers allo cated b efore registers of the blo ck Since these

are readonly the blo ck could b e made into a routine with these passed as

arguments

Blo ck instructions can have exception catch cases like call instructions

A common programming construct is to return to a blo ck that is an ancestor of the

current blo ck To p ermit this RCODE provides a sp ecial block return instruction

that returns results to the N th ancestor blo ck up the execution tree Note however

that if a return to the N th ancestor blo ck is executed returns to the N st

CHAPTER EXECUTION FLOW

N nd st ancestor blo cks must still b e executed to get all the intervening

blo ck instructions to complete These N separate blo ck return instructions may b e

executed in any order

Thus returning from each of a set of nested blo cks is viewed as an indep endent

activity

Lo ops

A loop instruction introduces a blo ck of instructions that are equivalent to a tail

recursive routine called by the lo op instruction

A lo op instruction is like a call instruction except for the following

When a lo op is scheduled it allo cates registers for the partitions of several

executions of the the lo op and after these allo cates the result and exception

value registers that a call instruction would allo cate

The lo op instruction has arguments like a call instruction that b ecome the

arguments of the rst iteration of the lo op

The lo op can execute a lo op continue instruction that is like a call in that it

starts the next execution of the lo op and passes arguments to that execution

However the lo op continue instruction arranges that the next execution of the

lo op will return directly to the lo op instruction if it executes a normal return

and not to the previous execution of the lo op

The new lo op execution b ecomes a child of the lo op instruction in the execution

tree and not of the lo op continue instruction execution

If a lo op execution executes a return instruction this returns results to the lo op

instruction Similarly if the lo op execution executes an exception throw the

throw b ehaves normally using the lo op instruction execution as the parent of

the lo op execution

Lo op instructions can have exception catch cases If an exception throw is

caught by the lo op instruction all lo op executions that are children of this lo op instruction are withdrawn

CHAPTER EXECUTION FLOW

A lo op instruction provides registers for some number of lo op executions When all of

these have b een scheduled the rst lo op execution must b e done so its registers can

b e reclaimed b efore the next lo op execution can b e scheduled Thus the maximum

number of simultaneous lo op executions is decided when RCODE machine language

is generated and not when RCODE is compiled Actually this is not the real

maximum number of simultaneous lo op executions it is only the apparent maximum

number presented by the debugging interface and restricts debugging more than

actual execution

Inlining

RCODE routines are optionally inlined when they are compiled to machine co de

In RCODE each routine is given a inlining priority which is higher if the routine is

b etter o inlined Each call is given an inlining enable level A call is inlined if the

called routine can b e identied at compile time and the call enable level is less than

or equal to this routines priority

Typically all calls in one routine are given the same enable level but this is not

required

There needs to b e some way to suppress inlining within inlined routines ie recursive

inlining This is done by giving each routine and each case of a case instruction an

inlining enable increment When a routine is inlined its increment is added to the

enable level of all call instructions inside the routine When a case is inlined its

increment is added the enable level of all call instructions inside the case

This prop osal is rather exp erimental and will have to wait serious usage to see how

well it works

Lo calizing

To localize a set of routines within a parent routine means to compile sp ecial versions

of the lo calized routines which can only b e called by the parent or by one of the other

routines in the set b eing lo calized The directed call graph of the set of routines to

b e lo calized excluding the parent may not have any cycles

Lo calized routines may b e implemented to use part of their parents register frame

instead of having their own register frame This is hidden inside an RCODE imple

mentation but makes for some increase in eciency

CHAPTER EXECUTION FLOW

Lo calization is similar to inlining but is useful when a routine is called several times

by its parent

In RCODE routines are given a localization priority that is separate from the rou

tines inlining priority This is compared with call instructions inlining enable level

to determine whether the routine should b e lo calized A routine is lo calized if any

one call by the parent or by another lo calized routine indicates it should b e lo calized

If the set of routines to b e lo calized contains a call cycle this is considered a com

pilation error Implementations should break the cycle in some unsp ecied way and

warn the user Any recursive chain of routines should contain a routine that is not

lo calizable

Implementations should p erform obvious optimizations such as inlining any lo calized

routine that is called only once and removing arguments to lo calized routines that

are the same for every call to the routine within its parent

Coroutines

In RCODE coroutines are implemented by a coroutine channel A coroutine

channel holds a p ointer to a call instruction execution that is waiting for results

Coroutines work as follows

To start a coroutine a sp ecial coroutine start instruction is used that calls

the coroutine and places a p ointer to the coroutine start instruction execution

which is a form of call instruction execution in a coroutine channel

A p ointer to the coroutine channel is passed as an argument to the coroutine

Once started the coroutine or one of its subroutines can then call the corou

tine channel This has the eect of taking the call argument list and using it

as the result list for the call instruction execution p ointed at by the channel It

then resets the channel to p oint at the execution of the instruction that called

the channel

Once a coroutine has called its channel and the previous caller of the channel

has received a return then the previous caller can call the channel to call the

coroutine back This also resets the channel to p oint at the execution of the instruction that called the channel

CHAPTER EXECUTION FLOW

When the coroutine returns its result values are sent to the call instruction

execution p ointed at by the coroutine channel

When the coroutine is called by the start instruction the coroutine execution

is attached in the execution tree to some call instruction execution designated

by the start instruction This designated call instruction execution must b e

an ancestor of the start instruction and is typically the parent of the routine

execution containing the start instruction execution

If the coroutine throws an exception the execution subtree ro oted at the corou

tine execution is given withdraw p ermission but withdraw p ermission is not

propagated to the parent of the coroutine execution and the exception value

list is not sent to that parent or one of its ancestors

The exception value list is saved until the coroutine channel p oints at a call

instruction execution that is not in the coroutine execution tree Then the

exception value list is returned to that call instruction execution

A coroutine may b e lo calized see Lo calizing ab ove if it do es not pass the p ointer

to the coroutine channel to any nonlo calized routine and if the coroutine execution

is to b e a sibling in the execution tree of the routine executing the start instruction

In this case a call graph is made of calls that pass the coroutine channel p ointer

and all routines connected to the coroutine in this call graph are allo cated separate

register space in the routine execution frame of the start instruction

Nonlo calized coroutines cause the routine execution tree to cease to b e stacklike

Lo calizing coroutines preserves any stacklike nature of this tree

NonSignalingNaNs

RCODE instructions that do not normally have side eects output IEEE non

signalingNaNs when they cannot pro duce more reasonable results They do this

instead of trapping if there is no p ossibility that a trap routine could pro duce a

reasonable result

The RCODE short circuit b o olean AND instruction will accept a nonsignalingNaN

as a legal second input if the rst input is false Thus the second input can b e

computed in parallel with the rst input even if the rst input must b e true in order

for the second input to b e legal If the rst input computes to false the second input

CHAPTER EXECUTION FLOW

may compute to a nonsignalingNaN but this will not aect the nal result which

is false

RCODE case instructions can have an optional case that is selected by a non

signalingNaN input If they do not have such a case they trap when given a non

signalingNaN input and this trap routine cannot return it must throw an exception

or invoke the debugger

Whether or not the output of an instruction is a nonsignalingNaN can b e thought

of as an additional piece of state for the instruction execution no de that is part of

the instruction execution tree However unlike the rest of the instruction execution

tree state this extra piece may not b e a function of the program counter value for

a sup erscalar or very long word computer implementations When it is not the

implementation must manipulate ags at run time to indicate nonsignalingNaN

outputs and this causes ineciencies Hop efully this will not b e necessary often

IEEE nonsignalingNaNs can contain many bits of information RCODE ignores

this extra information and may not preserve it when copying nonsignalingNaN

values This p ermits RCODE implementations to use a single bit to represent the

presence or absence of a nonsignalingNaN in a particular RCODE register

Traps

An instruction traps when it nds it has inputs that it cannot pro cess and either it

makes no sense to output a nonsignalingNaN or there is some p ossibility that a trap

routine might b e able to compute some other output for the instruction

An instruction trap is just like a subroutine call The routine to b e called is selected

from a dispatch table using the instruction op eration co de and input arguments as

appropriate The routine may return values to emulate the instruction if appropriate

More sp ecically when an instruction nds its inputs are such that it should trap

it may schedule the trap routine just as if the instruction were a call instruction

If the trap routine returns the result values are placed in the output registers of

the instruction and the instruction is done Instructions that trap b ehave like call

instructions that do not have exception catch cases

The implementation of traps is conceptually like turning instructions into case in

structions which have a nontrapping and a trapping case To p ermit all the desired

traps and trap routine returns extra co de may b e needed for some instructions on

CHAPTER EXECUTION FLOW

some hardware An implementation might have various optimizations for this such

as mo des where certain trap routine returns are disallowed and mo des where the

extra co de is computed on the y when rst needed and then cached for p ossible

future use

In most cases traps are considered an unusual circumstance but in some cases they

are not For example arithmetic op eration traps on signalingNaN tagged values

might b ecome the standard way of implementing rational numbers in LISP

Continuations

RCODE has a sp ecial continuation call instruction that makes the new routine exe

cution a child of the parent of the current routine execution instead of a child of the

continuation call instruction execution A routine execution that executes a contin

uation call may not execute a routine return The new routine execution created by

the continuation call b ecomes resp onsible for returning results to the common parent

One eect is that a call instruction may have more than one routine execution as a

child though only one is supp osed to return

Exception pro cessing is extended so that if an execution sets the state of any routine

execution of a call instruction execution to withdraw p ermission then the states

of any other routine execution children of the same call instruction execution are

withdrawn

If more than one routine execution attempts to return to the same call instruction

execution the result values are undened but at least one of the return instructions

will trap so a debugger can intervene

Recording State

The state of the execution tree must b e recorded for the sake of the debugging inter

face This is done in the RCODE virtual registers

When a typical RCODE arithmetic instruction is scheduled it is copied into its

output register The bit instruction ts in this bit register with a sp ecial tag

indicating it is an instruction waiting to execute When its inputs b ecome available

it executes and replaces itself in its output register by its result

CHAPTER EXECUTION FLOW

More complex instructions like case instructions record their state in their output

register or registers

Although these state changes mean that RCODE registers are not simply unchanging

constants the state changes are monotonic leading from less done values to done

values

All of this is of course virtual It is only visible through the debugging interface and

is just the means this interface has of presenting the execution tree state to debuggers

Implementation Challenges

To rst order approximation it is very easy to compile RCODE for existing com

puters Strict sequential execution semantics is used for out of line subroutine calls

and inlining is done in RCODE b efore translation to real machine language Instruc

tions are issued in any order consistent with the ab ove register dataow semantics

and strict execution of out of line calls The type co des in instructions are used to

generate ecient co de that do es not really pro duce bit tagged values Execution

tree state is computed by the compiler as a function of the implementation program

counter

However at second lo ok there are some challenges

Simulating the RCODE Computer

The debugging interface needs to make the co de execution lo ok like it was done on a

real RCODE computer Fortunately there is no sp eed requirement for doing this

The debugging interface may need remarkably little saved register frame state to

reconstruct the virtual register frame contents For a simple routine all the debugging

interface needs is the values of the arguments and any values read from memory or

returned as result values from outofline routine executions Everything else can b e

computed by simulation including the execution tree state

However a lot of extra information is required ab out a routine to p erform this re

construction when the debugger interface is invoked during execution of the routine

Although this information is computed when each routine is compiled it may b e so

voluminous that saving it all may not b e a go o d idea Instead it can b e dynami

cally recomputed as needed and a cache maintained of the information for routines

CHAPTER EXECUTION FLOW

recently examined using the debugger interface

NonSignalingNaN Outputs

Instructions on existing computers do not simply output nonsignalingNaNs or some

equivalent when confronted by illegal inputs but instead are likely to trap or set an

overow ag Therefore it is imp ortant not to execute them if their pro duction of a

nonsignalingNaN would not cause a trap at some p oint in the routine execution In

some cases it may not b e p ossible to arrange this and the overow ag will have to

b e saved for later use

For example if given a shortcircuit AND instruction whose output is not aected by

a nonsignalingNaN second input if the rst input is false the rst input computation

needs to b e done b efore any part of the second input computation that might trap

On the other hand if the pro duction of a nonsignalingNaN by an instruction such

as an add instruction is guaranteed to cause an irrecoverable trap later such as a

case instruction trap b ecause the case selection cannot b e computed then the add

instruction can b e executed and the trap made to lo ok as if it happ ened later

Trap Implementation

A number of issues o ccur when implementing traps that work as if the trapping

instruction turned into a subroutine call

First some computers may not save enough state to p ermit this and there may

have to b e co de asso ciated with the trapping instruction to assist in calling such trap

routines

Second some traps may need extra co de asso ciated with the trapping instruction to

p ermit traps to return prop erly

For example if a trap routine returns a nonsignalingNaN as the output value of

an instruction this may need to b e translated into an immediate second trap of

a downstream instruction that would trap if the rst instruction outputs a non

signalingNaN

Faithful execution of trap semantics may b e inappropriate in many cases and imple

mentations should provide mo des to disable the generation of the co de required in these cases

CHAPTER EXECUTION FLOW

Summary

RCODE is to provide a new execution mo del for future programming languages

which minimizes unnecessary sequentiality while running well on b oth existing and

future computers The future programming languages should b e easy to program but

need not b e strictly compatible with existing programming languages The RCODE

mo del of nonmemory dataow plus barriers and IOlike memory op erations should

work well for languages in which most co de is functional but a small amount of co de

is written using barriers and IOlike memory op erations

Chapter

Shared Ob ject Memory

Goals

Because there is a limit on the sp eed of a single pro cessor there is an ultimate need

for parallel computation in any application in which latency is a factor

Memory sp eeds have not b een increasing as fast as basic pro cessor sp eeds In fact

DRAM memory sp eeds have only b een increasing at the rate of p er year while

pro cessor sp eeds have b een increasing at p er year This phenomenon called the

memory wall see page will probably lead to computers in which a secondary

cache miss will cause a pause of a few hundred instructionexecutiontimes

There are only two metho ds I know to cover long cache miss times The rst is

prefetching predicting the need for a piece of memory far enough in advance in this

case a few hundred instructionexecutiontimes in advance The second metho d is

multithreading making a single pro cessor lo ok like several pro cessors so when one

pauses another runs The latter is the more widely applicable solution and leads

to decomp osing programs into a small number of threads that work together on a

common shared memory

Now consider the situation where many computers reside in the same building as

happ ens in schools We may want these computers to work together on the same

problem where problems of interest include compilations text pro cessing games

and simulations The communication latency b etween computers is limited by the

sp eed of light and as b er optic communication with throughputs on the order of

CHAPTER SHARED OBJECT MEMORY

gigabit p er second is b ecoming inexp ensive the sp eed of light latency is probably

the only limit on intercomputer communications for these withinbuilding systems

This latency will likely b e a few thousand instructionexecutiontimes within our time

frame AD see C page

1

This withinbuilding situation leads to the idea of having many pro cessors work on the

same problem by sharing an ob ject memory where it takes a few thousand instructions

to fetch an ob ject This memory is similar to disk but this shared object memory

has a resp onse at least times faster than disk alb eit times slower than lo cal

memory

Thus many computers in the same building should have a shared ob ject memory so

they can work on the same problem and a single computer on a desktop may want

to run several threads simultaneously with the threads sharing a memory that might

as well b e organized as a shared ob ject memory to o

Thus the goal of this chapter

G Find a design for a shared ob ject memory that can b e used for communi

cations b etween separate execution threads Assume latencies of several

hundred to several thousand instructionexecutiontimes for fetching in

formation from the shared ob ject memory

By chosing to use multiple heavyweight threads to hide memory latency I have

excluded the micro dataow approach to handling long memory latency This has

b een done b ecause sp ecial hardware is needed to make micro dataow work well see

the discussion in section and b ecause I feel the macro dataow approach will b e

sucient Macro dataow is just dataow in which the basic data items are larger

typically at least a few hundred bytes each and the basic op erations are larger at

least a few thousand instructionexecutiontimes each To the b est of my knowledge

macro dataow suces for most but not all computer applications while micro

dataow is necessary for only a small minority of applications Micro dataow may

b e easier to program for matrix applications but macro dataow may well b e easier

for text pro cessing and compiling applications

This chapter represents research into the p ossibilities of shared ob ject memory but

it is in a less nished state than the research of previous chapters on other facilities

in RCODE There are more lo ose ends a few of which I will identify b elow and the

rest of which I do not know ab out yet

CHAPTER SHARED OBJECT MEMORY

Requirements

What kinds of op erations can b e p erformed on ob jects in shared ob ject memory

First many ob jects will turn out to b e readonly after they are created just as many

les are The op erations on these are creation reading an ob ject and reading part

of a large ob ject

For functional programming it is also convenient to have ob jects that go through

an initial writeonly phase during which many threads write them but cannot read

them Then the ob jects are switched to readonly after which they cannot b e written

A histogram is similar but undergo es an accumulateonly phase during which many

threads may add to its elements Then when it is nished it is switched to readonly

However it seems unreasonable to exp ect all comp onents of an ob ject to b e in the

same situation during one phase of the ob ject Some may b e readonly while others

are accumulateonly for example

This thinking leads to the following

F RCODE will supp ort a shared ob ject memory in which individual ob

8

ject comp onents can b e b e marked as readonly writeonly writeonce

or accumulateonly as part of the comp onent type Also ob jects may

change types dynamically so their comp onents can for example switch

from writeonly to readonly

However this approach is not always enough so

F RCODE shared ob ject memory will supp ort atomic multiob ject trans

9

actions

Sp ecial hardware will b e necessary to make a real networked shared ob ject memory

but this is not available even approximately to day However shared memory sym

metric multiprocessors have b een available for some time and are working their way

toward student computers see H page This leads to

2

F RCODE shared ob ject memory will b e ecient on existing symmet

10

ric multiprocessors but may require sp ecially designed hardware for

eciency on a withinbuilding network

CHAPTER SHARED OBJECT MEMORY

Lastly there is an issue which I have not currently addressed very well and do not

list as a requirement b ecause I do not know whether or not it needs to b e a separate

requirement This is the issue of how to do searches eciently for the withinbuilding

shared ob ject memory I simply have not yet studied this problem suciently

There may b e other such issues that I have not yet identied

Main Problems

There are several ma jor problems that need to b e solved by the shared ob ject memory

One of these is cache coherency in a withinbuilding system keeping caches coherent

seems unreasonable

A second problem is thread synchronization when a thread runs out of things to do

it stops and something must restart it and direct its attention to new things to do In

a withinbuilding shared ob ject memory searching for things to do may b e inecient

A third problem is write delay in order to sequence writes to a shared memory one

must wait for earlier writes to complete b efore doing later writes but this wait time

can b e unreasonably long for a withinbuilding system

A fourth problem is atomic transactions threads need to p erform op erations that

that lo ck some data read and write the data and then unlo ck the data

A fth problem is hotspots if every thread in a building tries to read the same

memory lo cation or increment the same counter the memory lo cation to b e read or

incremented b ecomes a b ottleneck or hotsp ot

Shared ob ject memory must oer solutions to these problems b oth on existing sym

metric multiprocessors and on future withinbuilding systems In the latter case

sp ecial hardware will b e required but it should b e not b e unnecessarily complex

In the following subsections I introduce prop osed solutions to each of these problems

The Cache Coherency Problem

Symmetric multiprocessors may have small incoherent primary caches while within

building systems will have large incoherent caches Both systems may have write

buers that delay writes but I will assume these are small in b oth cases

CHAPTER SHARED OBJECT MEMORY

Most ob jects in shared ob ject memory simply go through a writeonly phase followed

by a readonly phase One problem for incoherent caches is how to handle this phase

transition

In a current symmetric multiprocessor the phase transition is handled by instructions

that ush delayed writes to memory and invalidate incoherent cache entries used to

read shared memory These op erations are fairly fast and can reasonably b e used to

implement phase transitions One of the reasons these op erations are fast is that the

larger caches in such a system are coherently maintained although the cost of doing

this is b eginning to b ecome great enough that coherency is b ecoming an option

On future withinbuilding computer systems the lo cal caches for a shared ob ject

memory may b e many megabytes in size and will not b e coherently maintained The

caches are so large that invalidating an entire cache is unreasonable One approach

is to have instructions that will invalidate only particular regions of the cache but

even this might b e time consuming

A dierent approach which I prop ose for actual use is a cache using virtual ob ject

addresses that consist of an ob ject number and a displacement within the ob ject

This has the advantage that ob jects can b e moved in memory without mo difying the

caches and provides a necessary alternative to the address registers of section

page for p ermitting such movement Keeping address registers coherent in such

a large distributed system would b e dicult but with virtual address caching this is

not necessary

The phase transition of an ob ject is handled by giving the ob ject a new ob ject number

and invalidating the old ob ject number Because the new ob ject number will not b e

in the virtual address caches there is no need to invalidate any cache entries

Invalidating the old ob ject number is only needed to detect serious programming

errors Normally an optimistic system can b e used that invalidates an ob ject number

in a few thousand instructionexecutiontimes while the application program keeps on

running This assumes programming errors will not o ccur just during this invalidation

delay either they will not o ccur at all or they will also o ccur outside the delay and

b e found and xed Sometimes a p essimistic system that delays use of the new ob ject

number until the old ob ject is withdrawn may help debugging

Some ob jects have some comp onents that can change any time and are therefore

volatile We will discuss these in conjunction with thread synchronization in the next section

CHAPTER SHARED OBJECT MEMORY

The Thread Synchronization Problem

There is a standard system for synchronizing threads that is commonly used for

symmetric multiprocessors This system which I will call standard synchronization

uses two op erations WAIT and NOTIFY WAIT delays a thread until the thread

has b een notied by a NOTIFY More precisely when a thread executes a WAIT

it stops until some thread has executed a NOTIFY targeted to the waiting thread

at some time since the waiting thread executed its previous WAIT ie the WAIT

instruction execution previous to the current WAIT instruction execution

Generally any thread can wait for some value to app ear in memory by executing a

lo op in which it rst checks whether the value is in memory and then if it do es

not nd the value do es a WAIT followed by a rep eat of the lo op We will call this

the standard synchronization loop Any thread that writes the value into memory

must NOTIFY all the threads that might b e waiting for that value after the value is

written

It is up to the programmer to maintain lists of threads that might b e waiting for a

value As this can b e a ma jor eciency issue there is no end to the ways of doing

this in dierent situations

Often considerable data is transmitted b etween threads by writing the data and then

writing a ag that indicates the data has b een written The writer must ush delayed

writes after writing the data and b efore writing the ag and then must ush delayed

writes after writing the ag to b e sure the ag is promptly visible The reader must

invalidate incoherent caches b efore lo oking for the ag to b e sure the ag is promptly

visible And the reader must invalidate incoherent caches after reading the ag and

b efore reading the data to b e sure the data read is the data that existed when the

ag was written

The reason I call this standard synchronization is that it is widely used b oth with

symmetric multiprocessors and to synchronize any pro cessor with its IO devices

the device registers are shared memory and traps are notications

In a withinbuilding shared ob ject memory most of the data go es through a writeonly

to readonly phase transition as describ ed ab ove Assuming virtual address caches

are used it will not b e necessary to invalidate caches in order to read current data

However invalidating the cache entry containing the ag needs to b e done in order

to see the current ag value as the caches in a withinbuilding system will not b e coherent

CHAPTER SHARED OBJECT MEMORY

Invalidating the cache entries of ags is a problem since each cache miss required to

check a ag to see if there is work to b e done will cause a multithousand instruction

executiontime delay To this problem I prop ose the following solution

Certain ob ject comp onents namely the ags are marked volatile in RCODE so they

can b e pro cessed sp ecially by the RCODE compiler For a withinbuilding system

there is a separate sp ecial p erthread volatile cache for volatile comp onents This

cache is invalidated at the b eginning of the standard synchronization lo op

The problem with standard synchronization lo op using this cache is mostly that the

lo op will want to lo ok at many volatile values to decide what to do Waiting for

one or two cache misses is not a problem but waiting for or is inecient So

we will try to pro cess all these cache misses in parallel without changing the basic

structure of the standard synchronization lo op

To do this instructions that read volatile values b ehave sp ecially on a cache miss

Instead of waiting for the cache to b e loaded they immediately return a sp ecial value

to indicate that the value is currently missing We will call this sp ecial value the

missing value it might actually b e a nonsignalingNaN as in section

The co de that pro cesses the volatile value should do the appropriate thing when

confronted with a missing value namely nothing The lo op will go on to read more

volatile values The rst time through the lo op will execute many read instructions

that cause cache misses and initiate withinbuilding shared ob ject memory reads The

volatile comp onent read instructions will return missing values and the lo op will nd

nothing to do

The volatile cache will remember whether it has delivered any missing values since

the b eginning of each synchronization lo op If it has then at the end of the lo op the

lo op will rep eat without doing a WAIT It will continue doing this until the lo op sees

no more missing values at which time if the lo op has found no work it will execute

a WAIT

The volatile cache must b e large enough to hold all the volatile values required by

one lo op Furthermore cache entries must not b e purged onced loaded during a lo op

so the cache should b e fully asso ciative However a large fully asso ciative hardware

cache is not necessary a smaller nonasso ciative hardware cache could b e supp orted

by a software since the RCODE compiler knows which read instructions are reading

volatile values and can maintain a software cache of successes

The advantage of all this is that most withinbuilding network reads for volatile

CHAPTER SHARED OBJECT MEMORY

values will b e done in parallel b eing scheduled by the cache misses o ccurring during

the rst iteration of the lo op Thus the entire standard synchronization lo op will

run in approximately a single withinbuilding network latency delay typically a few

thousand instructionexecutiontimes

As an optimization the lo op can delay somewhat b etween successive iterations if the

pro cessor can switch threads fast enough to use the time eciently

To improve the semantics the volatile cache should deliver nothing but missing values

after it has delivered the rst such value in a lo op even if it has actual values to

deliver This ensures that work which was of lower priority and whose ags are

therefore checked later in the lo op is not scheduled ahead of work of higher priority

checked earlier in the lo op just b ecause the higher priority ag values arrived latter

through the buildingwide network

Among the many p ossible ways of handling the thread synchronization problem I

have chosen this one b ecause it is compatible with standard synchronization and

b ecause it requires only simple sp ecial hardware

The Write Delay Problem

After an ob ject created in the functional programming style has all its comp onents

written the ob ject is converted to readonly and a readonly p ointer to the ob ject

is returned to some p otential user of the ob ject However if the readonly p ointer is

written to shared memory and then read and used by another pro cessor it is p ossible

the other pro cessor will read some part of the ob ject b efore the writes to that part of

the ob ject are nished If this happ ens the second pro cessor will read wrong data

On the other hand if the original ob ject creating pro cessor were to wait until all

comp onent writes completed b efore returning the readonly p ointer it would have to

wait a few thousand instructionexecutiontimes for a withinbuilding system This

is an unreasonable overhead for creating an arbitrary ob ject

When data is b eing written to b e received by other pro cessors generally the data is

written rst and then some ag is written to indicate the presence of the data In

the ab ove case the readonly p ointer might b e the ag More commonly a cluster

of new ob jects is created and a ag is written indicating the presence of the entire

cluster The solution I prop ose to the ineciency of waiting for writes to nish is the following

CHAPTER SHARED OBJECT MEMORY

Co de is separated into a partition that writes data followed by a partition that writes

ags At the b oundary b etween these particular types of partition the pro cessor

waits for all writes outstanding in the network to complete Thus a single network

latency time delay will o ccur only when ags are ab out to b e written

The Atomic Transaction Problem

An atomic transaction lo cks some data reads and writes it and then unlo cks the data

One problem with such transactions is that they are inecient if the lo ckunlock

op erations take very long or if readonly atomic transactions have to wait on each

others lo cks A second problem is that all of the data involved is volatile and

accessing it could take a long time in a withinbuilding system

One example of an atomic transaction is adding an item to a queue Another example

is adding an item to a directory A problem with the latter example is that searching

the directory to nd out whether an item to b e added is already there may involve

reading long lists and take a long time

For symmetric multiprocessors I prop ose an implementation of atomic transactions

based on the two counter locking scheme of Lamp ortLam Sets of data are pro

tected by count locks that consist of a pair of aligned bit counters that can b e

atomically read or written The counters are called the precounter and postcounter

The data sets may b e as small as a queue header or as large as a big directory

The precounter is incremented by a data set writing routine b efore it do es the write

the p ostcounter is incremented after the write is done and the counters are equal

when no write is in progress A reader reads and remembers the p ostcounter rst

copies information from the data set and then reads the precounter and checks for

equality with the saved value of the p ostcounter If there is inequality the data read

may b e garbage and the reader retries

The data read by a reader may not b e consistent if reading was overlapped by a

write A reader may check at any time to see if the data is consistent by reading

the precounter and checking for equality with the saved p ostcounter If the two are

unequal the data read may b e inconsistent and the the reader will fail

An atomic transaction that up dates one or more data sets go es through a read phase

followed by a write phase During the read phase the transaction reads and remem

b ers the p ostcounters of data sets reads data from each data set after reading the

CHAPTER SHARED OBJECT MEMORY

sets p ostcounter and makes a deferred write list of the lo cations in the data sets to

b e written and the values to b e written

During the write phase the transaction raises to high priority on its lo cal pro cessor

to lo ck out comp eting transactions on that pro cessor acquires a testandset lo ck

for each data set to lo ck out comp eting transactions on other pro cessors tests the

data set precounters for equality with the saved p ostcounter values and if there is

equality p erforms the writes indicated by the deferred write list b efore clearing the

testandset lo cks and lowering to the normal priority level If the counter equality

test fails the transaction must retry

Readonly transactions and the read phase of up dating transactions do not lo ck out

each other The write phase of a transaction is the only part that conicts with other

transactions and it is very short since the writes are preplanned by the deferred

write list However this is an optimistic lo cking scheme and it can thrash in the

presence of a large number of writes so it may b e desirable to detect thrashing and

have a transaction do something sp ecial when thrashing o ccurs such as reserving

data sets for a time interval normally long enough to include the read phase of the

transaction

The next problem is how to do all this in a withinbuilding shared ob ject memory

The following solution is quite simple

The lo ck counters and comp onents of the data sets b eing lo cked are treated as volatile

comp onents and cached in the volatile cache This cache is completely invalidated

just b efore each read of a p ostcount This guarantees that all comp onents read after

the p ostcount will b e at least as recent as the p ostcount Here all volatile reads wait

for their data to b e delivered b efore completing they do not return missing values

as reads did in the standard synchronization lo op ab ove

In the write phase a testandset lo ck is set for each data set and then the volatile

cache is invalidated again the precounts are all read and compared with the saved

p ostcounts and if the precounts are equal the deferred write list data is writ

ten b efore clearing the testandset lo cks Also the write phase must o ccur at an

appropriate priority level to lo ck out transactions by dierent threads on the same

pro cessor

As an example for a queue with a count lo ck in the same ob ject as the queue data

set comp onents there would b e a network read to obtain the ob ject with the

p ostcount and data set elements a second network read to obtain the precount

CHAPTER SHARED OBJECT MEMORY

in order to verify that the data set elements read are valid a network testand

set a third network read for the precount overlapping write op erations to

pro cess the deferred write list and a clear of the testandset lo ck Thus a queue

op eration would take ab out six network latency times Some of the six steps just

given could b e combined by optimizations eg acquiring the testandset lo ck and

reading the precount

For a simple op eration like maintaining a small queue it might b e b etter to simply

do the testandset at the b eginning returning the ob ject contents with the result

that tells whether the testandset was successful It might also b e p ossible to batch

clearing the testandset ag with comp onent writes in a single op eration But in the

more general case as for a large directory with many ob jects involved in the lo cked

data set the more general approach just describ ed would have to b e used

The Hotsp ot Problem

A hotsp ot o ccurs when many pro cessors attempt to read the same lo cation or p erform

some accumulate op eration such as addition on the same lo cation

To solve the hotsp ot problem for a buildingwide system I assume that network no des

can b e built that will combine requests from several pro cessors into a single request

on the rest of the system and when the resp onse comes back from the single request

will redistribute the resp onse appropriately mo died to the original requesters

For example if a network no de receives three requests to read the same blo ck of

memory from three pro cessors it can make one request on the rest of the system to

read the memory and then distribute the results to the requesting pro cessors

As another example if a network no de receives requests from two pro cessors to add

to a memory lo cation it can sum the requests and send the sum to b e added to the

lo cation If the result returned from such an op eration is supp osed to b e the lo cation

value b efore any addition then when the no de gets the result it can send it to one

pro cessor and send to the second pro cessor the sum of this result and the value added

by the rst pro cessor

It is imp ortant in this latter scheme that the value returned by a request to accumulate

to memory b e the value b efore the accumulate Trying to return the value after

accumulation makes it imp ossible for a network no de combining pro cessor requests

to compute values to b e send to pro cessors unless the op eration is invertible Not all

CHAPTER SHARED OBJECT MEMORY

op erations are invertible taking the maximum is not

I assume that at some future time it will b e reasonable to equip network no des with

RISC pro cessors with small amounts of memory For example a network no de might

b e a single integrated circuit and the RISC pro cessor might b e part of that circuit

Accumulate op erations can b e implemented by functions whose only variable input is

the values they are combining These functions may have small amounts of constant

input such as small tables that may b e loaded into various network no des to p ermit

these no des to p erform the accumulate op erations

The only thing that RCODE do es to comp ensate for hotsp ots is provide a design

that p ermits combining network no des to b e built easily

Principals

RCODE shared ob ject memory is based on several principals that are describ ed in

this section These mesh somewhat with the discussion in the previous sections on

shared ob ject memory problems

The Ordered Partition Mo del

It is assumed that programs are divided into partitions that are executed in some well

dened order but that memory op erations within a partition are unordered This

mo del conforms to the partition mo del in Chapter see section page It

also conforms to barrier mo dels that are widely used in parallel programming eg

the BSP mo delVal GV

CommutativeAssociative Op erations

The memory op erations within a partition must b e reorderable without changing the

semantics of the program This means these op erations must b e commutative and

asso ciative

Op erations on dierent memory comp onents have these prop erties Thus if a partition

has as set of comp onent write op erations all on dierent comp onents these op erations

commutate and asso ciate

CHAPTER SHARED OBJECT MEMORY

Strictly read op erations have this prop erty If all op erations on one comp onent in a

partition are read op erations all these read op erations commute and asso ciate They

will all return the same value

Typical accumulate op erations such as addition or taking the maximum commutate

and asso ciate even on a single comp onent

Type Change

Individual ob jects are exp ected to go though phases whose b oundaries are marked by

ob ject type changes Each phase must complete b efore the next phase starts The

phases may b e implemented by the partitions mentioned ab ove

For example a typical functional programming ob ject will have a writeonly phase

followed by a readonly phase In the write only phase one write op eration is executed

for each comp onent In the readonly phase each comp onent may b e read as often

or as little as desired

A histogram may go through three phases writeonly accumulateonly and after it

is complete readonly

Per Comp onent Access Disciplines

There are dierent disciplines for accessing a comp onent within an ob ject phase

Example disciplines are readonly writeonly and accumulate

Dierent comp onents of an ob ject may have dierent access disciplines within the

same phase of the ob ject Thus some comp onents may b e readonly while others are

writeonly and still others are accumulate

Volatile Comp onent Caching

A volatile ob ject comp onent is a comp onent that may change value indep endently of

the thread lo oking at it

Volatile comp onents are handled by a sp ecial volatile cache This cache is not auto

matically coherent with shared memory and has a sp ecial cache invalidate instruction

to clear the cache

CHAPTER SHARED OBJECT MEMORY

Sp ecial attention must b e paid to organizing the program and the volatile cache to

avoid unreasonable delays due to volatile cache misses

Probabilistic Error Detection

Fatal errors such as accessing an ob ject that has b een deleted or accessing an ob ject

whose type has changed using an obsolete virtual address are assumed not to o ccur

in p erverse patterns Therefore metho ds of detecting these that have a high but not

p erfect probability of detection should usually suce

Related Work

The use of small incoherent primary caches has b een pioneered by several computer

companies for example Digital Equipment Corp orationSit

Alternate metho ds of thread synchronization that involve read instructions which wait

for empty words to b ecome full may b e found in the work of Burton SmithSmi

+

ACC and ArvindNik Pap

The notion of a shared ob ject memory with caches that use the virtual address of the

ob jects has b een explored in the MUSHROOM pro jectWW

RCODE Shared Ob ject Memory Semantics

In this section we will give some details of the RCODE implementation of shared

ob ject memory

Access Disciplines

In RCODE each comp onent load or store treats the comp onent as having one of the

following access disciplines

CHAPTER SHARED OBJECT MEMORY

readonly

writeonly

writeonce

accumulate

transaction

volatile

Below we will describ e how RCODE sp ecies which access discipline to use with a

particular load or store and how each of the access disciplines b ehave

Access Sp ecications

In RCODE the access discipline used to load or store a comp onent is determined by

the access specication that is part of the type eld in the RCODE p ointer to the

comp onent

RCODE comp onent load and store instructions dene the comp onent to b e loaded

or stored by means of an RCODE p ointer formatted as in Figure This p ointer

is a virtual value in contains the relevant information but the information is not

actually store in this form during program execution The access sp ecication is part

of the type eld of the p ointer

The rest of the type eld includes the numeric type and size of the comp onent b eing

accessed and this information must b e compiled into instructions on typical mo dern

computers Similarly the access sp ecication must b e compiled into instructions

When either the type or access sp ecication information is not available at compile

time load and store instructions turn into dynamic case statements that switch on

the type eld value and compile new cases of themselves when they see new type

eld values This p ermits the type eld information to b e compiled into machine

instructions even when it is not known at compile time see page for more discussion

of dynamic case statements

Some of the access sp ecications corresp ond directly to the dierent access disciplines

and some have a dierent use The READONLY WRITEONLY VOLATILE

WRITEONCE and ACCUMULATE access sp ecications corresp ond directly to ac

cess disciplines The NEUTRAL and TRANSACTION access sp ecications are in

puts to an op eration which will b e describ ed b elow that combines a p ointer to an

ob ject or sub ob ject with a comp onent descriptor The OPEN OPENREADONLY

CHAPTER SHARED OBJECT MEMORY

RCODE Pointer

24 40 32 32

FFFFC

unsigned

16

ob ject

type

tag

displacement

number

Type

24 8

access

rest of type

sp ecication

access sp ecication NEUTRAL

j READONLY

j WRITEONLY

j VOLATILE

j WRITEONCE pl

j ACCUMULATE pl

j TRANSACTION pl

j OPEN

j OPENREADONLY

j OPENWRITEONLY

pl priority level

Figure RCODE Pointers and Types

CHAPTER SHARED OBJECT MEMORY

and OPENWRITEONLY are outputs from this combination op eration and are

used as comp onent access sp ecications for the transaction access discipline The

word OPEN signies that the transaction has obtained the required count lo cks

by merely reading the p ostcount so that the lo cks are op en

Several access sp ecications include a priority level These access sp ecications can

not b e used by any thread running at a higher priority level Implementations are

therefore free to lo ck out other threads on the same pro cessor by raising to the prior

ity level indicated by the access sp ecication Usually the op eration governed by the

access sp ecication is short enough that the access sp ecication priority level can b e

so high that any thread can use the sp ecication

Access Sp ecication Combination

The comp onent p ointer used by a load or store instruction is usually created by com

bining a sub ob ject p ointer with a comp onent descriptor that describ es how to access

a comp onent of the sub ob ject see Figure The comp onent p ointer output by this

combining op eration takes its ob ject number from the sub ob ject p ointer and its dis

placement from the sum of the sub ob ject p ointer and descriptor displacements The

type of the output comp onent p ointer is taken from the comp onent descriptor except

for the access sp ecication which is derived by combining the access sp ecications of

the sub ob ject p ointer and comp onent descriptor according to the table in Figure

The blank entries in this table are errors as is any combination of access sp ecications

not listed in this table

If the sub ob ject p ointer access sp ecication is NEUTRAL the comp onent p ointer

access sp ecication is taken from the comp onent descriptor The exception to this

rule is that a TRANSACTION comp onent descriptor pro duces an OPEN comp onent

p ointer and also obtains a lo ck see The Atomic Transaction Access Discipline

b elow page

READONLY and WRITEONLY sub ob ject p ointers always force the access sp ec

of the comp onent p ointer to b e READONLY or WRITEONLY resp ectively This

feature is commonly used to initialize a sub ob ject by using a WRITEONLY p ointer

to make all its comp onents WRITEONLY This saves creating a separate type map

see section page for initialization The companion feature of forcing all

comp onents to b e READONLY can b e similarly used to implement an ob ject type

CHAPTER SHARED OBJECT MEMORY

RCODE Pointer

24 40 32 32

FFFFC

unsigned

16

ob ject

p ointer type

tag

displacement

number

RCODE Comp onent Descriptor

24 40 32 32

FFFF

signed

16

lo ck

p ointer type

tag

number displacement

Combination of Pointer and Comp onent Descriptor

24 40 32 32

FFFFC

16

sum of ob ject

combined type

displacements

tag

number

Figure Combining RCODE Pointers and Comp onent Descriptors

CHAPTER SHARED OBJECT MEMORY

Sub ob ject Pointer Access Sp ecication

Comp onent

Descriptor OPEN OPEN

Access WRITE READ WRITE READ

Sp ecication NEUTRAL ONLY ONLY OPEN ONLY ONLY

NEUTRAL NEUTRAL WRITE READ OPEN OPEN OPEN

ONLY ONLY WRITE READ

ONLY ONLY

WRITE WRITE WRITE READ OPEN OPEN

ONLY ONLY ONLY ONLY WRITE WRITE

ONLY ONLY

READ READ WRITE READ OPEN OPEN

ONLY ONLY ONLY ONLY READ READ

ONLY ONLY

VOLATILE VOLATILE WRITE READ

ONLY ONLY

WRITE WRITE WRITE READ

ONCE ONCE ONLY ONLY

ACCUM ACCUM WRITE READ

ULATE ULATE ONLY ONLY

TRANS OPEN WRITE READ

ACTION ONLY ONLY

Figure Access Sp ecication Combination

CHAPTER SHARED OBJECT MEMORY

change that makes an entire ob ject readonly without requiring an additional type

map

Other details on the access sp ecications are given b elow with the discussion of par

ticular access disciplines

The Volatile Cache

RCODE has a sp ecial p erthread cache for volatile comp onents called the volatile

cache For a withinbuilding system this would b e sp ecial hardware For a symmetric

multiprocessor with an incoherent primary cache and a cache invalidate instruction

the primary cache eectively implements the RCODE volatile cache

The volatile cache is not coherent with shared memory There is a sp ecial INVAL

IDATECACHE instruction that clears the volatile cache Writes may or may not

up date the volatile cache

Read instructions use the volatile cache when the access sp ecication for the comp o

nent b eing read is VOLATILE WRITEONCE ACCUMULATE OPEN or OPEN

READONLY these last two are for transactions

The cache has the following missing value capabilities that p ermit implementation of

standard synchronization lo ops as describ ed in section ab ove

Read instructions have an optional cache miss modier that indicates the instructions

should return a sp ecial missing value immediately instead of waiting if the volatile

cache suers a cache miss during a read Even in this case the cache will issue shared

memory op erations to ll itself Also in this case the cache will b e big enough and

asso ciative enough to hold all volatile data in a synchronization lo op without losing

any due to cache entry conicts Sp ecial co de may b e compiled for read instructions

with a cache miss mo dier to provide software supp ort to a hardware cache in order

to implement this size capability

The cache sets a missing value returned ag if it returns any missing values and this

ag can b e read and set The cache has a missing value control ag that if set causes

the cache to return only missing values if the missing value returned ag is set even

if the cache has actual nonmissing values to return

CHAPTER SHARED OBJECT MEMORY

The ReadOnly Access Discipline

A comp onent accessed using the readonly access discipline can only b e read Fur

thermore it is assumed that such a comp onent cannot b e changed and therefore

maintenance of cache coherence for such comp onents is unnecessary if the cache uses

virtual addresses as describ ed in section

Readonly comp onents can only b e the result of an ob ject type change in which a

previously writable comp onent was made readonly

The WriteOnly Access Discipline

A comp onent accessed using the writeonly access discipline can only b e written

Furthermore it is assumed that in general there will b e at most one write of the

comp onent within a program partition where memory op erations are unordered

Note there may b e implementation diculties in p ermitting two comp onents to b e

written indep endently if these are b oth part of the same byte for example To

overcome these writes of unaligned parts of memory may b e much slower than one

would exp ect

There may b e partitions of a routine that are sp ecially marked so that they do not

start until all the write op erations in the previous partition are done This is the sole

metho d of sequencing writes

The Volatile Access Discipline

A comp onent accessed using the volatile access discipline can only b e read Only

caches that are coherent immediately after an INVALIDATECACHE instruction

can b e used for such reads

See sections and ab ove for more details on reads using the volatile cache

The WriteOnce Access Discipline

A comp onent accessed using the writeonce access discipline can b e read or written

but must b e a tagged datum that is initialized to the sp ecial OMITTED value

CHAPTER SHARED OBJECT MEMORY

A write to a writeonce comp onent will succeed if the previous value was the OMIT

TED value but will not write anything otherwise In any case the previous value is

returned and can b e checked to see if it was the OMITTED value or if it is equal in

the appropriate sense to the value that was to b e written

The access sp ecication used to write a writeonce comp onent contains a priority level

Some implementations raise to this priority level or higher to write the comp onent

Threads of higher priority cannot write the comp onent

Reads of the comp onent may attempt to read the comp onent from the normal non

volatile cache If the value so read is the OMITTED value then the read must read

the comp onent from using the volatile cache After a nonOMITTED value is read

it will not change and may b e recorded in a nonvolatile cache

The Accumulate Access Discipline

A comp onent accessed using the accumulate access discipline is written by using a

sp ecial accumulate routine The comp onent may also b e read in a fashion identical

to that of the volatile access discipline

An accumulate routine combines two comp onent values and pro duces a new output

comp onent value The comp onent values may b e register values or contiguous sub ob

ject values ie aligned bit strings stored in memory The accumulate routine can

take additional arguments that are constants again either register values or contigu

ous sub ob ject values It must b e p ossible to copy the accumulate routine and these

additional arguments throughout a network to combine values in order to apply the

metho d of section ab ove to control hotsp ots

A write instruction to an accumulate comp onent returns the previous value of the

comp onent

The access sp ecication used to write an accumulate comp onent contains a prior

ity level Some implementations raise to this priority level or higher to write the

comp onent Threads of higher priority cannot write the comp onent

The Atomic Transaction Access Discipline

A comp onent may b e read or written inside an atomic transaction by using the

transaction access discipline

CHAPTER SHARED OBJECT MEMORY

Atomic transactions are initiated and terminated by sp ecial instructions During an

atomic transaction a list of acquired locks is kept Every time a NEUTRAL sub ob ject

p ointer is combined with a TRANSACTION comp onent descriptor to pro duce an

OPEN p ointer a lo ck designated by the descriptor and p ointer is lo cated acquired

and added to the lo ck list It is also p ossible that the lo ck is already on the list in

which case it need not b e acquired a second time or added to the list a second time

Thus an OPEN p ointer is a p ointer to a comp onent whose lo ck has b een acquired

The lo ck acquired is identied by the lock number in the comp onent descriptor Typ

ically this lo ck is a comp onent of the same ob ject as the comp onent b eing accessed

An OPEN comp onent may b e read or written Writes to an OPEN comp onent do

not actually change the comp onent but instead make an entry in a deferred write list

which causes the write to b e done when the atomic transaction terminates

Lo cks in this system are acquired optimistically meaning that conicting atomic

transactions may acquire lo cks simultaneously As a consequence the values read

by an atomic transaction for OPEN comp onents may b e incorrect b ecause a dier

ent atomic transaction may have written those comp onents since the current atomic

transaction acquired their lo cks There is a validate op eration that may b e executed

at any time to see if the current atomic transaction is still valid ie no OPEN

comp onent has b een written since its lo ck was acquired At the end of the atomic

transaction if the transaction is still valid all delayed writes are done atomically

whereas if the transaction has b ecome invalid it fails and no writes are done

The OPENWRITEONLY access sp ecication is just like OPEN but excludes reads

OPENREADONLY is just like OPEN but excludes writes

When a TRANSACTION comp onent descriptor is used to acquire a lo ck a check is

made to b e sure the current atomic op eration was initiated at a priority level b elow

that sp ecied by the comp onent descriptor

Often an entire sub ob ject inside some ob ject is presented as a TRANSACTION com

p onent of the containing ob ject Then an OPEN p ointer to this sub ob ject is computed

and the prop er lo ck acquired and this OPEN p ointer is used to make OPENREAD

ONLY or OPENWRITEONLY p ointers to comp onents of the sub ob ject without

acquiring additional lo cks For example a queue header might b e such a sub ob ject

The actual lo cking scheme used it that describ ed in section The volatile cache

is implicitly used and cleared during an atomic transaction

CHAPTER SHARED OBJECT MEMORY

Summary

The goal of this chapter has b een to dene a shared ob ject memory that might

b ecome a standard for writing parallel text pro cessors compilers games and simu

lations Unlike the other chapters in this thesis the results presented here are more

preliminary and it is more likely that there are either undiscovered aws or a b etter way

Biblio graphy

+

ACC Rob ert Alverson David Callahan Daniel Cummings Brian Koblenz Al

lan Portereld and Burton Smith The Tera computer system In Pro

ceedings International Conference on Supercomputing pages

September

ANS ANSI Washington DC Reference Manual for the Ada Programming Lan

guage

App Apple Computer Dylan Interim Reference Manual Available by

httpwwwapplecom and see Technology and Research and then Dylan

Bro Ro dney A Bro oks Trading data space for reduced time and co de space in

realtime garbage collection on sto ck hardware In ACM Symposium

on LISP and Functional Programming pages August

CN Jacques Cohen and Alexandru Nicolau Comparison of compacting algo

rithms for garbage collection ACM Transactions on Programming Lan

guages and Systems Octob er

Coh J Cohen Garbage collection of linked data structures Computing Sur

veys September

CPW S S Coleman P C Poole and W M Waite The mobile programming

system Janus Software Practice and Experience

Def Defense Research Agency Attn Dr N E Peeling St Andrews Road

Malvern Worcestershire UK WR PS TDF Specication

Dig Digital Equipment Corp Migrating to an OpenVMS AXP System Plan

ning for Migration March Order No AAPVATE

BIBLIOGRAPHY

+

E John H Edmondson et al Internal organization of the Alpha a

Mhz bit quadissue CMOS RISC micropro cessor Digital Technical

Journal

+

F David M Fenwick et al The AlphaServer series Highend server

platform development Digital Technical Journal

GV Alexandros V Gerb essiotis and Leslie G Valiant Direct bulksynchronous

parallel algorithms Journal of Parallel and Distributed Computing

+

H P Hudak et al Rep ort on the programming language Haskell SIGPLAN

Notices Section R

Hay Barry Hayes Key ob jects in garbage collection Technical Rep ort CSTR

Stanford August

HC Tim Hickey and Jacques Cohen Performance analysis of onthey

garbage collection Communications of the ACM

November

HF P Hudak and J Fasel A gentle introduction to Haskell SIGPLAN No

tices Section T

HMS Anthony L Hosking J Eliot B Moss and Darko Stefanovic A compara

tive p erformance evaluation of write barrier implementations In OOPSLA

pages

HW B K Haddon and W M Waite Exp erience with the universal interme

diate language Janus Software Practice and Experience

Inc Sparc International Inc The SPARC Architecture Manual Version

Prentice Hall

Int International Computer Science Institute Berkeley California The Sather

Specication Available by ftp to icsib erkeleyedu

Jon Mark P Jones The implementation of the Gofer functional program

ming system Technical Rep ort YALEUDCSRR Yale University

Department of Computer Science May

BIBLIOGRAPHY

JW Simon L Peyton Jones and Philip Wadler Imp erative functional pro

gramming In Symposium on Principles of Programming Languages pages

ACM January

Lam Leslie Lamp ort Concurrent reading and writing Communications of the

ACM November

+

M Gerald Masini et al Object Oriented Languages Academic Press

Maca Stavros Macrakis From UNCOL to ANDF Progress in standard inter

mediate languages httpwwwosforg and see Grenoble Oce Papers

Macb Stavros Macrakis The structure of ANDF Principles and examples http

wwwosforg and see Grenoble Oce Papers

Mey Bertrand Meyer EIFFEL The Language Prentice Hall

MS Scott Milton and Heinz W Schmidt Dynamic dispatch in ob jectoriented

languages Technical Rep ort TRCS Australian National Univer

sity Canberra January

Nik Rishiyur S Nikhil ID language reference manual Technical Rep ort Com

putation Structures Group Memo MIT July

NOPH Scott Nettles James OToole David Pierce and Nicholas Haines

Replicationbased incremental copying collection In Y Bekkers and J Co

hen editors Memory Management IWMM pages Springer

Verlag LNCS

NR S C North and J H Reppy Concurrent garbage collection on sto ck

hardware In Gilles Kahn editor Functional Programming Languages and

Computer Architecture pages SpringerVerlag LNCS

Org E I Organick A Programmers View of the Intel System McGraw

Hill

Pap Gregory Michael Papadopoulos Implementation of a general purpose

dataow multiprocessor MIT Press Cambridge MA

Sit Richard L Sites editor Alpha Architecture Reference Manual Digital

Press

BIBLIOGRAPHY

SKW Vivek Singhal Sheetal V Kakkad and Paul R Wilson Texas An ef

cient p ortable p ersistent store In Anotonio Albano and Ron Morri

son editors Persistent Object Systems San Miniato pages

SpringerVerlag

Smi Burton J Smith A pip elined shared resource MIMD computer In Pro

ceedings of the International Conference on Parallel Processing pages

IEEE August

Str Bjarne Stroustrup The Design and Evolution of C Addison Wesley

+

U David Ungar et al How to Use SELF Sun Microsystems and Stanford

University Available by ftp to selfstanfordedu or selfsmlicom

Ume Kyo ji Umemura Floatingp oint number LISP Software Practice and

Experience Octob er

UU Urs Holzle and David Ungar Optimizing dynamicallydispatched calls

with runtime type feedback In Conference on Programming Language

Design and Implementation pages ACM June

Val Leslie G Valiant A bridging mo del for parallel computations Communi

cations of the ACM August

VD Francis Vaughan and Alan Dearle Supp orting large p ersistent stores us

ing conventional hardware In Anotonio Albano and Ron Morrison edi

tors Persistent Object Systems San Miniato pages Springer

Verlag

Wal Rob ert L Walton RCODE do cumentation and pap ers ftpdas

ftpharvardedupubwaltonrcoderco dehtml and in Harvard University

Tech Rep orts to app ear

Wil Paul R Wilson Unipro cessor garbage collection techniques In Y Bekkers

and J Cohen editors Memory Management IWMM pages

SpringerVerlag LNCS

WK Wm A Wulf and Sally A KcKee Hitting the memory wall Implications

of the obvious Computer Architecture News March

BIBLIOGRAPHY

WW Mario Wolczko and Ifor Williams Multilevel garbage collection in a high

p erformance p ersistent ob ject system In Anotonio Albano and Ron Morri

son editors Persistent Object Systems San Miniato pages

SpringerVerlag

YHS Taiichi Yuasa Masami Hagiya and William Schelter GNU Commonlisp

ftpprepaimitedu

Zor Benjamin Zorn The measured cost of conservative garbage collection

Software Practice and Experience July

Index

atomic op eration abstract type

atomic transaction access discipline

problem access sp ecication

ACCUMULATE

barrier instruction

accumulate

barrier op eration

accumulate access discipline

base

accumulate routine

of array descriptor

actual ob ject

big endian

actual type

bit eld

identier

bitstring AND write barrier test

sp ecication

blo ck

sp ecication example

blo ck instruction

actual type map

blo ck return instruction

address

Bro oks

address register

loading

C

1

overhead

C

2

storing

C

3

ALIGNED

C

aligned

deciencies

alignment

cache coherency

of ob ject

problem

ANY

cache miss

the formal type

overhead

argument list

cache miss mo dier

array descriptor

call

Arvind

call cycle

ASCI I Text Co de

call instruction

ATCODE

case

INDEX

data types case instruction

dataow case statement

register class

deferred action buer C

deferred write list co de generator

deletion Cohen

manual compact

demand ob ject swizzling comp onent descriptor

demand p ointer swizzling constant

detection contents

garbage nonconstant

dimension descriptor comp onent index

discontiguous sub ob ject comp onent size

displacement comp onentID

eld in p ointer cons cell

done state overhead

dynamic case statement conservative garbage collector

DYNAMICCONVERT conservative p ointer comp onent

constant comp onent descriptor

eager

context

eager forwarding

contiguous sub ob ject

EIFFEL

continuation

enable level

continuation call instruction

inlining

conversion

endian

dynamic

transmission b etween dierent

endian

ephemeral garbage collection

static

ephemeral ob ject

copy

ephemeral ro ot

copy type

ephemeral ro ot p ointer

copyonmark

ERROR

copyonscavenge

exception catch

copying collector

exception catch case

the

exception throw

coroutine

exception throw instruction

coroutine channel

exception value

count lo ck create

INDEX

G exception value list

G

G execution ow

G execution tree

G no de states

gap transition rules

garbage collection

F

1

garbage detection

F

2

gc cycle

F

3

gc time overhead

F

4

generational garbage collection

F

5

global memory

F

6

address registers in

F

7

F

8

H

1

F

9

H

2

F

10

H

3

oating p oint number

H

4

fork

H

5

formal comp onent descriptor

hardware trends

formal type

HASKELL

formal type map

Hayes

format

heap memory

memory unit

HEP

forwarded

high p erformance op eration

forwarding

histogram

forwarding scenario

example

frame

Holzle

frame heap

Hosking et al

frame memory

hotsp ot

frame p ointer

problem

free blo ck

I

e fromspace

I

l

full garbage collection

I

s fullempty bit

ID

functional language

increment

G

INDEX

lo op inlining enable

lo op instruction inline

inlining

M bit

controlling

M

1

inlining enable increment

M

2

inlining enable level

M

3

inlining priority

manual deletion

instruction

invalidating ob ject number

instruction execution tree

strongly typed

integer

mapp ed type

interoperate

mark

interrupt check

markcopyscavenge

INVALIDATECACHE

marked

instruction

marked bit

invariant

marking

ephemeral marking

invariants

marking

marking algorithm

ephemeral conditions join

nonephemeral conditions

lattice element

standard

lazy

marking algorithms

lazy forwarding

match ag

length

match type map

limit

memory

of array descriptor

memory management

LISP

memory manager

oating p oint

memory unit

load convert

format

load op eration

memory unit size

loadaddress op eration

memory value vector

loadro ot

memory vector ag

lo cal heap

memory wall

lo calization priority

metho dologies

lo calize

Milton

lo ck list

missing value

lo ck number missing value control ag

INDEX

OMITTED missing value returned ag

OPEN monad

OPENREADONLY MONSOON

OPENWRITEONLY Monso on

op eration Moss et al

outofline move

overhead dynamic

address register multiprocess singlepro cessor

nonmutator multiprocessor

subroutine call shared memory

write barrier MUSHROOM

mutator

packing problem

mutator action time

paging

mutator test time

an ob ject

partition Nettles et al

p ermanent networks

p ermanent bit NEUTRAL

p ermanent ob ject no p ermission

p ermission NOCONVERT

propagate nonsignalingNaN

p ointer nonsnapshot

bit tagged ephemeral detector

bit tagged nonstrict

p ointer comp onent North and Reppy

p olymorphic notephemeralro ot

p op notephemeralro ot bit

p ostcounter NOTIFY

precounter NREVERSE

priority null

inlining

ob ject

lo calization

ob ject map

priority level

hardware

for copying

ob ject number

priority scheduling

oset

when not allowed

within memory unit propagate

INDEX

routine execution p ermissions

routine execution tree push

routine map number

RCODE

routine type

feature selection

S bit RAM

SCODE as IO device

SATHER reach

SATHER reachable

scalar type read

scanning memory

scavenge read barrier

scavenged read barrier detector

scavenged bit readbarrierforwarding

scavenger readfree p ermission

Schmidt READONLY

SELF readonly

shared heap readonly access discipline

shared memory realtime

multiprocessor reduced p ointer

shared ob ject memory reference

side eect register

sideeect p ermission register co de

sideeectfree p ermission register dataow

signature register dataow co de

size register dataow language

slot register frame

Smith register frame tree

snapshot register number

ephemeral detector register value vector

snapshot detector replicationforwarding

source analyzer result list

space return instruction

sp ecial return instruction blo ck

stack sp ecial

standard synchronization return op eration

standard synchronization lo op ro ot set

INDEX

termination state

thread synchronization maintenance

problem of execution tree no de

tospace recording

top frame STATICCONVERT

TRANSACTION step

transaction of array descriptor

transaction access discipline stopping

transition rules during ob ject copying

execution tree state store a p ointer

trap store convert

trap ag store op eration

set for move strict

two counter lo cking sub ob ject address

two level addressing subroutine

type call overhead

eld of p ointer sup erscalar

type co de surface co de

type map sweep

constancy swizzled

issues swizzling

type map number demand ob ject

type matching demand p ointer

dynamic swizzling scenario

static symmetric multiprocessor

type matching op eration synchronize

type variable

T

e

typechange

T

ns

strongly typed

T

ss

typeinfo in C

tag

Umemura tagged data

UNCOL tagged value

Ungar bit

unifying clausal grammar bit

universal glue in register

UNKNOWN TERA

INDEX

writeonce unsigned

writeonce access discipline used ob ject

WRITEONLY

VCODE

writeonly

value list

writeonly access discipline

virtual base

virtual computer

virtual function table

in C

virtual instruction

virtual ob ject

virtual register

visualization co de

visualization program

vocalization co de

vocalization program

VOLATILE

volatile

volatile access discipline

volatile cache

WAIT

Wilson

withdraw p ermission

withdrawn

withinbuilding system

write

write barrier

bitstring AND test

RCODE

write barrier detector

end thrashing

ending

write delay

problem

writebarrierforwarding WRITEONCE