<<

APRIL: A Pro cessor Architecture for Multipro cessing

Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz

Lab oratory for Computer Science

Massachusetts Institute of Technology

Cambridge, MA 02139

Abstract design for multipro cessors provides a means for hiding

latency. When sucient parallelism exists, a pro cessor

Pro cessors in large-scale multipro cessors must b e able

that rapidly switches to an alternate of computa-

to tolerate large communication latencies and synchro-

tion during a remote memory request can achieve high

nization delays. This pap er describ es the architecture

utilization.

of a rapid-context-switching pro cessor called APRIL

Pro cessor utilization also diminishes due to synchro-

with supp ort for ne-grain threads and synchroniza-

nization latency. Spin lo ck accesses havealowover-

tion. APRIL achieves high single-thread p erformance

head of memory requests, but busy-waiting on a syn-

and supp orts virtual dynamic threads. A commercial

chronization eventwastes pro cessor cycles. Synchro-

RISC-based implementation of APRIL and a run-time

nization mechanisms that avoid busy-waiting through

software system that can switch contexts in ab out 10

pro cess blo cking incur a high overhead.

cycles is describ ed. Measurements taken for several par-

Ful l/empty bit [22] in a rapid context

allel applications on an APRIL simulator show that the

switching pro cessor allows ecient ne-grain synchro-

overhead for supp orting parallel tasks based on futures

nization. This scheme asso ciates synchronization infor-

is reduced by a factor of twoover a corresp onding im-

mation with ob jects at the of a data word,

plementation on the Encore Multimax. The scalability

allowing a low-overhead expression of maximum con-

of a multipro cessor based on APRIL is explored using

currency. Because the pro cessor can rapidly switchto

a p erformance mo del. We show that the SPARC-based

other threads, wasteful iterations in spin-wait lo ops are

implementation of APRIL can achieve close to 80% pro-

interleaved with useful work from other threads. This

cessor utilization with as few as three resident threads

reduces the negative e ects of synchronization on pro-

p er pro cessor in a large-scale cache-based machine with

cessor utilization.

an average base network latency of 55 cycles.

This pap er describ es the architecture of APRIL,

a pro cessor designed for large-scale multipro cessing.

APRIL builds on previous research on pro cessors for

1 Intro duction

parallel architectures such as HEP [22], MASA [8], P-

RISC [19], [14], [15], and [18]. Most of these pro cessors

The requirements placed on a pro cessor in a large-scale

supp ort ne-grain interleaving of instruction streams

multipro cessing environment are di erent from those in

from multiple threads, but su er from p o or single-

a unipro cessing setting. A pro cessor in a parallel ma-

thread p erformance. In the HEP, for example, instruc-

chine must b e able to tolerate high memory latencies

tions from a single thread can only b e executed once

and handle pro cess synchronization eciently [2]. This

every 8 cycles. Single-thread p erformance is imp ortant

need increases as more pro cessors are added to the sys-

for eciently running sections of applications with low

tem.

parallelism.

Parallel applications imp ose pro cessing and commu-

APRIL do es not supp ort cycle-by-cycle interleaving

nication bandwidth demands on the parallel machine.

of threads. To optimize single-thread p erformance,

An ecient and cost-e ective machine design achieves a

APRIL executes instructions from a given thread until

balance b etween the pro cessing p ower and the commu-

it p erforms a remote memory request or fails in a syn-

nication bandwidth provided. An imbalance is created

chronization attempt. We show that such coarse-grain

when an underutilized pro cessor cannot fully exploit the

multithreading allows a simple pro cessor design with

available network bandwidth. When the network has

context switchoverheads of 4{10 cycles, without sig-

bandwidth to spare, low pro cessor utilization can re-

ni cantly hurting overall system p erformance (although

sult from high network latency. An ecient pro cessor 1

the pip eline design is complicated by the need to handle

pip eline dep endencies). In APRIL, thread scheduling is

NETWORK

done in software, and unlimited virtual dynamic threads

ROUTER ALEWIFE MACHINE

y bit synchro-

are supp orted. APRIL supp orts full/empt PROCESSOR

nization, and provides tag supp ort for futures [9]. In this text, and task are pap er the terms pro cess, thread, con CACHE

CACHE CONTROLLER

used equivalently.

By taking a systems-level design approach that con- FPU

siders not only the pro cessor, but also the compiler and MAIN

MEMORY

run-time system, wewere able to migrate several non-

ALEWIFE NODE

critical op erations into the software system, greatly sim-

plifying pro cessor design. APRIL's simplicity allows an

implementation based on minor mo di cations to an ex-

isting RISC pro cessor design. We describ e such an im-

Figure 1: ALEWIFE no de.

plementation based on Sun Microsystem's SPARC pro-

cessor [23]. A compiler for APRIL, a run-time system,

and an APRIL simulator are op erational. We present

memory, cache/directory controller and a network rout-

simulation results for several parallel applications on

ing switch. Multiple no des are connected via a direct,

APRIL's eciency in handling ne-grain threads and

packet-switched network.

assess the scalabilityofmultipro cessors based on a

The controller synthesizes a global

coarse-grain multithreaded pro cessor using an analyt-

space via messages to other no des, and satis es requests

ical mo del. Our SPARC-based pro cessor supp orts four

from other no des directed to its lo cal memory. It main-

hardware contexts and can switch contexts in ab out 10

tains strong [7] for memory accesses.

cycles, which yields roughly 80% pro cessor utilization

On exception conditions, suchascache misses and failed

in a system with an average base network latency of 55

synchronization attempts, the controller can cho ose to

cycles.

trap the pro cessor or to make the pro cessor wait. A

The rest of this pap er is organized as follows. Sec-

multithreaded pro cessor reduces the ill e ects of the

tion2isanoverview of our multipro cessor system archi-

long-latency acknowledgment messages resulting from

tecture and the programming mo del. The architecture

a strong cache coherence proto col. To allow exp erimen-

of APRIL is discussed in Section 3, and its instruction

tation with other programming mo dels, the controller

set is describ ed in Section 4. A SPARC-based imple-

provides sp ecial mechanisms for bypassing the coher-

mentation of APRIL is detailed in Section 5. Section 6

ence proto col and facilities for preemptiveinterpro ces-

discusses the implementation and p erformance of the

sor interrupts and blo ck transfers.

APRIL run-time system. Performance measurements of

The ALEWIFE system uses a low-dimension direct

APRIL based on simulations are presented in Section 7.

network. Such networks scale easily and maintain high

Weevaluate the scalabilityofmultithreaded pro cessors

nearest-neighb or bandwidth. However, the longer ex-

in Section 8.

p ected latencies of low-dimension direct networks com-

pared to indirect multistage networks increase the need

for pro cessors that can tolerate long latencies. Further-

2 The ALEWIFE System

more, the lower bandwidth of direct networks over indi-

rect networks with the same channel width intro duces

APRIL is the pro cessing element of ALEWIFE, a large-

interesting design tradeo s.

scale multipro cessor b eing designed at MIT. ALEWIFE

In the ALEWIFE system, a context switch o ccurs

is a cache-coherent machine with distributed, globally-

whenever the network must b e used to satisfy a re-

shared memory. Cache coherence is maintained using

quest, or on a failed synchronization attempt. Since

a directory-based proto col [5]overalow-dimension di-

caches reduce the network request rate, we can em-

rect network [20]. The directory is distributed with the

ploy coarse-grain multithreading (context switch ev-

pro cessing no des.

ery 50{100 cycles) instead of ne-grain multithreading

(context switchevery cycle). This simpli es pro ces-

2.1 Hardware

sor design considerably b ecause context switches can b e

more exp ensive (4 to 10 cycles), and functionality such

As shown in Figure 1, each ALEWIFE no de consists of

as scheduling can b e migrated into run-time software.

a pro cessing element, oating-p oint unit, cache, main

Single-thread p erformance is optimized, and techniques 2

used in RISC pro cessors for enhancing pip eline p erfor- level parallelism and primitives for placement of data

mance can b e applied [10]. Custom design of a pro cess- and tasks. As an example, the programmer can use

ing element is not required in the ALEWIFE system; future-on whichworks just like a normal future but

indeed, we are using a mo di ed version of a commercial allows the sp eci cation of the no de on whichtoschedule

RISC pro cessor for our rst-round implementation. the future. Extending Mul-T in this way allows us to

exp eriment with techniques for enhancing lo cality and

to research language-level issues for programming par-

2.2 Programming Mo del

allel machines.

Our exp erimental programming language for ALEWIFE

is Mul-T [16], an extended version of Scheme. Mul-T's

3 Pro cessor Architecture

basic mechanism for generating concurrent tasks is the

future construct. The expression (future X ), where

APRIL is a pip elined RISC pro cessor extended with

X is an arbitrary expression, creates a task to evaluate

sp ecial mechanisms for multipro cessing. This section

X and also creates an ob ject known as a future to even-

gives an overview of the APRIL architecture and fo-

tually hold the value of X . When created, the future

cuses on its features that supp ort multithreading, ne-

is in an unresolved,orundetermined, state. When the

grain synchronization, cheap futures, and other mo dels

value of X b ecomes known, the future resolves to that

of computation.

value, e ectively mutating into the value of X . Con-

The left half of Figure 2 depicts the user-visible pro-

currency arises b ecause the expression (future X ) re-

cessor state comprising four sets of general purp ose reg-

turns the future as its value without waiting for the

isters, and four sets of Program Counter (PC) chains

future to resolve. Thus, the computation containing

and Pro cessor State Registers (PSR). The PC chain

(future X ) can pro ceed concurrently with the evalu-

represents the instruction addresses corresp onding to

ation of X . All tasks execute in a shared address-space.

a thread, and the PSR holds various pieces of pro cess-

The result of supplying a future as an op erand of

sp eci c state. Each register set, together with a single

some op eration dep ends on the nature of the op eration.

PC-chain and PSR, is conceptually group ed into a single

Non-strict op erations, such as passing a parameter to

entity called a task frame (using terminology from [8]).

a pro cedure, returning a result from a pro cedure, as-

Only one task frame is activeatagiven time and is

signing a value to a variable, and storing a value into a

designated by a current frame p ointer (FP). All reg-

eld of a data structure, can treat a future just likeany

ister accesses are made to the active register set and

other kind of value. Strict op erations such as addition

instructions are fetched using the active PC-chain. Ad-

and comparison, if applied to an unresolved future, are

ditionally, a set of 8 global registers that are always

susp ended until the future resolves and then pro ceed,

accessible (regardless of the FP) is provided.

using the value to which the future resolved as though

Registers are 32 bits wide. The PSR is also a 32-bit

that had b een the original op erand.

register and can b e read into and written from the gen-

The act of susp ending if an ob ject is an unresolved

eral registers. Sp ecial instructions can read and write

future and then pro ceeding when the future resolves is

the FP register. The PC-chain includes the Program

known as touching the ob ject. The touches that auto-

Counter (PC) and next Program Counter (nPC) which

matically o ccur when strict op erations are attempted

are not directly accessible. This assumes a single-cycle

are referred to as implicit touches. Mul-T also includes

branch delay slot. Condition co des are set as a side

an explicit touching or \strict" primitive (touch X )

e ect of compute instructions. A longer branch delay

that touches the value of the expression X and then

might b e necessary if the branch instruction itself do es a

returns that value.

compare so that condition co des need not b e saved [13];

Futures express control-level parallelism. In a large

in this case the PC chain is corresp ondingly longer.

class of , is more appropri-

Words in memory have a 32 bit data eld, and have

ate. Barriers are a useful means of synchronization for

an additional synchronization bit called the ful l/empty

such applications on MIMD machines, but force unnec-

bit.

essary serialization. The same serialization o ccurs in

Use of multiple register sets on the pro cessor, as in the

SIMD machines. Implementing data-level parallelism

HEP, allows rapid context switching. A context switch

in a MIMD machine that allows the expression of maxi-

is achieved bychanging the frame p ointer and empty-

mum concurrency requires cheap ne-grain synchroniza-

ing the pip eline. The cache controller forces a context

tion asso ciated with each data ob ject. We provide this

switch on the pro cessor, typically on remote network re-

supp ort in hardware with ful l/empty bits.

quests, and on certain unsuccessful full/empty bit syn-

We are augmenting Mul-T with constructs for data-

chronizations. 3

text switching in APRIL is achieved bychanging Processor State Memory Con

Ready Suspended ter. Since APRIL has four task frames,

Queue Queue the frame p oin

ve up to four threads loaded. The thread that Global register it can ha

frame

is b eing executed resides in the task frame p ointed to

g0

by the FP. A context switch simply involves letting the

PC and PSR g7

y while saving the PC-chain and

frames pro cessor pip eline empt

hanging the FP to p oint to another task frame. 0:r0 then c 0:r31

1:r0 Threads in ALEWIFE are virtual. Only a small sub-

ysically resident on the pro- FP 1:r31 set of all threads can b e ph

2:r0

cessors; these threads are called loaded threads. The re-

2:r31

unloaded threads and PSR 3:r0 maining threads are referred to as

PC

eonvarious queues in memory,waiting their turn nPC 3:r31 liv Register

frames to b e loaded. In a sense, the set of task frames acts

like a cache on the virtual threads. This organization

Global Heap

is illustrated in Figure 2. The scheduler tries to cho ose

Unloaded thread

threads from the set of loaded threads for execution to

ize the overhead of saving and restoring threads

Loaded thread minim

to and from memory. When control eventually passes

back to the thread that su ered a remote request, the

Figure 2: Pro cessor State and Virtual Threads.

controller should have completed servicing the request,

provided the other threads ran for enough cycles. By

APRIL implements futures using the trap mechanism.

maximizing lo cal cache and memory accesses, the need

For our prop osed exp erimental implementation based

for context switching reduces to once every 50 or 100

on SPARC, which do es not have four separate PC and

cycles, which allows us to tolerate latencies in the range

PSR frames, context switches are also caused through

of 150 to 300 cycles with 4 task frames (see Section 8).

traps. Therefore, a fast trap mechanism is essential.

Rapid context switching is used to hide the latency

When a trap is signalled in APRIL, the trap mechanism

encountered in several other trap events, such as syn-

lets the pip eline empty and passes control to the trap

chronization faults (or attempts to load from \empty"

handler. The trap handler executes in the same task

lo cations). These events can either cause the pro ces-

frame as the thread that trapp ed so that it can access

sor to susp end execution (wait) or to take a trap. In

all of the thread's registers.

the former case, the controller holds the pro cessor until

the request is satis ed. This typically happ ens on lo-

3.1 Coarse-Grain Multithreading cal memory cache misses, and on certain full/empty bit

tests. If a trap is taken, the trap handling routine can

In most pro cessor designs to date (e.g. [8,22,19,15]),

resp ond by:

multithreading has involved cycle-by-cycle interleaving

of threads. Such ne-grain multithreading has b een

1. spinning { immediately return from the trap and

used to hide memory latency and also to achieve high

retry the trapping instruction.

pip eline utilization. Pip eline dep endencies are avoided

2. switch spinning { context switch without unloading

by maintaining instructions from di erent threads in the

the trapp ed thread.

pip eline, at the price of p o or single-thread p erformance.

In the ALEWIFE machine, we are primarily con-

3. blocking { unload the thread.

cerned with the large latencies asso ciated with cache

misses that require a network access. Go o d sin-

The ab ove alternatives must b e considered with care

gle thread p erformance is also imp ortant. Therefore

b ecause incorrect choices can create or exacerbate star-

APRIL continues executing a single thread until a mem-

vation and thrashing problems. An extreme example

ory op eration involving a remote request (or an unsuc-

of starvation is this: all loaded threads are spinning

cessful synchronization attempt) is encountered. The

or switch spinning on an exception condition that an

controller forces the pro cessor to switch to another

unloaded thread is resp onsible for ful lling. We are in-

thread, while it services the request. This approachis

vestigating several p ossible mechanisms to handle such

called coarse-grain multithreading. Pro cessors in mes-

problems, including a sp ecial controller initiated trap

sage passing multicomputers [21, 27, 6, 4]have tra-

on certain failed synchronization tests, whose handler

ditionally taken this approach to allowoverlapping of

unloads the thread.

communication with computation. 4

of creating an excessivenumb er of ne-grain tasks ex- An imp ortant asp ect of the ALEWIFE system is its

ists. Our solution to this problem is called lazy task cre- combination of caches and multithreading. While this

ation [17]. With lazy task creation a future expression combination is advantageous, it also creates a unique

do es not create a new task, but computes the expression class of thrashing and starvation problems. For exam-

as a lo cal pro cedure call, leaving b ehind a marker indi- ple, forward progress can b e halted if a context execut-

cating that a new task could have b een created. The ing on one pro cessor is writing to a lo cation while a con-

new task is created only when some pro cessor b ecomes text on another pro cessor is reading from it. These two

idle and lo oks for work, stealing the continuation of that contexts can easily play \cache tag", since writes to a lo-

pro cedure call. Thus, the user can sp ecify the maximum cation force a context switch and invalidation of other

p ossible parallelism without the overhead of creating a cached copies, while reads force a context switch and

large numb er of tasks. The race conditions are resolved transform read-write copies into read-only copies. An-

using the ne-grain lo cking provided by the full/empty other problem involves thrashing b etween an instruction

bits. and its data; a context will b e blo cked if it has a load

instruction mapp ed to the same cache line as the tar-

get of the load. These and related problems have b een

3.3 Fine-grain synchronization

addressed with appropriate hardware interlo ck mecha-

Besides supp ort for lazy task creation, ecient ne-

nisms.

grain synchronization is essential for large-scale parallel

computing. Both the data ow and data-parallel mo dels

3.2 Supp ort for Futures

of computation rely heavily on the availabilityofcheap

ne-grain synchronization. The unnecessary serializa- Executing a Mul-T program with futures incurs two

tion imp osed by barriers in MIMD implementations of typ es of overhead not present in sequential programs.

data-parallellism can b e avoided by allowing ne-grain First, strict op erations must check their op erands for

word-level synchronization in data structures. The tra- availability b efore using them. Second, there is a cost

ditional test&set based synchronization requires extra asso ciated with creating new threads.

memory op erations and separate data storage for the

lo ck and for the asso ciated data. Busy-waiting or blo ck-

Detection of Futures Op erand checks for futures

ing in conventional pro cessors waste additional pro ces-

done in software imply wasted cycles on every strict

sor cycles.

op eration. Our measurements with Mul-T running on

APRIL adopts the full/empty bit approach used in

an Encore Multimax show that this is exp ensive. Even

the HEP to reduce b oth the storage requirements and

with clever compiler optimizations, there is close to a

the numb er of memory accesses. A bit asso ciated with

factor of two loss in p erformance over a purely sequen-

each memory word indicates the state of the word: full

tial implementation (see Table 3). Our solution em-

or empty. The load of an empty lo cation or the store

ploys a tagging scheme with hardware-generated traps

into a full lo cation can trap the pro cessor causing a

if an op erand to a strict op erator is a future. We b elieve

context switch, which helps hide synchronization delay.

that this hardware supp ort is necessary to make futures

Traps also obviate the additional software tests of the

a viable construct for expressing parallelism. From an

lo ckintest&set op erations. A similar mechanism is

architectural p ersp ective, this mechanism is similar to

used to implement I-structures in data ow machines [3],

dynamic typ e checking in Lisp. However, this mecha-

however APRIL is di erent in that it implements such

nism is necessary even in a statically typ ed language in

synchronizations through software trap handlers.

the presence of dynamic futures.

APRIL uses a simple data typ e enco ding scheme for

automatically generating a trap when op erands to strict

3.4 Multimo del Supp ort Mechanisms

op erators are futures. This implementation (discussed

APRIL is designed primarily for a shared-memory mul-

in Section 5) obviates the need to explicitly insp ect

tipro cessor with strongly coherent caches. However,

in software the op erands to every compute instruction.

we are considering several additional mechanisms which

This is imp ortant b ecause we do not wanttohurt the

will p ermit explicit management of caches and ecient

eciency of all compute instructions b ecause of the p os-

use of network bandwidth. These mechanisms present

sibility an op erand is a future.

di erent computational mo dels to the programmer.

To allow software-enforced cache coherence, wehave

Lazy Task Creation Little can b e done to reduce the

loads and stores that bypass the hardware coherence

cost of task creation if future is taken as a command

mechanism, and a ush op eration that p ermits soft-

to create a new task. In many programs the p ossibility

ware writeback and invalidation of cache lines. A loaded 5

Bit 31 Bit 0

Typ e Format Data transfer Control ow

Fixnum 00

Compute op s1 s2 d d s1 op s2 PC+1

ld type ad d mem[a] PC+1

Memory Other 0 10

st type ds mem[a] s PC+1

Cons 101

Branch jcond offset if cond

PC+offset Future 101

else PC+1

jmpl offset d d PC PC+offset

Figure 3: Data Typ e Enco dings.

Table 1: Basic instruction set summary.

data word. Asso ciating the typ e with the p ointer has

the advantage of saving an additional memory reference

when accessing typ e information. Figure 3 lists the dif-

context has a fencecounter that is incremented for

ferenttyp e enco dings. An imp ortant purp ose of this

each dirty cache line that is ushed and decremented

typ e enco ding scheme is to supp ort hardware detection

for eachacknowledgement from memory. This fence

of futures.

counter may b e examined to determine if all writebacks

have completed. We are prop osing a blo ck-transfer

Future Detection and Compute Instructions

mechanism for ecient transfer of large blo cks of data.

Since a compute instruction is a strict op eration, sp ecial

Finally,we are considering an interpro cessor-interrupt

action has to b e taken if either of its op erands is a fu-

mechanism (IPI) which p ermits preemptive messages

ture. APRIL generates a trap if a future is encountered

to b e sent to sp eci c pro cessors. IPIs o er reasonable

by a compute instruction. Future p ointers are easily

alternatives to p olling and, in conjunction with blo ck-

detected by their non-zero least signi cant bit.

transfers, form a primitive for the message-passing com-

putational mo del.

Memory Instructions Memory instructions are

Although each of these mechanisms adds complex-

complex b ecause they interact with the full/empty bits

ity to our cache controller, they are easily implemented

and the cache controller. On a memory access, two data

in the pro cessor through \out-of-band" instructions as

exceptions can o ccur: the accessed lo cation may not b e

discussed in Section 5.

in the cache (a cache miss), and the accessed lo cation

may b e empty on a load or full on a store (a full/empty

4 Instruction Set

exception). On a cache miss, the cache/directory con-

troller can trap the pro cessor or make the pro cessor

APRIL has a basic RISC instruction set augmented

wait until the data is available. On full/empty excep-

with sp ecial memory instructions for full/empty bit op-

tions, the controller can trap the pro cessor, or allow the

erations, multithreading, and cache supp ort. The at-

pro cessor to continue execution. Load instructions also

traction of an implementation based on simple SPARC

have the option of setting the full/empty bit of the ac-

pro cessor mo di cations has resulted in a basic SPARC-

cessed lo cation to empty while store instructions have

like design. All registers are addressed relative to a cur-

the option of setting the bit to full. These options give

rent frame p ointer. Compute instructions are 3-address

rise to 8 kinds of loads and 8 kinds of stores. The load

register-to-register arithmetic/logic op erations. Condi-

instructions are listed in Table 2. Store instructions are

tional branch instructions take an immediate op erand

similar except that they trap on full lo cations instead

and may increment the PC by the value of the immedi-

of empty lo cations.

ate op erand dep ending on the condition co des set by the

A memory instruction also shares resp onsibility for

arithmetic/logic op erations. Memory instructions move

detecting futures in either of its address op erands. Like

data b etween memory and the registers, and also inter-

compute instructions, memory instructions also trap

act with the cache and the full/empty bits. The basic

if the least signi cant bit of either of their address

instruction categories are summarized in Table 1. The

op erands are non-zero. This intro duces the restriction

remainder of this section describ es features of APRIL

that ob jects in memory cannot b e allo cated at byte

instructions used for supp orting multipro cessing.

b oundaries. This, however, is not a problem b ecause

ob ject allo cation at word b oundaries is favored for other

Data Typ e Formats APRIL supp orts tagged p oint-

reasons [11]. This trap provides supp ort for implicit fu-

ers for Mul-T, as in the Berkeley SPUR pro cessor [12],

ture touches in op erators that dereference p ointers, e.g.,

by enco ding the p ointer typ e in the low order bits of a

car in LISP. 6

to devote our limited resources to the design of a custom

1 2

Name Typ e Reset f/e bit EL trap CM resp onse

ALEWIFE cache and directory controller, rather than

ldtt 1 No Yes Trap

to pro cessor design. Second, the register windows in

ldett 2 Yes Yes Trap

the SPARC pro cessor p ermit a simple implementation

ldnt 3 No No Trap

of coarse-grain multithreading. Third, most of the in-

ldent 4 Yes No Trap

structions envisioned for the original APRIL pro cessor

ldnw 5 No No Wait

map directly to single or double instruction sequences

ldenw 6 Yes No Wait

on the SPARC. Software compatibility with a commer-

ldtw 7 No Yes Wait

cial pro cessor allows easy access to a large b o dy of soft-

ldetw 8 Yes Yes Wait

1 2

ware. Furthermore, use of a standard pro cessor p ermits

Empty lo cation. Cache miss.

us to ride the technology curve; we can take advantage

of new technology as it is develop ed.

Table 2: Load Instructions.

Rapid Context Switching on SPARC SPARC

pro cessors contain an implementation-dep endentnum-

Full/Empty Bit Conditional Branch Instructions

ber of overlapping register windows for sp eeding up pro-

Non-trapping memory instructions allow testing of the

cedure calls. The current register window is altered

full/empty bit by setting a condition bit indicating the

via SPARC instructions (SAVE and RESTORE) that mo d-

state of the memory word's full/empty bit. APRIL

ify the Current WindowPointer (CWP). Traps incre-

provides conditional branch instructions, Jfull and

ment the CWP, while the trap return instruction (RETT)

Jempty, that dispatch on this condition bit. This pro-

decrements it. SPARC's register windows are suited for

vides a mechanism to explicitly control the action taken

rapid context switching and rapid trap handling b ecause

following a memory instruction that would normally

most of the state of a pro cess (i.e., its 24 lo cal reg-

trap on a full/empty exception.

isters) can b e switched with a single-cycle instruction.

Although we are not using multiple register windows for

Frame Pointer Instructions Instructions are pro-

pro cedure calls within a single thread, this should not

vided for manipulating the register frame p ointer (FP).

signi cantly hurt p erformance [25,24].

FP p oints to the register frame on which the currently

To implement coarse-grain multithreading, we use

executing thread resides. An INCFP instruction incre-

two register windows p er task frame { a user window

ments the FP to p oint to the next task frame while

and a trap window. The SPARC pro cessor chosen for

a DECFP instruction decrements it. The incrementing

our implementation has eight register windows, allow-

and decrementing is done mo dulo the numb er of task

ing a maximum of four hardware task frames. Since

frames. RDFP reads the value of the FP into a register

the SPARC do es not havemultiple program counter

and STFP writes the contents of a register into the FP.

(PC) chains and pro cessor status registers (PSR), our

trap co de must explicitly save and restore the PSRs

Instructions for Other Mechanisms The sp ecial

during context switches (the PC chain is saved by the

mechanisms discussed in Section 3.4, suchasFLUSH

trap itself ). These values are saved in the trap window.

are made available through \out-of-band" instructions.

Because the SPARC has a minimum trap overhead of

Interpro cessor-interrupts , blo ck-transfers, and FENCE

ve cycles (for squashing the pip eline and computing

op erations are initiated via memory-mapp ed I/O in-

the trap vector), context switches will take at least this

structions (LDIO, STIO).

long. See Section 6.1 for further information.

The SPARC oating-p oint unit do es not supp ort reg-

ister windows, but has a single, 32-word register le.

5 An Implementation of APRIL

To retain rapid context switching ability for applica-

tions that require ecient oating p oint p erformance,

An ALEWIFE no de consists of several interacting sub-

wehave divided the oating p oint register le into four

systems: pro cessor, oating-p oint unit, cache, memory,

sets of eight registers. This is achieved by mo difying

cache and directory controller, and network controller.

oating-p oint instructions in a context dep endent fash-

For the rst round implementation of the ALEWIFE

ion as they are loaded into the FPU and by maintaining

system, we plan to use a mo di ed SPARC pro cessor

1

four di erent sets of condition bits. A mo di cation of

and an unmo di ed SPARC oating-p oint unit. There

the SPARC pro cessor will make the CWP available ex-

are several reasons for this choice. First, wehavechosen

ternally to allow insertion into the FPU instruction.

1

The SPARC-based implementation e ort is in collab oration

with LSI Logic Corp oration. 7

tem includes the trap and system routines, Mul-T run- Supp ort for Futures We detect futures on the

time supp ort, a scheduler, and a system b o ot routine. SPARC via two separate mechanisms. Future p oint-

Since a large p ortion of the supp ort for multithread- ers are tagged with their lowest bit set. Thus, direct

ing, synchronization and futures is provided in soft- use of a future p ointer is agged with a word-alignment

ware through traps and run-time routines, trap han- trap.Furthermore, a strict op eration, such as subtrac-

dling must b e fast. Below, we describ e the implemen- tion, applied to one or more future p ointers is agged

tation and p erformance of the routines used for trap with a mo di ed non- xnum trap, that is triggered if an

handling and context switching. op erand has its lowest bit set (as opp osed to either one

of the lowest two bits, in the SPARC sp eci cation).

6.1 Cache Miss and Full/EmptyTraps

Implementation of Loads and Stores The

Cache miss traps o ccur on cache misses that require

SPARC de nition includes the Alternate Space Indi-

a network request and cause the pro cessor to context

cator (ASI) feature that p ermits a simple implementa-

switch. Full/empty synchronization exceptions can o c-

tion of APRIL's many load and store instructions (de-

cur on certain memory instructions describ ed in Sec-

scrib ed in Section 4). The ASI is available externally as

tion 4. The pro cessor can resp ond to these exceptions

an eight-bit eld. Normal memory accesses use four of

by spinning, switch spinning,orblocking the thread.

the 256 ASI values to indicate user/sup ervisor and in-

In our current implementation, traps handle these ex-

struction/data accesses. Sp ecial SPARC load and store

ceptions by switch spinning, whichinvolves a context

instructions (LDASI and STASI) p ermit use of the other

switch to the next task frame.

252 ASI values. Our rst-round implementation uses

In our SPARC-based design of APRIL, we implement

di erent ASI values to distinguish b etween avors of

context switching through the trap mechanism using

load and store instructions, sp ecial mechanisms, and

instructions that change the CWP. The following is a

I/O instructions.

trap routine that context switches to the thread in the

next task frame.

Interaction with the Cache Controller The cache

controller in the ALEWIFE system maintains strong

rdpsr psrreg ; save PSR into a reserved reg.

cache coherence, p erforms full/empty bit synchroniza-

save ; increment the window pointer

tion, and implements sp ecial mechanisms. By examin-

save ;by2

ing the pro cessor's ASI bits during memory accesses,

wrpsr psrreg ; restore PSR for the new context

it can select b etween di erent load/store and synchro-

jmpl r17 ; return from trap and

nization b ehavior, and can determine if sp ecial mecha-

rett r18 ; reexecute trapping instruction

nisms should b e employed. Through use of the Memory

Exception (MEXC) line on SPARC, it can invoke syn-

We count 5 cycles for the trap mechanism to allow

chronous traps corresp onding to cache misses and syn-

the pip eline to empty and save relevant pro cessor state

chronization (full/empty) mismatches. The controller

b efore passing control to the trap handler. The ab ove

can susp end pro cessor execution using the MHOLD

trap handler takes an additional 6 cycles for a total of 11

line. It passes condition information to the pro cessor

cycles to e ect the context switch. In a custom APRIL

through the Copro cessor Condition bits (CCCs), p er-

implementation, the cycles lost due to PC saves in the

mitting the full/empty conditional branch instructions

hardware trap sequence, and those in calling the trap

(Jfull and Jempty) to b e implemented as copro cessor

handler for the PSR saves/restores and double incre-

branch instructions. Asynchronous traps (IPI's) are de-

menting the frame p ointer could b e avoided, allowing a

livered via the SPARC's asynchronous trap lines.

four-cycle context switch.

6.2 Future TouchTrap

6 Compiler and Run-Time Sys-

When a future touch trap is signalled, the future that

tem

caused the trap will b e in a register. The trap han-

dler has to deco de the trapping instruction to nd that The compiler and run-time system are integral parts

register. The future is resolved if the full/empty bit of of the pro cessor design e ort. A Mul-T compiler for

the future's value slot is set to full. If it is resolved, APRIL and a run-time system written partly in APRIL

the future in the register is replaced with the resolved assembly co de and partly in T have b een implemented.

value; otherwise the trap routine can decide to switch

Constructs for user-directed placement of data and pro-

spin or block the thread that trapp ed. Our future touch cesses have also b een implemented. The run-time sys- 8 es 23 cycles to execute if the future is trap handler tak Single-processor

Other parallel tracers: execution trace of ed.

resolv trap bit, ATUM2 Mul-T program parallel program

If the trap handler decides to blo ck the thread on an

unresolved future, the thread must b e unloaded from

are task frame, and an alternate thread may the hardw Mul-T Post-mortem

Compiler scheduler

b e loaded. Loading a thread involves writing the state of

the thread, including its general registers, its PC chain,

Multimax program APRIL machine language program

and its PSR, into a hardware task frame on the pro-

cessor, and unloading a thread involves saving the state

T-Mul-T APRIL

. Loading and unloading

of a thread out to memory emulator/ Simulator,

e op erations unless there is sp ecial threads are exp ensiv tracer Run-time sys

Memory

are supp ort for blo ckmovement of data b etween hardw requests/acks

Parallel traces Memory requests/acks

registers and memory. Since the scheduling mechanism

favors pro cessor-resident threads, loading and unload-

t. However, this is an

ing of threads should b e infrequen Cache

vestigation. issue that is under in Directory Memory Sim.

Network transactions

Performance Measurements 7 Network

Simulator

This section presents some results on APRIL's p erfor-

mance in handling ne-grain tasks. Wehave imple-

Figure 4: Simulator Organization.

mented a simulator for the ALEWIFE system written

in C and T. Figure 4 illustrates the organization of the

simulator. The Mul-T compiler pro duces APRIL co de,

system called SUMMIT, develop ed by the Sp oken Lan-

which gets linked with the run-time system to an

guage Systems Group at MIT. We ran each program

executable program. The instruction-level APRIL pro-

on the Encore Multimax, on APRIL using normal task

cessor simulator interprets APRIL instructions. It is

creation, and on APRIL using lazy task creation. For

written in T and simulates 40,000 APRIL instructions

purp oses of comparison, execution time has b een nor-

p er second when run on a SPARCServer 330. The pro-

malized to the time taken to execute a sequential version

cessor simulator interacts with the cache and directory

of each program, i.e., with no futures and compiled with

simulator (written in C) on memory instructions. The

an optimizing T-compiler.

cache simulator in turn interacts with the network sim-

The di erence b etween running the same sequential

ulator (also written in C) when making remote memory

co de on T and on Mul-T on the Encore Multimax

op erations. The simulator has proved to b e a useful

(columns \T seq" and \Mul-T seq") is due to the over-

to ol in evaluating system-wide architectural tradeo s

head of future detection. Since the Encore do es not

as it provides more accurate results than a trace driven

supp ort hardware detection of futures, an overhead of a

simulation. The sp eed of the simulator has allowed us

factor of 2 is intro duced, even though no futures are ac-

to execute lengthy parallel programs. As an example, in

tually created. There is no overhead on APRIL, which

a run of sp eech (describ ed b elow), the simulated pro-

demonstrates the advantage of tag supp ort for futures.

gram ran for 100 million simulated cycles b efore com-

The di erence b etween running sequential co de on

pleting.

Mul-T and running parallel co de on Mul-T with one

Evaluation of the ALEWIFE architecture through

pro cessor (\Mul-T seq" and 1) is due to the overhead

simulations is in progress. A sampling of our results on

of thread creation and synchronization in a parallel pro-

the p erformance of APRIL running parallel programs

gram. This overhead is very large for the b b enchmark

is presented here. Table 3 lists the execution times of

on b oth the Encore and APRIL using normal task cre-

four programs written in Mul-T: b, factor, queens

ation b ecause of very ne-grain thread creation. This

and sp eech. b is the ubiquitous doubly recursive Fi-

overhead accounts for approximately a factor of 28 in

b onacci program with `future's around each of its re-

execution time. For APRIL with normal futures, this

cursive calls, factor nds the largest prime factor of

overhead accounts for a factor of 14. Lazy task cre-

eachnumb er in a range of numb ers and sums them up,

ation on APRIL creates threads only when the machine

queens nds all solutions to the n-queens chess prob-

has the resources to execute them, and p erforms much

lem for n = 8 and sp eech is a mo di ed Viterbi graph

b etter b ecause it has the e ect of dynamically partition-

search used in a connected sp eech recognition

ing the program into coarser-grain threads and creating 9

T Mul-T

Program System seq seq 1 2 4 8 16

Encore 1.0 1.8 28.9 16.3 9.2 5.1

b APRIL 1.0 1.0 14.2 7.1 3.6 1.8 0.97

Apr-lazy 1.0 1.0 1.5 0.78 0.44 0.29 0.19

Encore 1.0 1.4 1.9 0.96 0.50 0.26

factor APRIL 1.0 1.0 1.8 0.90 0.45 0.23 0.12

Apr-lazy 1.0 1.0 1.0 0.52 0.26 0.14 0.09

Encore 1.0 1.8 2.1 1.0 0.54 0.31

queens APRIL 1.0 1.0 1.4 0.67 0.33 0.18 0.10

Apr-lazy 1.0 1.0 1.0 0.51 0.26 0.13 0.07

Encore 1.0 2.0 2.3 1.2 0.62 0.36

sp eech APRIL 1.0 1.0 1.2 0.60 0.31 0.17 0.10

Apr-lazy 1.0 1.0 1.0 0.52 0.27 0.15 0.09

Table 3: Execution time for Mul-T b enchmarks. \T seq" is T running sequential co de, \Mul-T seq" is Mul-T running

sequential co de, 1 to 16 denote numb er of pro cessors running parallel co de.

fewer futures. The overhead intro duced is only a fac- terference misses coupled with the higher average trac

tor of 1.5. In all of the programs, APRIL consistently generated by a higher utilized pro cessor imp ose greater

demonstrates lower overhead due to supp ort for thread bandwidth demands on the interconnection network.

creation and synchronization over the Encore. Context management instructions required to switch

Measurements for multiple pro cessor executions on the pro cessor b etween threads also add to the over-

APRIL (2 { 16) used the pro cessor simulator without head. Furthermore, the application must display suf-

the cache and network simulators, in e ect simulating a cient parallelism to allowmultiple thread assignment

shared-memory machine with no memory latency. The to each pro cessor.

numb ers demonstrate that APRIL and its run-time sys- What is a go o d p erformance metric to evaluate mul-

tem allow parallel program p erformance to scale when tithreading? A go o d measure of system p erformance is

synchronization and task creation overheads are taken system p ower, which is the pro duct of the number of

into account, but when memory latency is ignored. The pro cessors and the average pro cessor utilization. Pro-

e ect of communication in large-scale machines dep ends vided the computation of pro cessor utilization takes into

on several factors suchasscheduling, which are active account the deleterious e ects of cache, network, and

areas of investigation. context-switching overhead, the pro cessor utilization is

itself a go o d measure.

Wehave develop ed a mo del for multithreaded pro-

8 Scalability of Multithreaded

cessor utilization that includes the cache, network, and

switching overhead e ects. A detailed analysis is pre-

Pro cessor Systems

sented in [1]. This section will summarize the mo del

and our chief results. Pro cessor utilization U as a func-

Multithreading enhances pro cessor eciency by allow-

tion of the numb er of threads resident on a pro cessor

ing execution to pro ceed on alternate threads while the

p is derived as a function of the cache miss rate m(p),

memory requests of other threads are b eing satis ed.

the network latency T (p), and the context switching

However, any new mechanism is useful only if it en-

overhead C :

hances overal l system performance. This section ana-

lyzes the system p erformance of multithreaded pro ces-

sors.

8

1+T (p)m(p)

p

>

for p<

Amultithreaded pro cessor design must address the

>

< 1+T (p)m(p) 1+Cm(p)

tradeo b etween reduced pro cessor idle time and in-

U (p)= (1)

>

creased cache miss rates, network contention, and con-

>

1+T (p)m(p)

1

:

for p 

1+Cm(p) 1+Cm(p)

text managementoverhead. The private working sets of

multiple contexts interfere in the cache. The added in-

When the numb er of threads is small, complete over- 10

lapping of network latency is not p ossible. Pro cessor

Ideal

utilization with one thread is 1=(1 + m(1)T (1)). Ideally,

p threads available to overlap network delays, the

with Network Effects

ould increase p-fold. In practice, b ecause

utilization w Cache and Network Effects

work latency increase to m(p) and the miss rate and net CS Overhead

Useful Work

T (p), the utilization b ecomes p=(1 + m(p)T (p)).

When it is p ossible to completely overlap network

1.0 |

latency, pro cessor utilization is limited only by the con-

|

hing overhead paid on every miss (assuming

text switc 0.9

a context switch happ ens on a cache miss), and is given

0.8 |

by1=(1 + m(p)C ).

|

he and network terms have

The mo dels for the cac 0.7

b een validated through simulations. Both these terms

0.6 |

are shown to b e the sum of two comp onents: one com-

|

t indep endent of the numb er of threads p and the p onen 0.5

Processor Utilization U(p)

other linearly related to p (to rst order). Multithread-

0.4 |

ing is shown to b e useful when p is small enough that

0.3 |

the xed comp onents dominate. |

Let us lo ok at some results for the default set of sys- 0.2

tem parameters given in Table 4. The analysis assumes

0.1 |

8000 pro cessors arranged in a three dimensional array.

|

h a system, the average numb er of hops b etween a In suc 0.0 | | | | | | | | |

0 1 2 3 4 5 6 7 8

random pair of no des is nk =3 = 20, where n denotes net-

Processes p

work dimension and k its radix. This yields an average

round trip network latency of 55 cycles for an unloaded

Figure 5: Relative sizes of the cache, network and overhead

network, when memory latency and average packet size

comp onents that a ect pro cessor utilization.

are taken into account. The xed miss rate comprises

rst-time fetches of blo cks into the cache, and the inter-

ference due to multipro cessor coherence invalidations.

which corresp onds to our initial SPARC-based imple-

mentation of APRIL. This result is similar to that re-

p orted byWeb er and Gupta [26] for coarse-grain mul-

Parameter Value

tithreaded pro cessors. The main reason a low degree of

Memory latency 10 cycles

multithreading is sucient is that context switches are

Network dimension n 3

Network radix k 20 forced only on cache misses, which are exp ected to hap-

Fixed miss rate 2%

p en infrequently. The marginal b ene ts of additional

Average packet size 4

pro cesses is seen to decrease due to network and cache

Cache blo ck size 16 bytes

interference.

Thread working set size 250 blo cks

Why is utilization limited to a maximum of ab out

Cache size 64 Kbytes

0.80 despite an ample supply of threads? The reason is

that available network bandwidth limits the maximum

Table 4: Default system parameters.

rate at which computation can proceed. When avail-

able network bandwidth is used up, adding more pro-

cesses will not improve pro cessor utilization. On the

Figure 5 displays pro cessor utilization as a function of

contrary, more pro cesses will degrade p erformance due

the numb er of threads resident on the pro cessor when

to increased cache interference. In such a situation,

context switching overhead is 10 cycles. The degree

for b etter system p erformance, e ort is b est sp entin

to which the cache, network, and overhead comp onents

increasing the network bandwidth, or in reducing the

impact overall pro cessor utilization is also shown. The

bandwidth requirement of each thread.

ideal curve shows the increase in pro cessor utilization

The relatively large ten-cycle context switchoverhead

when b oth the cache miss rate and network contention

do es not signi cantly impact p erformance for the de-

corresp ond to that of a single pro cess, and do not in-

fault set of parameters b ecause utilization dep ends on

crease with the degree of multithreading p.

the pro duct of context switching frequency and switch-

We see that as few as three pro cesses yield close to

ing overhead, and the switching frequency is exp ected

80% utilization for a ten-cycle context-switchoverhead 11

to b e small in a cache-based system. This observation 10 Acknowledgements

is imp ortant b ecause it allows a simpler pro cessor im-

Wewould liketoacknowledge the contributions of the

plementation, and is exploited in the design of APRIL.

memb ers the ALEWIFE research group. In particular,

Amultithreaded pro cessor requires larger caches to

Dan Nussbaum was partly resp onsible for the pro ces-

sustain the working sets of multiple pro cesses, although

sor simulator and run-time system and was the source

cache interference is mitigated if the pro cesses share

of a gamut of ideas, David Chaiken wrote the cache

co de and data. For the default parameter set, we found

simulator, Kirk Johnson supplied the b enchmarks, and

that caches greater than 64 Kbytes comfortably sus-

Gino Maa and Sue Lee wrote the network simulator.

tain the working sets of four pro cesses. Smaller caches

We appreciate help from Gene Hill, Mark Perry, and

su er more interference and reduce the b ene ts of mul-

Jim Pena from LSI Logic Corp oration for the SPARC-

tithreading.

based implementation e ort. Our design was in uenced

by Bert Halstead's work on multithreaded pro cessors.

9 Conclusions

Our research b ene ted signi cantly from discussions

with Bert Halstead, Tom Knight, Greg Papadop oulos,

We describ ed the architecture of APRIL { a coarse-

Juan Loaiza, Bill Dally, SteveWard, Rishiyur Nikhil,

grain multithreaded pro cessor to b e used in a cache-

Arvind, and John Hennessy. Beng-Hong Lim is partly

coherentmultipro cessor called ALEWIFE. By rapidly

supp orted by an Analog Devices Fellowship. The re-

switching to an alternate task, APRIL can hide com-

search rep orted in this pap er is funded byDARPA

munication and synchronization delays and achieve high

contract # N00014-87-K-0825 and by grants from the

pro cessor utilization. The pro cessor makes e ective use

Sloan Foundation and IBM.

of available network bandwidth b ecause it is rarely idle.

APRIL provides supp ort for ne-grain tasking and de-

tection of futures. It achieves high single-thread p erfor- References

mance by executing instructions from a given task until

[1] Anant Agarwal. Performance Tradeo s in Multi-

an exception condition like a synchronization fault or re-

threaded Pro cessors. Septemb er 1989. MIT VLSI

mote memory op eration o ccurs. Coherent caches reduce

Memo 89-566, Lab oratory for Computer Science.

the context switch rate to approximately once every 50{

100 cycles. Therefore context switchoverheads in the

[2] Arvind and Rob ert A. Iannucci. Two Fundamen-

4{10 cycle range are tolerable, signi cantly simplifying

tal Issues in .Technical Rep ort TM

pro cessor design. By providing hardware supp ort only

330, MIT, Lab oratory for Computer Science, Oc-

for p erformance-critical op erations and migrating other

tob er 1987.

functionalityinto the compiler and run-time system, we

were able to simplify the pro cessor design even further.

[3] Arvind, R. S. Nikhil, and K. K. Pingali. I-

We describ ed a SPARC-based implementationof

Structures: Data Structures for Parallel Comput-

APRIL that uses the register windows of SPARCas

ing. In Proceedings of the Workshop on Graph Re-

task frames for multiple threads. A pro cessor simulator

duction, (Springer-Verlag Lecture Notes in Com-

and an APRIL compiler and run-time system have b een

puter Science 279), Septemb er/Octob er 1986.

written. The SPARC-based implementation of APRIL

[4] William C. Athas and Charles L. Seitz. Multicom-

switches contexts in 11 cycles. APRIL and its asso-

puters: Message-Passing Concurrent Computers.

ciated run-time system practically eliminate the over-

Computer, 21(8):9{24, August 1988.

head of ne-grain task creation and detection of fu-

tures. For Mul-T, the overhead reduces from 100%

[5] David Chaiken, Craig Fields, Kiyoshi Kurihara,

on an Encore Multimax-based implementation to under

and Anant Agarwal. Directory-Based Cache-

5% on APRIL. Weevaluated the scalabilityofmulti-

Coherence in Large-Scale Multipro cessors. June

threaded pro cessors in large-scale parallel machines us-

1990. To app ear in IEEE Computer.

ing an analytical mo del. For typical system parameters

and a 10 cycle context-switchoverhead, the pro cessor

[6] W. J. Dally et al. Architecture of a Message-Driven

can achieve close to 80% utilization with 3 pro cessor

Pro cessor. In Proceedings of the 14th Annual Sym-

resident threads.

posium on Computer Architecture, pages 189{196,

IEEE, New York, June 1987.

[7] Michel Dub ois, Christoph Scheurich, and FayeA.

Briggs. Synchronization, coherence, and event or- 12

[19] Rishiyur S. Nikhil and Arvind. Can Data ow Sub- dering in multipro cessors. IEEE Computer, 9{21,

sume von Neumann Computing? In Proceedings February 1988.

16th Annual International Symposium on Com-

[8] R.H. Halstead and T. Fujita. MASA: A Multi-

puter Architecture, IEEE, New York, June 1989.

threaded Pro cessor Architecture for Parallel Sym-

b olic Computing. In Proceedings of the 15th An-

[20] Charles L. Seitz. Concurrent VLSI Architectures.

nual International Symposium on Computer Ar-

IEEE Transactions on Computers, C-33(12), De-

chitecture, pages 443{451, IEEE, New York, June

cemb er 1984.

1988.

[21] Charles L. Seitz. The Cosmic Cub e. CACM,

[9] Rob ert H. Halstead. Multilisp: A Language for

28(1):22{33, January 1985.

Parallel Symb olic Computation. ACM Transac-

[22] B.J. Smith. A Pip elined, Shared Resource MIMD

tions on Programming Languages and Systems,

Computer. In Proceedings of the 1978 International

7(4):501{539, Octob er 1985.

Conference on Paral lel Processing, pages 6{8, 1978.

[10] J. L. Hennessy and T. R. Gross. Postpass Co de

[23] SPARC Architecture Manual. 1988. SUN Mi-

Optimization of Pip eline Constraints. ACM Trans-

crosystems, Mountain View, California.

actions on Programming Languages and Systems,

5(3):422{448, July 1983.

[24] P. A. Steenkiste and J. L. Hennessy. A Simple

Interpro cedural Register Allo cation Algorithm and

[11] J. L. Hennessy et al. Hardware/Software

Its E ectiveness for LISP. ACM Transactions on

Tradeo s for Increased Performance. In Proc.

Programming Languages and Systems, 11(1):1{32,

SIGARCH/SIGPLAN Symp. Architectural Sup-

January 1989.

port for Programming Languages and Operating

Systems, pages 2{11, March 1982. ACM, Palo Alto,

[25] David W. Wall. Global Register Allo cation at Link

CA.

Time. In SIGPLAN '86, Conference on Compiler

Construction, June 1986.

[12] M. D. Hill et al. Design Decisions in SPUR. Com-

puter, 19(10):8{22, Novemb er 1986.

[26] Wolf-DietrichWeb er and Ano op Gupta. Exploring

the Bene ts of Multiple Hardware Contexts in a

[13] Mark Horowitz et al. A 32-Bit Micropro cessor with

Multipro cessor Architecture: Preliminary Results.

2K-Byte On-Chip Instruction Cache. IEEE Jour-

In Proceedings 16th Annual International Sympo-

nal of Solid-State Circuits, Octob er 1987.

sium on Computer Architecture, IEEE, New York,

[14] R.A. Iannucci. Toward a Data ow/von Neumann

June 1989.

Hybrid Architecture. In Proceedings of the 15th

Annual International Symposium on Computer Ar-

[27] Colin Whitby-Strevens. The Transputer. In Pro-

chitecture,Hawaii, June 1988.

ceedings 12th Annual International Symposium on

Computer Architecture, IEEE, New York, June

[15] W. J. Kaminsky and E. S. Davidson. Developing

1985.

a Multiple-Instruction-Stream Single-Chip Pro ces-

sor. Computer, 66{78, Decemb er 1979.

[16] D. Kranz, R. Halstead, and E. Mohr. Mul-T: A

High-Performance Parallel Lisp. In Proceedings of

SIGPLAN '89, Symposium on Programming Lan-

guages Design and Implemenation, June 1989.

[17] Eric Mohr, David A. Kranz, and Rob ert H. Hal-

stead. Lazy task creation: a technique for increas-

ing the granularity of parallel tasks. In Proceed-

ings of Symposium on Lisp and Functional Pro-

gramming, June 1990. To app ear.

[18] Nigel P.Topham, Amos Omondi and Roland N.

Ibb ett. Context Flow: An Alternative to Conven-

tional Pip elined Architectures. The Journal of Su-

percomputing, 2(1):29{53, 1988. 13