APRIL: A Pro cessor Architecture for Multipro cessing
Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz
Lab oratory for Computer Science
Massachusetts Institute of Technology
Cambridge, MA 02139
Abstract design for multipro cessors provides a means for hiding
latency. When sucient parallelism exists, a pro cessor
Pro cessors in large-scale multipro cessors must b e able
that rapidly switches to an alternate thread of computa-
to tolerate large communication latencies and synchro-
tion during a remote memory request can achieve high
nization delays. This pap er describ es the architecture
utilization.
of a rapid-context-switching pro cessor called APRIL
Pro cessor utilization also diminishes due to synchro-
with supp ort for ne-grain threads and synchroniza-
nization latency. Spin lo ck accesses havealowover-
tion. APRIL achieves high single-thread p erformance
head of memory requests, but busy-waiting on a syn-
and supp orts virtual dynamic threads. A commercial
chronization eventwastes pro cessor cycles. Synchro-
RISC-based implementation of APRIL and a run-time
nization mechanisms that avoid busy-waiting through
software system that can switch contexts in ab out 10
pro cess blo cking incur a high overhead.
cycles is describ ed. Measurements taken for several par-
Ful l/empty bit synchronization [22] in a rapid context
allel applications on an APRIL simulator show that the
switching pro cessor allows ecient ne-grain synchro-
overhead for supp orting parallel tasks based on futures
nization. This scheme asso ciates synchronization infor-
is reduced by a factor of twoover a corresp onding im-
mation with ob jects at the granularity of a data word,
plementation on the Encore Multimax. The scalability
allowing a low-overhead expression of maximum con-
of a multipro cessor based on APRIL is explored using
currency. Because the pro cessor can rapidly switchto
a p erformance mo del. We show that the SPARC-based
other threads, wasteful iterations in spin-wait lo ops are
implementation of APRIL can achieve close to 80% pro-
interleaved with useful work from other threads. This
cessor utilization with as few as three resident threads
reduces the negative e ects of synchronization on pro-
p er pro cessor in a large-scale cache-based machine with
cessor utilization.
an average base network latency of 55 cycles.
This pap er describ es the architecture of APRIL,
a pro cessor designed for large-scale multipro cessing.
APRIL builds on previous research on pro cessors for
1 Intro duction
parallel architectures such as HEP [22], MASA [8], P-
RISC [19], [14], [15], and [18]. Most of these pro cessors
The requirements placed on a pro cessor in a large-scale
supp ort ne-grain interleaving of instruction streams
multipro cessing environment are di erent from those in
from multiple threads, but su er from p o or single-
a unipro cessing setting. A pro cessor in a parallel ma-
thread p erformance. In the HEP, for example, instruc-
chine must b e able to tolerate high memory latencies
tions from a single thread can only b e executed once
and handle pro cess synchronization eciently [2]. This
every 8 cycles. Single-thread p erformance is imp ortant
need increases as more pro cessors are added to the sys-
for eciently running sections of applications with low
tem.
parallelism.
Parallel applications imp ose pro cessing and commu-
APRIL do es not supp ort cycle-by-cycle interleaving
nication bandwidth demands on the parallel machine.
of threads. To optimize single-thread p erformance,
An ecient and cost-e ective machine design achieves a
APRIL executes instructions from a given thread until
balance b etween the pro cessing p ower and the commu-
it p erforms a remote memory request or fails in a syn-
nication bandwidth provided. An imbalance is created
chronization attempt. We show that such coarse-grain
when an underutilized pro cessor cannot fully exploit the
multithreading allows a simple pro cessor design with
available network bandwidth. When the network has
context switchoverheads of 4{10 cycles, without sig-
bandwidth to spare, low pro cessor utilization can re-
ni cantly hurting overall system p erformance (although
sult from high network latency. An ecient pro cessor 1
the pip eline design is complicated by the need to handle
pip eline dep endencies). In APRIL, thread scheduling is
NETWORK
done in software, and unlimited virtual dynamic threads
ROUTER ALEWIFE MACHINE
y bit synchro-
are supp orted. APRIL supp orts full/empt PROCESSOR
nization, and provides tag supp ort for futures [9]. In this text, and task are pap er the terms pro cess, thread, con CACHE
CACHE CONTROLLER
used equivalently.
By taking a systems-level design approach that con- FPU
siders not only the pro cessor, but also the compiler and MAIN
MEMORY
run-time system, wewere able to migrate several non-
ALEWIFE NODE
critical op erations into the software system, greatly sim-
plifying pro cessor design. APRIL's simplicity allows an
implementation based on minor mo di cations to an ex-
isting RISC pro cessor design. We describ e such an im-
Figure 1: ALEWIFE no de.
plementation based on Sun Microsystem's SPARC pro-
cessor [23]. A compiler for APRIL, a run-time system,
and an APRIL simulator are op erational. We present
memory, cache/directory controller and a network rout-
simulation results for several parallel applications on
ing switch. Multiple no des are connected via a direct,
APRIL's eciency in handling ne-grain threads and
packet-switched network.
assess the scalabilityofmultipro cessors based on a
The controller synthesizes a global shared memory
coarse-grain multithreaded pro cessor using an analyt-
space via messages to other no des, and satis es requests
ical mo del. Our SPARC-based pro cessor supp orts four
from other no des directed to its lo cal memory. It main-
hardware contexts and can switch contexts in ab out 10
tains strong cache coherence [7] for memory accesses.
cycles, which yields roughly 80% pro cessor utilization
On exception conditions, suchascache misses and failed
in a system with an average base network latency of 55
synchronization attempts, the controller can cho ose to
cycles.
trap the pro cessor or to make the pro cessor wait. A
The rest of this pap er is organized as follows. Sec-
multithreaded pro cessor reduces the ill e ects of the
tion2isanoverview of our multipro cessor system archi-
long-latency acknowledgment messages resulting from
tecture and the programming mo del. The architecture
a strong cache coherence proto col. To allow exp erimen-
of APRIL is discussed in Section 3, and its instruction
tation with other programming mo dels, the controller
set is describ ed in Section 4. A SPARC-based imple-
provides sp ecial mechanisms for bypassing the coher-
mentation of APRIL is detailed in Section 5. Section 6
ence proto col and facilities for preemptiveinterpro ces-
discusses the implementation and p erformance of the
sor interrupts and blo ck transfers.
APRIL run-time system. Performance measurements of
The ALEWIFE system uses a low-dimension direct
APRIL based on simulations are presented in Section 7.
network. Such networks scale easily and maintain high
Weevaluate the scalabilityofmultithreaded pro cessors
nearest-neighb or bandwidth. However, the longer ex-
in Section 8.
p ected latencies of low-dimension direct networks com-
pared to indirect multistage networks increase the need
for pro cessors that can tolerate long latencies. Further-
2 The ALEWIFE System
more, the lower bandwidth of direct networks over indi-
rect networks with the same channel width intro duces
APRIL is the pro cessing element of ALEWIFE, a large-
interesting design tradeo s.
scale multipro cessor b eing designed at MIT. ALEWIFE
In the ALEWIFE system, a context switch o ccurs
is a cache-coherent machine with distributed, globally-
whenever the network must b e used to satisfy a re-
shared memory. Cache coherence is maintained using
quest, or on a failed synchronization attempt. Since
a directory-based proto col [5]overalow-dimension di-
caches reduce the network request rate, we can em-
rect network [20]. The directory is distributed with the
ploy coarse-grain multithreading (context switch ev-
pro cessing no des.
ery 50{100 cycles) instead of ne-grain multithreading
(context switchevery cycle). This simpli es pro ces-
2.1 Hardware
sor design considerably b ecause context switches can b e
more exp ensive (4 to 10 cycles), and functionality such
As shown in Figure 1, each ALEWIFE no de consists of
as scheduling can b e migrated into run-time software.
a pro cessing element, oating-p oint unit, cache, main
Single-thread p erformance is optimized, and techniques 2
used in RISC pro cessors for enhancing pip eline p erfor- level parallelism and primitives for placement of data
mance can b e applied [10]. Custom design of a pro cess- and tasks. As an example, the programmer can use
ing element is not required in the ALEWIFE system; future-on whichworks just like a normal future but
indeed, we are using a mo di ed version of a commercial allows the sp eci cation of the no de on whichtoschedule
RISC pro cessor for our rst-round implementation. the future. Extending Mul-T in this way allows us to
exp eriment with techniques for enhancing lo cality and
to research language-level issues for programming par-
2.2 Programming Mo del
allel machines.
Our exp erimental programming language for ALEWIFE
is Mul-T [16], an extended version of Scheme. Mul-T's
3 Pro cessor Architecture
basic mechanism for generating concurrent tasks is the
future construct. The expression (future X ), where
APRIL is a pip elined RISC pro cessor extended with
X is an arbitrary expression, creates a task to evaluate
sp ecial mechanisms for multipro cessing. This section
X and also creates an ob ject known as a future to even-
gives an overview of the APRIL architecture and fo-
tually hold the value of X . When created, the future
cuses on its features that supp ort multithreading, ne-
is in an unresolved,orundetermined, state. When the
grain synchronization, cheap futures, and other mo dels
value of X b ecomes known, the future resolves to that
of computation.
value, e ectively mutating into the value of X . Con-
The left half of Figure 2 depicts the user-visible pro-
currency arises b ecause the expression (future X ) re-
cessor state comprising four sets of general purp ose reg-
turns the future as its value without waiting for the
isters, and four sets of Program Counter (PC) chains
future to resolve. Thus, the computation containing
and Pro cessor State Registers (PSR). The PC chain
(future X ) can pro ceed concurrently with the evalu-
represents the instruction addresses corresp onding to
ation of X . All tasks execute in a shared address-space.
a thread, and the PSR holds various pieces of pro cess-
The result of supplying a future as an op erand of
sp eci c state. Each register set, together with a single
some op eration dep ends on the nature of the op eration.
PC-chain and PSR, is conceptually group ed into a single
Non-strict op erations, such as passing a parameter to
entity called a task frame (using terminology from [8]).
a pro cedure, returning a result from a pro cedure, as-
Only one task frame is activeatagiven time and is
signing a value to a variable, and storing a value into a
designated by a current frame p ointer (FP). All reg-
eld of a data structure, can treat a future just likeany
ister accesses are made to the active register set and
other kind of value. Strict op erations such as addition
instructions are fetched using the active PC-chain. Ad-
and comparison, if applied to an unresolved future, are
ditionally, a set of 8 global registers that are always
susp ended until the future resolves and then pro ceed,
accessible (regardless of the FP) is provided.
using the value to which the future resolved as though
Registers are 32 bits wide. The PSR is also a 32-bit
that had b een the original op erand.
register and can b e read into and written from the gen-
The act of susp ending if an ob ject is an unresolved
eral registers. Sp ecial instructions can read and write
future and then pro ceeding when the future resolves is
the FP register. The PC-chain includes the Program
known as touching the ob ject. The touches that auto-
Counter (PC) and next Program Counter (nPC) which
matically o ccur when strict op erations are attempted
are not directly accessible. This assumes a single-cycle
are referred to as implicit touches. Mul-T also includes
branch delay slot. Condition co des are set as a side
an explicit touching or \strict" primitive (touch X )
e ect of compute instructions. A longer branch delay
that touches the value of the expression X and then
might b e necessary if the branch instruction itself do es a
returns that value.
compare so that condition co des need not b e saved [13];
Futures express control-level parallelism. In a large
in this case the PC chain is corresp ondingly longer.
class of algorithms, data parallelism is more appropri-
Words in memory have a 32 bit data eld, and have
ate. Barriers are a useful means of synchronization for
an additional synchronization bit called the ful l/empty
such applications on MIMD machines, but force unnec-
bit.
essary serialization. The same serialization o ccurs in
Use of multiple register sets on the pro cessor, as in the
SIMD machines. Implementing data-level parallelism
HEP, allows rapid context switching. A context switch
in a MIMD machine that allows the expression of maxi-
is achieved bychanging the frame p ointer and empty-
mum concurrency requires cheap ne-grain synchroniza-
ing the pip eline. The cache controller forces a context
tion asso ciated with each data ob ject. We provide this
switch on the pro cessor, typically on remote network re-
supp ort in hardware with ful l/empty bits.
quests, and on certain unsuccessful full/empty bit syn-
We are augmenting Mul-T with constructs for data-
chronizations. 3
text switching in APRIL is achieved bychanging Processor State Memory Con
Ready Suspended ter. Since APRIL has four task frames,
Queue Queue the frame p oin
ve up to four threads loaded. The thread that Global register it can ha
frame
is b eing executed resides in the task frame p ointed to
g0
by the FP. A context switch simply involves letting the
PC and PSR g7
y while saving the PC-chain and
frames pro cessor pip eline empt
hanging the FP to p oint to another task frame. 0:r0 then c 0:r31
1:r0 Threads in ALEWIFE are virtual. Only a small sub-
ysically resident on the pro- FP 1:r31 set of all threads can b e ph
2:r0
cessors; these threads are called loaded threads. The re-
2:r31
unloaded threads and PSR 3:r0 maining threads are referred to as
PC
eonvarious queues in memory,waiting their turn nPC 3:r31 liv Register
frames to b e loaded. In a sense, the set of task frames acts
like a cache on the virtual threads. This organization
Global Heap
is illustrated in Figure 2. The scheduler tries to cho ose
Unloaded thread
threads from the set of loaded threads for execution to
ize the overhead of saving and restoring threads
Loaded thread minim
to and from memory. When control eventually passes
back to the thread that su ered a remote request, the
Figure 2: Pro cessor State and Virtual Threads.
controller should have completed servicing the request,
provided the other threads ran for enough cycles. By
APRIL implements futures using the trap mechanism.
maximizing lo cal cache and memory accesses, the need
For our prop osed exp erimental implementation based
for context switching reduces to once every 50 or 100
on SPARC, which do es not have four separate PC and
cycles, which allows us to tolerate latencies in the range
PSR frames, context switches are also caused through
of 150 to 300 cycles with 4 task frames (see Section 8).
traps. Therefore, a fast trap mechanism is essential.
Rapid context switching is used to hide the latency
When a trap is signalled in APRIL, the trap mechanism
encountered in several other trap events, such as syn-
lets the pip eline empty and passes control to the trap
chronization faults (or attempts to load from \empty"
handler. The trap handler executes in the same task
lo cations). These events can either cause the pro ces-
frame as the thread that trapp ed so that it can access
sor to susp end execution (wait) or to take a trap. In
all of the thread's registers.
the former case, the controller holds the pro cessor until
the request is satis ed. This typically happ ens on lo-
3.1 Coarse-Grain Multithreading cal memory cache misses, and on certain full/empty bit
tests. If a trap is taken, the trap handling routine can
In most pro cessor designs to date (e.g. [8,22,19,15]),
resp ond by:
multithreading has involved cycle-by-cycle interleaving
of threads. Such ne-grain multithreading has b een
1. spinning { immediately return from the trap and
used to hide memory latency and also to achieve high
retry the trapping instruction.
pip eline utilization. Pip eline dep endencies are avoided
2. switch spinning { context switch without unloading
by maintaining instructions from di erent threads in the
the trapp ed thread.
pip eline, at the price of p o or single-thread p erformance.
In the ALEWIFE machine, we are primarily con-
3. blocking { unload the thread.
cerned with the large latencies asso ciated with cache
misses that require a network access. Go o d sin-
The ab ove alternatives must b e considered with care
gle thread p erformance is also imp ortant. Therefore
b ecause incorrect choices can create or exacerbate star-
APRIL continues executing a single thread until a mem-
vation and thrashing problems. An extreme example
ory op eration involving a remote request (or an unsuc-
of starvation is this: all loaded threads are spinning
cessful synchronization attempt) is encountered. The
or switch spinning on an exception condition that an
controller forces the pro cessor to switch to another
unloaded thread is resp onsible for ful lling. We are in-
thread, while it services the request. This approachis
vestigating several p ossible mechanisms to handle such
called coarse-grain multithreading. Pro cessors in mes-
problems, including a sp ecial controller initiated trap
sage passing multicomputers [21, 27, 6, 4]have tra-
on certain failed synchronization tests, whose handler
ditionally taken this approach to allowoverlapping of
unloads the thread.
communication with computation. 4
of creating an excessivenumb er of ne-grain tasks ex- An imp ortant asp ect of the ALEWIFE system is its
ists. Our solution to this problem is called lazy task cre- combination of caches and multithreading. While this
ation [17]. With lazy task creation a future expression combination is advantageous, it also creates a unique
do es not create a new task, but computes the expression class of thrashing and starvation problems. For exam-
as a lo cal pro cedure call, leaving b ehind a marker indi- ple, forward progress can b e halted if a context execut-
cating that a new task could have b een created. The ing on one pro cessor is writing to a lo cation while a con-
new task is created only when some pro cessor b ecomes text on another pro cessor is reading from it. These two
idle and lo oks for work, stealing the continuation of that contexts can easily play \cache tag", since writes to a lo-
pro cedure call. Thus, the user can sp ecify the maximum cation force a context switch and invalidation of other
p ossible parallelism without the overhead of creating a cached copies, while reads force a context switch and
large numb er of tasks. The race conditions are resolved transform read-write copies into read-only copies. An-
using the ne-grain lo cking provided by the full/empty other problem involves thrashing b etween an instruction
bits. and its data; a context will b e blo cked if it has a load
instruction mapp ed to the same cache line as the tar-
get of the load. These and related problems have b een
3.3 Fine-grain synchronization
addressed with appropriate hardware interlo ck mecha-
Besides supp ort for lazy task creation, ecient ne-
nisms.
grain synchronization is essential for large-scale parallel
computing. Both the data ow and data-parallel mo dels
3.2 Supp ort for Futures
of computation rely heavily on the availabilityofcheap
ne-grain synchronization. The unnecessary serializa- Executing a Mul-T program with futures incurs two
tion imp osed by barriers in MIMD implementations of typ es of overhead not present in sequential programs.
data-parallellism can b e avoided by allowing ne-grain First, strict op erations must check their op erands for
word-level synchronization in data structures. The tra- availability b efore using them. Second, there is a cost
ditional test&set based synchronization requires extra asso ciated with creating new threads.
memory op erations and separate data storage for the
lo ck and for the asso ciated data. Busy-waiting or blo ck-
Detection of Futures Op erand checks for futures
ing in conventional pro cessors waste additional pro ces-
done in software imply wasted cycles on every strict
sor cycles.
op eration. Our measurements with Mul-T running on
APRIL adopts the full/empty bit approach used in
an Encore Multimax show that this is exp ensive. Even
the HEP to reduce b oth the storage requirements and
with clever compiler optimizations, there is close to a
the numb er of memory accesses. A bit asso ciated with
factor of two loss in p erformance over a purely sequen-
each memory word indicates the state of the word: full
tial implementation (see Table 3). Our solution em-
or empty. The load of an empty lo cation or the store
ploys a tagging scheme with hardware-generated traps
into a full lo cation can trap the pro cessor causing a
if an op erand to a strict op erator is a future. We b elieve
context switch, which helps hide synchronization delay.
that this hardware supp ort is necessary to make futures
Traps also obviate the additional software tests of the
a viable construct for expressing parallelism. From an
lo ckintest&set op erations. A similar mechanism is
architectural p ersp ective, this mechanism is similar to
used to implement I-structures in data ow machines [3],
dynamic typ e checking in Lisp. However, this mecha-
however APRIL is di erent in that it implements such
nism is necessary even in a statically typ ed language in
synchronizations through software trap handlers.
the presence of dynamic futures.
APRIL uses a simple data typ e enco ding scheme for
automatically generating a trap when op erands to strict
3.4 Multimo del Supp ort Mechanisms
op erators are futures. This implementation (discussed
APRIL is designed primarily for a shared-memory mul-
in Section 5) obviates the need to explicitly insp ect
tipro cessor with strongly coherent caches. However,
in software the op erands to every compute instruction.
we are considering several additional mechanisms which
This is imp ortant b ecause we do not wanttohurt the
will p ermit explicit management of caches and ecient
eciency of all compute instructions b ecause of the p os-
use of network bandwidth. These mechanisms present
sibility an op erand is a future.
di erent computational mo dels to the programmer.
To allow software-enforced cache coherence, wehave
Lazy Task Creation Little can b e done to reduce the
loads and stores that bypass the hardware coherence
cost of task creation if future is taken as a command
mechanism, and a ush op eration that p ermits soft-
to create a new task. In many programs the p ossibility
ware writeback and invalidation of cache lines. A loaded 5
Bit 31 Bit 0
Typ e Format Data transfer Control ow
Fixnum 00
Compute op s1 s2 d d s1 op s2 PC+1
ld type ad d mem[a] PC+1
Memory Other 0 10
st type ds mem[a] s PC+1
Cons 101
Branch jcond offset if cond
PC+offset Future 101
else PC+1
jmpl offset d d PC PC+offset
Figure 3: Data Typ e Enco dings.
Table 1: Basic instruction set summary.
data word. Asso ciating the typ e with the p ointer has
the advantage of saving an additional memory reference
when accessing typ e information. Figure 3 lists the dif-
context has a fencecounter that is incremented for
ferenttyp e enco dings. An imp ortant purp ose of this
each dirty cache line that is ushed and decremented
typ e enco ding scheme is to supp ort hardware detection
for eachacknowledgement from memory. This fence
of futures.
counter may b e examined to determine if all writebacks
have completed. We are prop osing a blo ck-transfer
Future Detection and Compute Instructions
mechanism for ecient transfer of large blo cks of data.
Since a compute instruction is a strict op eration, sp ecial
Finally,we are considering an interpro cessor-interrupt
action has to b e taken if either of its op erands is a fu-
mechanism (IPI) which p ermits preemptive messages
ture. APRIL generates a trap if a future is encountered
to b e sent to sp eci c pro cessors. IPIs o er reasonable
by a compute instruction. Future p ointers are easily
alternatives to p olling and, in conjunction with blo ck-
detected by their non-zero least signi cant bit.
transfers, form a primitive for the message-passing com-
putational mo del.
Memory Instructions Memory instructions are
Although each of these mechanisms adds complex-
complex b ecause they interact with the full/empty bits
ity to our cache controller, they are easily implemented
and the cache controller. On a memory access, two data
in the pro cessor through \out-of-band" instructions as
exceptions can o ccur: the accessed lo cation may not b e
discussed in Section 5.
in the cache (a cache miss), and the accessed lo cation
may b e empty on a load or full on a store (a full/empty
4 Instruction Set
exception). On a cache miss, the cache/directory con-
troller can trap the pro cessor or make the pro cessor
APRIL has a basic RISC instruction set augmented
wait until the data is available. On full/empty excep-
with sp ecial memory instructions for full/empty bit op-
tions, the controller can trap the pro cessor, or allow the
erations, multithreading, and cache supp ort. The at-
pro cessor to continue execution. Load instructions also
traction of an implementation based on simple SPARC
have the option of setting the full/empty bit of the ac-
pro cessor mo di cations has resulted in a basic SPARC-
cessed lo cation to empty while store instructions have
like design. All registers are addressed relative to a cur-
the option of setting the bit to full. These options give
rent frame p ointer. Compute instructions are 3-address
rise to 8 kinds of loads and 8 kinds of stores. The load
register-to-register arithmetic/logic op erations. Condi-
instructions are listed in Table 2. Store instructions are
tional branch instructions take an immediate op erand
similar except that they trap on full lo cations instead
and may increment the PC by the value of the immedi-
of empty lo cations.
ate op erand dep ending on the condition co des set by the
A memory instruction also shares resp onsibility for
arithmetic/logic op erations. Memory instructions move
detecting futures in either of its address op erands. Like
data b etween memory and the registers, and also inter-
compute instructions, memory instructions also trap
act with the cache and the full/empty bits. The basic
if the least signi cant bit of either of their address
instruction categories are summarized in Table 1. The
op erands are non-zero. This intro duces the restriction
remainder of this section describ es features of APRIL
that ob jects in memory cannot b e allo cated at byte
instructions used for supp orting multipro cessing.
b oundaries. This, however, is not a problem b ecause
ob ject allo cation at word b oundaries is favored for other
Data Typ e Formats APRIL supp orts tagged p oint-
reasons [11]. This trap provides supp ort for implicit fu-
ers for Mul-T, as in the Berkeley SPUR pro cessor [12],
ture touches in op erators that dereference p ointers, e.g.,
by enco ding the p ointer typ e in the low order bits of a
car in LISP. 6
to devote our limited resources to the design of a custom
1 2
Name Typ e Reset f/e bit EL trap CM resp onse
ALEWIFE cache and directory controller, rather than
ldtt 1 No Yes Trap
to pro cessor design. Second, the register windows in
ldett 2 Yes Yes Trap
the SPARC pro cessor p ermit a simple implementation
ldnt 3 No No Trap
of coarse-grain multithreading. Third, most of the in-
ldent 4 Yes No Trap
structions envisioned for the original APRIL pro cessor
ldnw 5 No No Wait
map directly to single or double instruction sequences
ldenw 6 Yes No Wait
on the SPARC. Software compatibility with a commer-
ldtw 7 No Yes Wait
cial pro cessor allows easy access to a large b o dy of soft-
ldetw 8 Yes Yes Wait
1 2
ware. Furthermore, use of a standard pro cessor p ermits
Empty lo cation. Cache miss.
us to ride the technology curve; we can take advantage
of new technology as it is develop ed.
Table 2: Load Instructions.
Rapid Context Switching on SPARC SPARC
pro cessors contain an implementation-dep endentnum-
Full/Empty Bit Conditional Branch Instructions
ber of overlapping register windows for sp eeding up pro-
Non-trapping memory instructions allow testing of the
cedure calls. The current register window is altered
full/empty bit by setting a condition bit indicating the
via SPARC instructions (SAVE and RESTORE) that mo d-
state of the memory word's full/empty bit. APRIL
ify the Current WindowPointer (CWP). Traps incre-
provides conditional branch instructions, Jfull and
ment the CWP, while the trap return instruction (RETT)
Jempty, that dispatch on this condition bit. This pro-
decrements it. SPARC's register windows are suited for
vides a mechanism to explicitly control the action taken
rapid context switching and rapid trap handling b ecause
following a memory instruction that would normally
most of the state of a pro cess (i.e., its 24 lo cal reg-
trap on a full/empty exception.
isters) can b e switched with a single-cycle instruction.
Although we are not using multiple register windows for
Frame Pointer Instructions Instructions are pro-
pro cedure calls within a single thread, this should not
vided for manipulating the register frame p ointer (FP).
signi cantly hurt p erformance [25,24].
FP p oints to the register frame on which the currently
To implement coarse-grain multithreading, we use
executing thread resides. An INCFP instruction incre-
two register windows p er task frame { a user window
ments the FP to p oint to the next task frame while
and a trap window. The SPARC pro cessor chosen for
a DECFP instruction decrements it. The incrementing
our implementation has eight register windows, allow-
and decrementing is done mo dulo the numb er of task
ing a maximum of four hardware task frames. Since
frames. RDFP reads the value of the FP into a register
the SPARC do es not havemultiple program counter
and STFP writes the contents of a register into the FP.
(PC) chains and pro cessor status registers (PSR), our
trap co de must explicitly save and restore the PSRs
Instructions for Other Mechanisms The sp ecial
during context switches (the PC chain is saved by the
mechanisms discussed in Section 3.4, suchasFLUSH
trap itself ). These values are saved in the trap window.
are made available through \out-of-band" instructions.
Because the SPARC has a minimum trap overhead of
Interpro cessor-interrupts , blo ck-transfers, and FENCE
ve cycles (for squashing the pip eline and computing
op erations are initiated via memory-mapp ed I/O in-
the trap vector), context switches will take at least this
structions (LDIO, STIO).
long. See Section 6.1 for further information.
The SPARC oating-p oint unit do es not supp ort reg-
ister windows, but has a single, 32-word register le.
5 An Implementation of APRIL
To retain rapid context switching ability for applica-
tions that require ecient oating p oint p erformance,
An ALEWIFE no de consists of several interacting sub-
wehave divided the oating p oint register le into four
systems: pro cessor, oating-p oint unit, cache, memory,
sets of eight registers. This is achieved by mo difying
cache and directory controller, and network controller.
oating-p oint instructions in a context dep endent fash-
For the rst round implementation of the ALEWIFE
ion as they are loaded into the FPU and by maintaining
system, we plan to use a mo di ed SPARC pro cessor
1
four di erent sets of condition bits. A mo di cation of
and an unmo di ed SPARC oating-p oint unit. There
the SPARC pro cessor will make the CWP available ex-
are several reasons for this choice. First, wehavechosen
ternally to allow insertion into the FPU instruction.
1
The SPARC-based implementation e ort is in collab oration
with LSI Logic Corp oration. 7
tem includes the trap and system routines, Mul-T run- Supp ort for Futures We detect futures on the
time supp ort, a scheduler, and a system b o ot routine. SPARC via two separate mechanisms. Future p oint-
Since a large p ortion of the supp ort for multithread- ers are tagged with their lowest bit set. Thus, direct
ing, synchronization and futures is provided in soft- use of a future p ointer is agged with a word-alignment
ware through traps and run-time routines, trap han- trap.Furthermore, a strict op eration, such as subtrac-
dling must b e fast. Below, we describ e the implemen- tion, applied to one or more future p ointers is agged
tation and p erformance of the routines used for trap with a mo di ed non- xnum trap, that is triggered if an
handling and context switching. op erand has its lowest bit set (as opp osed to either one
of the lowest two bits, in the SPARC sp eci cation).
6.1 Cache Miss and Full/EmptyTraps
Implementation of Loads and Stores The
Cache miss traps o ccur on cache misses that require
SPARC de nition includes the Alternate Space Indi-
a network request and cause the pro cessor to context
cator (ASI) feature that p ermits a simple implementa-
switch. Full/empty synchronization exceptions can o c-
tion of APRIL's many load and store instructions (de-
cur on certain memory instructions describ ed in Sec-
scrib ed in Section 4). The ASI is available externally as
tion 4. The pro cessor can resp ond to these exceptions
an eight-bit eld. Normal memory accesses use four of
by spinning, switch spinning,orblocking the thread.
the 256 ASI values to indicate user/sup ervisor and in-
In our current implementation, traps handle these ex-
struction/data accesses. Sp ecial SPARC load and store
ceptions by switch spinning, whichinvolves a context
instructions (LDASI and STASI) p ermit use of the other
switch to the next task frame.
252 ASI values. Our rst-round implementation uses
In our SPARC-based design of APRIL, we implement
di erent ASI values to distinguish b etween avors of
context switching through the trap mechanism using
load and store instructions, sp ecial mechanisms, and
instructions that change the CWP. The following is a
I/O instructions.
trap routine that context switches to the thread in the
next task frame.
Interaction with the Cache Controller The cache
controller in the ALEWIFE system maintains strong
rdpsr psrreg ; save PSR into a reserved reg.
cache coherence, p erforms full/empty bit synchroniza-
save ; increment the window pointer
tion, and implements sp ecial mechanisms. By examin-
save ;by2
ing the pro cessor's ASI bits during memory accesses,
wrpsr psrreg ; restore PSR for the new context
it can select b etween di erent load/store and synchro-
jmpl r17 ; return from trap and
nization b ehavior, and can determine if sp ecial mecha-
rett r18 ; reexecute trapping instruction
nisms should b e employed. Through use of the Memory
Exception (MEXC) line on SPARC, it can invoke syn-
We count 5 cycles for the trap mechanism to allow
chronous traps corresp onding to cache misses and syn-
the pip eline to empty and save relevant pro cessor state
chronization (full/empty) mismatches. The controller
b efore passing control to the trap handler. The ab ove
can susp end pro cessor execution using the MHOLD
trap handler takes an additional 6 cycles for a total of 11
line. It passes condition information to the pro cessor
cycles to e ect the context switch. In a custom APRIL
through the Copro cessor Condition bits (CCCs), p er-
implementation, the cycles lost due to PC saves in the
mitting the full/empty conditional branch instructions
hardware trap sequence, and those in calling the trap
(Jfull and Jempty) to b e implemented as copro cessor
handler for the PSR saves/restores and double incre-
branch instructions. Asynchronous traps (IPI's) are de-
menting the frame p ointer could b e avoided, allowing a
livered via the SPARC's asynchronous trap lines.
four-cycle context switch.
6.2 Future TouchTrap
6 Compiler and Run-Time Sys-
When a future touch trap is signalled, the future that
tem
caused the trap will b e in a register. The trap han-
dler has to deco de the trapping instruction to nd that The compiler and run-time system are integral parts
register. The future is resolved if the full/empty bit of of the pro cessor design e ort. A Mul-T compiler for
the future's value slot is set to full. If it is resolved, APRIL and a run-time system written partly in APRIL
the future in the register is replaced with the resolved assembly co de and partly in T have b een implemented.
value; otherwise the trap routine can decide to switch
Constructs for user-directed placement of data and pro-
spin or block the thread that trapp ed. Our future touch cesses have also b een implemented. The run-time sys- 8 es 23 cycles to execute if the future is trap handler tak Single-processor
Other parallel tracers: execution trace of ed.
resolv trap bit, ATUM2 Mul-T program parallel program
If the trap handler decides to blo ck the thread on an
unresolved future, the thread must b e unloaded from
are task frame, and an alternate thread may the hardw Mul-T Post-mortem
Compiler scheduler
b e loaded. Loading a thread involves writing the state of
the thread, including its general registers, its PC chain,
Multimax program APRIL machine language program
and its PSR, into a hardware task frame on the pro-
cessor, and unloading a thread involves saving the state
T-Mul-T APRIL
. Loading and unloading
of a thread out to memory emulator/ Simulator,
e op erations unless there is sp ecial threads are exp ensiv tracer Run-time sys
Memory
are supp ort for blo ckmovement of data b etween hardw requests/acks
Parallel traces Memory requests/acks
registers and memory. Since the scheduling mechanism
favors pro cessor-resident threads, loading and unload-
t. However, this is an
ing of threads should b e infrequen Cache
vestigation. issue that is under in Directory Memory Sim.
Network transactions
Performance Measurements 7 Network
Simulator
This section presents some results on APRIL's p erfor-
mance in handling ne-grain tasks. Wehave imple-
Figure 4: Simulator Organization.
mented a simulator for the ALEWIFE system written
in C and T. Figure 4 illustrates the organization of the
simulator. The Mul-T compiler pro duces APRIL co de,
system called SUMMIT, develop ed by the Sp oken Lan-
which gets linked with the run-time system to yield an
guage Systems Group at MIT. We ran each program
executable program. The instruction-level APRIL pro-
on the Encore Multimax, on APRIL using normal task
cessor simulator interprets APRIL instructions. It is
creation, and on APRIL using lazy task creation. For
written in T and simulates 40,000 APRIL instructions
purp oses of comparison, execution time has b een nor-
p er second when run on a SPARCServer 330. The pro-
malized to the time taken to execute a sequential version
cessor simulator interacts with the cache and directory
of each program, i.e., with no futures and compiled with
simulator (written in C) on memory instructions. The
an optimizing T-compiler.
cache simulator in turn interacts with the network sim-
The di erence b etween running the same sequential
ulator (also written in C) when making remote memory
co de on T and on Mul-T on the Encore Multimax
op erations. The simulator has proved to b e a useful
(columns \T seq" and \Mul-T seq") is due to the over-
to ol in evaluating system-wide architectural tradeo s
head of future detection. Since the Encore do es not
as it provides more accurate results than a trace driven
supp ort hardware detection of futures, an overhead of a
simulation. The sp eed of the simulator has allowed us
factor of 2 is intro duced, even though no futures are ac-
to execute lengthy parallel programs. As an example, in
tually created. There is no overhead on APRIL, which
a run of sp eech (describ ed b elow), the simulated pro-
demonstrates the advantage of tag supp ort for futures.
gram ran for 100 million simulated cycles b efore com-
The di erence b etween running sequential co de on
pleting.
Mul-T and running parallel co de on Mul-T with one
Evaluation of the ALEWIFE architecture through
pro cessor (\Mul-T seq" and 1) is due to the overhead
simulations is in progress. A sampling of our results on
of thread creation and synchronization in a parallel pro-
the p erformance of APRIL running parallel programs
gram. This overhead is very large for the b b enchmark
is presented here. Table 3 lists the execution times of
on b oth the Encore and APRIL using normal task cre-
four programs written in Mul-T: b, factor, queens
ation b ecause of very ne-grain thread creation. This
and sp eech. b is the ubiquitous doubly recursive Fi-
overhead accounts for approximately a factor of 28 in
b onacci program with `future's around each of its re-
execution time. For APRIL with normal futures, this
cursive calls, factor nds the largest prime factor of
overhead accounts for a factor of 14. Lazy task cre-
eachnumb er in a range of numb ers and sums them up,
ation on APRIL creates threads only when the machine
queens nds all solutions to the n-queens chess prob-
has the resources to execute them, and p erforms much
lem for n = 8 and sp eech is a mo di ed Viterbi graph
b etter b ecause it has the e ect of dynamically partition-
search algorithm used in a connected sp eech recognition
ing the program into coarser-grain threads and creating 9
T Mul-T
Program System seq seq 1 2 4 8 16
Encore 1.0 1.8 28.9 16.3 9.2 5.1
b APRIL 1.0 1.0 14.2 7.1 3.6 1.8 0.97
Apr-lazy 1.0 1.0 1.5 0.78 0.44 0.29 0.19
Encore 1.0 1.4 1.9 0.96 0.50 0.26
factor APRIL 1.0 1.0 1.8 0.90 0.45 0.23 0.12
Apr-lazy 1.0 1.0 1.0 0.52 0.26 0.14 0.09
Encore 1.0 1.8 2.1 1.0 0.54 0.31
queens APRIL 1.0 1.0 1.4 0.67 0.33 0.18 0.10
Apr-lazy 1.0 1.0 1.0 0.51 0.26 0.13 0.07
Encore 1.0 2.0 2.3 1.2 0.62 0.36
sp eech APRIL 1.0 1.0 1.2 0.60 0.31 0.17 0.10
Apr-lazy 1.0 1.0 1.0 0.52 0.27 0.15 0.09
Table 3: Execution time for Mul-T b enchmarks. \T seq" is T running sequential co de, \Mul-T seq" is Mul-T running
sequential co de, 1 to 16 denote numb er of pro cessors running parallel co de.
fewer futures. The overhead intro duced is only a fac- terference misses coupled with the higher average trac
tor of 1.5. In all of the programs, APRIL consistently generated by a higher utilized pro cessor imp ose greater
demonstrates lower overhead due to supp ort for thread bandwidth demands on the interconnection network.
creation and synchronization over the Encore. Context management instructions required to switch
Measurements for multiple pro cessor executions on the pro cessor b etween threads also add to the over-
APRIL (2 { 16) used the pro cessor simulator without head. Furthermore, the application must display suf-
the cache and network simulators, in e ect simulating a cient parallelism to allowmultiple thread assignment
shared-memory machine with no memory latency. The to each pro cessor.
numb ers demonstrate that APRIL and its run-time sys- What is a go o d p erformance metric to evaluate mul-
tem allow parallel program p erformance to scale when tithreading? A go o d measure of system p erformance is
synchronization and task creation overheads are taken system p ower, which is the pro duct of the number of
into account, but when memory latency is ignored. The pro cessors and the average pro cessor utilization. Pro-
e ect of communication in large-scale machines dep ends vided the computation of pro cessor utilization takes into
on several factors suchasscheduling, which are active account the deleterious e ects of cache, network, and
areas of investigation. context-switching overhead, the pro cessor utilization is
itself a go o d measure.
Wehave develop ed a mo del for multithreaded pro-
8 Scalability of Multithreaded
cessor utilization that includes the cache, network, and
switching overhead e ects. A detailed analysis is pre-
Pro cessor Systems
sented in [1]. This section will summarize the mo del
and our chief results. Pro cessor utilization U as a func-
Multithreading enhances pro cessor eciency by allow-
tion of the numb er of threads resident on a pro cessor
ing execution to pro ceed on alternate threads while the
p is derived as a function of the cache miss rate m(p),
memory requests of other threads are b eing satis ed.
the network latency T (p), and the context switching
However, any new mechanism is useful only if it en-
overhead C :
hances overal l system performance. This section ana-
lyzes the system p erformance of multithreaded pro ces-
sors.
8
1+T (p)m(p)
p
>
for p<
Amultithreaded pro cessor design must address the
>
< 1+T (p)m(p) 1+Cm(p)
tradeo b etween reduced pro cessor idle time and in-
U (p)= (1)
>
creased cache miss rates, network contention, and con-
>
1+T (p)m(p)
1
:
for p
1+Cm(p) 1+Cm(p)
text managementoverhead. The private working sets of
multiple contexts interfere in the cache. The added in-
When the numb er of threads is small, complete over- 10
lapping of network latency is not p ossible. Pro cessor
Ideal
utilization with one thread is 1=(1 + m(1)T (1)). Ideally,
p threads available to overlap network delays, the
with Network Effects
ould increase p-fold. In practice, b ecause
utilization w Cache and Network Effects
work latency increase to m(p) and the miss rate and net CS Overhead
Useful Work
T (p), the utilization b ecomes p=(1 + m(p)T (p)).
When it is p ossible to completely overlap network
1.0 |
latency, pro cessor utilization is limited only by the con-
|
hing overhead paid on every miss (assuming
text switc 0.9
a context switch happ ens on a cache miss), and is given
0.8 |
by1=(1 + m(p)C ).
|
he and network terms have
The mo dels for the cac 0.7
b een validated through simulations. Both these terms
0.6 |
are shown to b e the sum of two comp onents: one com-
|
t indep endent of the numb er of threads p and the p onen 0.5
Processor Utilization U(p)
other linearly related to p (to rst order). Multithread-
0.4 |
ing is shown to b e useful when p is small enough that
0.3 |
the xed comp onents dominate. |
Let us lo ok at some results for the default set of sys- 0.2
tem parameters given in Table 4. The analysis assumes
0.1 |
8000 pro cessors arranged in a three dimensional array.
|
h a system, the average numb er of hops b etween a In suc 0.0 | | | | | | | | |
0 1 2 3 4 5 6 7 8
random pair of no des is nk =3 = 20, where n denotes net-
Processes p
work dimension and k its radix. This yields an average
round trip network latency of 55 cycles for an unloaded
Figure 5: Relative sizes of the cache, network and overhead
network, when memory latency and average packet size
comp onents that a ect pro cessor utilization.
are taken into account. The xed miss rate comprises
rst-time fetches of blo cks into the cache, and the inter-
ference due to multipro cessor coherence invalidations.
which corresp onds to our initial SPARC-based imple-
mentation of APRIL. This result is similar to that re-
p orted byWeb er and Gupta [26] for coarse-grain mul-
Parameter Value
tithreaded pro cessors. The main reason a low degree of
Memory latency 10 cycles
multithreading is sucient is that context switches are
Network dimension n 3
Network radix k 20 forced only on cache misses, which are exp ected to hap-
Fixed miss rate 2%
p en infrequently. The marginal b ene ts of additional
Average packet size 4
pro cesses is seen to decrease due to network and cache
Cache blo ck size 16 bytes
interference.
Thread working set size 250 blo cks
Why is utilization limited to a maximum of ab out
Cache size 64 Kbytes
0.80 despite an ample supply of threads? The reason is
that available network bandwidth limits the maximum
Table 4: Default system parameters.
rate at which computation can proceed. When avail-
able network bandwidth is used up, adding more pro-
cesses will not improve pro cessor utilization. On the
Figure 5 displays pro cessor utilization as a function of
contrary, more pro cesses will degrade p erformance due
the numb er of threads resident on the pro cessor when
to increased cache interference. In such a situation,
context switching overhead is 10 cycles. The degree
for b etter system p erformance, e ort is b est sp entin
to which the cache, network, and overhead comp onents
increasing the network bandwidth, or in reducing the
impact overall pro cessor utilization is also shown. The
bandwidth requirement of each thread.
ideal curve shows the increase in pro cessor utilization
The relatively large ten-cycle context switchoverhead
when b oth the cache miss rate and network contention
do es not signi cantly impact p erformance for the de-
corresp ond to that of a single pro cess, and do not in-
fault set of parameters b ecause utilization dep ends on
crease with the degree of multithreading p.
the pro duct of context switching frequency and switch-
We see that as few as three pro cesses yield close to
ing overhead, and the switching frequency is exp ected
80% utilization for a ten-cycle context-switchoverhead 11
to b e small in a cache-based system. This observation 10 Acknowledgements
is imp ortant b ecause it allows a simpler pro cessor im-
Wewould liketoacknowledge the contributions of the
plementation, and is exploited in the design of APRIL.
memb ers the ALEWIFE research group. In particular,
Amultithreaded pro cessor requires larger caches to
Dan Nussbaum was partly resp onsible for the pro ces-
sustain the working sets of multiple pro cesses, although
sor simulator and run-time system and was the source
cache interference is mitigated if the pro cesses share
of a gamut of ideas, David Chaiken wrote the cache
co de and data. For the default parameter set, we found
simulator, Kirk Johnson supplied the b enchmarks, and
that caches greater than 64 Kbytes comfortably sus-
Gino Maa and Sue Lee wrote the network simulator.
tain the working sets of four pro cesses. Smaller caches
We appreciate help from Gene Hill, Mark Perry, and
su er more interference and reduce the b ene ts of mul-
Jim Pena from LSI Logic Corp oration for the SPARC-
tithreading.
based implementation e ort. Our design was in uenced
by Bert Halstead's work on multithreaded pro cessors.
9 Conclusions
Our research b ene ted signi cantly from discussions
with Bert Halstead, Tom Knight, Greg Papadop oulos,
We describ ed the architecture of APRIL { a coarse-
Juan Loaiza, Bill Dally, SteveWard, Rishiyur Nikhil,
grain multithreaded pro cessor to b e used in a cache-
Arvind, and John Hennessy. Beng-Hong Lim is partly
coherentmultipro cessor called ALEWIFE. By rapidly
supp orted by an Analog Devices Fellowship. The re-
switching to an alternate task, APRIL can hide com-
search rep orted in this pap er is funded byDARPA
munication and synchronization delays and achieve high
contract # N00014-87-K-0825 and by grants from the
pro cessor utilization. The pro cessor makes e ective use
Sloan Foundation and IBM.
of available network bandwidth b ecause it is rarely idle.
APRIL provides supp ort for ne-grain tasking and de-
tection of futures. It achieves high single-thread p erfor- References
mance by executing instructions from a given task until
[1] Anant Agarwal. Performance Tradeo s in Multi-
an exception condition like a synchronization fault or re-
threaded Pro cessors. Septemb er 1989. MIT VLSI
mote memory op eration o ccurs. Coherent caches reduce
Memo 89-566, Lab oratory for Computer Science.
the context switch rate to approximately once every 50{
100 cycles. Therefore context switchoverheads in the
[2] Arvind and Rob ert A. Iannucci. Two Fundamen-
4{10 cycle range are tolerable, signi cantly simplifying
tal Issues in Multiprocessing.Technical Rep ort TM
pro cessor design. By providing hardware supp ort only
330, MIT, Lab oratory for Computer Science, Oc-
for p erformance-critical op erations and migrating other
tob er 1987.
functionalityinto the compiler and run-time system, we
were able to simplify the pro cessor design even further.
[3] Arvind, R. S. Nikhil, and K. K. Pingali. I-
We describ ed a SPARC-based implementationof
Structures: Data Structures for Parallel Comput-
APRIL that uses the register windows of SPARCas
ing. In Proceedings of the Workshop on Graph Re-
task frames for multiple threads. A pro cessor simulator
duction, (Springer-Verlag Lecture Notes in Com-
and an APRIL compiler and run-time system have b een
puter Science 279), Septemb er/Octob er 1986.
written. The SPARC-based implementation of APRIL
[4] William C. Athas and Charles L. Seitz. Multicom-
switches contexts in 11 cycles. APRIL and its asso-
puters: Message-Passing Concurrent Computers.
ciated run-time system practically eliminate the over-
Computer, 21(8):9{24, August 1988.
head of ne-grain task creation and detection of fu-
tures. For Mul-T, the overhead reduces from 100%
[5] David Chaiken, Craig Fields, Kiyoshi Kurihara,
on an Encore Multimax-based implementation to under
and Anant Agarwal. Directory-Based Cache-
5% on APRIL. Weevaluated the scalabilityofmulti-
Coherence in Large-Scale Multipro cessors. June
threaded pro cessors in large-scale parallel machines us-
1990. To app ear in IEEE Computer.
ing an analytical mo del. For typical system parameters
and a 10 cycle context-switchoverhead, the pro cessor
[6] W. J. Dally et al. Architecture of a Message-Driven
can achieve close to 80% utilization with 3 pro cessor
Pro cessor. In Proceedings of the 14th Annual Sym-
resident threads.
posium on Computer Architecture, pages 189{196,
IEEE, New York, June 1987.
[7] Michel Dub ois, Christoph Scheurich, and FayeA.
Briggs. Synchronization, coherence, and event or- 12
[19] Rishiyur S. Nikhil and Arvind. Can Data ow Sub- dering in multipro cessors. IEEE Computer, 9{21,
sume von Neumann Computing? In Proceedings February 1988.
16th Annual International Symposium on Com-
[8] R.H. Halstead and T. Fujita. MASA: A Multi-
puter Architecture, IEEE, New York, June 1989.
threaded Pro cessor Architecture for Parallel Sym-
b olic Computing. In Proceedings of the 15th An-
[20] Charles L. Seitz. Concurrent VLSI Architectures.
nual International Symposium on Computer Ar-
IEEE Transactions on Computers, C-33(12), De-
chitecture, pages 443{451, IEEE, New York, June
cemb er 1984.
1988.
[21] Charles L. Seitz. The Cosmic Cub e. CACM,
[9] Rob ert H. Halstead. Multilisp: A Language for
28(1):22{33, January 1985.
Parallel Symb olic Computation. ACM Transac-
[22] B.J. Smith. A Pip elined, Shared Resource MIMD
tions on Programming Languages and Systems,
Computer. In Proceedings of the 1978 International
7(4):501{539, Octob er 1985.
Conference on Paral lel Processing, pages 6{8, 1978.
[10] J. L. Hennessy and T. R. Gross. Postpass Co de
[23] SPARC Architecture Manual. 1988. SUN Mi-
Optimization of Pip eline Constraints. ACM Trans-
crosystems, Mountain View, California.
actions on Programming Languages and Systems,
5(3):422{448, July 1983.
[24] P. A. Steenkiste and J. L. Hennessy. A Simple
Interpro cedural Register Allo cation Algorithm and
[11] J. L. Hennessy et al. Hardware/Software
Its E ectiveness for LISP. ACM Transactions on
Tradeo s for Increased Performance. In Proc.
Programming Languages and Systems, 11(1):1{32,
SIGARCH/SIGPLAN Symp. Architectural Sup-
January 1989.
port for Programming Languages and Operating
Systems, pages 2{11, March 1982. ACM, Palo Alto,
[25] David W. Wall. Global Register Allo cation at Link
CA.
Time. In SIGPLAN '86, Conference on Compiler
Construction, June 1986.
[12] M. D. Hill et al. Design Decisions in SPUR. Com-
puter, 19(10):8{22, Novemb er 1986.
[26] Wolf-DietrichWeb er and Ano op Gupta. Exploring
the Bene ts of Multiple Hardware Contexts in a
[13] Mark Horowitz et al. A 32-Bit Micropro cessor with
Multipro cessor Architecture: Preliminary Results.
2K-Byte On-Chip Instruction Cache. IEEE Jour-
In Proceedings 16th Annual International Sympo-
nal of Solid-State Circuits, Octob er 1987.
sium on Computer Architecture, IEEE, New York,
[14] R.A. Iannucci. Toward a Data ow/von Neumann
June 1989.
Hybrid Architecture. In Proceedings of the 15th
Annual International Symposium on Computer Ar-
[27] Colin Whitby-Strevens. The Transputer. In Pro-
chitecture,Hawaii, June 1988.
ceedings 12th Annual International Symposium on
Computer Architecture, IEEE, New York, June
[15] W. J. Kaminsky and E. S. Davidson. Developing
1985.
a Multiple-Instruction-Stream Single-Chip Pro ces-
sor. Computer, 66{78, Decemb er 1979.
[16] D. Kranz, R. Halstead, and E. Mohr. Mul-T: A
High-Performance Parallel Lisp. In Proceedings of
SIGPLAN '89, Symposium on Programming Lan-
guages Design and Implemenation, June 1989.
[17] Eric Mohr, David A. Kranz, and Rob ert H. Hal-
stead. Lazy task creation: a technique for increas-
ing the granularity of parallel tasks. In Proceed-
ings of Symposium on Lisp and Functional Pro-
gramming, June 1990. To app ear.
[18] Nigel P.Topham, Amos Omondi and Roland N.
Ibb ett. Context Flow: An Alternative to Conven-
tional Pip elined Architectures. The Journal of Su-
percomputing, 2(1):29{53, 1988. 13