See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/2804946

XMT-M: A Scalable Decentralized Processor

Article · October 1999 Source: CiteSeer

CITATIONS READS 4 25

5 authors, including:

Bruce Jacob Uzi Vishkin University of Maryland University College University of Maryland, College Park

146 PUBLICATIONS 3,534 CITATIONS 221 PUBLICATIONS 8,916 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Low Power Architecture View project

Optical rectification View project

All content following this page was uploaded by Uzi Vishkin on 20 July 2017.

The user has requested enhancement of the downloaded file.

XMTM A Scalable Decentralized Pro cessor

Efraim Berkovich Joseph Nuzman Mano j Franklin Bruce Jacob and Uzi Vishkin

Department of Electrical and Computer Engineering and

University of Maryland Institute for Advanced Computer Studies UMIACS

University of Maryland College Park MD

Abstract

A dening challenge for research in computer science and engineering has b een the ongoing quest for

reducing the completion time of a single computation task Even outside the parallel pro cessing com

munities there is little doubt that the key to further progress in this quest is to do parallel pro cessing

of some kind A recently prop osed parallel pro cessing framework that spans the entire sp ectrum from

parallel to architecture to implementation is the explicit multithreading XMT framework

This framework provides i simple and natural parallel algorithms for essentially every generalpurp ose

application including notoriously dicult irregular integer applications and ii a multithreaded pro

gramming mo del for these algorithms which allows an indep endenceoforder semantics every

can pro ceed at its own sp eed indep endent of other concurrent threads To the extent p ossible the XMT

framework uses established ideas in parallel pro cessing

This pap er presents XMTM a microarchitecture implementation of the XMT mo del that is p ossi

ble with current technology XMTM oers an engineering design point that addresses four concerns

buildability programmability performance and The XMTM hardware is geared to execute

multiple threads in parallel on a single chip relying on very few new gadgets it can execute parallel

threads without busywaits Existing co de can b e run on XMTM as a single thread without any mo di

cations thereby providing backward compatibility for commercial acceptance Simulationbased studies

of XMTM demonstrate considerable improvements in p erformance relative to the b est serial pro cessor

even for small and therefore practical input sizes

Keywords Finegrained SPMD indep endence of order semantics instructionlevel parallelism ILP

nobusywait nite state machines parallel algorithms prexsum and spawnjoin

Intro duction

The coming years promise to b e exciting ones in the area of computer architecture Continued scaling of

submicron technology will give us orders of magnitude increase in onchip hardware resources Even by

conservative estimates a single chip will have a billion transistors in a few years Exploiting parallelism

in a big way is a natural way to translate this increase in transistor count to completing individual tasks

faster

Parallelism had b een traditionally exploited at coarse and negrained levels Emphasizing the build

able in the short term traditional techniques targeting coarsegrained parallelism have fo cused primarily

on MPPs pro cessors Although MPPs provide the strongest available machines for

some timecritical applications they have had very little impact on the mainstream computer market

Most computers to day are unipro cessors and even large servers have only mo dest numb ers of pro ces

sors A recent rep ort from the Presidents Information Technology Advisory Committee PITAC

has acknowledged the imp ortance and diculty of achieving scalable application p erformance on to days

parallel machines According to the rep ort there is substantive evidence that current scalable paral lel

architectures are not wel l suited for a number of important applications especial ly those where the com

putations are highly irregular or those where huge quantities of data must be transferred from memory to

support the calculation

The commo dity micropro cessor industry has b een traditionally lo oking to negrained or instruction

level parallelism ILP for improving p erformance with sophisticated microarchitectural techniques such

as pip elinin g branch prediction outoforder execution and sup erscalar execution and sophisticated

compiler optimizations but with little help from programmers Such hardwarecentered techniques ap

p ear to have scalability problems in the submicron technology era and are already app earing to run out

of steam Compilercentered techniques also are handicapp ed primarily due to the articial dep endencies

intro duced by serial programming

On analyzing this scenario it b ecomes apparent that the huge investment in serial software has forced

programmers to hide most of the parallelism present in an application by expressing the in

a serial form and delegating it to the compiler and the hardware to reextract a part of that hidden

parallelism The result has b een that b oth hardware complexity and compiler complexity have b een

increasing monotonically with a less satisfying improvement in p erformance However we are reaching

a p oint in time when such evolutionary approaches can no longer b ear much fruit b ecause of increasing

complexity and fast approaching physical limits According to a recent p osition pap er by Dally and Lacy

over the past years the increased density of VLSI chips was applied to close the gap between

microprocessors and highend CPUs Today this gap is ful ly closed and adding devices to uniprocessors

is wel l beyond the point of diminishing returns

To get signicant increases in computing p ower a radically dierent approach may b e needed One

such approach is to set free the crippled programmers so that they are not forced to suppress the

parallelism they observe and are instead allowed to explicitly sp ecify the parallelism The b o oks

attest to the many great ideas that the eld has develop ed over the years

although some of the ideas were ahead of the implementation technology and are still waiting to b e put

to practical use Culler and Singh in their recent b o ok on Parallel Computer Architecture mention

under the title Potential Breakthroughs p breakthrough may come come from architecture if we

can somehow design machines in a costeective way that makes it much less important for a programmer

to worry about data locality and communication that is to truly design a machine that can look to the

programmer like a PRAM The recently prop osed explicit multithreading XMT framework was

inuenced by a hop e that this can b e done

We view ILP as the main success story form of parallelism thus far as it was adopted in a big way in

the commercial world for reducing the completion time of general purp ose applications XMT aspires to

expand the ILP paral lelism bridgehead with the ground forces of algorithmlevel parallelism which is

guided by a sound theoretical foundation by letting programmers express b oth negrained and coarse

1

grained parallelism in a natural way

1

Designed for reducing data access communication and synchronization cost for current multipro cessors there has b een

the parallel programming metho dology as describ ed in Section of There has also b een a related evolutionary

approach to let programmers express some of the coarsegrained paralleli sm with the use of heavyweight forks carried out

by the op erating system and lightweight threads using library functions to b e run on multipro cessors However it has

not yet b een demonstrated that generalpurp ose application s could b enet much from these techniques two concrete but

p ointed examples are breadthrstsearch on graphs and searching directed acyclic graphs more generally irregular integer

application s of the kind taught in standard Computer Science algorithms and datastructure courses

The XMT framework also p ermits decentralized and scalable pro cessors with reduced hardware com

plexity Decentralization is very imp ortant b ecause in the future wire delays wil l become the dominant

factor in chip performance By wire delays we mean onchip delay of connections b etween gates The

Semiconductor Industry Asso ciation estimates that within a decade only of a chip will b e reachable

in a clo ck cycle Microarchitectures will have to use decentralization techniques to tolerate long

onchip communication latencies ie lo calize communication so as to make infrequent use of crosschip

signal propagation

The ob jective of this pap er is to explore a decentralized microarchitecture implementation for the XMT

paradigm The highlights of the investigated microarchitecture called XMTM are i decentralized

pro cessing elements that can execute multiple threads in parallel ii indep endence of order among the

concurrent threads iii relaxed memory consistency mo del and iv buildabili ty with current technol

ogy

The rest of this pap er is organized as follows Section provides background material on the XMT

framework Section describ es XMTM a realizable microarchitecture implementation of the XMT

paradigm Section presents an exp erimental analysis of XMTMs p erformance conducted with a de

tailed simulator In particular it shows that even for small input sizes the XMT pro cessors p erformance

is signicantly b etter than that of the b est serial pro cessor that has comparable hardware Section dis

cusses related work and highlight the dierences with the XMT approach Finally Section presents the

ma jor conclusions and directions for future work

Explicit MultiThreading XMT

The XMT framework is grounded in a rather ambitious vision aimed at onchip generalpurp ose parallel

computing that is presenting a comp etitive alternative to the stateoftheart serial pro cessors and

their successors in the next decade Towards that end the broad XMT framework spans the entire sp ec

trum from algorithms through architecture to implementation This section provides a brief description

of the XMT framework

The XMT Programming Mo del

The programming mo del underlying the XMT framework is paral lel in nature as opp osed to the serial

mo del used in most computers To b e sp ecic XMT uses an arbitrary CRCW concurrent read concurrent

write SPMD single program multiple data programming mo del SPMD implies concurrent threads that

execute the same co de on dierent data it is a more implementable extension of the classical PRAM

2

mo del The XMT threads can b e mo derately long providing some lo cality of reference Figure

illustrates the XMT programming mo del The virtual threads initiated by the Spawn and terminated

by the Join have the same co de At runtime dierent threads may have dierent lengths based on the

control ow paths taken through them The arbitrary CRCW asp ect of the mo del dictates that concurrent

writes into the same memory lo cation result in having an arbitrary one among these writes to succeed

that is one thread writes into the memory lo cation while the others bypass to their next instruction

This p ermits each thread to progress at its own sp eed from its initiating Spawn to its terminating Join

without ever having to wait for other threads that is no thread ever do es a busywait for another thread

Interthread synchronization o ccurs only at the Joins We say that the XMT programming mo del inherits

2

It is imp ortant to note that traditional multipro cessors exploit coarsegrain parallelism with the use of very long threads

which provide even more lo cality However programmers nd it more dicult to reason ab out coarsegrain paralleli sm

than negrain parallelism Thus there is a tradeo b etween thread length which aects lo cality of reference and

programmability

the independence of order semantics IOS of the arbitrary CRCW and builds on it A ma jor advantage

of using this SPMD mo del is that it is an extension of the classical PRAM mo del for which a vast b o dy

of parallel algorithms are available in the literature

Serial execution Spawn threads

Parallel execution Join threads

Figure Parallelism Prole for the XMT Mo del

XMT HighLevel Language Level

The XMT highlevel language level includes SPMD extensions to standard serial highlevel languages The

extension includes Spawn and Prexsum statements The prexsum statement is used to synchronize

b etween threads Prexsum is similar to the fetchandincrement used in the NYU Ultra Computer

and has the following semantics

Prefixsum B R i B B R and ii R initial value of B

The primitive by itself may not b e very interesting but happ ens to b e very useful when several threads

simultaneously p erform a prexsum against a common base In such a situation the multiple prex

sum op erations can b e combined by the hardware to form a single multioperand prexsum op eration

Because each prexsum is atomic each thread will receive a dierent value in its lo cal storage R due

to indep endenc e of order semantics any order is acceptable The prexsum command can b e used

for implementing functions such as i load balancing parallel implementation of queuestack and ii

interthread synchronization

We shall use a simple example to clarify the XMT programming mo del and the use of prexsum

Supp ose we have an array of integers A and wish to compact the array by copying all nonzero values

to another array B in an arbitrary order cf left hand side of Figure The right hand side of Figure

gives a XMT highlevel language co de to do this array compaction

The XMT Instruction Set Architecture ISA

For most part the XMT ISA is the same as any standard ISA Three new instructions are added

to supp ort the XMT programming mo dela Spawn instruction a Join instruction and a Prexsum

instruction The Spawn instruction and Join instruction are added to enable transitions back and forth

from the serial mo de to the parallel mo de The prexsum PS instruction is intro duced as a primitive

for co ordinating parallel threads and other uses It is imp ortant to note that an existing serial co de

forms legal singlethread co de for an XMT pro cessor this phenomenon helps to provide ob ject co de

compatibility for existing co de AB int base = 0; # a global variable, shared by all threads 0 45 Spawn(n) # spawn threads with ID 0 through n-1 0 17 9123 9123 { 0 83 int t_id; # thread ID number: between 0 and n-1 0 0 int e = 1; # local variable, private to each thread 0 if (A[t_id]) # check if array element is non-zero 45 0 { 0 PS(base, e); # perform prefix-sum to get new value for e 0 83 B[e] = A[t_id]; # copy element from array A to B (index e) 0 } 17 0 } # implied Join

0 n = base;

Figure The array compaction problem The nonzero values in array A are copied to array B in

an arbitrary order The co de on the right hand side gives an XMT highlevel program to solve the array

compaction problem

Register Mo del

The XMT ISA sp ecies two typ es of registersglobal and lo cal The global registers are visible to all

threads of a spawnjoin pair and the serial thread The lo cal registers are sp ecic to each virtual thread

and are visible only to the relevant thread To provide compatibility with existing binaries assemblers

compiler development to ols and op erating systems one can simply divide the register space into two

partitions so that the lower partition refers to the global registers and the upp er partitions refers to the

lo cal registers For instance a reference to register R in the MIPS assembly language implies a global

register access while a reference to register R implies access to a lo cal register

Memory Mo del

The XMT framework supp orts a mo del that is all concurrent threads see a common

shared memory address space

Memory Consistency Mo del The XMT memory mo del supp orts a weak consistency mo del b etween

the concurrent threads of a SpawnJoin pair This stems from XMTs indep endence of order semantics

Memory reads and writes from concurrent threads are generally not ordered When an interthread

ordering is required a prexsum instruction is used The following example co de illustrates this

sw t At write local value to Ai based on the thread ID

psi g t use PS to coordinate thread accesses g is initialized to

beq t r DONE if the PS result is zero were done

xori t t if i is even ti else ti

lw t At load At into t

The ab ove co de can result in the following sequence of events

Thread Thread

Write to A Write to A

PS op eration gets PS op eration gets

PS result is so go to DONE PS result is so continue

Read A

As p er the semantics of the program values written in one thread b efore the prexsum must b e

visible to the remaining threads once they execute their prexsum Thus the other thread is assured of

getting the correct value of Ai We call this typ e of prexsum a gatekeep er prexsum Gatekeep er

prexsums can b e identied at compile time

A Design Principle Guiding XMT Nobusywaits Finite State Machines

The XMT framework is based on an interesting design principle or design ideal called nobusywaits

nite state machines NBW FSMs According to that ideal progress is achieved by sharing the work

among FSMs however a delayed FSM cannot susp end the progress of other FSMs The latter means

that no FSM can force other FSMs to waste cycles by doing busywaiting The NBW FSMs ideal guided

the design of the XMT highlevel programming language and assembly language The FSMs that are

reected in these languages are virtual The IOS in the assembly language allows each thread to progress

at its own sp eed from its initiating Spawn command to its terminating Join command without having to

ever wait for other threads Synchronization o ccurs only at the Join command where all threads must

terminate b efore the execution can pro ceed as a serial thread That is in line with the NBW FSMs ideal

a virtual thread avoids busywaits by terminating itself committing suicide just b efore the thread

could have run into a busywait situation As we will see in Section at the microarchitecture level we

will only b e able to alleviate violations of the NBW FSMs principle but not to completely avoid them

Decentralized XMT Microarchitecture

The XMT framework can b e implemented in the hardware in a variety of ways This section details one

p ossible microarchitecture implementation for XMT This implementation is based on current technology

and is intended to oer architectural insight into XMT not to stand as the sole microarchitecture choice

for the XMT framework

To b egin with we anticipate that much of the communication will b e intrathread This suggests a

decentralized organization with multiple pro cessing elements or thread control units TCUs to execute

concurrent threads Each TCU has its own fetch unit deco de unit register le and dispatch buer

The TCUs are indep endent in that they do not p erform crosschecking of dep endencies while executing

the instructions of their threads In order to maximize the use of certain resources several TCUs are

group ed together as a cluster as shown in Figure The TCUs in a cluster share a common p ercluster

L instruction cache a common set of functional units and a common p ercluster L data cache

How many TCUs should b e there in a cluster The answer dep ends on the interconnect wire delays

within a cluster ie the wire delays from the shared L instruction cache the functional units and

the L data cache to the TCUs If these data transfers take multiple clo ck cycles then the b enets of

clustering the TCUs together may b ecome counterpro ductive

Figure illustrates the organization of multiple clusters within the XMTM pro cessor and shows

the intercluster communication channels These channels take multiple cycles for a data transfer The

op erations requiring intercluster communication are i prexsum ii spawnjoin iii global register

le access and iv memory access

Prexsum Implementation

The prexsum mechanism go es through a central facility that can compute results from n threads in

O time not On time There is a dedicated prexsum bus for broadcasting the prexsum Dispatch Independent TCUs buffers Functional units

IF ID

Memory Instruction Interface IF ID Cache OR Data ... Cache

IF ID

... I/F Unit Register

File

Figure An XMTM Cluster Each XMTM cluster includes a shared instruction cache a shared

data cache multiple TCUs and a numb er of shared functional units

Independent clusters

MI IC or DC

I/F

LEVEL 2 MI LEVEL 2 INSTRUCTION IC or DATA DC CACHE CACHE I/F ...

MI IC or DC

I/F

Global Prefix Spawn /

Reg File -sum Join

Figure An XMTM Pro cessor Communication paths in this diagram require multiple cycles

results Clusters have a p ointtop oint connection to the global prexsum unit to which they send their

prexsum requests for pro cessing

All prexsum requests with a common base that arrive at the central prexsum unit in the same

time slice are pro cessed simultaneously and the results are sent out simultaneously The whole pro cess

is pip elined so that a numb er of dierent prexsum op erations can b e in ight at the same time The

base register contents and register identier are red out on a shared bus as so on as the rst request

for a particular base arrives The IF interface unit in each cluster listens on the shared bus for these

broadcasts

First we notice that a prexsum request that contributes zero to the base is equivalent to a read of

the base with no sp ecied ordering Thus these requests can b e handled lo cally at the cluster by reading

a lo cal copy of the base register Nonzero prexsum requests from the same cluster using the same base

can b e combined into a single request to the global prexsum unit The global unit groups these cluster

requests into batches of the same base p erforms a prexsum across the batch and broadcasts the results

on the prexsum bus Each cluster listens on the bus and derives its range of values within that cycles

batch The cluster also up dates its lo cal copy of the base register Each cluster assigns unique values

from its prexsum range to its lo cal prexsum requests A prexsum hardware implementation that

avoids any serialization is describ ed in We also note that in the case of bit prexsum requests the

numb er of wires necessary for the shared bus is c log nc log c where c is the numb er of clusters

2 2

and n is the numb er of TCUs

SpawnJoin Implementation

The XMTM pro cessor can b e in one of two mo des at any given timeserial or parallel In the serial

mo de only the rst TCU is active Execution of a Spawn instruction causes a transition from the serial

mo de into the parallel mo de The spawning is p erformed by the Spawn Control Unit SCU The SCU

do es two main things i it activates the TCUs whenever a Spawn is executed and ii it discovers when

all virtual threads have b een executed so that the pro cessor can resume serial mo de The SCU broadcasts

the Spawn command the the numb er of threads n on a bus that connects to the TCUs Each TCU

up on receiving the Spawn command executes the owchart given in Figure

Start upon receiving Spawn(n) command

Set t_id = TCU_id

Is t_id > n?

Initiate Join Execute thread t_id Do PS to get new t_id

Figure Flowchart depicting activities of a TCU up on receiving a Spawn command

If the numb er of virtual threads is less than the numb er of TCUs t then all of the threads are initiated

at the same time If the numb er of virtual threads is more than the numb er of TCUs then the rst t

virtual threads are initiated in the t TCUs The remaining threads are initiated as and when individual

TCUs b ecome free Notice that it would have b een straightforward to spawn the second set of threads

threads with lab el  t after all of the TCUs have completed the execution of the rst set of threads

assigned to them However if some TCUs terminate and wait while others continue a gross violation

of the NBWFSMs ideal o ccurs To alleviate this violation we have devised a hardwarebased scheme

that makes use of the IOS b etween threads As and when a TCU completes the execution of its thread

it sends a prexsum request to the SCU The SCU p erforms the prexsum op eration and sends to

the waiting TCUs a new t id Each of the waiting TCU makes sure that its t id is less than n b efore

pro ceeding to execute the thread otherwise the TCU initiates a join op eration This typ e of thread

allo cation continues until all virtual threads of a Spawn instruction have b een initiated

In the parallel mo de we also take advantage of the SPMD style instruction co de in the following way

The co de is broadcast on the bus so that all TCUs can simultaneously get the co de to b e executed

Global Register Co ordination

It is likely that for global register co ordination we can use the same broadcast data bus as the prexsum

unit Because that bus activity dep ends on the frequency of prexsum op erations we can use the extra

capacity to broadcast global register values when they are written Each lo cal register le can keep the

values of the global registers for use by the cluster functional units

Whenever a thread writes to a global register the new value is written to the lo cal copy of the shared

register and then is sent out on the shared bus when the bus b ecomes available To maintain coherence

the pro cessor do es not restart serial mo de after a join until all register writes have b een broadcast Also

if one thread writes to a shared register and another thread or threads needs to read that value those

accesses are prioritized by using a gatekeeper prexsum In such a situation the thread that writes to

the register would issue its prexsum request only after its write has gone out on the bus and is therefore

visible to all the clusters In this way a delayed coherence can b e maintained across all the global register

copies and the clusters can safely use their lo cal global register values without fear that the register

values are incorrect Note that such a relaxed consistency mo del is p ossible among the register les

b ecause the XMT programming mo del allows it

Memory System

The XMT framework supp orts a shared memory mo del that is all concurrent threads see a common

shared memory address space We can think of two alternatives for implementing the top p ortion of the

memory hierarchy for such a systemshared cache and distributed caches The shared cache implemen

tation has the advantage of not having to deal with issues such as cache coherency However its access

time is likely to b e higher b ecause of interconnect demands The distributed cache implementation

p ermits each TCU or cluster to have a lo cal cache thereby providing faster access to the top p ortion of

the memory hierarchy However it has to deal with the problem of maintaining coherency b etween the

multiple caches Further research is needed to determine which of the two options is b est for the XMT

framework for dierent technologies For instance if a high miss rate exists in the lo cal caches then each

memory access is likely to b e comparable in duration to an access of a shared cache In that case it

makes sense to avoid the problem of in the design and implement only a single shared

cache The emphasis in this pap er on a decentralized architecture comp onent and existing technology

led us in the direction of lo cal caches

The memory system we investigate in this pap er for XMTM is as follows Each cluster has a small

level cache Multiple level caches are connected together by an interconnect The next level of the

memory hierarchy consists of a large shared cache level cache which connects to main memory A

large numb er of pip elined memory requests can b e p ending at a time as in the Tera pro cessor The

idea is to use an overabundance of memory requests at each level of the memory hierarchy to hide memory

latency The interconnect used to tie the caches can b e a crossbar a shared bus a ring etc

Cache Coherence

When a shared memory mo del is implemented in a distributed manner maintaining a consistent view

of the memory for all the pro cessing elements is vital The use of distributed caches necessitates imple

menting proto cols for maintaining cache coherence Cache coherence proto cols come under two broad

categoriesinvalidatebased and up datebased In a writeback writeinvalidate coherence scheme a pro

cessor doing a write waits to get access to the shared bus It then broadcasts the write address The

other caches sno op the bus and invalidate that blo ck if present The writing pro cessor then has exclusive

access to that cache blo ck and keeps writing to the copy in its lo cal cache Another pro cessor reading

that same blo ck will cause a read miss at its cache and after getting access to the bus will send a read

request to memory Because the pro cessor with exclusive access to the blo ck is sno oping the bus it will

gain access to the bus and send the up dated version of the blo ck and ab ort the access to memory This

typ e of proto col works well when there is not much of data sharing

In the analogous case in an up datebased proto col a pro cessor sends a write request to its lo cal cache

and the request gets broadcast to the lo cal caches of all pro cessors Up on receiving the up date the lo cal

caches up date the relevant blo ck if present Any pro cessor that needs to read a memory lo cation will

get the value from its lo cal cache or from the next level of the memory hierarchy This typ e of proto col

generally results in high bandwidth requirements b ecause of using writethrough caches

Case 1 Case 2 Case 3

(Serial) thread X writes to location A Parallel thread X writes to location A Parallel thread X writes to location A

Spawn Join Prefix-sum (gatekeeper)

Parallel thread Y reads from location A (Serial or parallel) thread Y reads from location A Parallel thread Y reads from location A

Figure Cache Coherence Hazards for XMT

For the XMTM memory system we chose an up datebased proto col b ecause the XMT memory con

sistency mo del by virtue of its indep endence of order semantics IOS p ermits a relaxed up date p olicy

Therefore after a TCU sends a write request to its lo cal cache it can generally continue executing its

thread without waiting for the write to b e globally p erformed However there are three cases allowed by

the programming mo del where this writeandcontinue p olicy must b e mo died These three coherence

hazards are summarized in Figure The rst case is that writes from the serial thread must complete

b efore the system initiates a spawn The second case is that writes from spawned threads must complete

b efore the system restarts the serial thread The third case o ccurs when a prexsum instruction and

a branch based on the outcome of that prexsum instruction separate a read from a write Thus the

writing TCU will stall executing its spawn join or gatekeep er prexsum op eration until

it is sure that its write request has reached all other caches The programming mo del guarantees that

no thread will attempt to read the up dated data b efore the next spawn join or gatekeep er prexsum

op eration executes Thus the caches are allowed to b ecome inconsistent with each other for extended

p erio ds of time This proto col may o ccasionally stall the writing thread the stall time dep ends on how

long it takes to broadcast a write to all the caches With a ringbased interconnect this stall time would

b e the time it takes for a write request to go around the ring plus the time it takes for the lo cal ring

segment to b ecome available but even in that case no busywaits o ccur for the remaining threads

Exp erimental Evaluation

The previous section presented a detailed description of a decentralized XMT microarchitecture Next

we present a detailed simulationbased p erformance evaluation of XMTM

Exp erimental Framework

We have develop ed an XMTM simulator for evaluating the p erformance p otential of XMTM and the

XMT framework in general This simulator uses the SimpleScalar ISA with four new instructions added

a Spawn a Join and two Prexsums The register set is also expanded to have b oth global and lo cal

registers The simulator accepts XMT assembly co de and simulates its execution All imp ortant features

of the XMTM system have b een incorp orated in the simulator

Default XMTM Conguration

 Pro cessor conguration Three dierent congurations clusters clusters and clusters

 SpawnJoin Dedicated PS unit for thread ID generation

 Prexsum PS Pip elined cycle latency after request received total PS execution time is

cycles plus cycle communication delay

 Cluster conguration TCUscluster ALU cycle latency Branch cycle latency

MultDiv cycle multiply cycle divide neither is pip elin ed slot loadstore buer p orts to

L dcache slot PS IF

 TCU pip eline stage singleissue inorder pip eline stall on branch

 Cache conguration word blo ck Dcaches are way set asso ciative Icaches are direct

mapp ed Dcaches are writethrough with no fetch on writemiss Onchip L dcache is K

words L dcaches are K words each L icache is K words and L icaches are words each

Each L dcache can send and receive memory requests every C cycles where C is the numb er of

clusters which roughly corresp onds to a ring top ology

 Main Memory cycle latency words p er cycle bandwidth DRAM bank access conicts are

not mo deled

For these exp eriments we mo del the memory hierarchy without sp ecifying the exact interconnection

structure by setting a xed latency for every memory request from a level cache to get to the shared

cache level and to the other level caches To justify this mo deling consider a ring top ology for

connecting the level caches Assume that at each no de on the ring there is an automaton that will give

priority to a request from another no de Each level cache has a buer to hold requests that originate

from its lo cal cluster until these requests can b e sent Therefore once a request makes it out onto the

ring its maximum latency will dep end on the time taken to go around the ring In the worst case of

all no des sending requests all the time each no de will have to wait for its original request to come back

around the ring b efore it can issue another request Note that our xed latency mo deling is therefore a

somewhat conservative estimate for the pro cessing of requests from the level caches For a relatively

low numb er of no des on the ring as in the designs we simulated the ring do es not p erform to o badly

when we have a relatively long main memory access latency However for higher numb er of no des andor

shorter memory latencies a dierent design would b e a b etter choice

Default Sup erscalar Conguration

 Pro cessor conguration way sup erscalar outoforder execution entry instruction win

dow entry loadstore queue integer ALUs mult units entry bimo dal branch predic

tor

 Cache conguration Has separate icache and dcache b oth of which have single cycle access

and have word blo cks Dcache is same as XMTMs L dcache Icache has times as many

blo cks as XMTMs L icache

Benchmarks and Performance Metrics

For b enchmarks we use a collection of co de snipp ets with varying degrees of inherent parallelism as de

scrib ed in Figure We are currently restricted to using snipp ets in this pap er b ecause no big application

exists yet for the XMT programming paradigm We b elieve that the p erformance on snipp ets provide a

reasonable intuition into the p erformance and scalability of XMTM We have selected co de snipp ets with

varying degrees of inherent parallelism Evaluating XMTMs p erformance for entire applications can b e

done only after rewriting the entire applications in the XMT programming mo del and then compiling

Benchmarks Description Input sizes

Serial Traverse a randomly dispersed through memory and find the sum of 50 item list spaced in the list item data values. This application is not one which we know how to 200 words, 500 in 2K linkedlist parallelize, so it is implemented with a serial algorithm. words, 5K in 20K words, 250K in 1M words.

Embarrassingly Based on the STREAM benchmark [26], we sequentially read arrays, perform 50 item array, 500 item parallel some short calculations on the values, and write the results to another array. array, 5K item array, Since each iteration of the loop is independent, parallelization of execution is 250K item array. stream obvious. In the superscalar domain, one approach for speeding up this code is loop unrolling; we do that for the SimpleScalar version.

Interacting Compacting an array, we take a sparse array and rewrite into a compact form. 50 item array, 500 item parallel This application requires keeping a running count of the next available location array, 5K item array, in the new array. Two variants of arrcomp were simulated: arrcomp_d which 250K item array arrcomp just reads the original array (regular memory access) and arrcomp_i which (uncompacted arrays are uses indirection through another array (irregular memory access). 1/4 full).

More frequently Find the maximum value of a list. In the serial case, we read through the list, 50 item array, 500 interacting keeping a running maximum. For the parallel case we choose a synchronous item array, 5K item array, parallel max-finding scheme. A balanced binary tree is formed where a node of the tree 250K item array. will have the result of a maximum operation on its two child nodes. The root of max the tree will have the maximum of the list. The algorithm proceeds from leaves to root, synchronizing after every level in the tree. The threads are very short and there are log(n) spawn-joins.

Sorting Unraveling a linked list of known length which is packed within an array. This is 50 item array, 500 item a version of the problem called “list-ranking”. This application is useful for array, 5K item array, listsort managing linked-list free space in OSes [25]. In the serial algorithm, we 250K item array. traverse the list and rewrite it in the proper order. For the XMT version, we use two algorithms: (1) Wyllie’s pointer jumping algorithm [18] for the 50 and 500 sized inputs and (2) the no-cut coin-tossing algorithm for the 5K and 250K sized inputs. The work [10] presented discussion of the various list-ranking algorithms on XMT.

integersort A variant of radix-sort. It sorts integers from a range of values by applying bin- 50 item array, 500 item sort in iterations for a smaller range. For speed-up evaluations, one would wish array, 5K item array, to compare integersort with the fastest serial sorting algorithms, and not only 250K item array. serial radix-sort, as we did; however, the literature implies that for some memory architectures radix-sort is fastest [1], while for others other sorting

routines are fastest [23].

Figure Benchmarks

those parallel programs to XMT co de We recognize the imp ortance of eventually carrying out studies

with entire applications

A compilerlike translation from highlevel to optimized assembly co de is done manually for lack of

an XMT compiler To have a fair basis for comparison the serial versions of the applications are also

generated by hand and are optimized by using techniques such as lo op unrolling

We use small input data sizes to illustrate that the XMTM pro cessor can achieve b etter p erformance

for even small input sizes While comparing the p erformance against sup erscalar pro cessors the metric

used is sp eedup obtained by dividing the numb er of execution cycles taken by the sup erscalar pro cessor

by the numb er of cycles taken by XMTM

XMTM Performance

In the rst set of exp eriments we measure the numb er of cycles taken by dierent XMTM congurations

to execute the b enchmarks with varying size inputs The numb er of cycles taken by the XMTM con

gurations are compared against those taken by the default centralized wideissue sup erscalar pro cessor

Figure presents the results obtained The gure consists of diagrams corresp onding to dierent

input sizes In each diagram the Xaxis represents the b enchmarks For each b enchmark histograms

are plotted one for each XMTM conguration a cluster XMTM an cluster XMTM and a

cluster XMTM The Yaxis denotes the sp eedup of the XMTM congurations over that of the default

sup erscalar conguration Notice that comparing the IPCs instructions p er cycle for the two pro cessors

will not b e meaningful as they execute dierent programs

8.0 15.0 Input Size 50 Input Size 500

6.0

10.0

4.0 over superscalar Speedup over superscalar 5.0

2.0

0.0 0.0 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32

linkedlist listsort integersort max stream arrcomp_d arrcomp_i linkedlist listsort integersort max stream arrcomp_d arrcomp_i

Benchmark / Configuration Benchmark / Configuration

40.0 40.0 Input Size 5K Input Size 250K

30.0 30.0

20.0 20.0 Speedup over superscalar Speedup over superscalar

10.0 10.0

0.0 0.0 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32 2 8 32

linkedlist listsort integersort max stream arrcomp_d arrcomp_i linkedlist listsort integersort max stream arrcomp_d arrcomp_i

Benchmark / Configuration Benchmark / Configuration

Figure XMTM Sp eedups relative to Serial Computing

Let us lo ok at the results of Figure closely For an input size of cf the rst diagram in Figure

the cluster XMTM conguration p erforms b etter than the other two XMTM congurations The

cluster XMTM do es not have enough parallel resources to harness interthread parallelism and not much

interthread parallelism is available with an input size of to comp ensate for the high latencies of the

cluster XMTM The cluster p erforms substantially b etter than the centralized sup erscalar pro cessor

for three of the b enchmarks and p erforms slightly worse than the centralized sup erscalar pro cessor for

three of the b enchmarks

When the input size is increased to cf the second diagram in Figure the cluster XMT

M despite its increased crosschip latency b egins outp erforming the cluster XMTM for most of the

b enchmarks b ecause of the increased interthread parallelism available Even the cluster XMTM

conguration outp erforms the centralized sup erscalar pro cessor in all but one of the b enchmarks

When the input size is increased b eyond the XMTM congurations continue to harness more

parallelism as might b e exp ected It is imp ortant to p oint out that these results have to b e analyzed

in the prop er context The XMTM congurations are p erforming inorder execution and single

instruction issue in each TCU Thus the TCUs in an XMTM pro cessor are not p erforming functions

such as branch prediction dynamic scheduling register renaming memory address disambiguation etc

In short the XMTM TCUs are not exploiting any intrathread parallelism except for the

overlap obtained in a stage pip eline This is very imp ortant First it suggests that our sp eedup results

are conservative Second in the future as clo ck cycles continue to decrease it b ecomes more and more

dicult to p erform centralized tasks such as dynamic scheduling and branch prediction within a limited

cycle time We would also like to p oint out that the XMT framework do es not preclude the use of

conventional techniques to extract intrathread parallelism

Eect of Crosschip Communication Latency on XMTM Performance

Our next set of exp eriments fo cus on studying the eect of global crosschip communication latency on

XMTM p erformance Crosschip communication refers to prexsum op erations spawnjoin op erations

global register accesses and L cache accesses Three dierent latencies were mo deled for oneway inter

cluster communication and clo ck cycles This corresp onds to latencies of and cycles

resp ectively for functions that require twoway communication

Figure gives the results obtained in this study Again four diagrams are given corresp onding to four

dierent input sizes Each diagram records three histogram bars for each b enchmark Thus the Xaxis

represents the b enchmarks and for each b enchmark the three intercluster communication latencies

The Yaxis represents the normalized p erformance normalization is done with resp ect to the p erformance

of the unitlatency conguration That is the height of the bar represents the value obtained by dividing

the execution time of the b enchmark for a latency of cycle by the execution time of the b enchmark for

the corresp onding latency

The results of Figure indicate that for all input sizes considered the p erformance of XMTM do es not

change when the intercluster communication latency is increased from to Even when this latency is

increased to cycles XMTMs p erformance remains the same except for the two arrcomp b enchmarks

for which there is a drop of ab out These results indicate that XMTM is somewhat resilient to

increased crosschip interconnect delays

Related Work

First of all we should emphasize that we have not invented parallel computing with XMT We have

tried to build on available technologies to the extent p ossible

The relaxation in the synchrony of PRAM algorithms is related to the works of and on asyn

chronous PRAMs The highlevel language we used for XMT builds on Fork and its previous versions

develop ed at the U Saarbrucken Germany see and Basic insights concerning the use of a

prexsum like primitive go back to the FetchandAdd or FetchandIncrement primitives cf

Insights concerning nested Spawns rely on the work of Guy Blello chs group and others Input Size 50 Input Size 500

1.2 1.2

1.0 1.0

0.8 0.8

0.6 0.6 Normalized performance Normalized performance

0.4 0.4

0.2 0.2

0.0 0.0 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

linkedlist listsort integersort max stream arrcomp_d arrcomp_i linkedlist listsort integersort max stream arrcomp_d arrcomp_i

Benchmark / Cycles Latency Benchmark / Cycles Latency

Input Size 5K Input Size 250K

1.2 1.2

1.0 1.0

0.8 0.8

0.6 0.6 Normalized performance Normalized performance

0.4 0.4

0.2 0.2

0.0 0.0 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16 1 4 16

linkedlist listsort integersort max stream arrcomp_d arrcomp_i linkedlist listsort integersort max stream arrcomp_d arrcomp_i

Benchmark / Cycles Latency Benchmark / Cycles Latency

Figure Eect of CrossChip Global Communication Latency on XMTM Performance

The U Wisconsin Multiscalar pro ject and the U Washington simultaneous multithreading SMT

pro ject with their use of multiple program counters and the computer architecture literature on

multithreading see for instance have also b een very useful however the way XMT prop oses to

attack the completion time of a single task which is so central to XMT makes XMT drastically dierent

than these approaches that is the reliance on PRAM algorithms Our exp erience has b een that some

knowledge of PRAM algorithms is a necessary condition for appreciating how big the dierence is

Simultaneous multithreading improves throughput by issuing instructions from several concur

rently executing threads to multiple functional units each cycle However SMT do es not app ear to

contribute towards designing a machine that lo ok to the programmer like a PRAM

The similarity of XMT to the pioneering Tera MultiThreaded architecture is very limited Tera

fo cuses on supp orting a plurality of threads by constantly switching among threads it do es not issue

instructions from more than one thread at the same cycle this in turn would limit the relevance of our

multiop erand and Spawn instructions for their architecture Teras multipro cessor was engineered to

hide long latencies to memories for big applications Its design aims at pro cessors each running

threads XMT on the other hand is designed to provide comp etitive p erformance for even small input

sizes which makes it more practical in general and for desktop applications in particular To explain

this we observe that higher bandwidth and lower latencies which are exp ected from onchip designs in

the billion transistor era will allow a to b ecome comp etitive with its serial counterpart

for a much smaller input size than for MPP paradigms such as Tera But why do es this observation

hold true Parallelism provided by a parallel algorithm increases as the input size increases now when

implemented on an MPP part of this algorithm parallelism is used for hiding system deciencies such

as latency when latency is a minor problem more algorithm parallelism can b e applied directly to

sp eedups So parallel algorithms can b ecome comp etitive for much smal ler inputs

Another strong argument in favor of XMT is that it is anticipated that explicit paral lelism wil l provide

for simpler hardware Explicit ILP generally means static extraction of ILP which allows for simpler

hardware than that for dynamic ie by hardware extraction of ILP Stating simpler hardware as the

main motivation industry has demonstrated its interest in explicit instructionlevel parallelism ILP by

way of heavily investing in it

Lee and DeVries investigate singlechip vector micropro cessors as a way to exploit more parallelism with

less hardware and reduced instruction bandwidth They exp ose more instructionlevel parallelism

to the pro cessor core by moving to a more explicitly parallel programming mo del Similarly the IRAM

pro ject uses vector pro cessing to increase parallelism the pro ject also aims to fully exploit the bandwidth

p ossibilities of integrating DRAM onto the micropro cessor Complementing this research outof

order multithreaded and decoupled vector architectures have b een prop osed by Espasa Mateo and

Smith as metho ds to improve the p erformance of vector pro cessing The XMT architecture has

much in common with the recent vector approaches as it is solving many of the same problems in much

the same way the primary dierence is in the use of a SPMDstyle parallel algorithm programming mo del

instead of a vector mo del

Concluding Remarks

The von Neumann architecture has provided a principled engineering design p oint for computing sys

tems for over half a century It has b een very robust and resilient withstanding dramatic changes in

all the relevant technologies Parallel computing has long b een considered an antithesis to the von Neu

mann architecture However the main success thus far in using parallelism for reducing completion time

of a single generalpurp ose task has b een accomplished by what we called earlier the ILP bridgehead

yet another von Neumann architecture However technology changes due to the evolving submicron

technology are making it increasingly dicult to extract parallelism by conventional ILP techniques

In the past years the increased transistor budget of pro cessor chips was applied to close the gap

b etween micropro cessors and highend CPUs As p ointed out in to day this gap is fully closed and

using the additional transistor budget for unipro cessors is well b eyond the p oint of diminishing returns

It is b ecoming increasingly imp ortant therefore to tap into explicit parallel pro cessing The XMT

approach tries to jump into lling this gap the nobusywaits nitestatemachine principle with its

implied indep endence of order semantics IOS are designed to directly address the profound weakness

of continued evolutionary development of the von Neumann approach

The pap er intro duced the broad XMT framework through few bridging mo dels or internal

interfaces This pap er contributes a rst microarchitecture implementation a signicant step for the

overall XMT eort The research area of XMT and SPMD in general as it connects with parallel

algorithms oers an interesting design p oint these algorithms map well to hardware supp ort for multiple

indep endent concurrent threads This hardware supp ort in turn maps well to the limitations of future

submicron technologiesinterconnect delays dominating the p erformancewhich necessitate most of the

communication to b e lo calized The XMTM implementation highlights an imp ortant strength of XMT

it lends itself to decentralized implementation with almost no degradation in p erformance An XMTM

pro cessor can comprise numerous simple indep endent identical thread execution units and the nature

of the programming paradigm is such that interthread communication is highly structured and regular

This pap er shows that such a microarchitecture can withstand high crosschip communication delays

It can also take advantage of a large numb er of functional units as opp osed to traditional sup erscalar

designs which typically cannot make use of more than one or two dozen functional units

XMTM integrates several wellundersto o d and widelyused programming primitives that are usually

implemented in software the novelty of the microarchitecture is the integration of these primitives in

a singlechip environment which oers increased communication bandwidth and signicantly decreased

communication latency compared to more traditional parallel architectures The integrated primitives

are the spawnjoin mechanism which enables parallelism by initiating and terminating the concurrent

execution of multiple threads of control and the prexsum op eration which is used to co ordinate the

threads

Finally we note that XMTM has achieved signicant sp eedups by extracting only interthread paral

lelism and no intrathread parallelism and that it can get additional b enet from extracting intrathread

parallelism with the use of standard ILP techniques

References

R C Agarwal A Sup er Scalar Sort Algorithm for RISC Pro cessors Proc ACM SIGMOD

GS Almasi A Gottlieb Highly Paral lel Computing Second Edition BenjaminCummings

R Alverson D Callahan D Cummings B Koblenz A Portereld and B Smith The Tera Computer System

Proc International Conference on Supercomputing

G E Blello ch Vector Mo dels for DataParallel Computing MIT Press

GE Blello ch S Chatterjee JC Hardwick J Sip elstein and M Zagha Implementation of a p ortable nested data

parallel language Proc th ACM PPOPP pp

D Burger and T M Austin The SimpleScal ar To ol Set Version Tech Rep ort CS University of Wisconsin

Madison June

R Cole and O Za jicek The APRAM incorp orating asynchrony into the PRAM mo del Proc st ACMSPAA pp

D E Culler and J P Singh Parallel Computer Architecture Morgan Kaufmann

W J Dally and S Lacy VLSI Architecture Past Present and Future Proc Adv Research in VLSI Conf

S Dascal and U Vishkin Exp eriments with List Ranking on Explicit MultiThreaded XMT Instruction Paral

lelism Proc rd Workshop on Algorithms Engineering WAE July London UK Downloadable from

httpwwwumiacsumdedu~vishki n XMT

R Espasa and M Valero Multithreaded Vector Architectures Proc Third International Symposium on High Per

formance Computer Architecture HPCA pp

R Espasa M Valero and J E Smith Outoforder Vector Architectures Proc th Annual International Symposium

on Microarchitecture MICRO pp

M Franklin The Multiscalar Architecture PhD thesis Technical Rep ort TR Computer Sciences Department

University of WisconsinMad iso n Decemb er

E Freudenthal and A Gottlieb Pro cess Co ordination with FetchandIncrement Proc Fourth International Con

ference on Architectural Support for Programming Languages and Operating Systems ASPLOSIV

PB Gibb ons A more practical PRAM algorithm Proc st ACMSPAA pp

A Gottlieb B Lubachevksy and L Rudolph Basic techniques for the ecient co ordination of large numb ers of

co op erating sequential pro cessors ACM Transaction on Programming Languages and Systems pp

R A Iannucci G R Gao R H Halstead and B Smith editors Multithreaded Computer Architecture A Summary

of the State of the Art Kluwer Boston MA

J JaJa An Introduction to Paral lel Algorithms AddisonWesley Reading MA

R Joy and K Kennedy Presidents Information Technology Advisory Committee PITAC Interim Report to the

President National Co ordination Oce for Computing Information and Communication Wilson Blvd Suite

Arlington VA August

CW Kessler and H Seidl Integrating synchronous and asynchronous paradigms the Fork parallel programming

language Technical rep ort no Fachb ereich Informatik Univ Trier D Trier Germany

CW Kessler Quick reference guides i Fork and ii SBPRAM Instruction set simulator system software

Universitat Trier FB IV Informatik D Trier Germany May

C Kozyrakis et al Scalable Pro cessors in the BillionTransi stor Era IRAM IEEE Computer Vol pp

Septemb er

A LaMarca and R E Ladner The Inuence of Caches on the Performance of Sorting Proc th Annual ACMSIAM

Symposium on Discrete Algorithms pp

C G Lee and D J DeVries Initial Results on the Performance and Cost of Vector Micropro cessors Proc th

Annual International Symposium on Microarchitecture MICRO pp

A Silb erschatz and P B Galvin Operating System Concepts Fifth Edition Addison Wesley Longman Inc p

STREAM Sustainable Memory Bandwidth in High Performance Computers The University of Virginia

httpwwwcsvirginiaedustre am

D M Tullsen S J Eggers J S Emer H M Levy J L Lo and R L Stamm Exploiting Choice Instruction Fetch

and Issue on an Implementable Simultaneous Multithreadin g Pro cessor Proc rd Annual International Symposium

on Computer Architecture ISCA pp

U Vishkin From Algorithm Paralleli sm to InstructionLevel Paralleli sm An Enco deDeco de Chain Using Prexsum

Proc th ACM Symposium on Paral lel Algorithms and Architectures SPAA pp

U Vishkin S Dascal E Berkovich and J Nuzman Explicit Multithreaded XMT Bridging Mo dels for Instruction

Paralleli sm Proc th ACM Symposium on Paral lel Algorithms and Architectures SPAA pp See

also the XMT home page httpwwwumiacsumdedu~vishkin XMT

Personal communication with H Levy August

The National Technology Roadmap for Semiconductors Semiconductor Industry Asso ciation

View publication stats