Improving SinglePro cess Performance with Multithreaded Pro cessors

Alexandre Farcy Olivier Temam

Universitede Versailles

avenue des EtatsUnis

Versailles France

ffarcytemamgprismuvsqfr

Abstract Because of these hardware and software b ottlenecks it is

not obvious that pro cessor manufacturers will b e indenitely

Multithreaded pro cessors are an attractive alternative to su

capable of scaling up sup erscalar pro cessors Consequently

p erscalar pro cessors Their ability to handle multiple threads

simultaneously is likely to result in higher instruction through several alternatives have b een prop osed in literature and in

put and in b etter utilization of functional units though they can

dustry Some solutions require a deep mo dication of the

still b enet from many hardware and software functionalities

pro cessor execution mo del like dataow architectures

of sup erscalar pro cessors and thus consist in an evolution rather

while other solutions mostly consist in evolutions of current

than a radical transformation of current pro cessors However

sup erscalar designs though they still address main p erfor

to date multithreaded pro cessors have b een mostly shown capa

mance b ottlenecks Considering the pace at which hard

ble of improving the p erformance of multipleprocess workloads

ware pro cessor architecture can sometimes evolve less rad

ie threads with indep endent contexts but in order to comp ete

ical transitions are more likely to b e contemplated by pro

with sup erscalar pro cessors they must also prove their ability to

cessor manufacturers Three such evolutions are currently

improve singlepro cess p erformance

In this article it is prop osed to improve singlepro cess p erfor

b eing considered

mance by simply parallelizing a pro cess over several threads shar

The rst solution is VLIW pro cessors which has the ad

ing the same context using automatic parallelization techniques

vantage of strongly reducing hardware complexity but fur

already available for multiprocessors The purp ose of this article

ther increases the burden on the compiler for b oth detecting

is to analyze the impact of sharedcontext workloads on b oth pro

paralleli sm and eciently scheduling instructions to func

cessor architecture and processor performance On a rst hand

tional units The two other solutions onchip multiproces

based on previous research works and by reusing many comp o

sors and multithreaded pro cessors have some similaritie s

nents of sup erscalar pro cessors a multithreaded pro cessor archi

The architecture of onchip multiprocessors is simple but

tecture is dened Then considering the issues raised by shared

rigid the threads scheduled on one pro cessor cannot exploit

context workloads on data architectures have mostly b een

ignored up to now we attempt to determine how current cache

the functional units of others so that the functional unit us

architectures can evolve to cop e with sharedcontext workloads

age ratio may not b e much higher than in current pro cessors

The impact of shared workloads on this architecture is analyzed

Multithreaded pro cessors as describ ed in the seminal arti

in details showing that like multiprocessors multithreaded pro

cle by Hirata trade simplicity for eciency functional

cessors exhibit p erformance b ottlenecks of their own that limit

units are shared by the dierent threads each assigned to

singlepro cess sp eedups

a logical pro cessor Because each logical pro cessor needs to

Keywords Multithreaded pro cessors singlepro cess p erfor

communicate with each functional unit the hardware inter

mance cache memories

connection overhead for functional units and register banks

Introduction

can b e high On the other hand functional units are likely

Sup erscalar pro cessors capable of issuing ab out instruc

to b e more eciently used

tions p er cycle are now b ecoming a standard However stud

This advantage of multithreaded pro cessors over sup er

ies like indicate that the amount of intrinsic instruction

scalar pro cessors has b een recently demonstrated in by

level parallelism is limited Besides the necessity to concur

Tullsen showing that multithreaded pro cessors can b e twice

rently execute a high number of instructions from the same

as ecient as highdegree sup erscalar pro cessors comparing

instruction ow induces signicant hardware overheads that

threads with issue slots However up to now the e

increase with the degree of paralleli sm and the instruction

ciency of multithreaded pro cessors has b een mostly demon

window size from which paralleli sm is extracted like for in

strated for heterogeneous workloads ie when the workload

stance the necessity to check dep endences b etween a high

is comp osed of distinct pro cesses in But such ar

number of instructions registers that need to b e dynam

chitectures may prove viable solutions only if they are also

ically renamed buering techniques to allow outoforder

ecient at increasing singlepro cess p erformance Multiple

execution

instruction issue is suggested in for each thread in or

der to optimize singlepro cess p erformance while decreasing

global workload execution time using multithreading With

resp ect to singlepro cess p erformance this solution is e

cient but it has the same limitation s as sup erscalar pro ces

sors intrinsic instructionlevel parallelism

Multithreaded pro cessor architectures allow coarser grains

of parallelism to b e used The same techniques used for mul

capacities of multithreaded pro cessors than from fast con tipro cessors ie parallelizi ng at the lo op nest level can b e

text switching techniques and thus recent studies are rather used Thus multithreaded pro cessors can readily b enet

fo cused on instruction throughput issues from the large amount of research on automatic paralleliza

In a multithreaded pro cessor based on the DEC tion Furthermore the cost of dispatching blo cks of itera

is prop osed with the goal of implementing Simultaneous tions to all pro cessors which is a sequential and thus often

Multithreading ie several threads are run concurrently costly pro cess rep eated prior to each parallel lo op is con

Like in a sup erscalar pro cessor each thread is capable of siderably reduced since logical pro cessors are lo cated on the

issuing multiple instructions in the same cycle In the same chip Multipro cessors can exhibit limited eciency b e

multithreaded pro cessor architecture prop osed is also close cause the granularity of parallelis m within a lo op nest is to o

to a sup erscalar pro cessor but functional unit utilizatio n is small with resp ect to the number of pro cessors and the initi

improved by grouping threads so that at least one thread ation time This obstacle disapp ears in multithreaded pro

p er group is capable of issuing In like in the two previ cessors b ecause the initiati on time is fairly small and most

ous studies each thread can use any of the functional units of all b ecause these architectures can tolerate small gran

Each thread is assigned to one of the logical pro cessors ularities a single iteration Consequently the same paral

up to in Data dep endences are resolved with a lelization techniques used in multiprocessors can b e applied

scoreboard technique and resource dep endences with to multithreaded pro cessors without the usually asso ciated

standby stations implemented b efore each functional unit aws On the other hand it is not obvious that new p er

thus limiting the impact of resource hazards on thread is formance b ottlenecks sp ecic to multithreaded pro cessors

sue rate Still hardware supp ort for switchoncachemiss is would not o ccur

prop osed b ecause this pro cessor is to b e used in a multipro The purp ose of this article is twofold rst to evaluate

2

cessor environment where long latencies can b e exp ected the capacity of multithreaded pro cessors to improve single

Also shared registers are used for interthread communica pro cess p erformance using the paralleli zati on techniques that

tions and also serve synchronization means Singlepro cess already exist for multiprocessors and to evaluate the ex

p erformance improvement is briey discussed in this article ibility of multithreaded pro cessors with dierent types of

a sp eedup with resp ect to a conventional RISC pro cessor workloads Second and more imp ortant the b ehavior of

is rep orted using threads and two loadstore units note paralleli zed co des on multithreaded pro cessors are examined

that a p erfect cache is used in details The inuence of each comp onent of the archi

Sohi and Franklin also prop osed a more novel tecture is discussed and main p erformance b ottlenecks are

architecture called multiscalar processor b etween a multi underlined Techniques for coping with these limitations in

pro cessor concept and a multithreaded pro cessor A mul future multithreaded pro cessor architectures are discussed

tiscalar pro cessor allows parallel execution of tasks each on The multithreaded pro cessor and cache architectures used

a distinct execution unit Those units share a common are detailed in section In section the dierent asp ects

memory and communicate through a ring to forward reg of the exp erimental framework paralleliza tion trace collec

ister values Unlike most multithreaded architectures men tion simulation are presented Finally in section multi

tioned ab ove functional units cannot b e shared by the tasks threaded pro cessors p erformance is analyzed

However a multiscalar pro cessor requires much work to b e

Related Work

done at compile time to extract tasks and optimize schedul

Up to now several dierent concepts of multithreaded

ing and hardware recovery mechanisms to allow sp eculative

pro cessors have b een prop osed Some are simple mo dica

execution of tasks In these studies the p ossibil ity to im

tions of current pro cessors and have b een actually imple

prove singlepro cess p erformance is evaluated showing

mented others are more dierent concepts that still retain

sp eedups up to with an unit multiscalar pro cessor over

a strong resemblance with current pro cessors It is not the

a conventional pro cessor on SPECint and the data cache

purp ose of this article to prop ose a novel pro cessor archi

issues of pro cessors running multiple sharedcontext threads

tecture on the contrary to make the results of this study

are examined in details

relevant to a wide range of multithreaded pro cessor architec

Cache Architecture When multiple threads are run si

tures sp ecic innovations like thread synchronization sup

multaneously the number of cache accesses p er cycle can

p ort or context groups were not implemented As

b e very high With resp ect to instruction caches the most

mentioned ab ove the goal is to evaluate p erformance trade

simple solution is to use one cache p er thread The same so

os of multithreaded pro cessor architectures seen as an evo

lution can b e considered for data caches However if a single

lution of current sup erscalar pro cessors Still the evalua

pro cess is paralleli zed over multiple threads the same con

tion of paralleli zed pro cesses and asso ciated coherence issues

text is to b e shared by several threads The main constraint

made it necessary to prop ose p ossible evolutions of current

is then to maintain coherence Assume one cache is used for

data cache architectures

each thread Falsesharing can happ en like in multiproces

Pro cessor Architecture The concepts developed in the

sors in a parallel section with no dep endence b etween tasks

rst multithreaded pro cessor architectures were closer to

iterations Unless the lines that are not up to date are

dataow machines like than to sup erscalar pro cessors

ushed at the end of the parallel section coherence is not

The goal was to b oth hide latency and increase instruc

enforced Because it can b e very dicult to keep track of

tion throughput by implementing enhanced context switch

such cases see it is not p ossible to extend the private

ing techniques like in also novel thread interleaving

cache solution to a multithreaded pro cessor where a pro cess

techniques have b een prop osed

can b e paralleli zed ie where several threads can share the

Because cache hierarchies as used in current sup erscalar

same context This cache coherence problem is very similar

1

pro cessors are relatively successful at hiding latency larger

to that of multiprocessors However unlike multiprocessors

gains can b e exp ected from the high instruction throughput

this problem can b e simply solved by having all threads that

1

Firstlevel and secondlevel cache onchip in DEC

2

This pro cessor was designed for an image synthesis application

secondlevel on the pro cessor mo dule in P the simulation of virtual worlds

share the same context use the same cache space The data cache used in our architecture is split into sev

The issue is then to allow multiple accesses p er cycle to eral indep endent banks demanding a crossbar to distribute

this cache space On a smaller scale the same issue has al the requests from the loadstore units to any bank The

ready b een raised for sup erscalar pro cessors Up to now dif main drawback of such a crossbar is the increased cache ac

ferent forms of multibanking and multiporting have b een cess time Yet a study shows that passing through a

used Twobanked caches o ddeven are used in the MIPS crossbar costs ab out ns for a bit path width Other

TFP and T Intel Pentium and HP PA utilization s of crossbars in onchip and time constrained en

allowing two simultaneous accesses to two dierent cache vironments like the MIPS R crossbar to dispatch

banks Data replication is used in the DEC two iden instructions to the oatingp oint or integer units and

tical banks stores are serialized to preserve coherence and the full crossbar b etween the p orts of the Intel Pentium

virtual multiporting in the Power the SRAM access Pros reservation station further conrm the feasibil

time is half the pro cessor clo ck allows two cache accesses in ity of such designs With resp ect to p erformance the main

a single pro cessor clo ck cycle True multiporting has b een drawback of using a crossbar to access several banks is bank

used in the Intel Pentiums and IBM Powers DTLBs conicts two loadstore instructions willing to simultane

using true dualp orted memory ously access the same bank must b e serialized thus stalling

Actually among all these solutions it seems that only one of the loadstore pip elines The way a co de is paral

the multibank concept can scale up with the number of si lelized can strongly aect the probability of bank conicts

multaneous cache accesses p er cycle For n p orts an SRAM see section

th

1

In our implementation each bank is either private to

the pro cessor clo ck is unlikely if n is large with a clo ck

n

a thread or can b e gathered with several other banks to

With resp ect to bank replication copying the cache n times

form a larger cache space that can b e accessed by several

without any eective gain in available cache space is pro

threads b elonging to the same context Typically n such

hibitive True multiporting seems at rst the most nat

threads share a cache space spread over n banks With

ural solution However the number of p orts p er SRAM

3

resp ect to bank sharing only two radically opp osite solu

increases the cache surface Moreover the energy required

tions are tested in either each pro cess has a private

to reload all capacity cells makes the SRAM cell access time

cache or all pro cesses share all cache banks When the num

longer Therefore though np ort SRAMs with n large

b er of threads is equal to the number of cache banks like

can b e used in theory the asso ciated increased onchip space

in our case Tullsen et al show that b oth solutions are

and cache access time make it a prohibitive solution

equivalent Here we prop ose an intermediate solution to

Multithreaded Pro cessor Architecture

have the n threads of a given pro cess context share n cache

The main asset of the designs prop osed by Hirata

banks This solution is exible If the number of active

and Tullsen is their scalabili ty ie the number of func

threads is smaller than the number of logical pro cessors

tional units can b e easily improved For the sake of exi

singlethread pro cesses can b e provided with more than one

bility a pro cessor mo del which is inspired from Hiratas has

bank Though for small cache sizes this solution p erforms

b een used in this study see Figure with logical pro cessors

mostly like the allbanksshared alternative as cache size in

issuing each a single instruction each cycle and standby

creases interprocess misses tend to b e dominant over intra

stations b efore functional units as introduced in

pro cess misses as shown in Thus in the future it will

more critical to avoid interprocess misses by privatiz To Lower Memory Level b e

Logical Processor ing groups of banks to pro cesses than to limit intraprocess

(Thread Slot)

capacityconict misses by further increasing the available More imp ortant this intermediate solution re I-Cache I-Cache I-Cache I-Cache cache space Pipeline L/S Pipeline

Stages Stages duces the o ccurrence of bank conicts

If multiple loadstore units are used further coherence Fetch Fetch Fetch Fetch Fetch Fetch

Load/Store

also arise Two instructions accessing the same ad Decode Decode Decode Decode Unit issues

Decode Decode

dress in two dierent loadstore pip elines have to complete

inorder Assume one pip elin e is stall b ecause of bank con

Functional

or any cache stall it is necessary to stall the other

SBS SBS icts Units

Addr Calc loadstore units to enforce inorder loadstore execution as Exec FP Addr Addr

ALU FP Div Stage no outoforder execution of memory references mechanism Add/Mul (other units) Stages

Crossbar Xbar stage is provided in our mo del

The cache banks used in this study are nonblo cking to

miss queue close Cache decrease the impact of such stalls with a

Bank 0 Bank 1 Bank 2 Bank 3 Access

DEC s Using multiple cache banks and

Stage to the

a crossbar has also b een suggested in and but the Write Write

Register Register Register Register Multi-Banked

coherence issues were not considered in detail Back Register Data Cache Back asso ciated Bank Bank Bank Bank File

Stage To Lower Memory Level Stage

However Franklin and Sohi prop ose to use an ARB

Address Resolution Buer to handle the ab ove mentioned

Figure Processor Architecture

coherence issues memory disambiguation in their multi

However unlike Hirata register dep endences are here

scalar pro cessor The ARB is a cache placed b etween the

resolved by having standby stations also play the role of

crossbar and the cache bank which keeps track of the dif

reservation stations of a Tomasulo algorithm allowing

ferent loadstore instructions accessing the same address

register bypassing Like in thread priority over each

When an illegal ly ordered loadstore sequence has b een de

functional unit is managed with a simple roundrobin algo

tected the corresp onding task is reexecuted from the con

rithm to avoid starvation

icting instruction Aside from its hardware overhead this

solution is dicult to extend to multithreaded pro cessors

3

p ort mWMHz p orts mWMHz p orts

Indeed it is based on the fact that in a multiscalar pro

mWMHz

cessor tasks are separated in small sets of instructions one As explained ab ove parallel lo ops are detected with KAP

or few basic blo cks dispatched in a rotating manner to the and marked as such in the sourceco de Then the Fortran

dierent execution units Thus once the oldest execution toFortran compiler Sage is used to instrument the

unit has completed it is certain that no loadstore order co de with markers references to known addresses so that

issue can arise Since the size of tasks is small few cache the b eginning end of parallel lo ops and the b eginning of

lines could p otentially b e victim of coherence conicts at the each iteration can b e detected during co de execution Then

same time so that the ARB needs not b e large In our case Spa was mo died to acknowledge these marker addresses

th

1

and augment the trace with the desired informations

with n logical pro cessors one thread can corresp ond to

n

A last delicate issue was the fate of scalar variables like

of a parallel lo op which is far to o large for an ARB An

lo op indices If the same scalar variable is used by multiple

other p ossible solution is the Address Reorder Buer of

threads numerous bank conicts are likely to o ccur Thus

the HP PAs loadstore unit which allows outoforder

it is necessary to privatize scalar variables to each thread by

execution of memory references and sp eculative execution

shifting their addresses so that two such variables dont fall

of loadstore instructions by buering p ending LOADs and

in the same cache bank

STOREs and comparing the addresses of these requests to

We developed our own simulator to test varying alter

new loadstore requests to detect coherence issues

natives of multithreaded designs b ecause few multithreaded

Exp erimental Framework

simulators were available Still we contemplated using Con

curro but the sp ecic features of this design esp ecially

Parallelization To sustain the argument that current tech

the principl e of gathering threads in context groups were

niques on automatic paralleli zatio n can b e readily applied to

to o restrictive As explained in section the multithreaded

multithreaded pro cessors it was critical to use an automatic

pro cessor architecture simulated is close to the one describ ed

paralleli zer currently available for multiprocessors We have

in except for the memory system parts which were not

selected KAP a commercial Fortran prepro cessor designed

describ ed in that article The functional units latency of

by KAI KAP was used to detect parallel lo op nests and to

the pro cessor architecture simulated are cycle for integer

mark these lo ops as parallel in the sourceco de For that

ALU cycles for oatingpoint AddMultiply cycles for

task the eciency of KAP was not high but we exp ect

oating point Divide and cycles for LoadStore Except

to obtain b etter results in the future either with improved

5

for FP Divide all functional units are pip eline d

versions of KAP or with stateoftheart academic paralleli z

Each thread is assigned to a logical pro cessor which in

ers like Polaris or SUIF includin g more ecient data

cludes the logic for fetch and decode stages Each thread

dep endence analysis techniques the Omega Test was

is assigned a private instruction cache bank and a private

recently implemented in Polaris

register bank Each cache bank is Kbyte large direct

Because the goal was to measure the b est p erformance

mapp ed with a byte line and writeback The number of

that can b e achieved with a paralleli zed pro cess on a mul

datainstruction cache banks is always equal to the number

tithreaded pro cessor only lo op nests without lo op carried

of logical pro cessors ie the number of threads The miss

dep endences were paralleli zed ie no synchronization tech

latency is equal to cycles and miss request can b e issued

niques were used The outer lo op of each parallel lo op nest

every cycles

was parallelized The two most classic techniques for stat

The co des were compiled on a Sparc

ically distributing lo op iterations over dierent pro cessors

logical pro cessors in our case are by blo cks of contiguous

Benchmarks In this study b oth sharedcontext and a

iterations and by assigning iteration i to pro cessor i mod P

combination of sharedcontext and privatecontext workloads

where P is the number of logical pro cessors For cachebased

are executed Thus programs corresp onding to classic pro

systems like KSR mo dulo scheduling may often b ehave

cessor workloads as well as b enchmarks that lend to auto

p o orly b ecause many pro cessors need to access the same

matic parallelizati on techniques are needed For the rst

cache line at the same time In the cache design selected

category we selected Gcc Compress Simulator Contour

6

see section this results in cache bank conicts There

Grep Latex and Segmentation These seven programs

fore a blo ck pro cessor scheduling technique was nally se

were only used to increase the load on the pro cessor and no

lected Lo ops are simply parallelized as shown b elow

sp ecic p erformance evaluation was done on those For nu

merical co des we picked three of the Perfect Club b ench

DOALL IINB

marks FLO ARCD MDG Since this study is fo cused on

DO IIIminIIBN

DO IN

analyzing the b ehavior of multiple sharedcontext threads

Loop Body

Loop Body

statistics are collected over parallel sections only even though

ENDDO

ENDDO

all instructions are executed Considering our goal was to

Parallel loop evaluate the b est achievable p erformance with a multithreaded

Original loop

pro cessor we chose the three co des on which KAP was most

N

e where B d

ecient at nding parallel lo ops other co des had to o short

P

parallel sections The fraction of instructions in parallel

Traces and Simulation Ob jectco de traces are collected

sections is for FLO for ARCDand for MDG

with the Spa package The main diculty is to collect

The miss ratio of each co de on a single cache bank is shown

parallel traces in such a way that dierent paralleli zati on al

in Figure

ternatives blo ckmodulo varying number of logical pro ces

sors can b e evaluated The pro cessor to which an iteration

necessary for each co de

5

These values have b een averaged over several current micropro

of a parallel lo op must b e assigned is determined onthey

cessors MIPS R MIPS R PowerPC

during trace collection The only informations required are

6

Contour and Segmentation are image analysis co des Simulator

a ag indicating the b eginning end of a parallel section and

is our pro cessor simulator

4

the iteration number of the lo op b eing parallelized

4

In fact to apply blo ck pro cessor scheduling the total number of iterations in a parallel lo op must b e known thus two passes are

40.0

even though they are not impaired by the same remote

communications and task distribution overhead issues or

30.0 like sup erscalar pro cessors even though the intrinsic paral

20.0 lelism is high

MISS Ratio (%) 1.4 10.0 FLO52 MDG 1.2 ARC2D 0.0

Gcc 1 Grep Latex MDG FLO52 ARC2D Contour

Simulator Compress

Segmentation 0.8

CPI

Figure Miss ratio of al l benchmarks

0.6

Performance Evaluation

0.4 Because it is unlikely a large number of logical pro cessors

can b e implemented on a single chip in the near future the 0.2 1 2 3 4 5 6 7 8

exp eriments have b een limited to logical pro cessors in each Number of Threads (4 functional units of each class)

case For the sake of clarity in all exp eriments the number

Figure CPI

7

of functional units of each class is increased altogether

analysis of the impact of each functional unit but detailed 5.5

FLO52

also provided Thus a single number is used to charac is 5 MDG ARC2D terize the number of functional units unless otherwise sp eci 4.5

ed two functional units meaning two functional units of 4 In this section various workloads are analyzed

each class 3.5 First global b ehavior is presented then the dierent origins 3

Speedup of p erformance b ottlenecks are progressively detailed 2.5 2 2.4 FLO52-8 1.5 2.2 MDG-8 ARC2D-8 1 1 2 3 4 5 6 7 8 2 Number of Threads (4 functional units of each class)

1.8

Figure Speedup with respect to one thread 1.6

Speedup

Origins of stalls Six stall causes can b e distinguis hed

1.4

in Figure WAW register dep endences branch delays

1.2

resource conicts busystall loadstore units instruction Let us examine in 1 cache miss and thread synchronization

1 2 3 4 5 6 7 8 A logical

Number of Functional Units of Each Class more details when each of these stalls can o ccur

pro cessor can b e stall only if a stall o ccurs in one of the two

Figure Speedup with respect to one thread

stages directly controlled by the logical pro cessor the fetch

and deco de stages Otherwise as far as a logical pro cessor

Global b ehavior of parallelized pro cesses In Figure

can issue no stall cycle is counted even if some functional

each program is parallelized into threads a program PRG

units are stall for a reason or another

paralleli zed into n threads is denoted PRGn and the num b er of functional units of each class is varied from up to

Branch It app ears that b eyond functional units ie functional Resource 60 units of each class only little p erformance improvements can I-miss

8 L/S

In fact when n threads are run on a mul b e exp ected Static Sched.

WAW

tithreaded pro cessor the number of functional units needs

n to achieve maximum p erformance thanks

not b e equal to 40

to the sharing of functional units by the dierent threads This is actually one of the ma jor arguments for supp orting

% Total Execution Cycles in terms of functional units scal

multithreaded pro cessors 20

ing up a multithreaded pro cessor is cheaper than scaling up

an onchip multiprocessor However for FLO and more

ARCD there are obviously p erformance b ot particularly for 0 FLO52 ARC2D MDG

1, 2, 4, 8 Threads (4 functional units of each class)

tlenecks other than resource issues that limit p erformance

improvements This can b e seen on Figures and where Figure Distribution of stal l cycles

the number of threads is varied from up to with the num

Consequently with resp ect to register dependences only

b er of functional units equal to for each class ie near

Write After Write dep endences can directly result in a stall

b est p erformance for threads The maximum sp eedup

cycle the deco de stage cannot issue an instruction if a func

achieved by FLO is ab out and for ARCD ie half the

tional unit is ab out to write in its destination register In

optimal sp eedup for threads As a result the CPI remains

case there is a RAW Read After Write dep endence but a

relatively high for ARCD and for FLO consider

standby station is available b efore one of the corresp onding

ing the number of threads In other terms multithreaded

functional units the instruction is issued and no stall cycle

pro cessors exhibit sublinear sp eedups like multiprocessors

o ccurs WAR Write After Read dep endences cannot o c

7

cur as registers are read and reserved in the deco de stages

the classes of functional units provided in our architecture are

integer ALU FP addermultiplier FP divider and LoadStore

in program order As can b e seen in Figure WAW dep en

8

the sp eedup is computed with resp ect to one thread run on a

dences account for a ma jor share of stall cycles Thus dy

functional units p er class pro cessor in terms of CPI

namic register renaming techniques as used in sup erscalar

pro cessors can strongly b enet to multithreaded pro cessors the same cache line resulting in bank conicts Clearly b et

as well On the other hand the eect of WAW dep endences ter static pro cessor scheduling techniques or implementing

is relatively stable as the number of threads increases and dynamic allo cation of iterations to logical pro cessors should

thus other stall causes then seem to b ecome signicant if not b e considered Still this load balancing phenomenon would

dominant have a far stronger impact on a classic multiprocessor

No sophisticated branch prediction technique has b een

40.0

branch delays for the moment but consider used to avoid Cache

Resource

ing the co des parallelized are numerical co des ie made of lo ops the eect of branch p enalties do es not show much

30.0

With resp ect to resource conicts the deco de stage is stall if all corresp onding functional units and their standby

20.0

stations are busy if all functional units are busy the free

% Stall Cycles

standby stations can b e used to hide resource conicts as

well The distribution of resource conicts is shown in Fig

10.0

ure integer and oatingp oint add and mul op erations are

resp onsible for most conicts the resource conicts statis

in Figure do not include loadstore units As can tics 0.0 FLO52 ARC2D MDG

1, 2, 4, 8 Threads (4 functional units of each class)

b e seen threshold p erformance can b e achieved with inte

ger and oatingp oint addmul functional units and a single

Figure Distribution of stal l cycles due to loadstore func

unit for each other op eration Note that the number of add

tional units

units necessary is close to the optimal values of integer

Stalls due to loadstore instructions There are two

add and oatingp oint add found in for an issue su

main categories of stalls due to loadstore instructions cache

p erscalar pro cessor stalls for coherence or misses and loadstore unit resource

60.0

As can b e seen in Figure loadstore resource ALU conicts FP Add/Mul

FP Div conicts only o ccur past threads recall there are loadstore

units in the exp eriments of Figure This phenomenon is

to loadstore units mostly b ecause coherence is

40.0 sp ecic

sues forbid using loadstore units standby stations to hide

resource conicts Indeed once a memory instruction has

% Stall Cycles

b een sent to the standby station of one of the loadstore

20.0

units other LOADs or STOREs of the same thread cannot

b e sent to another standby station to avoid outoforder execution of loadstore instructions see section

0.0 FLO52 ARC2D MDG 1, 2, 3, 4, 8 Functional Units of Each Class 30.0 Bank Conflicts

Misses Distribution of resource conicts fraction of stal l Figure Memory

cycles as a function of the number of functional units Reg. Conflicts

20.0

With resp ect to loadstore units there are multiple causes

of stalls resource conicts but also stalls due to cache % Stalls Cycles

10.0

accesses As the number of threads increases loadstore

related stalls increase signicantly see Figure Thus

such stalls are likely to b e a ma jor cause of sp eedup limi They will b e examined in more details in the next tations 0.0 FLO52 ARC2D MDG

1, 2, 4, 8 Threads (4 functional units of each class)

paragraphs

Figure Distribution of stal l cycles due to cache stal ls

Finally a consequence of static allo cation of iterations on

the logical pro cessors static scheduling is that some threads

Cache stalls The ma jor cause of loadstore stalls are cache

may complete execution earlier than others dep ending on

stalls Four kinds of cache stalls can b e distingui shed regis

the distribution of iterations In an actual implementation

ter accesses memory bandwidth misses and bank conicts

of a multithreaded pro cessor two solutions can b e adopted

see Figure Register access conicts o ccur when two ref

synchronizing threads early threads wait for other threads

erences cache hit line reload reference leaving the miss

to complete or switching an active thread on the logical pro

queue need to access the write register p ort of a given

cessor of an idle thread Here the rst technique was used

thread at the same time Though such conicts exist and

in order to highlight the eect of static pro cessor schedul

require sp ecial hardware to handle their impact is often neg

ing As shown in Figure while this eect is negligible for

ligible Because the rstlevel cache size increases with the

FLO it b ecomes dominant for ARCD when the number n of

number of threads the increasing issue rate of loadstore

logical pro cessors increases The consequence of static pro

instructions do es not seem to induce an excessive memory

cessor scheduling as applied here is that one thread may

request issue rate see Figure graph Memory This is

receive up to n less iterations than other threads for any

true only b ecause cache banks are writeback Indeed the

parallel lo op nest This eect do es not show in FLO b e

p erformance of writethrough cache banks is worse b ecause

cause for many of its lo ops the number of iterations

the high amount of memory store requests induce numerous

is a multiple of the number of logical pro cessors Other

cache stalls particularly for FLO see Figure Still the

techniques like mo dulo pro cessor scheduling which allows a

impact of multithreaded pro cessors on lower memory levels

b etter distribution of iterations b ehave even worse b ecause

should b e examined in more details

many threads simultaneously access consecutive data ie

less sensitive to bank conicts The second complementary

12.0

is to investigate the p ossibili ty to apply Write Back research direction Write Through

10.0

existing solutions to memorybank conicts in vector pro cessors to multiplebank caches 8.0

6.0 60.0 % Cache Stalls FLO52 4.0 ARC2D MDG 2.0 40.0

0.0 1 2 4 8 Number of Threads

% Bank Conflicts 20.0

Figure WriteThrough vs WriteBack cache stal ls due

0.0

FLO to memory contention for 1 2 3 4 5 6 7 8

Cache Bank Number

According to Figure the two main causes of cache

Figure Distribution of cache bank conicts threads

stalls are misses and cachebank conicts Since cache banks

are nonblo cking misses eectively induce cache stalls in

Heterogeneous Workloads While the ab ove paragraphs

two cases to copy back a dirty line to writebuer and

dealt with singlepro cess p erformance multithreaded pro

when a miss o ccurs on a line waiting for a p ending miss re

cessors also need to prove exible ie to p erform well with

quest When a thread is starved waiting for a loadstore

heterogeneous workloads a combination of paralleli zed pro

that missed the deco de stage carries on issuing instructions

cesses and privatecontext threads In Figure two exp er

and sending them to standby stations thus no stall o ccurs

iments were conducted Graph N PRG corresp onds to a

immediately On the other hand the thread must stall if all

program paralleliz ed over threads thus the allo cated cache

standby stations are busy or one loadstore standby sta

space remains constant and n other privatecontext threads

9

tion Thus the eect of memory latency also feels indirectly

are run simultaneously This graph shows that running an

through resource conicts particularly on loadstore stand

increasing number of threads do es not aect much the p er

by stations Copying a dirty line to a write buer is a cheap

formance of the paralleli zed pro cess Still the small p erfor

op eration cycle but it happ ens frequently Also to our

mance degradation corresp onds to additional memory traf

surprise misses on p ending lines o ccur relatively often This

c which further delays memory requests of the paralleli zed

is related to the nature of parallel threads it sometimes hap

pro cess However crossbar conicts do not increase thanks

p ens that several threads need to read the same line at the

to the pseudoprivatization of cache banks to pro cesses The

same time since some array data can b e shared by several

purp ose of graph N PRGN is dierent to evaluate

iterations or the dierent data needed by several threads b e

whether a multithreaded pro cessor p erforms b etter within

long to the same cache line Still as the cache size increases

a heterogeneous environment threads mostly have distinct

less than half of cache stalls are related to misses see Fig

contexts or within a homogeneous environment threads

ures and except for ARCD Thus it seems that a com

mostly share the same context In graph N PRGN

bination of parallel threads and nonblo cking cache banks

threads are always run in total but the program is paral

are ecient at hiding small latencies Consequently switch

lelized over n threads other singlethread pro cesses are

onmiss p olicies that require additional hardware supp ort

also added according to the order of the list in section

may not b e necessary With resp ect to ARCD a sp ecic

It app ears that a homogeneous workload p erforms b etter

phenomenon o ccurs in this co de there are numerous ref

b ecause of the cache structure adopted however note that

erences with p oweroftwo strides which thus induce cache

our cache design privileges n sharedcontext threads over n

conicts even for large cache sizes

privatecontext threads b ecause in the rst case all threads

share the n cache banks while in the latter case each thread 18 ARC2D only receives a single bank thus a smaller cache space Still 16 FLO52 MDG the sti dierence is also due to the fact that the total work

14 ing set of a homogeneous workload is often smaller than that

12

of a heterogeneous workload since parallel threads usually 10 share some data

8

Miss Rate (%) 1.8

6 N + PRG-4 1.6 N + PRG-(8-N) 4 1.4 2 1 2 3 4 5 6 7 8 1.2 Number of Threads

1.0

Figure Variation of process total miss ratio with the CPI

0.8 number of threads

0.6

On the other hand for all three b enchmarks the eect 0.4 of cache bank conicts steadily increases with the number of

0.2

threads The impact of cache bank conicts on total execu 0.0

0 1 2 3 4 5 6 7 8 strong b ecause all loadstore units

tion time is particularly Number N of Threads

must b e stall to preserve coherence each time a conict o c

Figure Heterogeneous workloads

curs see section The eect of bank conicts is further

9

worsened by the p o or distribution of conicts over cache

The additional pro cesses are gcc for additional pro cess

banks as shown in Figure for threads Thus more gcccompress for additional pro cesses and so on according to the

order of the list in section

elab orate and selective coherence schemes that require only

partial stalling would make the present cache architectures

Conclusions and Further Work R Govindara jan SS Nemawarkar and Philip Lenir De

sign and Performance Evaluation of a Multithreaded Archi

Based on the exp eriments of this study the conclusions

tecture In International Symposium on Computer Archi

on multithreaded pro cessors are mitigated p otential capac

tecture pages

ities of such architectures are signicant but further hard

Elana D Granston and Harry A G Wijsho Managing

ware improvements are required to exploit them Multi

Pages in Shared Systems Getting the Com

threaded pro cessors can eectively improve singlepro cess

piler into the Game Technical Rep ort Department of

p erformance and b est achievable p erformance is reached

Computer Science Leiden University December Sub

with fewer functional units than the number of threads On

mitted to International Conference on Supercomputing

the other hand the usual assumption that multithreaded

Bernard Karl Gunther Superscalar performance in a Multi

pro cessors can avoid the complexity of sup erscalar archi

threaded PhD thesis University of Tasma

tectures is mostly wrong since b oth architectures require

nia Hobart December

similar hardware enhancements like dynamic register re

John L Hennessy and David A Patterson Computer Ar

naming and memory disambiguation More imp ortant only

chitecture A Quantitative Approach Morgan Kaufmann

sublinear sp eedups are obtained b ecause cache bandwidth

Publishers Inc

cannot b e fully exploited even though a cache design with

Hiroaki Hirata et al An Elementary Pro cessor Architec

sucient p otential bandwidth can b e designed with current

ture with Simultaneous Instruction Issuing from Multiple

technology The two main limitations are the necessity to

Threads In International Symposium on Computer Archi

tecture pages

preserve coherence which induce excessive cache stalls and

cache bank conicts

Peter YanTek Hsu Designing the TFP Micropro cessor

Thus b efore sup erscalar pro cessors can mutate into mul IEEE Micro pages April

tithreaded pro cessors capable of supp orting multiple threads

Intel Corp oration Pentium Processor Users Manual

simultaneously several issues must b e investigated Auto

G Irlam SPA package

matic paralleliza tion techniques are not mature enough to

Stephan Jourdan Pascal Sainrat and Daniel Litaize Ex

avoid using multipleissue threads as prop osed with Simul

ploring Congurations of Functional Units in an Outof

taneous Multithreading On the other hand automatic

Order Sup erscalar Pro cessor In International Symposium

paralleli zati on can b e used to increase intrinsic parallelism

on Computer Architecture pages July

in many cases So that singlepro cess sp eedup do es not de

Daniel C McCrackin Eliminating Interlocks in Deeply

crease with the number of threads a more selective cache

Pip elined Pro cessors by Delay Enforced Multistreaming

coherence scheme and techniques for reducing cache bank IEEE Transactions on Computers

conicts must b e developed Also the hardware cost re

Micropro cessor Rep ort vol no IBM Regains Perfor

lated with implementing dynamic scheduling of tasks over

mance Lead with Power Octob er

sharedcontext threads would widen the applicatio n of mul

Micropro cessor Rep ort vol no PA Combines

tithreaded pro cessors to co des with very few iterations p er

Complexity and Speed November

lo op ie that typically p erform p o orly on multiprocessors

Micropro cessor Rep ort vol no Intel Boosts Pentium

The scop e of multithreaded pro cessors would b e further en

Pro to Mhz November

hanced if fast synchronization techniques are implemented

MIPS Technologies Incorp orated R Microprocessor

p ossibly through registers or caches to parallelize lo ops with

Chip Set Product Overview

dep endences

MIPS Technologies Incorp orated R Microprocessor

Chip Set Product Overview

References

RS Nikhil GM Papadopoulos and Arvind T A Mul

A Agarwal J Hennessy and M Horowitz Cache Perfor

mance of Op erating Systems and Multiprogramming Trans tithreaded Massively Parallel Architecture In International

actions on Computer Systems pages Novem Symposium on Computer Architecture

b er

W Pugh The Omega Test a fast and practical integer pro

gramming algorithm for dep endence analysis In Proceedings

Anant Agarwal BengHong Lim David Kranz and John

of the ACM SIGPLAN Conference on Programming Lan

Kubiatowicz APRIL A Pro cessor Architecture for Multi

guage Design and Implementation

pro cessing In International Symposium on Computer Ar

chitecture pages June

Allan L Sillburt et al A MHz m BiCMOS Mo du

lar Memory Family of DRAM and Multip ort SRAM IEEE

B Blume et al Polaris The Next Generation in Paralleliz

Journal of SolidState Circuits March

ing Compilers In Proceedings of the Seventh Workshop on

Gurindar S Sohi Scott E Breach and TN Vijaykumar

Languages and Compilers for Parallel Computing

Multiscalar Pro cessors In International Symposium on

Computer Systems Lab oratory Departments of Electrical

Computer Architecture pages July

Engineering and Computer Science Stanford University The

Gurindar SSohi and Mano j Franklin HighBandwidth Data

Stanford SUIF Compiler

Memory Systems for Sup erscalar Pro cessors In Interna

George Cyb enko Lyle Kipp Lynn Pointer and David Kuck

tional Conference on Architectural Support for Programming

Sup ercomputing Performance Evaluation and the Perfect

Languages and Operating Systems pages April

Benchmarks In Supercomputing pages

O Temam C Fricker and W Jalby Cache Interference

Phenomena In Proceedings of the ACM SIGMETRICS

Digital Equipment Corp oration Maynard Massachussets

Conference on Measurement and Modeling of Computer Sys Alpha Microprocessor Hardware Reference Manual

tems

Dean M Tullsen Susan J Eggers and Henry M Levy

Mano j Franklin The Multiscalar Architecture PhD thesis

Simultaneous Multithreading Maximizing OnChip Paral

University of Wisconsin Madison

lelism In International Symposium on Computer Architec

D Gannon et al SIGMA I I A Tool Kit for Building Paral

ture pages

lelizing Compiler and Performance Analysis Systems Tech

B Zerrouk JM Blin and A Greiner Encapsulating Net

nical rep ort University of Indiana

works and Routing In Proceedings of the th International

Parallel Processing Symposium