<<

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by KITopen

EVALUATING A MULTITHREADED SUPERSCALAR

MICROPROCESSOR VERSUS AMULTIPROCESSOR CHIP

T UNGERER

Dept of Design and Fault Tolerance D Karlsruhe Germany

Phone Fax

ungererInformatikUniKarlsruhede

U SIGMUND

VIONA Development GmbH Karlstr D Karlsruhe Germany

Phone Fax

ulivionade

This pap er examines implementation techniques for future generations of micro

pro cessors While the wide sup erscalar approach which issues and more in

structions p er cycle from a single fails to a satisfying p erformance

its combination with techniques that utilize more coarsegrained parallelism is

very promising These techniques are multithreading and multipro cessing Multi

threaded sup erscalar p ermits several threads to issue instructions to the execution

units of a wide sup erscalar pro cessor in a single cycle Multipro cessing integrates

two or more sup erscalar pro cessors on a single chip Our results showthatthe

threaded issue sup erscalar pro cessor reaches a p erformance of executed

instructions p er cycle Using the same numb er of threads the multipro cessor chip

reaches a higher throughput than the multithreaded sup erscalar approach How

ever if chip costs are taken into consideration a threaded issue sup erscalar

pro cessor outp erforms a multipro cessor chip built from singlethreaded pro cessors

by a factor of in p erformancecost relation

Intro duction

Current micropro cessors utilize instructionlevel parallelism by a deep pro ces

sor pip eline and by the sup erscalar instruction issue technique DEC Alpha

PowerPC and MIPS R Sun UltraSparc and HP PA

issue up to four instructions p er cycle from a single thread VLSItechnology

will allow future generations of micropro cessors to exploit instructionlevel par

allelism up to instructions p er cycle or more Possible techniques are a wide

sup erscalar approach IBM power pro cessor is a issue sup erscalar pro ces

sor the VLIWapproach the SIMD approach within a pro cessor as in the

HP PALC and the CISCapproach where a single instruction is dynam

ically split into its RISC particles as in the AMD K or the PentiumPro

However the instructionlevel parallelism found in a conventional instruction

stream is limited Recent studies show the limits of pro cessor utilization even

of to days sup erscalar micropro cessors Using the SPEC b enchmark suite

the PowerPC shows an execution of to instructions p er cycle

and even an issue Alpha pro cessor will fail to sustain instructions per

cycle

The solution is the additional utilization of more coarsegrained parallelism

The main approaches are the multipro cessor chip and the multithreaded pro ces

sor The multipro cessor chip integrates twoor more complete pro cessors on

a single chip Therefore every unit of a pro cessor is duplicated and used in

dep endently of its copies on the chip For example the Texas Instruments

TMS Multimedia Video Pro cessor integrates four digital signal pro

cessors and a scalar RISC pro cessor on a single chip

In con trast the multithreaded pro cessor stores multiple contexts in dierent

register sets on the chip The functional units are multiplexed between the

threads that are loaded in the register sets Dep ending on the sp ecic mul

tithreaded pro cessor design only a single instruction pip eline is used or a

single dispatch unit issues instructions from dierent instruction buers si

multaneously Because of the multiple register sets context switchingisvery

fast Multithreaded pro cessors tolerate memory latencies byoverlapping the

longlatency op erations of one thread with the execution of other threads in

contrast to the multipro cessor chip approach

While the multipro cessor chip is easier to implement use of multithreading

in addition to a wide issue bandwidth is a promising approach Several ap

proaches of multithreaded pro cessors exist in commercial and in researchma

chines

The cyclebycycle interleaving approach exemplied by the Denelcor

HEP and the Tera pro cessor contexts each cycle Because

only a single instruction p er context is allowed in the pip eline the single

thread p erformance is extremely p o or

The blo ckinterleaving approach exemplied by the MIT Sparcle pro ces

sor executes a single thread until it reaches a longlatency op eration

suchasaremotecache miss or a failed synchronization at which p oint

it switches to another context The Rhamma pro cessor switches con

texts whenever a load store or synchronization op eration is discovered

The simultaneous multithreading approach combines a wide issue su

p erscalar instruction dispatch with the multiple context approachbypro

viding several register sets on the micropro cessor and issuing instructions

from several instruction queues simultaneously Therefore the issue slots

of a wide issue pro cessor can b e lled by op erations of several threads

Latencies o ccurring in the execution of single threads are bridged byis

suing op erations of the remaining threads loaded on the pro cessor In

principle the full issue bandwidth can b e utilized

While the simultaneous multithreading approachsurveys enhancements of

the Alpha pro cessor our multithreaded sup erscalar approach is based on

the PowerPC Both approaches however are similar in their instruction

issuing p olicy Wesimulate the full instruction pip eline of the PowerPC

and extend it to employmultithreading However we simplify the instruction

set using an extended DLX instruction set instead use static instead of

dynamic branch prediction and renounce the oating p ointunit We install

the same base pro cessor in our multipro cessor chip to guarantee a

fair comparison with the multithreaded sup erscalar approach

Evaluation Metho dology

The Superscalar Base

Our sup erscalar base pro cessor see Fig implements the sixstage instruc

tion pip eline of the Po werPC pro cessor fetch deco de dispatch execute

complete and writeback

The pro cessor uses various kinds of mo dern techniques as

eg separate co de and data caches branch target address static branch

prediction inorder dispatch indep endent execution units with reservation

stations rename registers outoforder execution and inorder completion

The fetch and deco de units always work on a continuous blo ck of instructions

These blo cks maynotalways b e the maximum size as there are limitations by

the cache size instruction fetch cannot overlap cache lines and by branches

that are predicted to b e taken As blo cksizewecho ose the numb er of instruc

tions that could b e issued simultaneously

The dispatch unit is restricted by the maximum issue bandwidth the maxi

mum numb er of instructions that can b e issued simultaneously to the execution Instruction Memory

Instruction Cache Branch Target Address Cache

Fetch Unit Register Set Fetch Buffer Register units

Decode Unit Register Control Dispatch Buffer Rename

Reservation Dispatch Unit Execution units Stations Completion Execution Units Buffer

Data Cache Completion Unit

Data Memory

Figure Sup erscalar architecture

units The maximum issue bandwidth of a pro cessor is mainly limited bythe

restricted numb er of busses from the rename registers to the execution units

and not by the instruction selection matrix Due to the issuing p olicy of the

PowerPC executions are always issued in program order to the reservation

stations of the execution units The instructions are executed outoforder by

the execution units The use of separate reservation stations for the execu

tion units simplies the dispatch b ecause only the lack of instructions the

lack of resources and the maximum bandwidth limit the simultaneous instruc

tion issue rate Data dep endencies are taken care of by rename registers and

Instructions can b e dispatched to the reservation stations

without checking for control or data dep endencies An instruction will not b e

executed until all source op erands are available

All standard integer op erations are executed in a single cycle by the simple

integer units Only the multiply and the divide instructions are executed in a

complex integer unit The execution of multiply instructions is fully pip elined

and consumes a sp ecied numb er of cycles The integer divide is not pip elined

its latency may also b e sp ecied in the simulator

The branch unit executes one branch instruction p er cycle Branch prediction

starts in the fetch unit using a branch target address cache A simple static

branch prediction technique is applied in the deco de unit Eachforward branch

is predicted as not taken eachbackward branch as taken A static branch

prediction simplies pro cessor design but reduces the prediction accuracy

Completion is controlled by the completion unit which retires instructions in

program order with the same maximum rate as the maximum issue bandwidth

When an instruction is retired its result is copied from the rename register

to its register in the register set The rename register and the slot in the

completion buer are released

The memory interface is dened as a standard DRAM interface with cong

urable burst sizes and delaystosimulate advanced RAM typ es like SDRAM or

EDO All caches are highly congurable to test p enalties due to cache thrashing

caused bymultiple threads

The pro cessor is designed scalable the n umber of execution units and the

size of buers are not limited byanyarchitectural sp ecication This allows

exp erimentation to nd an optimized conguration for an actual pro cessor

dep ending on chip size and exp ected application load

MultithreadedSuperscalar Base Processor

While the multipro cessor chip simply comprises two or more base pro cessors

of a sp ecic issue bandwidth the multithreaded sup erscalar approachismore

complicated In the multithreaded sup erscalar pro cessor see Fig the

buers of the control pip eline fetch buer dispatch buer and completion

buer and the register set are duplicated according to the numb er of hosted

threads Each thread has its own set of buers and registers thus running

logically indep endent of the other threads The pro cessor is designed scalably

with resp ect to the numb er of hosted threads To the user of a multithreaded

ves likeamultipro cessor Each thread sup erscalar pro cessor the machine b eha

executes in its own context and is not aected by other threads

Since the fetch and deco de unit only work on a single thread p er cycle they

may also be duplicated to gain a higher throughput Instructions are still

fetched and deco ded in blo cks of contiguous instructions There is only a BTAC Register Frames Execution Units Rename Register

I-Cache Fetch Dispatch Complete Buffer Buffer Buffer Fetch Decode Dispatch Complete Fetch Dispatch Complete Fetch Buffer Decode Buffer Buffer

Fetch Dispatch Complete

Buffer Buffer Buffer

Figure Multithreaded control pip eline

single dispatch and a single completion unit The dispatch unit simultaneously

selects instructions from all indep endent dispatch buers up to its maximum

issue bandwidth The completion unit simultaneously retires anynumber of

instructions of any thread up to a total maximum of retired instructions p er

cycle

The dispatch unit is not restricted with resp ect to the numb er of instruction

issues according to each thread There is no xed allo cation b etween threads

and execution units in the multithreaded sup erscalar pro cessor in contrast to

the multipro cessor chip

The rename registers the branch target address cache the data and the in

struction cache are shared by all active threads There is no xed allo cation

of any of these resources to sp ecic threads This allows for maximum p erfor

mance with anynumb er of active threads

Each thread executes in a separate register set The contents of the registers

of a register set describ e the state of a thread and form a socalled activation

frame An activation frame is called active if the thread is currently executed

by the pro cessor ie the activation frame is represented in a register set

In addition to the active activation frames in the pro cessor we provide an

activation frame cache which holds activation frames that are not currently

scheduled for execution The activation frames in the activation frame cache

and in the register sets can b e interchanged without a signicant p enalty By

adding a cache of not currently running threads the machine can b e virtualized

to an arbitrary numb er of threads When a thread p erforms an instruction with

long latencylike an external synchronization the thread can b e preempted by

a ready thread in the cache

The activation frame cache is also used to implementaderivation of a register

window technique A new activation frame is created for every subprogram

activation thereby signicantly reducing the number of memory accesses to

the data cache

The execution units are expanded by a thread that is resp onsible

for creation and deletion of threads for synchronization and communication

between threads Except for the loadstore and the thread control unit each

kind of may b e arbitrarily duplicated

Within a multithreaded pro cessor latencies are almost all covered by other

threads so the p enalty created by a static branch prediction should not aect

average executions p er cycle

Simulator and Application Workload

Starting with the sup erscalar base pro cessor we conducted a simu

lation evaluating various congurations of the multithreaded sup erscalar

approach and of the multipro cessor chip mo dels All functional units were

simulated with correct cycle behaviour We employed an instructiondriven

simulator not a co de tracer so all execution related eects are simulated

The simulator is either scriptdriven or interactive All functional units can b e

observed during the interactive execution in separate windows which yields

the abilitytoinvestigate all runtime eects in detail In scriptdriven mo de

runs Vari versatile scripts can be dened to build up more complex test

ous results are collected for all units of the pro cessors and for

each thread These rep orts can b e automatically evaluated to create combined

results for several tests

The simulation workload is generated by a congurable workload generator

which creates random high level programs and compiles them to our machine

language The distribution of the machine instructions is similar to that gen

erated from high level programs This approachallows us to create workloads

for dierenttyp es of programs without the need for a complete

Performance Results

For the p erformance results presented in this section we cho ose a multi

threaded simulation work load that represents general purp ose programs with

out oatingp oint instructions for a typical register windowarchitecture

Instruction typ e Average use

Integer

Complexinteger

Load

Store

Branch

Threadcontrol

With a single loadstore unit used in our approach the chosen work load has

a theoretical maximum throughput of ab out ve the

frequency of load and store instructions sums up to

For the simulation results presented b elow we used separate KByte way

setasso ciative data and instruction caches with Byte cache lines a cache ll

burst rate of a fullyasso ciativeentry branch target address cache

general purp ose registers p er thread rename registers a entry com

pletion buer up to simple integer units single complex integer loadstore

branch and thread control units each execution unit with a entry reserva

tion station For the multithreaded approachweusedtwo fetch units and two

deco de units The number of simultaneously fetched instructions and the sizes

of fetch and dispatch buers are adjusted to the issue bandwidth Wevary the

issue bandwidth and the numb er of hosted threads in each case from up to

according to the total numb er of execution units

The simulation results in Fig show that the singlethreaded issue su

p erscalar pro cessor throughput only reaches a p erformance of executed

instructions per cycle The fourissue approach is slightly b etter with

due to instruction cache thrashing in the issue case

We also see in Fig that the range of linear gain where the number of

executed instructions equals the issue bandwidth ends at an issue bandwidth

of ab out four instructions p er cycle The throughput reaches a plateau at ab out

instructions p er cycle where neither increasing the numb er of threads nor

the issue bandwidth signicantly raises the numb er of executed instructions

Moreover the diagram on the right side in Fig shows a marginal gain in 1 Issue bandwidth 4,5 2 8 4 4 7 3,5 8 e 6 3 5 2,5 4 2 3 1,5 2 7 1 Instructions per cycl 5 1

Number of nstructions per cycle 0,5 I threads 3 0 0 1 7 8 5 6 3 4 1 2 3 4 5 6 7 8 1 2

Issue bandwidth Number of threads

Figure Average instruction throughput p er pro cessor

instruction throughput when we advance from a to an issue bandwidth

This marginal gain is nearly indep endent of the number of threads in the

multithreaded pro cessor

Further simulations shown in identify the single loadstore unit as a

principal b ottleneck The dierence between the exp ected ve instructions

p er cycle and the simulation result of is due to bubbles in the loadstore

pip eline caused by data cache misses Our multithreaded sup erscalar approach

reaches the maximum throughput that is p ossible with a single loadstore unit

To compare the multithreaded approach with a multipro cessor solution we

show in Fig the results normalized in relation to the number of threads

cycle divided bythe number of threads It the number of instructions p er

allows the comparison of parallel systems built up from basic multithreaded

or singlethreaded pro cessors Fig also shows that a multipro cessor built

of singlethreaded sup erscalar pro cessors delivers the highest average instruc

tion throughput p er thread note the throughput p er pro cessor is shown in

Fig Each step to more parallel threads delivers less average through

put per thread It lo oks as if the multithreaded approach falls behind the

multipro cessor chip approach However the costs for the dierent pro cessor

congurations were not taken into account The multipro cessor approachdu

plicates complete pro cessors whereas the multithreaded design only duplicates

parts of the pro cessor 1,4

1,2

1

Instructions per 0,8 cycle and thread 0,6 0,4 0,2 1 3 0 5 8 Number of 7 6 7 threads 5 4 3 2

Issue bandwidth 1

Figure Average instruction throughput p er thread

Toverify our results we compare them to Tullsens simulation in Wesee

that our simulator pro duces dierent results esp ecially viewing the threaded

issue sup erscalar approach in the table b elow

Numb erThreadsIssue Tullsens simulation Our simulation

The reason for the deviating results of Tullsens simulation follows from the

high number of execution units in Tullsens approach and from the limited

exploitation of instructionlevel parallelism in the Alpha pro cessor used as the

base of Tullsens simulation Compared to our simulations Tullsens approach

favours the multithreaded sup erscalar approachover the multipro cessor chip

approach For example up to eight loadstore units are used in Tullsens

simulation ignoring hardware costs and design problems we do not believe

that it is costeective to implement simultaneously working loadstore units

within a multithreaded sup erscalar pro cessor

It is obvious that dierent pro cessor congurations can only b e compared if a

measurement for their costs eg in chip space is used Otherwise unrealistic

pro cessors are compared with each other simply stating that more units result

in more p erformance

To get a rough measurement of the costs for a pro cessor conguration we

prop ose a formula based on the Power PC o or plan The formula expresses

hardware costs based on chip space usage p er unit Estimated costs p er unit

Unit typ e Estimated cost

Integer unit

LoadStore unit

Fetch and deco de unit

Branch unit

Caches

Registers Numb er of Threads

Completion unit Issue Bandwidth

Dispatch unit Numb er of ThreadsIssue Bandwidth

The formula is only a rule of thumb The formula for the dispatch unit is based

on the required interconnections b etween the dispatch unit the register sets and the execution units

2 1,8 1,6 Performance by 1,4 cost, normalized 1,2 to scalar 1 processor 0,8 0,6 0,4 0,2 7 0 5 1 3 Number of 2 3 4 threads 5 6 1 7

Issue bandwidth 8

Figure Average instruction throughput in relation to chip costs

As each threads register set and dispatch queue has to b e connected with all

execution units the required chip space is prop ortional to the pro duct of the

numb er of hosted threads and the issue bandwidth of the pro cessor

Fig displays the instructions p er cycle in relation to the hardware costs of

a sp ecic pro cessor conguration The solution with four threads and issue

bandwidth four shows the b est p erformancecost relation However this ob

servation is application and design sp ecic The advantage of multithreading

is highly dep endent on the ratio of loadstore instructions to other instruc

tions in the workload Also the chip costs change with dierentarchitectural

decisions

Conclusion

This pap er examined the multithreaded sup erscalar pro cessor in comparison to

the multipro cessor chip approach taking p erformance and hardware costs into

consideration For our researchstudyweused asimulator for multithreaded

pro cessors based on the PowerPC

While the singlethreaded issue sup erscalar pro cessor only reaches a through

put of ab out the threaded issue sup erscalar pro cessor executes

instructions p er cycle the loadstore frequency in the work load sets the the

oretical maximum to instruction p er cycle Increasing the issue bandwidth

from to yields only a marginal gain in instruction throughput a result that

is nearly indep endentofthenumb er of threads in the multithreaded pro cessor

The multipro cessor chip approach with singlethreaded scalar pro cessors

reaches instructions per cycle Using the same number of threads the

multipro cessor chip reaches a higher throughput than the multithreaded su

p erscalar approach refer to the previous paragraph However if we take

the chip costs into consideration a threaded issue sup erscalar pro cessor

outp erforms a multipro cessor chip built from singlethreaded pro cessors bya

factor of in p erformancecost relation

References

TA Diep C Nelson JP Shen Performance Evaluation of the PowerPC

Micropro cessor The nd Annual International Symp osium on Computer

Architecture Santa Margherita Ligure June

D E Tullsen S J Eggers H M Levy Simultaneous Multithreading

Maximizing OnChip Parallelism The nd Annual International Symp osium

on Santa Margherita Ligure June

Texas Instruments TMSC Technical Brief Multimedia Video Pro ces

sor MVP Texas Instruments

B J Smith The Architecture of HEP In J S Kowalik Ed Parallel

MIMD Computation The HEP Sup ercomputer and Its Applications The

MIT Pr ess Cambridge

R Alverson et al The Tera Computer System th International Confer

ence on Sup ercomputing Amsterdam June

A Agarwal et al The MIT Alewife Machine Architecture and Perfor

mance The nd Annual International Symp osium on Computer Architec

ture Santa Margherita Ligure June

W Gruenewald Th Ungerer Towards Extremely Fast Context

ing in a Blo ckm ultithreaded Pro cessor nd Euromicro Conference Prague

Sept

S P Song M Denman J Chang The PowerPC RISC Micropro cessor

IEEE Micro Vol No Oct

J L Hennessy D A Patterson Computer Architecture a Quantitative

Approach San Mateo

U Sigmund Design of a Multithreaded Sup erscalar Pro cessor Master

Thesis University of Karlsruhe in German

U Sigmund Th Ungerer Identifying Bottlenecks in Multithreaded Su

p erscalar Micropro cessors To b e published