Identifying Bottlenecks in a Multithreaded

Sup erscalar Micropro cessor



Ulrich Sigmund and Theo Ungerer

VIONA DevelopmentGmbH Karlstr D Karlsruhe Germany

University of Karlsruhe Dept of Design and Fault Tolerance D

Karlsruhe Germany

Abstract This pap er presents a multithreaded sup erscalar pro cessor

that p ermits several threads to issue instructions to the execution units

of a wide sup erscalar pro cessor in a single cycle Instructions can simul

taneously b e issued from up to threads with a total issue bandwidth

of instructions p er cycle Our results show that the threaded issue

pro cessor reaches a throughput of instructions p er cycle

Intro duction

Current micropro cessors utilize instructionlevel parallelism by a deep pro cessor

pip eline and by the sup erscalar technique that issues up to four instructions

p er cycle from a single VLSItechnology will allow future generations of

micropro cessors to exploit instructionlevel parallelism up to instructions p er

cycle or more However the instructionlevel parallelism found in a conventional

instruction stream is limited

The solution is the additional utilization of more coarsegrained parallelism

The main approaches are the multipro cessor chip and the multithreaded pro ces

pro cessors on a sor The multipro cessor chip integrates two or more complete

single chip Therefore every unit of a pro cessor is duplicated and used indep en

dently of its copies on the chip In contrast the multithreaded pro cessor stores

multiple contexts in dierent register sets on the chip The functional units are

multiplexed between the threads that are loaded in the register sets Multi

threaded pro cessors tolerate memory latencies byoverlapping the longlatency

op erations of one thread with the execution of other threads in contrast to

the multipro cessor chip approach Simultaneous multithreading combines a

wide issue sup erscalar instruction dispatch with multithreading Instructions are

simultaneously issued from several instruction queues Therefore the issue slots

of a wide issue pro cessor can b e lled by op erations of several threads

While the simultaneous multithreading approach surveys enhancements of

the Alpha pro cessor our multithreaded sup erscalar approach is based on

a simplied PowerPC pro cessor Both approaches however are similar

in their instruction issuing p olicy This pap er fo cuses on the identication and

a voidance of b ottlenecks in the multithreaded sup erscalar pro cessor Further

simulations shown in install the same base pro cessor in a multipro cessor

chip and compare it with the multithreaded sup erscalar approach

The Multithreaded Sup erscalar Pro cessor Mo del

Our multithreaded sup erscalar pro cessor uses various kinds of mo dern microar

chitecture techniques as eg branch prediction inorder dispatch indep endent

execution units with reservation stations rename registers outoforder execu

tion and inorder completion We apply the full instruction pip eline of the Pow

erPC and extend it to employ multithreading However we simplify the

instruction set using an extended DLX instruction set instead use static

instead of dynamic branch prediction and renounce the oating p oint unit The

pro cessor mo del is designed scalable the number of parallel threads the sizes

of internal buers register sets and caches the number and typ e of execution

units are not limited byany architectural sp ecication

We conducted a software simulation evaluating various congurations of the

multithreaded sup erscalar pro cessor For the simulation results presented b elow

we used separate KByte way setasso ciative data and instruction caches with

try Byte lines a cache ll burst rate of a fullyasso ciative en

branch target address cache general purp ose registers p er thread rename

registers a entry completion queue simple integer units single complex

integer loadstore branch and thread control units each with a

entry The number of simultaneously fetched instructions

and the sizes of fetch and dispatch buers are adjusted to the issue bandwidth

We vary the issue bandwidth and the number of hosted threads in each case

from up to according to the total numb er of execution units

Wecho ose a multithreaded simulation work load that represents general pur

p ose programs without oatingp oint instructions for a typical register window

architecture We assume integer complex integer load

store branch and thread control instructions

Performance Results

The simulation results in Fig left show that the singlethreaded issue

sup erscalar pro cessor throughput measured in average instructions p er cycle

p erformance of executed The four only reaches a

issue approach is slightly b etter with due to instruction cache thrashing in

the issue case

Increasing the numb er of threads from which instructions are simultaneously

issued to the issue slots also increases p erformance The throughput reaches a

plateau at ab out instructions p er cycle where neither increasing the number

of threads nor the issue bandwidth signicantly raises the number of executed

instructions When issue bandwidth is kept small to instructions p er cycle

and four to eight threads are regarded we exp ect a full exploitation of the issue

bandwidth As seen in Fig left however the issue bandwidth is only utilized

by ab out Even a highly multithreaded pro cessor seems unable to fully

exploit the issue bandwidth A further analysis reveales the single fetch and

deco de units as b ottlenecks leading to starvation of the dispatch unit

Therefore we apply two indep endent fetchandtwo deco de units the simula

tion results are shown in Fig right The gradients of the graphs representing

the multithreaded approaches are much steep er indicating an increased through

put The pro cessor throughput in an threaded issue pro cessor is ab out four

times higher than in the single threaded issue case However the range of lin

ear gain where the numb er of executed instructions equals the issue bandwidth

ends at an issue bandwidth of ab out four instructions p er cycle The throughput

reaches a plateau at ab out instructions p er cycle where neither increasing

the numb er of threads nor the issue bandwidth signicantly raises the number

of executed instructions Moreover the diagram on the right side in Fig shows

a marginal gain in instruction throughput when we advance from a to an

issue bandwidth This marginal gain is nearly indep endent of the number of

threads in the multithreaded pro cessor

8 8

7 7 6 6 5 5 4 4 3 3 2 2 7 7 5 1 5 1

Number of nstructions per cycle Number of nstructions per cycle I I threads 3 0 threads 3 0 1 7 8 1 7 8 5 6 5 6 3 4 3 4 1 2 1 2

Issue bandwidth Issue bandwidth

Fig Average instruction throughput p er pro cessor with one fetch and one deco de

unit and with with two fetch and two deco de units

Further simulations see Fig showed that a p erformance increase is not

yielded when the numb er of slots in the completion queue is increased over the

slots assumed in the simulations ab ove Instruction execution is limited by

true data dep endencies and by control dep endencies that cannot be removed

Also four writeback p orts and rename registers seem appropriate

The loadstore unit and the memory subsystem remain as the main b ottle

neck that may p otentially b e removed by a dierent conguration The loadstore

unit is limited to the execution of a single instruction per cycle Duplication

of the loadstore unit denitely increases p erformance However two or more

loadstore units that access a single data cache are dicult to implement b ecause

of consistency and thrashing problems Using an instruction mix with load

and store instructions p otentially allows a pro cessor throughput of veinstead of the measured instructions p er cycle with a single loadstore unit

4,5 4,5 4 4 3,5 3,5 3 3 2,5 2,5 2 2 1,5 1,5 1 56 56 1 0,5 40 40 0,5 0 24 0 Rename register 24 4 8 8 12 8 8 6 7 16 20 4 5 24 28 2 3

Completion queue 32 1 Writeback ports

Fig Sources of unused issue slots

Toevaluate if a dierent memory conguration might increase the through

put we simulated dierent cache sizes cache line sizes from to bytes

cache schemes direct mapp ed set asso ciative workloads and numb ers of active

threads The simulations show that our simulated pro cessor is able to completely

hide all latencies caused by cache rells by its multithreaded execution

mo del The multithreaded sup erscalar pro cessor reaches the maximum through

put that is p ossible with a single loadstore unit Penalties caused bydatacache

misses are resp onsible for the dierence to the theoretical maximum throughput

of ve instructions p er cycle

Conclusion

This pap er surveyed b ottlenecks in a multithreaded sup erscalar pro cessor based

on the PowerPC taking various congurations into con

sideration Using an instruction mix with load and store instructions the

p erformance results show for an issue pro cessor with four to eight threads that

two instruction fetch and two deco de units four integer units rename regis

ters four register p orts and a completion queue with slots are sucient The

single loadstore unit proves as the principal b ottleneck b ecause it cannot easily

b e duplicated The multithreaded sup erscalar pro cessor threaded issue is

able to completely hide latencies caused by burst cache rells It reaches

instructions per cycle that is p ossible with a the maximum throughput of single loadstore unit

References

Tullsen D E Eggers S J Levy H M Simultaneous Multithreading Maximizing

OnChip Parallelism The nd Ann Int Symp on Comp Arch

Song S P Denman MChang J The PowerPC RISC Micropro cessor IEEE

Micro Vol No

Sigmund U Ungerer Th Evaluating a Multithreaded Sup erscalar Micropro cessor

vs a Multipro cessor Chip th PASA Workshop Juelich World Sc Publ

Hennessy J L Patterson D A a Quantitative Approach

San Mateo

A

This article was pro cessed using the L T X macro package with LLNCS style E