Identifying Bottlenecks in a Multithreaded Superscalar
Total Page:16
File Type:pdf, Size:1020Kb
Identifying Bottlenecks in a Multithreaded Sup erscalar Micropro cessor Ulrich Sigmund and Theo Ungerer VIONA DevelopmentGmbH Karlstr D Karlsruhe Germany University of Karlsruhe Dept of Computer Design and Fault Tolerance D Karlsruhe Germany Abstract This pap er presents a multithreaded sup erscalar pro cessor that p ermits several threads to issue instructions to the execution units of a wide sup erscalar pro cessor in a single cycle Instructions can simul taneously b e issued from up to threads with a total issue bandwidth of instructions p er cycle Our results show that the threaded issue pro cessor reaches a throughput of instructions p er cycle Intro duction Current micropro cessors utilize instructionlevel parallelism by a deep pro cessor pip eline and by the sup erscalar technique that issues up to four instructions p er cycle from a single thread VLSItechnology will allow future generations of micropro cessors to exploit instructionlevel parallelism up to instructions p er cycle or more However the instructionlevel parallelism found in a conventional instruction stream is limited The solution is the additional utilization of more coarsegrained parallelism The main approaches are the multipro cessor chip and the multithreaded pro ces pro cessors on a sor The multipro cessor chip integrates two or more complete single chip Therefore every unit of a pro cessor is duplicated and used indep en dently of its copies on the chip In contrast the multithreaded pro cessor stores multiple contexts in dierent register sets on the chip The functional units are multiplexed between the threads that are loaded in the register sets Multi threaded pro cessors tolerate memory latencies byoverlapping the longlatency op erations of one thread with the execution of other threads in contrast to the multipro cessor chip approach Simultaneous multithreading combines a wide issue sup erscalar instruction dispatch with multithreading Instructions are simultaneously issued from several instruction queues Therefore the issue slots of a wide issue pro cessor can b e lled by op erations of several threads While the simultaneous multithreading approach surveys enhancements of the Alpha pro cessor our multithreaded sup erscalar approach is based on a simplied PowerPC pro cessor Both approaches however are similar in their instruction issuing p olicy This pap er fo cuses on the identication and a voidance of b ottlenecks in the multithreaded sup erscalar pro cessor Further simulations shown in install the same base pro cessor in a multipro cessor chip and compare it with the multithreaded sup erscalar approach The Multithreaded Sup erscalar Pro cessor Mo del Our multithreaded sup erscalar pro cessor uses various kinds of mo dern microar chitecture techniques as eg branch prediction inorder dispatch indep endent execution units with reservation stations rename registers outoforder execu tion and inorder completion We apply the full instruction pip eline of the Pow erPC and extend it to employ multithreading However we simplify the instruction set using an extended DLX instruction set instead use static instead of dynamic branch prediction and renounce the oating p oint unit The pro cessor mo del is designed scalable the number of parallel threads the sizes of internal buers register sets and caches the number and typ e of execution units are not limited byany architectural sp ecication We conducted a software simulation evaluating various congurations of the multithreaded sup erscalar pro cessor For the simulation results presented b elow we used separate KByte way setasso ciative data and instruction caches with try Byte cache lines a cache ll burst rate of a fullyasso ciative en branch target address cache general purp ose registers p er thread rename registers a entry completion queue simple integer units single complex integer loadstore branch and thread control units each execution unit with a entry reservation station The number of simultaneously fetched instructions and the sizes of fetch and dispatch buers are adjusted to the issue bandwidth We vary the issue bandwidth and the number of hosted threads in each case from up to according to the total numb er of execution units Wecho ose a multithreaded simulation work load that represents general pur p ose programs without oatingp oint instructions for a typical register window architecture We assume integer complex integer load store branch and thread control instructions Performance Results The simulation results in Fig left show that the singlethreaded issue sup erscalar pro cessor throughput measured in average instructions p er cycle p erformance of executed instructions per cycle The four only reaches a issue approach is slightly b etter with due to instruction cache thrashing in the issue case Increasing the numb er of threads from which instructions are simultaneously issued to the issue slots also increases p erformance The throughput reaches a plateau at ab out instructions p er cycle where neither increasing the number of threads nor the issue bandwidth signicantly raises the number of executed instructions When issue bandwidth is kept small to instructions p er cycle and four to eight threads are regarded we exp ect a full exploitation of the issue bandwidth As seen in Fig left however the issue bandwidth is only utilized by ab out Even a highly multithreaded pro cessor seems unable to fully exploit the issue bandwidth A further analysis reveales the single fetch and deco de units as b ottlenecks leading to starvation of the dispatch unit Therefore we apply two indep endent fetchandtwo deco de units the simula tion results are shown in Fig right The gradients of the graphs representing the multithreaded approaches are much steep er indicating an increased through put The pro cessor throughput in an threaded issue pro cessor is ab out four times higher than in the single threaded issue case However the range of lin ear gain where the numb er of executed instructions equals the issue bandwidth ends at an issue bandwidth of ab out four instructions p er cycle The throughput reaches a plateau at ab out instructions p er cycle where neither increasing the numb er of threads nor the issue bandwidth signicantly raises the number of executed instructions Moreover the diagram on the right side in Fig shows a marginal gain in instruction throughput when we advance from a to an issue bandwidth This marginal gain is nearly indep endent of the number of threads in the multithreaded pro cessor 8 8 7 7 6 6 5 5 4 4 3 3 2 2 7 7 5 1 5 1 Number of nstructions per cycle Number of nstructions per cycle I I threads 3 0 threads 3 0 1 7 8 1 7 8 5 6 5 6 3 4 3 4 1 2 1 2 Issue bandwidth Issue bandwidth Fig Average instruction throughput p er pro cessor with one fetch and one deco de unit and with with two fetch and two deco de units Further simulations see Fig showed that a p erformance increase is not yielded when the numb er of slots in the completion queue is increased over the slots assumed in the simulations ab ove Instruction execution is limited by true data dep endencies and by control dep endencies that cannot be removed Also four writeback p orts and rename registers seem appropriate The loadstore unit and the memory subsystem remain as the main b ottle neck that may p otentially b e removed by a dierent conguration The loadstore unit is limited to the execution of a single instruction per cycle Duplication of the loadstore unit denitely increases p erformance However two or more loadstore units that access a single data cache are dicult to implement b ecause of consistency and thrashing problems Using an instruction mix with load and store instructions p otentially allows a pro cessor throughput of veinstead of the measured instructions p er cycle with a single loadstore unit 4,5 4,5 4 4 3,5 3,5 3 3 2,5 2,5 2 2 1,5 1,5 1 56 56 1 0,5 40 40 0,5 0 24 0 Rename register 24 4 8 8 12 8 8 6 7 16 20 4 5 24 28 2 3 Completion queue 32 1 Writeback ports Fig Sources of unused issue slots Toevaluate if a dierent memory conguration might increase the through put we simulated dierent cache sizes cache line sizes from to bytes cache schemes direct mapp ed set asso ciative workloads and numb ers of active threads The simulations show that our simulated pro cessor is able to completely hide all latencies caused by cache rells by its multithreaded execution mo del The multithreaded sup erscalar pro cessor reaches the maximum through put that is p ossible with a single loadstore unit Penalties caused bydatacache misses are resp onsible for the dierence to the theoretical maximum throughput of ve instructions p er cycle Conclusion This pap er surveyed b ottlenecks in a multithreaded sup erscalar pro cessor based on the PowerPC microarchitecture taking various congurations into con sideration Using an instruction mix with load and store instructions the p erformance results show for an issue pro cessor with four to eight threads that two instruction fetch and two deco de units four integer units rename regis ters four register p orts and a completion queue with slots are sucient The single loadstore unit proves as the principal b ottleneck b ecause it cannot easily b e duplicated The multithreaded sup erscalar pro cessor threaded issue is able to completely hide latencies caused by burst cache rells It reaches instructions per cycle that is p ossible with a the maximum throughput of single loadstore unit References Tullsen D E Eggers S J Levy H M Simultaneous Multithreading Maximizing OnChip Parallelism The nd Ann Int Symp on Comp Arch Song S P Denman MChang J The PowerPC RISC Micropro cessor IEEE Micro Vol No Sigmund U Ungerer Th Evaluating a Multithreaded Sup erscalar Micropro cessor vs a Multipro cessor Chip th PASA Workshop Juelich World Sc Publ Hennessy J L Patterson D A Computer Architecture a Quantitative Approach San Mateo A This article was pro cessed using the L T X macro package with LLNCS style E.