Evaluating a Multithreaded Superscalar

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by KITopen EVALUATING A MULTITHREADED SUPERSCALAR MICROPROCESSOR VERSUS AMULTIPROCESSOR CHIP T UNGERER Dept of Computer Design and Fault Tolerance D Karlsruhe Germany Phone Fax ungererInformatikUniKarlsruhede U SIGMUND VIONA Development GmbH Karlstr D Karlsruhe Germany Phone Fax ulivionade This pap er examines implementation techniques for future generations of micro pro cessors While the wide sup erscalar approach which issues and more in structions p er cycle from a single thread fails to yield a satisfying p erformance its combination with techniques that utilize more coarsegrained parallelism is very promising These techniques are multithreading and multipro cessing Multi threaded sup erscalar p ermits several threads to issue instructions to the execution units of a wide sup erscalar pro cessor in a single cycle Multipro cessing integrates two or more sup erscalar pro cessors on a single chip Our results showthatthe threaded issue sup erscalar pro cessor reaches a p erformance of executed instructions p er cycle Using the same numb er of threads the multipro cessor chip reaches a higher throughput than the multithreaded sup erscalar approach How ever if chip costs are taken into consideration a threaded issue sup erscalar pro cessor outp erforms a multipro cessor chip built from singlethreaded pro cessors by a factor of in p erformancecost relation Intro duction Current micropro cessors utilize instructionlevel parallelism by a deep pro ces sor pip eline and by the sup erscalar instruction issue technique DEC Alpha PowerPC and MIPS R Sun UltraSparc and HP PA issue up to four instructions p er cycle from a single thread VLSItechnology will allow future generations of micropro cessors to exploit instructionlevel par allelism up to instructions p er cycle or more Possible techniques are a wide sup erscalar approach IBM power pro cessor is a issue sup erscalar pro ces sor the VLIWapproach the SIMD approach within a pro cessor as in the HP PALC and the CISCapproach where a single instruction is dynam ically split into its RISC particles as in the AMD K or the Intel PentiumPro However the instructionlevel parallelism found in a conventional instruction stream is limited Recent studies show the limits of pro cessor utilization even of to days sup erscalar micropro cessors Using the SPEC b enchmark suite the PowerPC shows an execution of to instructions p er cycle and even an issue Alpha pro cessor will fail to sustain instructions per cycle The solution is the additional utilization of more coarsegrained parallelism The main approaches are the multipro cessor chip and the multithreaded pro ces sor The multipro cessor chip integrates twoor more complete pro cessors on a single chip Therefore every unit of a pro cessor is duplicated and used in dep endently of its copies on the chip For example the Texas Instruments TMSC Multimedia Video Pro cessor integrates four digital signal pro cessors and a scalar RISC pro cessor on a single chip In con trast the multithreaded pro cessor stores multiple contexts in dierent register sets on the chip The functional units are multiplexed between the threads that are loaded in the register sets Dep ending on the sp ecic mul tithreaded pro cessor design only a single instruction pip eline is used or a single dispatch unit issues instructions from dierent instruction buers si multaneously Because of the multiple register sets context switchingisvery fast Multithreaded pro cessors tolerate memory latencies byoverlapping the longlatency op erations of one thread with the execution of other threads in contrast to the multipro cessor chip approach While the multipro cessor chip is easier to implement use of multithreading in addition to a wide issue bandwidth is a promising approach Several ap proaches of multithreaded pro cessors exist in commercial and in researchma chines The cyclebycycle interleaving approach exemplied by the Denelcor HEP and the Tera pro cessor switches contexts each cycle Because only a single instruction p er context is allowed in the pip eline the single thread p erformance is extremely p o or The blo ckinterleaving approach exemplied by the MIT Sparcle pro ces sor executes a single thread until it reaches a longlatency op eration suchasaremotecache miss or a failed synchronization at which p oint it switches to another context The Rhamma pro cessor switches con texts whenever a load store or synchronization op eration is discovered The simultaneous multithreading approach combines a wide issue su p erscalar instruction dispatch with the multiple context approachbypro viding several register sets on the micropro cessor and issuing instructions from several instruction queues simultaneously Therefore the issue slots of a wide issue pro cessor can b e lled by op erations of several threads Latencies o ccurring in the execution of single threads are bridged byis suing op erations of the remaining threads loaded on the pro cessor In principle the full issue bandwidth can b e utilized While the simultaneous multithreading approachsurveys enhancements of the Alpha pro cessor our multithreaded sup erscalar approach is based on the PowerPC Both approaches however are similar in their instruction issuing p olicy Wesimulate the full instruction pip eline of the PowerPC and extend it to employmultithreading However we simplify the instruction set using an extended DLX instruction set instead use static instead of dynamic branch prediction and renounce the oating p ointunit We install the same base pro cessor in our multipro cessor chip simulations to guarantee a fair comparison with the multithreaded sup erscalar approach Evaluation Metho dology The Superscalar Base Processor Our sup erscalar base pro cessor see Fig implements the sixstage instruc tion pip eline of the Po werPC pro cessor fetch deco de dispatch execute complete and writeback The pro cessor uses various kinds of mo dern microarchitecture techniques as eg separate co de and data caches branch target address cache static branch prediction inorder dispatch indep endent execution units with reservation stations rename registers outoforder execution and inorder completion The fetch and deco de units always work on a continuous blo ck of instructions These blo cks maynotalways b e the maximum size as there are limitations by the cache size instruction fetch cannot overlap cache lines and by branches that are predicted to b e taken As blo cksizewecho ose the numb er of instruc tions that could b e issued simultaneously The dispatch unit is restricted by the maximum issue bandwidth the maxi mum numb er of instructions that can b e issued simultaneously to the execution Instruction Memory Instruction Cache Branch Target Address Cache Fetch Unit Register Set Fetch Buffer Register units Decode Unit Register Control pipeline Dispatch Buffer Rename Reservation Dispatch Unit Execution units Stations Completion Execution Units Buffer Data Cache Completion Unit Data Memory Figure Sup erscalar architecture units The maximum issue bandwidth of a pro cessor is mainly limited bythe restricted numb er of busses from the rename registers to the execution units and not by the instruction selection matrix Due to the issuing p olicy of the PowerPC executions are always issued in program order to the reservation stations of the execution units The instructions are executed outoforder by the execution units The use of separate reservation stations for the execu tion units simplies the dispatch b ecause only the lack of instructions the lack of resources and the maximum bandwidth limit the simultaneous instruc tion issue rate Data dep endencies are taken care of by rename registers and reservation station Instructions can b e dispatched to the reservation stations without checking for control or data dep endencies An instruction will not b e executed until all source op erands are available All standard integer op erations are executed in a single cycle by the simple integer units Only the multiply and the divide instructions are executed in a complex integer unit The execution of multiply instructions is fully pip elined and consumes a sp ecied numb er of cycles The integer divide is not pip elined its latency may also b e sp ecied in the simulator The branch unit executes one branch instruction p er cycle Branch prediction starts in the fetch unit using a branch target address cache A simple static branch prediction technique is applied in the deco de unit Eachforward branch is predicted as not taken eachbackward branch as taken A static branch prediction simplies pro cessor design but reduces the prediction accuracy Completion is controlled by the completion unit which retires instructions in program order with the same maximum rate as the maximum issue bandwidth When an instruction is retired its result is copied from the rename register to its register in the register set The rename register and the slot in the completion buer are released The memory interface is dened as a standard DRAM interface with cong urable burst sizes and delaystosimulate advanced RAM typ es like SDRAM or EDO All caches are highly congurable to test p enalties due to cache thrashing caused bymultiple threads The pro cessor is designed scalable the n umber of execution units and the size of buers are not limited byanyarchitectural sp ecication This allows exp erimentation to nd an optimized conguration for an actual pro cessor

Load more