Improving Single-Process Performance with Multithreaded

Improving SinglePro cess Performance with Multithreaded Pro cessors Alexandre Farcy Olivier Temam Universitede Versailles avenue des EtatsUnis Versailles France ffarcytemamgprismuvsqfr Abstract Because of these hardware and software b ottlenecks it is not obvious that pro cessor manufacturers will b e indenitely Multithreaded pro cessors are an attractive alternative to su capable of scaling up sup erscalar pro cessors Consequently p erscalar pro cessors Their ability to handle multiple threads simultaneously is likely to result in higher instruction through several alternatives have b een prop osed in literature and in put and in b etter utilization of functional units though they can dustry Some solutions require a deep mo dication of the still b enet from many hardware and software functionalities pro cessor execution mo del like dataow architectures of sup erscalar pro cessors and thus consist in an evolution rather while other solutions mostly consist in evolutions of current than a radical transformation of current pro cessors However sup erscalar designs though they still address main p erfor to date multithreaded pro cessors have b een mostly shown capa mance b ottlenecks Considering the pace at which hard ble of improving the p erformance of multipleprocess workloads ware pro cessor architecture can sometimes evolve less rad ie threads with indep endent contexts but in order to comp ete ical transitions are more likely to b e contemplated by pro with sup erscalar pro cessors they must also prove their ability to cessor manufacturers Three such evolutions are currently improve singlepro cess p erformance In this article it is prop osed to improve singlepro cess p erfor b eing considered mance by simply parallelizing a pro cess over several threads shar The rst solution is VLIW pro cessors which has the ad ing the same context using automatic parallelization techniques vantage of strongly reducing hardware complexity but fur already available for multiprocessors The purp ose of this article ther increases the burden on the compiler for b oth detecting is to analyze the impact of sharedcontext workloads on b oth pro paralleli sm and eciently scheduling instructions to func cessor architecture and processor performance On a rst hand tional units The two other solutions onchip multiproces based on previous research works and by reusing many comp o sors and multithreaded pro cessors have some similaritie s nents of sup erscalar pro cessors a multithreaded pro cessor archi The architecture of onchip multiprocessors is simple but tecture is dened Then considering the issues raised by shared rigid the threads scheduled on one pro cessor cannot exploit context workloads on data cache architectures have mostly b een ignored up to now we attempt to determine how current cache the functional units of others so that the functional unit us architectures can evolve to cop e with sharedcontext workloads age ratio may not b e much higher than in current pro cessors The impact of shared workloads on this architecture is analyzed Multithreaded pro cessors as describ ed in the seminal arti in details showing that like multiprocessors multithreaded pro cle by Hirata trade simplicity for eciency functional cessors exhibit p erformance b ottlenecks of their own that limit units are shared by the dierent threads each assigned to singlepro cess sp eedups a logical pro cessor Because each logical pro cessor needs to Keywords Multithreaded pro cessors singlepro cess p erfor communicate with each functional unit the hardware inter mance cache memories connection overhead for functional units and register banks Introduction can b e high On the other hand functional units are likely Sup erscalar pro cessors capable of issuing ab out instruc to b e more eciently used tions p er cycle are now b ecoming a standard However stud This advantage of multithreaded pro cessors over sup er ies like indicate that the amount of intrinsic instruction scalar pro cessors has b een recently demonstrated in by level parallelism is limited Besides the necessity to concur Tullsen showing that multithreaded pro cessors can b e twice rently execute a high number of instructions from the same as ecient as highdegree sup erscalar pro cessors comparing instruction ow induces signicant hardware overheads that threads with issue slots However up to now the e increase with the degree of paralleli sm and the instruction ciency of multithreaded pro cessors has b een mostly demon window size from which paralleli sm is extracted like for in strated for heterogeneous workloads ie when the workload stance the necessity to check dep endences b etween a high is comp osed of distinct pro cesses in But such ar number of instructions registers that need to b e dynam chitectures may prove viable solutions only if they are also ically renamed buering techniques to allow outoforder ecient at increasing singlepro cess p erformance Multiple execution instruction issue is suggested in for each thread in or der to optimize singlepro cess p erformance while decreasing global workload execution time using multithreading With resp ect to singlepro cess p erformance this solution is e cient but it has the same limitation s as sup erscalar pro ces sors intrinsic instructionlevel parallelism Multithreaded pro cessor architectures allow coarser grains of parallelism to b e used The same techniques used for mul capacities of multithreaded pro cessors than from fast con tipro cessors ie parallelizi ng at the lo op nest level can b e text switching techniques and thus recent studies are rather used Thus multithreaded pro cessors can readily b enet fo cused on instruction throughput issues from the large amount of research on automatic paralleliza In a multithreaded pro cessor based on the DEC tion Furthermore the cost of dispatching blo cks of itera is prop osed with the goal of implementing Simultaneous tions to all pro cessors which is a sequential and thus often Multithreading ie several threads are run concurrently costly pro cess rep eated prior to each parallel lo op is con Like in a sup erscalar pro cessor each thread is capable of siderably reduced since logical pro cessors are lo cated on the issuing multiple instructions in the same cycle In the same chip Multipro cessors can exhibit limited eciency b e multithreaded pro cessor architecture prop osed is also close cause the granularity of parallelis m within a lo op nest is to o to a sup erscalar pro cessor but functional unit utilizatio n is small with resp ect to the number of pro cessors and the initi improved by grouping threads so that at least one thread ation time This obstacle disapp ears in multithreaded pro p er group is capable of issuing In like in the two previ cessors b ecause the initiati on time is fairly small and most ous studies each thread can use any of the functional units of all b ecause these architectures can tolerate small gran Each thread is assigned to one of the logical pro cessors ularities a single iteration Consequently the same paral up to in Data dep endences are resolved with a lelization techniques used in multiprocessors can b e applied scoreboard technique and resource dep endences with to multithreaded pro cessors without the usually asso ciated standby stations implemented b efore each functional unit aws On the other hand it is not obvious that new p er thus limiting the impact of resource hazards on thread is formance b ottlenecks sp ecic to multithreaded pro cessors sue rate Still hardware supp ort for switchoncachemiss is would not o ccur prop osed b ecause this pro cessor is to b e used in a multipro The purp ose of this article is twofold rst to evaluate 2 cessor environment where long latencies can b e exp ected the capacity of multithreaded pro cessors to improve single Also shared registers are used for interthread communica pro cess p erformance using the paralleli zati on techniques that tions and also serve synchronization means Singlepro cess already exist for multiprocessors and to evaluate the ex p erformance improvement is briey discussed in this article ibility of multithreaded pro cessors with dierent types of a sp eedup with resp ect to a conventional RISC pro cessor workloads Second and more imp ortant the b ehavior of is rep orted using threads and two loadstore units note paralleli zed co des on multithreaded pro cessors are examined that a p erfect cache is used in details The inuence of each comp onent of the archi Sohi and Franklin also prop osed a more novel tecture is discussed and main p erformance b ottlenecks are architecture called multiscalar processor b etween a multi underlined Techniques for coping with these limitations in pro cessor concept and a multithreaded pro cessor A mul future multithreaded pro cessor architectures are discussed tiscalar pro cessor allows parallel execution of tasks each on The multithreaded pro cessor and cache architectures used a distinct execution unit Those units share a common are detailed in section In section the dierent asp ects memory and communicate through a ring to forward reg of the exp erimental framework paralleliza tion trace collec ister values Unlike most multithreaded architectures men tion simulation are presented Finally in section multi tioned ab ove functional units cannot b e shared by the tasks threaded pro cessors p erformance is analyzed However a multiscalar pro cessor requires much work to b e Related Work done at compile time to extract tasks and optimize schedul Up to now

Load more