The Impact of Exploiting Instruction-Level Parallelism On

SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION The Impact of Exploiting InstructionLevel Parallelism on SharedMemory Multipro cessors Vijay S Pai Parthasarathy Ranganathan Hazim Ab delSha and Sarita Adve Department of Electrical and Computer Engineering Rice University Houston Texas fvijaypai|parthas|sha|saritag@riceedu Abstract Current micropro cessors incorp orate tech sors however have assumed a simple processor with single niques to aggressively exploit instructionlevel parallelism issue inorder scheduling blo cking loads and no sp ecu ILP This pap er evaluates the impact of such pro cessors lation A few multiprocessor architecture studies mo del on the p erformance of sharedmemory multiprocessors b oth stateoftheart ILP pro cessors but do not without and with the latencyhiding optimization of soft ware prefetching analyze the impact of ILP techniques Our results show that while ILP techniques substantially To fully exploit recent advances in unipro cessor technol reduce CPU time in multiprocessors they are less eec ogy for sharedmemory multiprocessors a detailed analysis tive in removing memory stall time Consequently despite the inherent latency tolerance features of ILP pro cessors of how ILP techniques aect the p erformance of such sys we nd memory system p erformance to b e a larger b ot tems and how they interact with previous optimizations for tleneck and parallel eciencies to b e generally p o orer in such systems is required This pap er evaluates the impact ILPbased multiprocessors than in previousgeneration mul of exploiting ILP on the p erformance of sharedmemory tipro cessors The main reasons for these deciencies are insucient opp ortunities in the applications to overlap mul multiprocessors b oth without and with the latencyhiding tiple load misses and increased contention for resources in optimization of software prefetching the system We also nd that software prefetching do es not For our evaluations we study ve applications using de change the memory b ound nature of most of our applications on our ILP multiprocessor mainly due to a large number of tailed simulation describ ed in Section I I late prefetches and resource contention Section III analyzes the impact of ILP techniques on Our results suggest the need for additional latency hiding the p erformance of sharedmemory multiprocessors with or reducing techniques for ILP systems such as software clustering of load misses and pro ducerinitiated communi out the use of software prefetching All our applications 1 cation see p erformance improvements from the use of current ILP techniques but the improvements vary widely In partic ular ILP techniques successfully and consistently reduce I Introduction the CPU comp onent of execution time but their impact Sharedmemory multiprocessors built from commo dity on the memory stall time is lower and more application micropro cessors are b eing increasingly used to provide high dep endent Consequently despite the inherent latency tol p erformance for a variety of scientic and commercial ap erance features integrated within ILP pro cessors we nd plications Current commo dity micropro cessors improve memory system p erformance to b e a larger b ottleneck and p erformance by using aggressive techniques to exploit high parallel eciencies to b e generally p o orer in ILPbased levels of instructionlevel parallelism ILP These tech multiprocessors than in previousgeneration multiproces niques include multiple instruction issue outoforder dy sors These deciencies are caused by insucient opp or namic scheduling nonblo cking loads and sp eculative ex tunities in the application to overlap multiple load misses ecution We refer to these techniques collectively as ILP and increased contention for system resources from more techniques and to pro cessors that exploit these techniques frequent memory accesses as ILP processors Softwarecontrolled nonbinding prefetching has b een Most previous studies of sharedmemory multiproces shown to b e an eective technique for hiding memory latency in simple pro cessorbased shared memory sys This work is supp orted in part by an IBM Partnership award tems Section IV analyzes the interaction b etween soft Intel Corp oration the National Science Foundation under Grant No CCR CCR CDA and CDA ware prefetching and ILP techniques in sharedmemory and the Texas Advanced Technology Program under Grant No multiprocessors We nd that compared to previous Sarita Adve is also supp orted by an Alfred P Sloan generation systems increased late prefetches and increased Research Fellowship Vijay S Pai by a Fannie and John Hertz Foun dation Fellowship and Parthasarathy Ranganathan by a Lo dieska contention for resources cause software prefetching to b e Sto ckbridge Vaughan Fellowship less eective in reducing memory stall time in ILPbased 1 This pap er combines results from two previous conference pa systems Thus even after adding software prefetching p ers using a common set of system parameters a more aggressive MESI versus MSI cachecoherence proto col a more ag most of our applications remain largely memory b ound on gressive compiler the b etter of SPARC SC and gcc for the ILPbased system each application rather than gcc and full simulation of pri vate memory references Overall our results suggest that compared to previous SELECTED FOR IEEE TRANS ON COMPUTERS SPECIAL ISSUE ON CACHES FEB NOT THE FINAL VERSION Pro cessor parameters generation sharedmemory systems ILPbased systems Clo ck rate MHz have a greater need for additional techniques to tolerate Fetchdecoderetire rate p er cycle or reduce memory latency Sp ecic techniques motivated Instruction window re order buer size by our results include clustering of load misses in the appli Memory queue size cations to increase opp ortunities for load misses to overlap Outstanding branches Functional units ALUs FPUs address genera with each other and techniques such as pro ducerinitiated tion units all cycle latency communication that reduce latency to make prefetching Memory hierarchy and network parameters more eective Section V L cache KB directmapp ed p orts MSHRs byte line L cache KB way asso ciative p ort I I Methodology MSHRs byte line pip elined Memory way interleaved ns access time A Simulated Architectures bytescycle Bus MHz bits split transaction To determine the impact of ILP techniques on multipro Network D mesh MHz bits p er hop cessor p erformance we compare two systems ILP and it delay of network cycles Simple equivalent in every resp ect except the pro cessor No des in multiprocessor Resulting contentionless latencies in processor cycles used The ILP system uses stateoftheart ILP pro cessors L hit cycle while the Simple system uses simple pro cessors Section I I L hit cycles A We compare the ILP and Simple systems not to sug Lo cal memory cycles Remote memory cycles gest any architectural tradeos but rather to understand Cachetocache transfer cycles how aggressive ILP techniques impact multiprocessor p er formance Therefore the two systems have identical clo ck Fig System parameters rates and include identical aggressive memory and network status holding registers MSHRs to store information congurations suitable for the ILP system Section I IA on outstanding misses and to coalesce multiple requests to Figure summarizes all the system parameters the same cache line All multiprocessor results rep orted in A Pro cessor mo dels this pap er use a conguration with no des The ILP system uses stateoftheart pro cessors that in B Simulation Environment clude multiple issue outoforder dynamic scheduling We use RSIM the Rice Simulator for ILP Multipro nonblo cking loads and sp eculative execution The Simple cessors to mo del the systems studied RSIM is system uses previousgeneration simple pro cessors with sin an executiondriven simulator that mo dels the pro cessor gle issue inorder static scheduling and blo cking loads pip elines memory system and interconnection network and represents commonly studied sharedmemory systems in detail including contention at all resources It takes Since we did not have access to a compiler that schedules SPARC application executables as input To sp eed up our instructions for our inorder simple pro cessor we assume simulations we assume that all instructions hit in the in singlecycle functional unit latencies as also assumed by struction cache This assumption is reasonable since all our most previous simplepro cessor based sharedmemory stud applications have very small instruction fo otprints ies Both pro cessor mo dels include supp ort for software controlled nonbinding prefetching to the L cache C Performance Metrics A Memory Hierarchy and Multipro cessor Conguration In addition to comparing execution times we also rep ort the individual comp onents of execution time CPU data We simulate a hardware cachecoherent nonuniform memory stall and synchronization stall times to charac memory access CCNUMA sharedmemory multiproces terize the p erformance b ottlenecks in our systems With sor using an invalidationbased fourstate MESI directory ILP pro cessors it is unclear how to assign stall time to sp e coherence proto col We mo del release consistency b e cic instructions since each instructions execution may b e cause previous studies have shown that it achieves the b est overlapped with b oth preceding and following instructions p erformance We use the following convention similar to previous work The pro cessing no des are connected

The Impact of Exploiting Instruction-Level Parallelism On

2.5 Classification of Parallel Computers

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism

Trafficdb: HERE's High Performance Shared-Memory Data Store

Scheduling of Shared Memory with Multi - Core Performance in Dynamic Allocator Using Arm Processor

The Impact of Exploiting Instruction-Level Parallelism on Shared-Memory Multiprocessors

Lecture 17: Multiprocessors

Use of Shared Memory in the Context of Embedded Multi-Core Processors: Exploration of the Technology and Its Limits

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use

Instruction Level Parallelism Example

CIS 501 Computer Architecture This Unit: Shared Memory

Programming with Shared Memory PART I