The Numachine Multiprocessor: Design and Analysis

The NUMAchine Multiprocessor: Design and Analysis Robin Grlndky A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of EIectrical and Cornputer Engineering Cornputer Engineering Group University of Toronto 8 1999 Robin Grindley National Library Bibliothèque nationale I*I of Canada du Canada Acquisitions and Acquisitions et Bibliographic Services services bibliographiques 395 Wellington SIreet 395. rue Wdlingtm OnawaON KlAON4 Ottawa ON KiAW Canada Canada Our N. Nolrr m- The author has granted a non- L'auteur a accordé une licence non exclusive licence allowing the exclusive permettant à la National Library of Canada to Bibliothèque nationale du Canada de reproduce, loan, distribute or sell reproduire, prêter, distribuer ou copies of this thesis in microform, vendre des copies de cette thèse sous paper or electronic formats. la fome de microfiche/film, de reproduction sur papier ou sur format électronique. The author retains ownership of the L'auteur conserve la propriété du copyright in this thesis. Neither the droit d'auteur qui protège cette thèse. thesis nor substantial extracts fkom it Ni la thèse ni des extraits substantiels may be printed or othewise de celle-ci ne doivent être imprimés reproduced without the author's ou autrement reproduits sans son permission. autorisation. The NUMAchine Multiproecssor: Design and Am-ûysisJ" Doctor of Wilosophy, 1999, Robin Grindley Graduate Department of EkeMd anà Computer Engineering University of Toronto Abstract This dissertation considers the design and aaalysis of NUMAchiac: a distributcd, sharcd- memory multiprocessor. The architecture and design process leading to a working 48-processor prototype are describeci in detail. Analysis of the system is bascd on a cycle-accurate, execution-driven simulator developed as part of the thesis. An exploration of the design space is also undertaken to provide some intuition as to possible future enhanccments to the architecture. Shared-memory mu1ti processors and parallel processing are bccoming increasingly com- mon not only in the scientific domain, but also as a replacement for mainfiames in the field of large-scale enterprise computing. The shared-memory programming paradigai provides an intuitive view of memory as a globally shared =source among aii processors. This is more familiar to programmers of uniprocessors than the alternative, message-passing. The disiribu- tion of memory across the system leads to Non-Uaifonn Memory Access times (NUMA), since processors have fast access to local memory and slower access to remote memories across the system network.The architecture contains features which attempt to hide or reduce the effects of this non-unifonnity. NUMAchine provides cache coherence in hardware, making it an instance of the general class of multiprocessor architectures called CC-NUMA (for cache-coherent NUMA). The system network in NUMAchine consists of a hierarchy of rings. We show how certain proper- ties of rings allow for an efficient cache coherence scheme with reduccd overheads in compar- ison to other CC-NUMAarchitectures. We use the simulator, which we developed as part of this project, to explore the NUMAchine design space in an attempt to discover how changes in various aspects of the architecturc afict overall performance. Acknowledgments Fust and foremost, 1would likc to thank Zvonko Vrancsic and Michacl Stumm for their guidance and support, without which this thcsis would cer- tainly never have come to fniition. And of course 1&O have to thank Zvonko for teaching me some of the fincr points of squash, even if hc did takc an inordinate amount of pleasure in thrashhg me. To Michacl, thanks for the proof that sleep is not actualiy a biological nccessity. To the Punks, well, what can 1say? It's kna long, strange trip, and you guys were dong for the whole ride. But now 1 guess it is time to get off the roller coaster, stagger around a bit and fat1 down. Wsthout a close group of fricnds to providc emotiond support, 1 wou1d have run out of steam long ago. To Dan, Gus, Kate, Andy and Stcf, my thanks. And feel friee to cal1 the debt whenever you want. Penultimately, a toast to the NUMAchine team. Who ever thought that by sirnply banging yow heaâ against a picce of hardware for ycars on end you could get it to work? Don? tell anybody. And finally 1 would Likc to thank my parents. To my mothcr, whose patience with a son who stemcd dcstined to be in school forever bordered on the beatific, my love and thanks. And to my fathcr, who could not stick around for the end of the party, the best 1can do is to promise that 1 will honour the memory. iii CONTENTS CHAPTER I Intmduction 1 CHAPTER 2 Background 6 An Overview of Wrallelism 6 E4riyHistory 6 Low-icwl Rardlelhn 8 Higher-lm1 Pbmlièlirm 9 APrrrellelTmonomy 9 Lunits to fùmlklism 10 Architectural As- of Paralle1 Systcms 13 7hcPRAMmodcl 13 Message Phuing vs. Shared Memory 16 C(1cheCoACmnce 17 Menwry Consisrency Modcls 24 Menwry Snôsystems 28 Multipmcessor Netwaks 31 Full cmsstsbars 33 Muftistage Inferronncction Neoh 33 Hypercvbcr 33 k-uryn-cubes 35 Fa mes 35 Busses Md Rings 35 NenvorkScunmary 36 Systern Ami Nctworks 36 SCI (ScdaMe Coherrnt lnte@xe) 38 My- 39 Memory Channrl Il 39 Synfiniry a Sarnple Multiproccssors 41 SIanford OASH and FUSH 41 Illinois 1-ACW 42 Emrompruer O SUNEIOuOO 43 SGI Origin 45 Beowvlf Y Corrlirsion 46 CHAPTER 3 NUMAchine Architecture, Implernentation & Simulotor 47 CHAPTER 4 SimulatimEnvircmment 80 Srmion Bus iW Queue Modclling 81 Memory Carû &? Prvcessor Card 82 hging&llicy 81 lnrtrucrwn Ferches anà SequeruiaI Code 83 Rototype Arialysis 8s Compariron of the Simrrlator and the Pmtoîype 86 FdA~ownance8% Ring Pk#ownance 89 Nefwork Coche Rr70rmancc 97 Requesr Md BœhfLurency IOI Flow Conrml Iû3 Conclusion 105 FdSimubticm Paramctcrs 107 Algœithmic Spcdup oftbc Test Rograms Il2 Baseiinc Rrf'and Pagt Placement 114 Compatativc Studics 117 Coherrnce~~dIl7 A RehdConsutcncy Modrl 120 Centml Ring *cd I2l NcnivorkCachcPcrformance l23 Nemk CmkSi= 123 Nenuofi C'ICisucimiviry 127 Conclusioa U9 CHAPTER 6 Conclusion 131 APPENDIX A Notes on the Simulator 139 MWModiticaticms 139 Notes cm the NUMAchine Simulator (Mhtsim) 141 References 146 LIST OF FIGURES FIGURE 2.1 : Von Netumm and PRAM mrnory moclls. 14 FIGURE 2.2: A hierarchical cack cobcrrace directay. 23 FIGURE 23: Memory consisteocy modtis. 25 FIGURE 2.4: Relaxcd consisteacy. 27 FIGURE 25: Various classes of mcrnory subsystcms. 28 FIGURE 2.6: Scalable intercdmoshivorlcs. 34 FIGURE 2.7: DASH and FLASH architectures. 42 FIGURE 2.8: Teracomputer archi tecturc. 44 FIGURE 2.9: SUN E 10000 archilecnirr. 44 FIGURE 2.10: Layout of the SGI Mgin 2000. 45 FIGURE 3.1: A high-level view of the NUMAchine architecture. 49 FIGURE 3.2: Cards on the statioa bus. 53 FIGURE 3.3: The NUMAchine Ncrworlr interface Card (MC). 56 FIGURE 3.4: The hier-Ring interface (IRI). 58 FIGURE 35: The NUMAchine filtcrmask. 63 FIGURE 3.6: Coberence actions at the home memoiy. 66 FIGURE 3.7: Seqwntiai consistency in NUMAchine. 7 1 FIGURE 3.8: The NUMAchine simulator structure. 74 FIGURE 4.1: Modeliing of sequential cak and instruction fetcbes. 84 FIGURE 4.2: Simulateci prototype spceûups fa the SpWpograms. 85 FIGURE 4.3: Parailel versus algorithmic speedups. 87 FIGURE 4.4: SimuJator versus hardware prototype speedups. 89 FIGURE 4.5: Overinvaii&tion rates. 90 FIGURE 4.6: Central Ring utilizatim. 9 1 FIGURE 4.7: LocalRingutilizations. 92 FIGURE 4.8: Cenaal Ring queue utilizations. 93 FIGURE 4.9: Local Ring queue utilizatiais. W FIGURE 4.1 0: Use of the just-freed sla % FIGURE 4.1 1 : Networlr Cache hit rates. 98 HGURE 4.12: Local Rings locks caused by the W. 1û4 FIGURE 4.13: Average bus utiiuation. 104 vii FIGURE 5.1: Direct-mapped versus 4-way associative paessor caches. 1 1 1 FIGURE 5.2: Algorithmic pualit1 speedups fa the cxpcrimcntal systcm. 1 14 FIGURE 53: Paralkl spetdups of the basclinc systcm with a round-robin and ûrst-hit pagc-placemmt policies. 115 FIGURE 5.4: Processor utilisation graphs comspoading to the fus-hit speedup curvcs in Figi~rc53. 116 FIGURE 55: Turniog off cache cobacnce. 1 19 FIGURE 5.6: Bandwidth rcquirtments of tbe Central Ring. 122 FIGURE 5-7: Effccts of increasiag Ccnual Ring & 123 FIGURE 5.8: Effects of Network Cache sut cm performance. 126 FIGURE 5.9: Effects of aAding associativity to the Nctwdc Cache. 128 FiGW6.1 : NUMAchiw with 24 processors. 136 viii LIST OF TABLES TABLE 2.1 : TABLE 4.1: SpWpcogram psirametas fatfre prototype anaîysis. 8 1 TABLE 4.2: Unipoccssor sirnulated wrsus hardware execution timcs. 88 TABLE 4.3: Base coatmtion-fricc iatency for a local tead. 10 1 TABLE 4.4: Congested latcncies for a 64-poccssor &cm simulatios 103 TABLE 5.1: Probicm Shfor Splasb2 Kcmels 1 12 TABLE 5.2: Roblem Sasfa SpWAppiicatiam 1 13 CHAPTER 1 Znn-oductin There will never be such a thing as too much computing power. As new levels of computing power become available, eiîher applations grow in size and complexity. or new problems become feasible. nie most costeffective means of increasing computing power over the 1st 30 years has been to ride the tcchnological wave. Increased density of integrattd circuits, lower voltages and incrcascd clock specùs have duced the pressure on computer designers to apply novel architecniral solutions to the pefiormance problem. keping pace with Moore's Law [Moore 19751 has been an astounding technical feat for the semiconductor industry. But the basic architecture of a computer systcm has rcmained rdatively unchanged: a single processor with one or more levels of cache, and a dedicated volatile memory. The power behind Moore's Law is exponentiai growth. In the long terni, this is also one of its shortcomings.

Load more