DESIGN OPTIONS FOR SMALL SCALE SHARED MEMORY MULTIPROCESSORS by Luiz André Barroso _____________________ A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) December, 1996 Copyright 1996 Luiz André Barroso i to Jacqueline Chame ii Acknowledgments During my stay at USC I have had the previlege of interacting with a number of people who have, in many different and significant ways, helped me along the way. Out of this large group I would like to mention just a few names. Koray Oner, Krishnan Ramamurthy, Weihua Mao, Barton Sano and Fong Pong have been friends and colleages in the every day grind. Koray and Jaeheon Jeong and I spent way too many sleepless nights together building and debugging the RPM multiprocessor. Thanks to their work ethic, talent and self-motivation we were able to get it done. I am also thankful to the support of my thesis committee throughout the years. Although separated from them by thousends of miles, my family has been very much present all along, and I cannot thank them enough for their love and support. The Nobrega Chame family has been no less loving and supportive. My friends, PC and Beto, have also been in my heart and thoughts despite the distance. I am indebted to the people at Digital Equipment Western Research Laboratory for offering me a job in a very special place. Thanks in particular to Joel Bartlett, Kourosh Gharachorloo and Marco Annaratone for reminding me that I had a thesis to finish when I was imersed in a lot of other fun stuff. Jacqueline Chame is the main reason why I have survived it. iii Table of Contents CHAPTER 1: INTRODUCTION 1 1.1 Motivations................................................................................................................................. 1 1.2 Summary of Research Contributions.......................................................................................... 4 1.3 Prior Related Work and Background.......................................................................................... 6 1.3.1 Multiprocessor Interconnect Architectures ............................................................... 6 1.3.1.1 Uniform vs. Non-Uniform Memory Access Architectures....................... 6 1.3.1.2 Limits on Bus Performance....................................................................... 7 1.3.1.3 Point-to-Point Links .................................................................................. 9 1.3.1.4Ring Networks......................................................................................... 10 1.3.1.5Crossbar Networks .................................................................................. 11 1.3.1.6Other Networks ....................................................................................... 12 1.3.1.7Cluster-based Architectures .................................................................... 12 1.3.2 Cache Coherence Protocols..................................................................................... 13 1.3.2.1 Snooping.................................................................................................. 14 1.3.2.2Centralized Directories............................................................................ 16 1.3.2.3 Distributed Directories ............................................................................ 18 1.3.3 Reducing and Tolerating Memory Latencies .......................................................... 19 1.3.3.1Prefetching .............................................................................................. 19 1.3.3.2 Relaxed Consistency Models .................................................................. 21 1.3.3.3Multithreading......................................................................................... 23 1.3.3.4 Hardware Support for Synchronization................................................... 23 1.3.4 Performance Evaluation Methodologies................................................................. 24 CHAPTER 2: CACHE COHERENCE IN RING BASED MULTIPROCESSORS 25 2.1 Ring Architectures.................................................................................................................... 25 2.1.1 Token-Passing Ring................................................................................................. 27 2.1.2 Register Insertion Ring ........................................................................................... 27 2.1.3 Slotted Ring............................................................................................................. 29 2.1.4 Packaging and Electrical Considerations ................................................................ 30 2.2 Dividing the Ring into Message Slots...................................................................................... 31 2.3 Cache Coherence Protocols for a Slotted Ring Multiprocessor ............................................... 33 2.3.1 Centralized Directory Protocols.............................................................................. 33 2.3.2 Distributed Directory Protocols .............................................................................. 39 2.3.3 Snooping Protocols ................................................................................................. 42 2.4 Summary................................................................................................................................... 47 CHAPTER 3: PERFORMANCE EVALUATION METHODOLOGY 49 3.1 Trace-driven Simulations ......................................................................................................... 49 3.2 A Hybrid Analytical Methodology........................................................................................... 52 3.2.1 Analytic Models for Ring-based Protocols ............................................................. 53 3.3 Program-driven Simulations..................................................................................................... 56 3.4 Benchmarks .............................................................................................................................. 57 CHAPTER 4: PERFORMANCE OF UNIDIRECTIONAL RING MULTIPROCESSORS 60 4.1 Snooping vs. Centralized Directory Protocols ......................................................................... 62 4.2 Distributed Directory Protocols................................................................................................ 69 4.3 Effect of Cache Block Size....................................................................................................... 72 iv CHAPTER 5: PERFORMANCE OF BIDIRECTIONAL RING MULTIPROCESSORS 75 5.1 Bidirectional Rings and Evaluation Assumptions ....................................................................75 5.2 Simulation of Unidirectional and Bidirectional Rings..............................................................77 5.3 Discussion .................................................................................................................................79 5.4 Summary ...................................................................................................................................88 CHAPTER 6: PERFORMANCE OF NUMA BUS MULTIPROCESSORS 89 6.1 A High-Performance NUMA Bus Architecture........................................................................89 6.2 A NUMA Bus Snooping Protocol ............................................................................................90 6.3 Packet- vs. Circuit-Switched Buses ..........................................................................................91 6.4 Performance Evaluation of a Packet-Switched NUMA Bus ....................................................92 6.5 Potential of Software Prefetching .............................................................................................97 6.6 Summary .................................................................................................................................104 CHAPTER 7: PERFORMANCE OF CROSSBAR MULTIPROCESSORS 105 7.1 A NUMA Crossbar-based Multiprocessor Architecture.........................................................105 7.1.1 Cache Coherence Protocols for Crossbar-connected Multiprocessors ..................107 7.1.2 Simulation Results for Ring, Bus and Crossbar-based Systems............................108 7.2 Summary .................................................................................................................................114 CHAPTER 8: HARDWARE SUPPORT FOR LOCKING OPERATIONS 115 8.1 Atomic Operations ..................................................................................................................115 8.2 Test&Set Primitives in Write-Invalidate Protocols.................................................................116 8.3 Queue On Lock Bit (QOLB)...................................................................................................119 8.4 Hardware Support for Locking on Snooping Slotted Rings ...................................................120 8.5 Performance Impact of Hardware Locking Mechanisms .......................................................122 8.6 Summary .................................................................................................................................128 CHAPTER 9: THE IMPACT OF RELAXED MEMORY CONSISTENCY MODELS 129 9.1 Introduction.............................................................................................................................129
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages166 Page
-
File Size-