A Quantitative Performance Evaluation of SCI Memory Hierarchies

A Quantitative Performance Evaluation of SCI Memory Hierarchies Roberto A Hexsel University of Edinburgh 1994 Abstract The Scalable Coherent Interface (SCI) is an IEEE standard that defines a hardware platform for scalable shared-memory multiprocessors. SCI consists of three parts. The first is a set of physical interfaces that defines board sizes, wiring and network clock rates. The second is a communication protocol based on unidirec- tional point to point links. The third defines a cache coherence protocol based on a full directory that is distributed amongst the cache and memory modules. The cache controllers keep track of the copies of a given datum by maintaining them in a doubly linked list. SCI can scale up to 65520 nodes. This dissertation contains a quantitative performance evaluation of an SCI- connected multiprocessor that assesses both the communication and cache coherence subsystems. The simulator is driven by reference streams generated as a by-product of the execution of "real" programs. The workload consists of three programs from the SPLASH suite and three parallel loops. The simplest topology supported by SCI is the ring. It was found that, for the hardware and software simulated, the largest efficient ring size is between eight and sixteen nodes and that raw network bandwidth seen by processing elements is limited at about 80Mbytes/s. This is because the network saturates when link traffic reaches 600-700Mbytes/s. These levels of link traffic only occur for two poorly designed programs. The other four programs generate low traffic and their execution speed is not limited by interconnect nor cache coherence protocol. An analytical model of the multiprocessor is used to assess the cost of some frequently occurring cache coherence protocol operations. In order to build large systems, networks more sophisticated than rings must be used. The performance of SCI meshes and cubes is evaluated for systems of up to 64 nodes. As with rings, processor throughput is also limited by link traffic for the same two poorly designed programs. Cubes are 10-15% faster than meshes for programs that generate high levels of network traffic. Otherwise, the differences are negligible. No significant relationship between cache size and network dimensionality was found. I Acknowledgements This dissertation is one of the products of my living in Scotland. It was a period of much discovery, both technically and personally. On the personal side, I lived there long enough to become well acquainted with British culture and politics. Some of the good memories I will keep are from many hours spent with BBC Radio 4, BBC 1 and 2, Channel 4, The Edinburgh Filmhouse, The Cameo Cinema, The Queen's Hall, The Royal Sheakespeare Company. Through these media, I met the Bard and Handel, Inspectors Taggart and Frost, Jeremy Paxman and John Pilger, Marina Warner and Glenys Kinnock, Noam Chomsky and Edward Said, Dennis Skinner and Tony Benn, Prime Minister Question Time and Spitting Image, Peter Greenaway and a wealth of European cinema. Along with many others, these people, their work and the institutions they work for became an important element of my thinking. Being in exile is not easy but it can be an extremely enriching experience. It was for me. Many people took part, directly or indirectly, in the work that is reported here. Some were related to me in a professional capacity, some helped by just being there. I would like to express my gratitude to them all. The Coordenadoria de Aperfeiçoamento de Pessoal de NIvel Superior (CAPES), Ministério da Educação, Brazil, awarded the scholarship that made possible my coming to Britain. I would like to thank the people at the Department of Computer Science for making easy my life and endeavours as a graduate student. J particular, I am grateful to Angela, Chris (cc), Eleanor, George, Jenny, John (jhb), Murray, Paul, Sam and Todd. I would also like to thank all the people who put up with my simulations hogging the compute servers and/or their workstations. Dave Gustayson, the chairman of the IEEE-SCI working group, provided useful coments on a paper about SCI rings that evolved into Chapter 4. He kept insisting, thankfully, that SCI= rings is not true! Nigel Topham was my supervisor and I am indebted to him for the opportunity to work on Computer Architecture. I am also indebted to Nigel for his guidance and support. Had I followed his advice more closely on a few occasions, much time and grief would have been saved. Stuart Anderson was my second supervisor and I thank him for being there during the mid-summer crises. My parents offered much needed support and encouragement. Without their financial support in the later stages of this work, it would not have been completed. Table of Contents 1. Introduction 1 2. Shared Memory Multiprocessors 5 2.1 Interconnection Networks ......................... 6 2.2 Shared Memory Implementations ................... 10 2.2.1 Cache Memories ......................... 10 2.2.2 Multiprocessor Cache Coherency ............... 12 2.2.3 Ring Based Shared-Memory Multiprocessors ......... 15 2.3 The Scalable Coherent Interface .................... 16 2.3.1 SCI Communication Protocol ................. 17 2.3.2 SCI Cache Coherence Protocol ................. 18 2.3.3 Related Work .......................... 21 3. The Architecture Simulator 24 3.1 Simulation Methodology ........................ 24 3.2 The Simulated Multiprocessor ..................... 27 3.2.1 Processors and Memory Hierarchy ............... 27 3.2.2 The Simulation Model of SCI Rings .............. 30 3.3 The Workload .............................. 32 3.3.1 SPLASH Programs ....................... 33 3.3.2 Parallel Loops .......................... 34 3.3.3 Data Set Sizes .......................... 36 3.4 Accuracy of the Simulation Results .................. 38 111 Table of Contents iv 4. The Performance of SCI Rings 40 4.1 Performance Metrics .......................... 40 4.2 Node and Ring Design ......................... 42 4.2.1 Design Space .......................... 42 4.2.2 Characterising the Workload .................. 43 4.2.3 Cache Size and Cache Access Latency ............. 48 4.2.4 Processor Clock Speed ..................... 53 4.3 Throughput and Latency ........................ 54 4.4 Other Ring-based Systems ....................... 59 4.4.1 Comparing DASH and SCI .................... 59 5. A Model of the SCI-connected Multiprocessor 67 5.1 The Analytical Model .........................67 5.2 Costing Sharing-lists and Conflict Misses ...............73 6. The Performance of Meshes and Cubes 78 6.1 The Simulated Multiprocessor ..................... 78 6. 1.1 Routing ............................. 79 6.1.2 SCI Switches .......................... 80 6.2 SCI Meshes ............................... 81 6.2.1 Machine and Cache Size - SPLASH Programs ........ 81 6.2.2 Machine and Cache Size - Parallel Loops ........... 83 6.2.3 Throughput and Latency .................... 85 6.3 SCI Cubes ................................ 89 6.3.1 Machine and Cache Size .................... 89 6.3.2 Throughput and Latency .................... 92 6.4 A Comparison of Rings, Meshes and Cubes .............. 95 6.4.1 Throughput and Latency .................... 97 6.4.2 Cache Size and Network Dimensionality ........... 99 Table of Contents v 7. Conclusion 101 A. Performance Data 114 A.1 SCI Rings ................................ 115 A.1.1 chol() - DASH Parameters ................... 115 A.1.2 mp3d() - DASH Parameters .................. 116 A.1.3 water() - DASH Parameters .................. 117 A.1.4 chol() .............................. 118 A.1.5 mp3d() .............................. 121 A.1.6 water() .............................. 124 A.1.7 ge() ............................... 127 A.1.8 mmult() ............................. 130 A.1.9 paths() .............................. 133 A.2 SCI Meshes ............................... 136 A.2.1 chol() .............................. 136 A.2.2 mp3d() .............................. 137 A.2.3 water() .............................. 138 A.2.4 ge() ............................... 139 A.2.5 mmult() ............................. 140 A.2.6 paths() .............................. 141 A.3 SCI Cubes ................................ 143 A.3.1 chol() .............................. 143 A.3.2 mp3d() .............................. 143 A.3.3 water() .............................. 144 A.3.4 ge() ............................... 145 A.3.5 mmultQ ............................. 146 A.3.6 paths() .............................. 147 List of Figures 2.1 Interconnection networks: network size versus cost..... 6 2.2 SCI link interface....................... 18 2.3 Sharing-list setup........................ 19 2.4 Sharing-list purge sequence................... 20 3.1 Simulation environment . 26 3.2 Architecture of the processing nodes . 27 3.3 SCI link interface . 30 4.1 Execution time breakdown for cholO, mp3d() and waterO. 45 4.2 Execution time breakdown for geO, minult() and paths () . 46 4.3 Shared-data read hit ratios for 64, 128, 256 and 512Kbytes coherent caches................................. 48 4.4 Execution time as a function of cache size, for cholO, mp3d() and water()................................. 51 4.5 Execution time as a function of cache size, for ge 0, mniult () and paths () . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 Speedup achieved by doubling processor clock frequency, with cache sizes of 64 and 256Kbytes . . . . . . . . . .

Load more