Synchronized MIMD Computing Bradley C. Kuszmaul

Synchronized MIMD Computing by Bradley C. Kuszmaul SB mathematics Massachusetts Institute of Technology SB computer science and engineering Massachusetts Institute of Technology SM electrical engineering and computer science Massachusetts Institute of Technology Submitted to the Department of Electrical Engineering and Computer Science in partial ful®llment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 1994 c Massachusetts Institute of Technology 1994. All rights reserved. : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : : :: : :: : :: : : :: Author : Department of Electrical Engineering and Computer Science May 22, 1994 :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : : :: : :: Certi®ed by :: Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : :: : : :: : :: : : :: Accepted by : F. R. Morgenthaler Chair, Department Committee on Graduate Students 1 2 Synchronized MIMD Computing by Bradley C. Kuszmaul Submitted to the Department of Electrical Engineering and Computer Science on May 22, 1994, in partial ful®llment of the requirements for the degree of Doctor of Philosophy Abstract Fast global synchronization provides simple, ef®cient solutions to many of the system problems of parallel computing. It achieves this by providing composition of both performance and correctness. B If you understand the performance and meaning of parallel computations A and , then you B understand the performance and meaning of ªA; barrier; º. To demonstrate this thesis, this dissertation ®rst describes the architecture of the Connection Machine CM-5 supercomputer, a synchronized MIMD (multiple instruction stream, multiple data stream) computer for which I was a principal architect. The CM-5 was designed to run programs written in the data-parallel style by providing fast global synchronization. Fast global synchronization also helps solve many of the system problems for the CM-5, including clock distribution, diagnostics, and timesharing. Global barrier synchronization is used frequently in data-parallel programs to guarantee correctness, but the barriers are often viewed as a performance overheadthat should be removedif possible. By studying speci®c mechanisms for using the CM-5 data network ef®ciently, the second part of this dissertation shows that this view is incorrect. Interspersing barriers during message sending can dramatically improve performance of many important message patterns. Barriers are compared to two other mechanisms, bandwidth matching and managing injection order, for improving the performance of sending messages. The last part of this dissertation explores the bene®ts of global synchronization for MIMD-style programs, which are less well understood than data-parallel programs. To understand the programming issues, I engineered the StarTech parallel chess program. Computer chess is a resource- intensive irregular MIMD-style computation, providing a challenging scheduling problem. Global synchronization allows us to write a scheduler which runs unstructured computations ef®ciently and predictably. Given such a scheduler, the run time of a dynamic MIMD-style program on a W particular machine becomes simply a function of the critical path length C and the total work . I W P + C + empirically found that the StarTech program executes in time T 1 02 1 5 4 3 seconds, which, except for the constant-term of 4.3 seconds, is within a factor of 2.52 of optimal. Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering 3 4 Contents Contents 5 List of Figures 7 1 Introduction 9 2 The Network Architecture of the Connection Machine CM-5 23 2.1 The CM-5 Network Interface 25 2.2 The CM-5 Data Network 26 2.3 The CM-5 Control Network 32 2.4 The CM-5 Diagnostic Network 37 2.5 Synchronized MIMD Goals 41 2.6 CM-5 History 43 3 Mechanisms for Data Network Performance 45 3.1 Introduction 45 3.2 CM-5 Background 47 3.3 Timing on the CM-5 49 3.4 Using Barriers Can Improve Performance 50 3.5 Packets Should Be Reordered 53 3.6 Bandwidth Matching 55 3.7 Programming Rules of Thumb 57 4 TheStarTechMassivelyParallelChessProgram 59 4.1 Introduction 59 4.2 Negamax Search Without Pruning 62 4.3 Alpha-Beta Pruning 63 4.4 ScoutSearch 65 4.5 Jamboree Search 66 4.6 Multithreading with Active Messages 67 4.7 The StarTech Scheduler 70 5 The Performance of the StarTech Program 73 5.1 Introduction 73 5.2 Analysis of Best-Ordered Trees 76 5.3 Analysis of Worst-Ordered Game Trees 81 5.4 Jamboree Search on Real Chess Positions 84 5.5 Scheduling Parallel Computer Chess is Demanding 93 5 5.6 Performance of the StarTech Scheduler 96 5.7 Swamping 101 5.8 A Space-Time Tradeoff 109 6 Tuning the StarTech Program 113 6.1 Introduction 113 6.2 What is the Right Way to Measure Speedup of a Chess Program? 113 6.3 The Global Transposition Table 116 6.4 Improving the Transposition Table Effectiveness 120 6.5 Ef®ciency Heuristics for Jamboree Search 127 6.6 How Time is Spent in StarTech 128 7 Conclusions 133 Acknowledgments 145 Bibliography 149 Biographical Note 161 6 List of Figures 1-1 The organization of a SIMD computer. 11 1-2 The organization of a synchronized MIMD computer. 11 1-3 A fragment of data-parallel code. 14 1-4 Naive synchronized-MIMD code. 14 1-5 Optimized synchronized-MIMD code. 14 1-6 Exploiting the split-phase barrier via code transformations. 15 2-1 The organization of the Connection Machine CM-5. 24 2-2 A binary fat-tree. 26 2-3 The interconnection pattern of the CM-5 data network. 27 2-4 The format of messages in the data network. 29 2-5 The format of messages in the control network. 35 2-6 Steering a token down the diagnostic network. 40 3-1 A 64-node CM-5 data-network fat-tree showing all of the major arms and their bandwidths (in each direction). 47 3-2 The CM-5 operating system in¯ates time when the network is full 49 3-3 The effect of barriers between block transfers on the cyclic-shift pattern. 51 3-4 The total number of packets in the network headed to any given processor at any given time. 52 3-5 Big blocks also suffer from target collisions. 53 3-6 The effect on performance of interleaving messages. 54 3-7 The effect of bandwidth matching on permutations separated by barriers. 56 4-1 Algorithm negamax. 63 4-2 Practical pruning: White to move and win. 64 4-3 Algorithm absearch. 64 4-4 Algorithm scout. 65 4-5 Algorithm jamboree. 66 4-6 A search tree and the sequence of stack con®gurations for a serial implementation. 67 4-7 A sequence of activation trees for a parallel tree search. 68 4-8 Three possible ways to for a frame to terminate. 69 4-9 Busy processors keep working while the split-phase barrier completes. 72 5-1 The critical tree of a best-ordered uniform game-tree. 77 5-2 Numerical values for average available parallelism. 78 5-3 Partitioning the critical tree to solve for the number of vertices. 79 5-4 Performance metrics for worst-ordered game trees. 82 5-5 Jamboree search sometimes does more work than a serial algorithm. 85 7 5-6 Jamboree search sometimes does less work than a serial algorithm. 86 5-7 The data¯ow graph for Jamboree search. 87 5-8 Computing the estimated critical path length using timestamping. 88 5-9 The critical path of a computation that includes parallel-or. 88 5-10 The total work of each of 25 chess positions. 90 5-11 The critical path of each of 25 chess positions. 91 5-12 Serial run time versus work ef®ciency. 92 5-13 The ideal-parallelism pro®le of a typical chess position. 94 5-14 The sorted ideal-parallelism pro®le. 94 5-15 The processor utilization pro®le of a typical chess position. 95 5-16

Synchronized MIMD Computing Bradley C. Kuszmaul

Parallel Prefix Sum (Scan) with CUDA

CSE373: Data Structures & Algorithms Lecture 26

C5ENPA1-DS, C-5E NETWORK PROCESSOR SILICON REVISION A1

CSE 613: Parallel Programming Lecture 2

Design and Implementation of a Stateful Network Packet Processing

A Review of Multicore Processors with Parallel Programming

Parallel Algorithms and Parallel Program Design

Embedded Multi-Core Processing for Networking

Parallel Processing! 1! CSE 30321 – Lecture 23 – Introduction to Parallel Processing! 2! Suggested Readings! •! Readings! –! H&P: Chapter 7! •! (Over Next 2 Weeks)!

A Review of Parallel Processing Approaches to Robot Kinematics and Jacobian

A Survey on Parallel Multicore Computing: Performance & Improvement

Parallelizing Multiple Flow Accumulation Algorithm Using CUDA and Openacc