Open Thesis.Pdf

The Pennsylvania State University The Graduate School Department of Computer Science and Engineering COMMUNICATION AND SCHEDULING IN CLUSTERS: A USER-LEVEL PERSPECTIVE AThesisin Computer Science and Engineering by Shailabh Nagar c 2001 Shailabh Nagar Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2001 We approve the thesis of Shailabh Nagar. Date of Signature Anand Sivasubramaniam Associate Professor of Computer Science and Engineering Thesis Adviser, Chair of Committee Chita R. Das Professor of Computer Science and Engineering Mary J. Irwin Professor of Computer Science and Engineering Matthias A. Blumrich Research Staff Member, IBM T.J. Watson Research Center Special Member Natarajan Gautam Assistant Professor of Industrial Engineering Dale A. Miller Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering iii Abstract Cluster computing has become the preferred model for building high performance servers today. The rapid advances and commoditization of microprocessors, network hardware and system software has made it possible for clusters to be built from off-the- shelf components. However the performance of such clusters is hampered by the software overheads of communication and scheduling at each node. The improvement of cluster communication and scheduling performance are important areas of research for modern computer systems. This thesis explores the use of user-level networking for improving cluster communication. By using intelligent network interface cards and optimized implementations of well-designed protocols, user-level networking (ULN) can significantly reduce communication overheads. The thesis examines the design and implementation of user-level network interfaces from a software and hardware perspective. It studies the benefits of ULN in clusters and ways to improve the scalability of ULN implementations. It also investigates how quality of service can be efficiently provided in a ULN. User-level communication has been used to improve the scheduling of tasks across cluster nodes. The thesis examines communication-driven coscheduling mechanisms and their impact on parallel application performance. It is shown that communication and scheduling are closely related and overall cluster performance depends on optimizing the two subsystems together. iv Table of Contents List of Tables :::::::::::::::::::::::::::::::::::::: ix List of Figures ::::::::::::::::::::::::::::::::::::: x Chapter 1. Introduction :::::::::::::::::::::::::::::::: 1 1.1 Efficient, scalable user-level communication .............. 3 1.2 Communication-driven scheduling of parallel processes . .... 6 Chapter 2. pSNOW :::::::::::::::::::::::::::::::::: 8 2.1pSNOW.................................. 11 2.2 Communication Mechanisms for NOW . .............. 13 2.2.1 Network Interfaces . ..................... 14 2.2.2 Communication Substrates . .............. 15 2.3 Implementation and Performance of Communication Substrates . 17 2.3.1 (C1,NI1).............................. 17 2.3.2 (C1,NI2).............................. 20 2.3.3 (C2,NI3).............................. 21 2.3.4 (C3,NI3).............................. 23 2.3.5 Validation............................. 26 2.4 Microbenchmark and Application Performance . .... 26 2.4.1 Ping Pong and Ping Bulk.................... 28 2.4.2 FFT................................ 28 v 2.4.3 Matmul.............................. 30 2.4.4 IS................................. 32 2.5ImpactofNetworkProcessorSpeed................... 33 2.6Summary................................. 38 Chapter 3. MU-Net :::::::::::::::::::::::::::::::::: 40 3.1Myrinet.................................. 41 3.2MU-Net.................................. 42 3.2.1 User-levelLibrary........................ 43 3.2.2 MU-NetKernelDrivers..................... 44 3.2.3 MU-NetMCP........................... 44 3.2.4 DetailsofMU-NetOperations.................. 45 3.2.4.1 Creating an Endpoint . .............. 45 3.2.4.2 SendingaMessage................... 48 3.2.4.3 Receiving a Message . .............. 52 3.2.4.4 DestroyinganEndpoint................ 54 3.3PerformanceResults........................... 54 3.4Summary................................. 57 Chapter 4. Scalability of VIA ::::::::::::::::::::::::::::: 59 4.1 The VI Architecture . ..................... 61 4.1.1 Objectives and overview of VIA . .............. 61 4.1.2 VIAOperations.......................... 63 4.2 Scalability Considerations for VIA . .............. 66 vi 4.2.1 Memory management . ..................... 67 4.2.2 WorkandCompletionQueues.................. 68 4.2.3 Firmwaredesign......................... 71 4.2.4 NIChardware........................... 74 4.2.5 Application/Library . ..................... 76 4.2.6 Hosthardware.......................... 77 4.3PerformanceEvaluation......................... 77 4.3.1 Simulator and Workload ..................... 77 4.3.2 MessagingSequence....................... 81 4.3.3 Base Design:Needforascalablesolution........... 83 4.3.4 SizeofDescriptorPayload.................... 85 4.3.5 EffectofCompletionQueues(CQ)............... 86 4.3.6 Pipelined/Overlappedfirmwaredesign............. 87 4.3.7 HardwareDMAQueues..................... 90 4.3.8 HardwareDoorbellsupport................... 91 4.3.9 Tailgating Descriptors . ..................... 94 4.3.10 Separate Send and Receive Processing . .... 95 4.3.11ShadowQueues.......................... 97 4.3.12 Putting it all together . ..................... 98 4.4Summary................................. 100 Chapter 5. Quality of Service for VIA :::::::::::::::::::::::: 101 5.1 Issues in providing QoS aware communication . .... 103 vii 5.2OverviewofULNandNICOperations................. 106 5.2.1 Doorbells............................. 107 5.2.2 HDMAOperation........................ 108 5.2.3 NSDMAandNRDMAoperation................ 109 5.3FirmwareDesign............................. 110 5.3.1 SequentialVIA(SVIA)..................... 111 5.3.2 ParallelVIA(PVIA)....................... 111 5.3.3 QoS-awareVIA(QoSVIA).................... 112 5.4PerformanceResults........................... 115 5.4.1 ResultsfromExperimentalPlatform.............. 116 5.4.1.1 Raw Performance of VIA Implementations . .... 116 5.4.1.2 PerformancewithCBRChannels(QoS)....... 120 5.4.2 SimulationResults........................ 122 5.4.2.1 ResultsforHigherLoads............... 123 5.4.2.2 VBRQoSChannels.................. 126 5.4.2.3 HardwareNSDMAQueue............... 126 5.5Summary................................. 128 Chapter 6. Communication-driven scheduling :::::::::::::::::::: 130 6.1 Scheduling Strategies . ..................... 133 6.1.1 LocalScheduling......................... 136 6.1.2 SpinBlock(SB)......................... 138 6.1.3 DynamicCoscheduling(DCS).................. 140 viii 6.1.4 DynamicCoschedulingwithSpinBlock(DCS-SB)...... 142 6.1.5 PeriodicBoost(PBandPBT)................. 143 6.1.6 PeriodicBoostwithSpinBlock(PB-SB)............ 145 6.1.7 SpinYield(SY).......................... 146 6.1.8 DynamicCoschedulingwithSpinYield(DCS-SY)...... 147 6.1.9 PeriodicBoostwithSpinYield(PB-SY)............ 147 6.2PerformanceResults........................... 149 6.2.1 ExperimentalSetupandWorkloads............... 149 6.2.2 ComparisonofSchedulingSchemes............... 153 6.2.3 Discussion............................. 161 6.3Summary................................. 164 Chapter 7. Conclusions :::::::::::::::::::::::::::::::: 167 7.1FutureWork............................... 170 7.2TheIdealNetworkInterface....................... 172 References :::::::::::::::::::::::::::::::::::::::: 174 ix List of Tables 2.1 Simulation Parameters . ..................... 13 3.1 Comparison of Roundtrip Latencies (in µs)................ 55 3.2 Anatomy of a Message in µs (Effect of message size and multiple endpoints) 56 5.1 Break-up of time expended in different operations during message trans- fer for different message sizes. Times are in microsecs, and indicate the time when that operation is completed since the send doorbell is rung for the send side, and receive doorbell is rung for the receive side. Receive descriptors are pre-posted before the data actually comes in for PVIA and QoSVIA, and their DMA transfers do not figure in the critical path. 117 5.2 Average 1-way Message Latency in microsecs using Ping-Bulk experiment 119 6.1 Design space of scheduling strategies . .............. 134 6.2 Four Process Mixed Workloads (% of time spent in communication is given next to each application) . ..................... 151 6.3TwoProcessMixedWorkloads....................... 153 6.4CompletionTimeinSeconds(Workloads1to5)............. 153 6.5CompletionTimeinSeconds(Workloads6to9)............. 156 6.6CompletionTimeinSeconds(Workloads10to12)............ 158 6.7IndividualCompletionTimesforWorkloads3and5(insecs)...... 158 x List of Figures 2.1 1-way Latency for (C1,NI1) . ..................... 19 2.2 1-way Latency for (C1,NI2) . ..................... 20 2.3Comparisonbetween(C1,NI1)and(C1,NI2)................ 22 2.4 1-way Latency for (C2,NI3) . ..................... 23 2.5 1-way Latency for (C3,NI3) on a TLB hit . .............. 25 2.6PerformanceofMicrobenchmarks...................... 29 2.7PerformanceofFFT............................. 30 2.8PerformanceofMatmul........................... 32 2.9PerformanceofIS............................... 33 2.10Comparisonof(C2,NI3)and(C3,NI3)................... 35 2.11ImpactofNPSpeed............................. 37 3.1MappingofEndpointintoUserAddressSpace.............

Load more