Scheduling on Asymmetric Parallel Architectures

Scheduling on Asymmetric Parallel Architectures Filip Blagojevic Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications Committee Members: Dimitrios S. Nikolopoulos (Chair) Kirk W. Cameron Wu-chun Feng David K. Lowenthal Calvin J. Ribbens May 30, 2008 Blacksburg, Virginia Keywords: Multicore processors, Cell BE, process scheduling, high-performance computing, performance prediction, runtime adaptation c Copyright 2008, Filip Blagojevic Scheduling on Asymmetric Parallel Architectures Filip Blagojevic (ABSTRACT) We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimensions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of program- mers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically ”rightsize” the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model pre- dicts with high accuracy the execution time and scalability of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal degrees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications. ACKNOWLEDGMENTS I would like to thank my advisor Dr. Dimitrios S. Nikolopoulos for his guidance during my graduate studies. I would also like to thank Dr. Alexandros Stamatakis, Dr. Xizhou Feng, and Dr. Kirk Cameron for providing us with the original MPI implementations of PBPI and RAxML and for discussions on scheduling and modeling the Cell/BE. I would like to thank to the members of the PEARL group, Dr. Christos Antonopoulos, Dr. Matthew Curtis-Maury, Scott Schneider, Jae-Sung Yeom, and Benjamin Rose, for their involvement in the projects pre- sented in this dissertation. I would also like to thank my Ph.D. committee for their discussion and suggestions for this work: Dr. Kirk W. Cameron, Dr. Davd Lowenthal, Dr. Wu-chun Feng, and Dr. Calvin J. Ribbens. Also, I thank Georgia Tech, its Sony-Toshiba-IBM Center of Competence, and NSF, for the Cell/BE resources that have contributed to this research. Fi- nally, I would like to thank the institutions that have funded this research: the National Science Foundation and the U.S. Department of Energy. iii This page intentionally left blank. iv Contents 1 Problem Statement 1 1.1 Mapping Parallelism to Asymmetric Parallel Architectures . 2 2 Statement of Objectives 5 2.1 Dynamic Multigrain Parallelism . 5 2.2 Rightsizing Multigrain Parallelism . 8 2.3 MMGP Model . 9 3 Experimental Testbed 11 3.1 RAxML . 12 3.2 PBPI . 13 3.3 Hardware Platform . 14 4 Code Optimization Methdologies for Asymmetric Multi-core Systems with Explic- itly Managed Memories 17 4.1 Porting and Optimizing RAxML on Cell . 18 4.2 Function Off-loading . 18 4.2.1 Optimizing Off-Loaded Functions . 19 4.2.2 Vectorizing Conditional Statements . 20 4.2.3 Double Buffering and Memory Management . 23 4.2.4 Vectorization . 24 4.2.5 PPE-SPE Communication . 27 4.2.6 Increasing the Coverage of Offloading . 28 4.3 Parallel Execution . 29 4.4 Chapter Summary . 30 5 Scheduling Multigrain Parallelism on Asymmetric Systems 33 5.1 Introduction . 33 v 5.2 Scheduling Multi-Grain Parallelism on Cell . 33 5.2.1 Event-Driven Task Scheduling . 34 5.2.2 Scheduling Loop-Level Parallelism . 36 5.2.3 Implementing Loop-Level Parallelism . 42 5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism . 43 5.3.1 Application-Specific Hybrid Parallelization on Cell . 44 5.3.2 MGPS . 47 5.4 S-MGPS . 49 5.4.1 Motivating Example . 50 5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism . 51 5.5 Chapter Summary . 57 6 Model of Multi-Grain Parallelism 61 6.1 Introduction . 61 6.2 Modeling Abstractions . 62 6.2.1 Hardware Abstraction . 62 6.2.2 Application Abstraction . 63 6.3 Model of Multi-grain Parallelism . 65 6.3.1 Modeling sequential execution . 66 6.3.2 Modeling parallel execution on APUs . 67 6.3.3 Modeling parallel execution on HPUs . 69 6.3.4 Using MMGP . 71 6.3.5 MMGP Extensions . 72 6.4 Experimental Validation and Results . 72 6.4.1 MMGP Parameter approximation . 73 6.4.2 Case Study I: Using MMGP to parallelize PBPI . 74 6.4.3 Case Study II: Using MMGP to Parallelize RAxML . 77 6.4.4 MMGP Usability Study . 81 6.5 Chapter Summary . 83 7 Scheduling Asymmetric Parallelism on a PS3 Cluster 85 7.1 Introduction . 85 7.2 Experimental Platform . 87 7.3 PS3 Cluster Scalability Study . 88 7.3.1 MPI Communication Performance . 88 7.3.2 Application Benchmarks . 88 vi 7.4 Modeling Hybrid Parallelism . 93 7.4.1 Modeling PPE Execution Time . 94 7.4.2 Modeling the off-loaded Computation . 96 7.4.3 DMA Modeling . 97 7.4.4 Cluster Execution Modeling . 98 7.4.5 Verification . 99 7.5 Co-Scheduling on Asymmetric Clusters . 99 7.6 PS3 versus IBM QS20 Blades . 102 7.7 Chapter Summary . 104 8 Kernel-Level Scheduling 107 8.1 Introduction . 107 8.2 SLED Scheduler Overview . 108 8.3 ready to run List . 110 8.3.1 ready to run List Organization . 110 8.3.2 Splitting ready to run List . 111 8.4 SLED Scheduler - Kernel Level . 113 8.5 SLED Scheduler - User Level . 116 8.6 Experimental Setup . 117 8.6.1 Benchmarks . 118 8.6.2 Microbenchmarks . 118 8.6.3 PBPI . 122 8.6.4 RAxML . 123 8.7 Chapter Summary . 125 9 Future Work 127 9.1 Integrating ready-to-run list in the Kernel . 128 9.2 Load Balancing and Task Priorities . 130 9.3 Increasing Processor Utilization . 131 9.4 Novel Applications and Programming Models . 132 9.5 Conventional Architectures . 132 9.6 MMGP extensions . 133 10 Overview of Related Research 135 10.1 Cell – Related Research . 135 10.2 Process Scheduling - Related Research . 138 vii 10.3 Modeling – Related Research . 141 10.3.1 PRAM Model . 141 10.3.2 BSP model . 142 10.3.3 LogP model . 143 10.3.4 Models Describing Nested Parallelism . 144 Bibliography 147 viii List of Figures 2.1 A hardware abstraction of an accelerator-based architecture. Host processing units (HPUs) supply coarse-grain parallel computation across accelerators. Ac- celerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism. 6 3.1 Organization of Cell. 14 4.1 The likelihood vector structure is used in almost all memory traffic between main memory and the local storage of the SPEs. The structure is 128-bit aligned, as required by the Cell architecture. 23 4.2 The body of the first loop in newview(): a) Non–vectorized code, b) Vector- ized code. 25 4.3 The second loop in newview(). Non–vectorized code shown on the left, vectorized code shown on the right. spu madd() multiplies the first two arguments and adds the result to the third argument. spu splats() creates a vector by replicating a scalar element. 26 4.4 Performance of (a) RAxML and (b) PBPI with different number of MPI pro- cesses. 29 5.1 Scheduler behavior for two off-loaded tasks, representative.

Scheduling on Asymmetric Parallel Architectures

High Performance Computing Through Parallel and Distributed Processing

Parallel System Performance: Evaluation & Scalability

Scalable Task Parallel Programming in the Partitioned Global Address Space

Oblivious Network RAM and Leveraging Parallelism to Achieve Obliviousness

Compiling for a Multithreaded Dataflow Architecture : Algorithms, Tools, and Experience Feng Li

Massively Parallel Computers: Why Not Prirallel Computers for the Masses?

CUDA C++ Programming Guide

14. Parallel Computing 14.1 Introduction 14.2 Independent

CS 211: Computer Architecture ¾ Starting with Simple ILP Using Pipelining ¾ Explicit ILP - EPIC ¾ Key Concept: Issue Multiple Instructions/Cycle Instructor: Prof

Parallel Programming in Openmp About the Authors

CUDA Dynamic Parallelism

Models for Parallel Computation in Multi-Core, Heterogeneous, and Ultra Wide-Word Architectures