Scheduling on Asymmetric Parallel Architectures

Filip Blagojevic

Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of

Doctor of Philosophy in and Applications

Committee Members: Dimitrios S. Nikolopoulos (Chair) Kirk W. Cameron Wu-chun Feng David K. Lowenthal Calvin J. Ribbens

May 30, 2008 Blacksburg, Virginia

Keywords: Multicore processors, Cell BE, scheduling, high-performance computing, performance prediction, runtime adaptation

c Copyright 2008, Filip Blagojevic Scheduling on Asymmetric Parallel Architectures

Filip Blagojevic

(ABSTRACT)

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate con- ventional cores that run legacy codes with specialized cores that serve as computational ac- celerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. To maximize performance on heterogeneous multi-core processors, programs need to expose multiple dimen- sions of parallelism simultaneously. Unfortunately, programming with multiple dimensions of parallelism is to date an ad hoc process, relying heavily on the intuition and skill of program- mers. Formal techniques are needed to optimize multi-dimensional parallel program designs. We investigate user- and kernel-level schedulers that dynamically ”rightsize” the dimensions and degrees of parallelism on the asymmetric parallel platforms. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler sup- port. Our runtime environment outperforms the native Linux and MPI scheduling environment by up to a factor of 2.7. We also present a model of multi-dimensional parallel computation for steering the parallelization process on heterogeneous multi-core processors. The model pre- dicts with high accuracy the execution time and of a program using conventional processors and accelerators simultaneously. More specifically, the model reveals optimal de- grees of multi-dimensional, task-level and data-level concurrency, to maximize performance across cores. We evaluate our runtime policies as well as the performance model we developed, on an IBM Cell BladeCenter, as well as on a cluster composed of Playstation3 nodes, using two realistic bioinformatics applications. ACKNOWLEDGMENTS

I would like to thank my advisor Dr. Dimitrios S. Nikolopoulos for his guidance during my graduate studies. I would also like to thank Dr. Alexandros Stamatakis, Dr. Xizhou Feng, and Dr. Kirk Cameron for providing us with the original MPI implementations of PBPI and RAxML and for discussions on scheduling and modeling the Cell/BE. I would like to thank to the members of the PEARL group, Dr. Christos Antonopoulos, Dr. Matthew Curtis-Maury, Scott Schneider, Jae-Sung Yeom, and Benjamin Rose, for their involvement in the projects pre- sented in this dissertation. I would also like to thank my Ph.D. committee for their discussion and suggestions for this work: Dr. Kirk W. Cameron, Dr. Davd Lowenthal, Dr. Wu-chun Feng, and Dr. Calvin J. Ribbens. Also, I thank Georgia Tech, its Sony-Toshiba-IBM Center of Competence, and NSF, for the Cell/BE resources that have contributed to this research. Fi- nally, I would like to thank the institutions that have funded this research: the National Science Foundation and the U.S. Department of Energy.

iii This page intentionally left blank.

iv Contents

1 Problem Statement 1 1.1 Mapping Parallelism to Asymmetric Parallel Architectures ...... 2

2 Statement of Objectives 5 2.1 Dynamic Multigrain Parallelism ...... 5 2.2 Rightsizing Multigrain Parallelism ...... 8 2.3 MMGP Model ...... 9

3 Experimental Testbed 11 3.1 RAxML ...... 12 3.2 PBPI ...... 13 3.3 Hardware Platform ...... 14

4 Code Optimization Methdologies for Asymmetric Multi-core Systems with Explic- itly Managed Memories 17 4.1 Porting and Optimizing RAxML on Cell ...... 18 4.2 Function Off-loading ...... 18 4.2.1 Optimizing Off-Loaded Functions ...... 19 4.2.2 Vectorizing Conditional Statements ...... 20 4.2.3 Double Buffering and Memory Management ...... 23 4.2.4 Vectorization ...... 24 4.2.5 PPE-SPE Communication ...... 27 4.2.6 Increasing the Coverage of Offloading ...... 28 4.3 Parallel Execution ...... 29 4.4 Chapter Summary ...... 30

5 Scheduling Multigrain Parallelism on Asymmetric Systems 33 5.1 Introduction ...... 33

v 5.2 Scheduling Multi-Grain Parallelism on Cell ...... 33 5.2.1 Event-Driven Task Scheduling ...... 34 5.2.2 Scheduling Loop-Level Parallelism ...... 36 5.2.3 Implementing Loop-Level Parallelism ...... 42 5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism ...... 43 5.3.1 Application-Specific Hybrid Parallelization on Cell ...... 44 5.3.2 MGPS ...... 47 5.4 S-MGPS ...... 49 5.4.1 Motivating Example ...... 50 5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism ...... 51 5.5 Chapter Summary ...... 57

6 Model of Multi-Grain Parallelism 61 6.1 Introduction ...... 61 6.2 Modeling Abstractions ...... 62 6.2.1 Hardware Abstraction ...... 62 6.2.2 Application Abstraction ...... 63 6.3 Model of Multi-grain Parallelism ...... 65 6.3.1 Modeling sequential execution ...... 66 6.3.2 Modeling parallel execution on APUs ...... 67 6.3.3 Modeling parallel execution on HPUs ...... 69 6.3.4 Using MMGP ...... 71 6.3.5 MMGP Extensions ...... 72 6.4 Experimental Validation and Results ...... 72 6.4.1 MMGP Parameter approximation ...... 73 6.4.2 Case Study I: Using MMGP to parallelize PBPI ...... 74 6.4.3 Case Study II: Using MMGP to Parallelize RAxML ...... 77 6.4.4 MMGP Usability Study ...... 81 6.5 Chapter Summary ...... 83

7 Scheduling Asymmetric Parallelism on a PS3 Cluster 85 7.1 Introduction ...... 85 7.2 Experimental Platform ...... 87 7.3 PS3 Cluster Scalability Study ...... 88 7.3.1 MPI Communication Performance ...... 88 7.3.2 Application Benchmarks ...... 88

vi 7.4 Modeling Hybrid Parallelism ...... 93 7.4.1 Modeling PPE Execution Time ...... 94 7.4.2 Modeling the off-loaded Computation ...... 96 7.4.3 DMA Modeling ...... 97 7.4.4 Cluster Execution Modeling ...... 98 7.4.5 Verification ...... 99 7.5 Co-Scheduling on Asymmetric Clusters ...... 99 7.6 PS3 versus IBM QS20 Blades ...... 102 7.7 Chapter Summary ...... 104

8 Kernel-Level Scheduling 107 8.1 Introduction ...... 107 8.2 SLED Scheduler Overview ...... 108 8.3 ready to run List ...... 110 8.3.1 ready to run List Organization ...... 110 8.3.2 Splitting ready to run List ...... 111 8.4 SLED Scheduler - Kernel Level ...... 113 8.5 SLED Scheduler - User Level ...... 116 8.6 Experimental Setup ...... 117 8.6.1 Benchmarks ...... 118 8.6.2 Microbenchmarks ...... 118 8.6.3 PBPI ...... 122 8.6.4 RAxML ...... 123 8.7 Chapter Summary ...... 125

9 Future Work 127 9.1 Integrating ready-to-run list in the Kernel ...... 128 9.2 Load Balancing and Task Priorities ...... 130 9.3 Increasing Processor Utilization ...... 131 9.4 Novel Applications and Programming Models ...... 132 9.5 Conventional Architectures ...... 132 9.6 MMGP extensions ...... 133

10 Overview of Related Research 135 10.1 Cell – Related Research ...... 135 10.2 Process Scheduling - Related Research ...... 138

vii 10.3 Modeling – Related Research ...... 141 10.3.1 PRAM Model ...... 141 10.3.2 BSP model ...... 142 10.3.3 LogP model ...... 143 10.3.4 Models Describing Nested Parallelism ...... 144

Bibliography 147

viii List of Figures

2.1 A hardware abstraction of an accelerator-based architecture. Host processing units (HPUs) supply coarse-grain parallel computation across accelerators. Ac- celerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism...... 6

3.1 Organization of Cell...... 14

4.1 The likelihood vector structure is used in almost all memory traffic be- tween main memory and the local storage of the SPEs. The structure is 128-bit aligned, as required by the Cell architecture...... 23 4.2 The body of the first loop in newview(): a) Non–vectorized code, b) Vector- ized code...... 25

4.3 The second loop in newview(). Non–vectorized code shown on the left, vector- ized code shown on the right. spu madd() multiplies the first two arguments and adds the result to the third argument. spu splats() creates a vector by replicating a scalar element...... 26 4.4 Performance of (a) RAxML and (b) PBPI with different number of MPI pro- cesses...... 29

5.1 Scheduler behavior for two off-loaded tasks, representative of RAxML. Case (a) illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the be- havior of the Linux scheduler with the same workload. The numbers correspond to MPI processes. The shaded slots indicate context switching. The example assumes a Cell-like system with four SPEs...... 36 5.2 Parallelizing a loop across SPEs using a work-sharing model with an SPE des- ignated as the master...... 39

ix 5.3 The data structure Pass is used for communication among SPEs. The vi ad variables are used to pass input arguments for the loop body from one local storage to another. The variable sig is used as a notification signal that the memory transfer for the shared data updated during the loop is completed. The variable res is used to send results back to the master SPE, and as a dependence resolution mechanism...... 42

5.4 Parallelization of the loop from function evaluate() in RAxML. The left side depitcs the code executed by the master SPE, while the right side depitcs the code executed by a worker SPE. Num SPE represents the number of SPE worker threads...... 44

5.5 Comparison of task-level and hybrid parallelization schemes in RAxML, on the Cell BE. The input file is 42 SC. The number of ML trees created is (a) 1–16, (b) 1–128...... 45

5.6 MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML trees created: (a) 1–16, (b) 1–128...... 49

5.7 Execution time of RAxML with a variable number of SPE threads. The input dataset is 25 SC...... 51

5.8 Execution times of RAxML, with various static multi-grain scheduling strate- gies. The input dataset is 25 SC...... 51

5.9 The sampling phase of S-MGPS. Samples are taken from four execution inter- vals, during which the code performs identical operations. For each sample, each MPI process uses a variable number of SPEs to parallelize its enclosed loops...... 53

5.10 PBPI executed with different levels of TLP and LLP parallelism: deg(TLP)=1- 4, deg(LLP)=1–16 ...... 56

6.1 A hardware abstraction of an accelerator-based architecture with two layers of parallelism. Host processing units (HPUs) relatively supply coarse-grain paral- lel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain paral- lelism. Both HPUs and APUs have local memories and communicate through shared-memory or message-passing. Additional layers of parallelism can be expressed hierarchically in a similar fashion...... 62

x 6.2 Our application abstraction of two parallel tasks. Two tasks are spawned by the main process. Each task exhibits phased, multi-level parallelism of varying granularity. In this paper, we address the problem of mapping tasks and subtasks to accelerator-based systems...... 64 6.3 The sub-phases of a sequential application are readily mapped to HPUs and APUs. In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2 executes on the APU. HPUs and APUs are assumed to communicate via ...... 66 6.4 Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads com- putations to one APU (part a) and two APUs (part b). The single point-to-point transfer of part a is modeled as overhead plus computation time on the APU. For multiple transfers, there is additional overhead (g), but also benefits due to parallelization...... 68 6.5 Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs (2 on the right and 2 on the left). The first on the HPU offloads compu- tation to APU1 and APU2 then idles. The second HPU thread is switched in, offloads code to APU3 and APU4, and then idles. APU1 and APU2 complete and return data followed by APU3 and APU4...... 69 6.6 MMGP predictions and actual execution times of PBPI, when the code uses one dimension of PPE (HPU) parallelism...... 75 6.7 MMGP predictions and actual execution times of PBPI, when the code uses one dimension of SPE (APU) parallelism, with a data-parallel implementation of the maximum likelihood calculation...... 76 6.8 MMGP predictions and actual execution times of PBPI, when the code uses two dimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees of parallelism which optimizes performance is 4-way PPE parallelism combined with 4-way SPE parallelism. The chart illustrates the results when both SPE parallelism and PPE parallelism are scaled to two Cell processors...... 78 6.9 MMGP predictions and actual execution times of RAxML, when the code uses one dimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS2...... 79 6.10 MMGP predictions and actual execution times of RAxML, when the code uses one dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS2...... 80 6.11 MMGP predictions and actual execution times of RAxML, when the code uses two dimensions of SPE (APU) and PPE (HPU) parallelism. Performance is optimized by oversubscribing the PPE and maximizing task-level parallelism. . 82

xi 6.12 Overhead of the sampling phase when MMGP scheduler is used with the PBPI application. PBPI is executed multiple times with 107 input species. The se- quence size of the input file is varied from 1,000 to 10,000. In the worst case, the overhead of the sampling phase is 2.2% (sequence size 7,000)...... 83

7.1 MPI Allreduce() performance on the PS3 cluster. Processes are distributed evenly between nodes. Each node runs up to 6 processes, using shared memory for communication within the node...... 89 7.2 MPI Send/Recv() latency on the PS3 cluster. Processes are distributed evenly between nodes. Each node runs up to 6 processes, using shared memory for communication within the node...... 90 7.3 Measured and predicted performance of applications on the PS3 cluster. PBPI is executed with weak scaling. RAxML is executed with strong scaling. x-axis

notation: Nnode - number of nodes, Nprocess - number of processes per node,

NSPE - number of SPEs per process...... 92 7.4 Four cases illustrating the importance of co-scheduling PPE threads and SPE threads. Threads labeled ”P” are PPE threads, while threads labeled ”S” are SPE threads. We assume that P-threads and S-threads communicate through shared memory. P-threads poll shared memory locations directly to detect if a previously off-loaded S-thread has completed. Striped intervals indicate yield- ing of the PPE, dark intervals indicate computation leading to a thread off-load on an SPE, light intervals indicate computation yielding the PPE without off- loading on an SPE. Stars mark cases of mis-scheduling...... 95 7.5 SPE execution ...... 96 7.6 Double buffering template for tiled parallel loops...... 97 7.7 Performance of yield-if-not-ready policy and the native Linux scheduler in PBPI

and RAxML. x-axis notation: Nnode - number of nodes, Nprocess - number of

processes per node, NSPE - number of SPEs per process...... 101 7.8 Performance of different scheduling strategies in PBPI and RAxML...... 103 7.9 Comparison between the PS3 cluster and an IBM QS20 cluster...... 104

8.1 Upon completing the assigned tasks, the SPEs send signal to the PPE processes through the ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run list to the kernel, which in return can schedule the appropriate process on the PPE...... 108

xii 8.2 Vertical overview of the SLED scheduler. The user level part contains the ready- to-run list, shared among the processes, while the kernel part contains the sys- tem call through which the information from the ready-to-run list is passed to the kernel...... 109

8.3 ProcessP1, which is bound to CPU1, needs to be scheduled to run by the sched- uler that was invoked on CPU2. Consequently, the kernel needs to perform

migration of the process P1, from CPU1 to CPU2 ...... 112 8.4 System call for migrating the processes across the execution contexts. Function sched migrate task() performs the actual migration. SLEDS yield() function schedules the process to be the next to run on the CPU...... 113 8.5 The ready to run list is split in two parts. Each of the two sublists contain pro-

cesses that are sharing the execution context (CPU1 or CPU2). This approach avoids any possibility of expensive process migration across the execution con- texts...... 114 8.6 Execution flow of the SLEDS yield() function: (a) The appropriate process is found in the running list (tree), (b) The process is pulled out from the list, and its priority is increased, (c) The process is returned to the list, and since its priority is increased it will be stored at the left most position...... 115 8.7 Outline of the SLEDS scheduler: Upon off-loading a process is required to call the SLEDS Offload() function. SLEDS Offload() checks if the off-loaded task has finished (Line 14), and if not, calls the yield() function. yield() scans the ready to run list, and yields to the next process by executing SLEDS yield() system call...... 117 8.8 Execution times of RAxML when the ready to run list is scanned between 50 and 1000 times. x-axis represents the number of scans of the ready to run list. y-axis represents the execution time. Note that the lowest value for the y-axis is 12.5, and the difference between the lowest and the highest execution time is 4.2%. The input file contains 10 species, each represented by 1800 nucleotides. 118 8.9 Comparison of the EDTLP and SLED schemes using microbenchmarks: Total execution time is measured as the length of the off-loaded tasks is increased. . . 119 8.10 Comparison of the EDTLP and SLED schemes using microbenchmarks: Total execution time is measured as the length of the off-loaded tasks is increased – task size is limited to 2.1us...... 120 8.11 EDTLP outperforms SLED for small task sizes due to higher complexity of the SLED scheme...... 121

xiii 8.12 Comparison of the EDTLP scheme and the combination of SLED and EDTLP schemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs...... 121 8.13 Comparison of the EDTLP scheme and the combination of SLED and EDTLP schemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs – task size is limited to 2.µs...... 122 8.14 Comparison of EDTLP and SLED schemes using the PBPI application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis)...... 123 8.15 Comparison of EDTLP and the combination of SLED and EDTLP schemes using the PBPI application. The application is executed multiples time with varying length of the input sequence (represented on the x-axis)...... 124 8.16 Comparison of EDTLP and SLED schemes using the RAxML application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis)...... 124 8.17 Comparison of EDTLP and the combination of SLED and RAxML schemes using the RAxML application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis)...... 125

9.1 Upon completing the assigned tasks, SPEs send signals to PPE processes through the ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run queue to the kernel, which in return can schedule the appropriate process on the PPE...... 129

xiv List of Tables

4.1 Execution time of RAxML (in seconds). The input file is 42 SC. (a) The whole application is executed on the PPE, (b) newview() is offloaded on one SPE. . . . 20 4.2 Execution time of RAxML after the floating-point conditional statement is trans- formed to an integer conditional statement and vectorized. The input file is 42 SC. 22 4.3 Execution time of RAxML with double buffering applied to overlap DMA transfers with computation. The input file is 42 SC...... 24 4.4 Execution time of RAxML following vectorization. The input file is 42 SC. . . 27 4.5 Execution time of RAxML following the optimization of communication to use direct memory-to-memory transfers. The input file is 42 SC...... 28 4.6 Execution time of RAxML after offloading and optimizing three functions: newview(), makenewz() and evaluate(). The input file is 42 SC...... 29

5.1 Performance comparison for (a) RAxML and (b) PBPI with two schedulers. The second column shows execution time with the EDTLP scheduler. The third column shows execution time with the native Linux kernel scheduler. The work- load for RAxML contains 42 organisms. The workload for PBPI contains 107 organisms...... 37 5.2 Execution time of RAxML when loop-level parallelism (LLP) is exploited in one bootstrap, via work distribution between SPEs. The input file is 42 SC: (a) DNA sequences are represented with 10,000 nucleotides, (b) DNA sequences are represented with 20,000 nucleotides...... 40 5.3 Execution time of PBPI when loop-level parallelism (LLP) is exploited via work distribution between SPEs. The input file is 107 SC: (a) DNA sequences are represented with 1,000 nucleotides, (b) DNA sequences are represented with 10,000 nucleotides...... 41

xv 5.4 Efficiency of different program configurations with two data sets in RAxML. The best configuration for 42 SC input is deg(TLP)=8, deg(LLP)=1. The best configuration for 25 SC is deg(TLP)=4, deg(LLP)=2. deg() corresponds the degree of a given dimension of parallelism (LLP or TLP)...... 54

5.5 RAxML – Comparison between S-MGPS and static scheduling schemes, illus- trating the convergence overhead of S-MGPS...... 55

5.6 PBPI – comparison between S-MGPS and static scheduling schemes: (a) deg(TLP)=1, deg(LLP)=1–16; (b) deg(TLP)=2, deg(LLP)=1–8; (c) deg(TLP)=4, deg(LLP)=1– 4; (d) deg(TLP)=8, deg(LLP)=1–2...... 58

xvi Chapter 1

Problem Statement

In the quest for delivering higher performance to scientific applications, hardware designers be- gan to move away from models and embraced architectures with multiple processing cores. Although all commodity microprocessor vendors are marketing multicore processors, these processors are largely based on replication of superscalar cores. Unfortu- nately, superscalar designs exhibit well-known performance and power limitations. These limi- tations, in conjunction with a sustained requirement for higher performance, stimulated interest in unconventional processor designs, that combine parallelism with acceleration. These designs leverage multiple cores some of which are customized accelerators for data-intensive computa- tion. Examples of these heterogeneous, accelerator-based parallel architectures are Cell BE [3], GPGPU [4], Rapport KiloCore [2], EXOCHI [96], etc.

As a case study and a representative of the accelerator-based asymmetric architectures, in this dissertation we investigate the Cell Broadband Engine (CBE). Cell has recently drawn considerable attention by industry and academia. Since it was originally designed for the game box market, Cell has low cost and a modest power budget. Nevertheless, the processor is able to achieve unprecedented peak performance for some real-world applications. IBM announced recently the use of Cell chips in a new Petaflop system with 16,000 Cells named RoadRunner, due for delivery in 2008.

1 The potential of the Cell BE has been demonstrated convincingly in a number of stud- ies [33,39,69,74,91]. Thanks to eight high-frequency execution cores with pipelined SIMD ca- pabilities, and an aggressive data transfer architecture, Cell has a theoretical peak performance of over 200 Gflops for single-precision FP calculations and a peak memory bandwidth of over 25 Gigabytes/s. These performance figures position Cell ahead of the competition against the most powerful commodity microprocessors. Cell has already demonstrated impressive perfor- mance ratings in applications and computational kernels with highly vectorizable data paral- lelism, such as signal processing, compression, encryption, dense and sparse numerical ker- nels [12, 13, 15, 39, 48, 49, 66, 75, 78, 79, 99].

1.1 Mapping Parallelism to Asymmetric Parallel Architec-

tures

Arguably, one of the most difficult problems that programmers face while migrating to a new parallel architecture is the mapping of algorithms and data to the architecture. Accelerator- based multi-core processors complicate this problem in two ways. Firstly, by introducing het- erogeneous execution cores, the user needs to be concerned with mapping each component of the application to the type of core that best matches the computational and memory bandwidth demand of the component. Secondly, by providing multiple cores with embedded SIMD or multi-threading capabilities, the user needs to be concerned with extracting multiple dimen- sions of parallelism from the application and mapping each dimension to parallel execution units, so as to maximize performance.

Cell provides a motivating and timely example for the problem of mapping algorithmic parallelism to modern multi-core architectures. The processor can exploit task and data par- allelism, both across and within its cores. On accelerator-based multi-core architectures the programmer must be aware of core heterogeneity, and carefully balance execution between the

2 host and accelerator cores. Furthermore, the programmer faces a seemingly vast number of options for parallelizing code on these architectures. Functional and data decompositions of the program can be implemented on both, the host and the accelerator cores. Functional decom- positions can be achieved by dividing functions between the hosts and the accelerators and by off-loading functions from the hosts to accelerators at runtime. Data decompositions are also possible, by using SIMDization on the vector units of the accelerator cores, or loop-level paral- lelization across accelerators, or a combination of loop-level parallelization across accelerators and SIMDization within accelerators. In this thesis we explore different approaches used to automatize mapping applications to asymmetric parallel architectures. We explore both, runtime and static approaches for combin- ing and managing functional and data decomposition. We combine and orchestrate multiple levels of parallelism inside an application in order to achieve both, harmoniously utilization of all host and accelerator cores, as well as high memory bandwidth available on asymmetric multi-core processors. Although we chose Cell to be our case study, our scheduling algorithms and decisions are general and can be applied to any asymmetric parallel architecture.

3 4 Chapter 2

Statement of Objectives

2.1 Dynamic Multigrain Parallelism

While many studies have been focused on performance evaluation and optimizations for the heterogeneous multi-core architectures [23, 31, 54, 63, 65, 74, 98], the optimal mapping of par- allel applications to these architectures has not been investigated. In this thesis we explore heterogeneous multi-core architectures from a different perspective, namely that of multigrain parallelization. The asymmetric parallel architectures have a specific design, they can exploit orthogonal dimensions of task and on a single chip. The processor is controlled by one or more host processing elements, which usually schedule the computation off-loaded to accelerator processing units. The accelerators are usually SIMD processors and provide the bulk of the processor’s computational power. A general design of heterogeneous, accelerator-based architectures is represented in Figure 2.1.

To simplify programming and improve efficiency on asymmetric parallel architectures, we present a set of dynamic scheduling policies and the associated mechanisms. We introduce an event-driven scheduler, EDTLP, which oversubscribes the host processing cores and exposes dynamic parallelism across accelerators. We also propose MGPS, a scheduling module which controls multi-grain parallelism on the fly to monotonically increase accelerator utilization.

5 HPU/LM HPU/LM #1 #NHP

Shared Memory / Message Interface

APU/LM APU/LM APU/LM #1 #2 #NAP

Figure 2.1: A hardware abstraction of an accelerator-based architecture. Host processing units (HPUs) supply coarse-grain parallel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain paral- lelism.

MGPS monitors the number of active accelerators used by off-loaded tasks over discrete inter- vals of execution and makes a prediction on the best combination of dimensions and granularity of parallelism to expose to the hardware. The purpose of these policies is to exploit the proper layers and degrees of parallelism from the application, in order to maximize efficiency of the processor’s computational cores. We explore the design and implementation of our schedul- ing policies using two real-world scientific applications, RAxML [87] and PBPI [45]. RAxML and PBPI are bioinformatics applications used for generating the phylogenetic trees, and we describe them in more detail in Chapter 3.

One of the most efficient execution models on asymmetric parallel architectures, which reduces the idle time on the host processors as well as on the accelerators, is to oversubscribe the host processors unit with multiple processes. In this approach one or more accelerators are assigned to each process for off-loading the expensive computation. Although the offloading approach enables high utilization of the architecture, it also increases contention and the number of context-switches on the host processor unit, as well as time necessary of a single context- switch to complete. To reduce the contention caused by context switching, and the idle time that occurs on the accelerator cores as a consequence, we designed and implemented slack-

6 minimizer scheduler (SLED). In our case study, the SLED scheduler is capable of improving the performance on the Cell processor for up to 17%.

The study related to dynamic scheduling strategies makes the following contributions:

• We present a runtime system and scheduling policies that exploit polymorphic (task and loop-level) parallelism on asymmetric parallel processors. Our runtime system is adap- tive, in the sense that it chooses the form and degree of parallelism to expose to the hardware, in response to workload characteristics. Since the right choice of form(s) and degree(s) of parallelism depends non-trivially on workload characteristics and user input, our runtime system unloads an important burden from the programmer.

• We show that dynamic multigrain parallelization is a necessary optimization for sustain- ing maximum performance on asymmetric parallel architectures, since no static paral- lelization scheme is able to achieve high accelerator efficiency in all cases.

• We present an event-driven multithreading execution engine, which achieves higher effi- ciency on accelerators by oversubscribing the host core.

• We present a feedback-guided scheduling policy for dynamically triggering and throttling loop-level parallelism across accelerators. We show that work-sharing of divisible tasks across accelerators should be used when the event-driven multithreading engine leaves more than half of the accelerators idle. We observe benefits from loop-level paralleliza- tion of off-loaded tasks across accelerators. However, we also observe that loop-level parallelism should be exposed only in conjunction with low-degree task-level parallelism.

• We present the kernel-level extensions to our runtime system, which enable efficient pro- cess scheduling in a case when the host core is oversubscribed with multiple processes

7 2.2 Rightsizing Multigrain Parallelism

When executing multi-level parallel application on asymmetric parallel processors, the perfor- mance can be strongly affected by the execution configuration. In case of RAxML execution on the Cell processor, depending on the runtime degree of each level of parallelism in the ap- plication, the performance variation can be as high as 40%. To address the issue of determining the most optimal parallel configuration, we introduce a new runtime scheduler, S-MGPS, which performs sampling and timing of the dominant phases in the application in order to determine the most efficient mapping of different levels of parallelism to the architecture. There are sev- eral essential differences between S-MGPS and our previously introduced runtime scheduler, MGPS. MGPS is a utilization-driven scheduler, which seeks the highest possible accelerator utilization by exploiting additional layers of parallelism when some accelerator cores appear underutilized. MGPS attempts to increase utilization by creating more accelerator tasks from innermost layers of parallelism, more specifically, as many tasks as the number of idle acceler- ators recorded during intervals of execution. S-MGPS is a scheduler which seeks the optimal application-system configuration, in terms of layers of parallelism exposed to the hardware and degree of granularity per layer of parallelism, based on runtime task throughput of the appli- cation and regardless of system utilization. S-MGPS takes into account the cumulative effects of contention and other system bottlenecks on software parallelism and can converge to the best multi-grain parallel execution algorithm. MGPS on the other hand uses only informa- tion on SPE utilization and may often converge to a suboptimal multi-grain parallel execution algorithm. A further contribution of S-MGPS is that the scheduler is immune to the initial configuration of parallelism in the application and uses a sampling method which is indepen- dent of application-specific parameters, or input. On the contrary, the performance of MGPS is sensitive to both the initial structure of parallelism in the application and input.

Although the scientific codes we use in this thesis implement similar functionality, they differ in their structure and parallelization strategies and raise different challenges for user-level

8 schedulers. We show that S-MGPS performs within 2% off the optimal scheduling algorithm in PBPI and within 2%–10% off the optimal scheduling algorithm in RAxML. We also show that S-MGPS adapts well to variation of the input size and granularity of parallelism, whereas the performance of MGPS is sensitive to both these factors.

2.3 MMGP Model

The technique used by the S-MGPS scheduler might not be scalable to large, complex systems, large applications, or applications with behavior that varies significantly with the input. The execution time of a complex application is the function of many parameters. A given parallel application may consist of N phases where each phase is affected differently by accelerators. Each phase can exploit d dimensions of parallelism or any combination thereof such as ILP, TLP, or both. Each phase or dimension of parallelism can use any of m different programming and execution models such as message passing, shared memory, SIMD, or any combination thereof. Accelerator availability or use may consist of c possible configurations, involving dif- ferent numbers of accelerators. Exhaustive analysis of the execution time for all combinations requires at least N × d × m × c trials with any given input.

Models of parallel computation have been instrumental in the adoption and use of parallel systems. Unfortunately, commonly used models [24,35] are not directly portable to accelerator- based systems. First, the heterogeneous processing common to these systems is not reflected in most models of parallel computation. Second, current models do not capture the effects of multi-grain parallelism. Third, few models account for the effects of using multiple program- ming models in the same program. Parallel programming at multiple dimensions and with a synthesis of models consumes both enormous amounts of programming effort and significant amounts of execution time, if not handled with care. To overcome these deficits, we present a model for multi-dimensional parallel computation on asymmetric multi-core processors. Con- sidering that each dimension of parallelism reflects a different degree of computation granular-

9 ity, we name the model MMGP, for Model of Multi-Grain Parallelism. MMGP is an analytical model which formalizes the process of programming accelerator- based systems and reduces the need for exhaustive measurements. This proposal presents a generalized MMGP model for accelerator-based architectures with one layer of host processor parallelism and one layer of accelerator parallelism, followed by the specialization of this model for the Cell Broadband Engine. The input to MMGP is an explicitly parallel program, with parallelism expressed with machine-independent abstractions, using common programming libraries and constructs. Upon identification of a few key parameters of the application derived from micro-benchmarking and profiling of a sequential run, MMGP predicts with reasonable accuracy the execution time with all feasible mappings of the application to host processors and accelerators. MMGP is fast and reasonably accurate, therefore it can be used to quickly identify optimal operating points, in terms of the exposed layers of parallelism and the degree of parallelism in each layer, on accelerator-based systems. Experiments with two complete applications from the field of com- putational phylogenetics on a shared-memory multiprocessor with single and multiple nodes that contain the Cell BE, show that MMGP models parallel execution time of complex parallel codes with multiple layers of task and data parallelism, with mean error in the range of 1%–6%, across all feasible program configurations on the target system. Due to the narrow margin of error, MMGP predicts accurately the optimal mapping of programs to cores for the cases we have studied so far.

10 Chapter 3

Experimental Testbed

This chapter provides details on our experimental testbed, including the two applications that we used to study user-level schedulers on the Cell BE (RAxML and PBPI) and the hardware platform on which we conducted this thesis.

RAxML and PBPI are computational biology applications designed to determine the phy- logenetic trees. Phylogenetic trees are used to represent the evolutionary history of a set of n organisms. An alignment with the DNA or AA sequences representing those n organisms (also called taxa) can be used as input for the computation of phylogenetic trees. In a phylogeny the organisms of the input data set are located at the tips (leaves) of the tree whereas the inner nodes represent extinct common ancestors. The branches of the tree represent the time which was required for the mutation of one species into another, new one. The generation of phylo- genies with computational methods has many important applications in medical and biological research (see [14] for a summary).

The fundamental algorithmic problem computational phylogeny faces consists of the im- mense amount of alternative tree topologies which grows exponentially with the number of or- ganisms n, e.g. for n = 50 organisms there exist 2.84 ∗ 1076 alternative trees (number of atoms in the universe ≈ 1080). In fact, it has only recently been shown that the phylogeny problem is NP-hard [34]. In addition, generating phylogenies is very memory- and floating point-intensive

11 process, such that the application of high performance computing techniques as well as the as- sessment of new CPU architectures can contribute significantly to the reconstruction of larger and more accurate trees. The computation of the phylogenetic tree containing representatives of all living beings on earth is still one of the grand challenges in Bioinformatics.

3.1 RAxML

RAxML-VI-HPC (v2.1.3) (Randomized Axelerated Maximum Likelihood version VI for High Performance Computing) [87] is a program for large-scale ML-based (Maximum Likelihood [43]) inference of phylogenetic (evolutionary) trees using multiple alignments of DNA or AA (Amino Acid) sequences. The program is freely available as open source code at icwww.epfl.ch/˜stamatak.

The current version of RAxML incorporates a rapid hill climbing search algorithm. A re- cent performance study [87] on real world datasets with ≥ 1,000 sequences reveals that it is able to find better trees in less time and with lower memory consumption than other current ML programs (IQPNNI, PHYML, GARLI). Moreover, RAxML-VI-HPC has been parallelized with MPI (Message Passing Interface), to enable non-parametric bootstrap- ping and multiple inferences on distinct starting trees in order to search for the best-known ML tree. Like every ML-based program, RAxML exhibits a source of fine-grained loop-level par- allelism in the likelihood functions which consume over 90% of the overall computation time. This source of parallelism scales well on large memory-intensive multi-gene alignments due to increased cache efficiency.

The MPI version of RAxML is the basis of our Cell version of the code [20]. In RAxML multiple inferences on the original alignment are required in order to determine the best-known (best-scoring) ML tree (we use the term best-known because the problem is NP-hard). Fur- thermore, bootstrap analyses are required to assign confidence values ranging between 0.0 and 1.0 to the internal branches of the best-known ML tree. This allows determining how well- supported certain parts of the tree are and is important for the biological conclusions drawn

12 from it. All those individual tree searches, be it bootstrap or multiple inferences, are completely independent from each other and can thus be exploited by a simple master-worker MPI scheme. Each search can further exploit data parallelism via thread-level parallelization of loops and/or SIMDization.

3.2 PBPI

PBPI is based on Bayesian phylogenetic inference, which constructs phylogenetic trees from DNA or AA sequences using the Markov Chain Monte Carlo (MCMC) sampling method. The program is freely available as open source code at www.pbpi.org.The MCMC method is inher- ently sequential, and the state of each time step depends on previous time steps. Therefore, the PBPI application uses algorithmic improvements described below to achieve highly effi- cient parallel inference of phylogenetic trees. PBPI exploits multi-grain parallelism, to achieve scalability on large-scale systems, such as the IBM BlueGene/L [45]. The algorithm of PBPI can be summarized as follows:

1. Partition the Markov chains into chain groups, and split the data set into segments along the sequences.

2. Organize the virtual processors that execute the code into a two-dimensional grid; map each chain group to a row on the grid and map each segment to a column on the grid.

3. During each generation, compute the partial likelihood across all columns and use all-to- all communication to collect the complete likelihood values to all virtual processors on the same row.

4. When there are multiple chains, randomly choose two chains for swapping using point- to-point communication.

13 I/O Controller SPE SPE SPE SPE

LS LS LS LS

PowerPC Element Interconnect BUS (EIB) PPE

LS LS LS LS SPE SPE SPE SPE Memory Controller

Figure 3.1: Organization of Cell.

From a computational perspective, PBPI differs substantially from RAxML. While RAxML is embarrassingly parallel, PBPI uses a predetermined virtual processor topology and a corre- sponding data decomposition method. While the degree of in RAxML may vary considerably at runtime, PBPI exposes from the beginning of execution, a high-degree of two-dimensional data parallelism to the runtime system. On the other hand, while the degree of task parallelism can be controlled dynamically in RAxML without performance penalty, in PBPI changing the degree of outermost data parallelism requires data redistribution and incurs a high performance penalty.

3.3 Hardware Platform

The Cell BE is a heterogeneous multi-core processor which integrates a simultaneous multi- threading PowerPC core ( the Power Processing Element or PPE), and eight specialized accel- erator cores (the Synergistic Processing Elements or SPEs) [40]. These elements are connected in a ring topology on an on-chip network called the Element Interconnect Bus (EIB). The orga- nization of Cell is illustrated in Figure 3.1.

The PPE is a 64-bit SMT processor running the PowerPC ISA, with vector/SIMD multime- dia extensions [71]. The PPE has two levels of on-chip cache. The L1-I and L1-D caches of the PPE have a capacity of 32 KB. The L2 cache of the PPE has a capacity of 512 KB.

14 Each SPE is a 128-bit with two major components: a Synergistic Processor Unit (SPU) and a Memory Flow Controller (MFC). All instructions are executed on the SPU. The SPU includes 128 registers, each 128 bits wide, and 256 KB of software-controlled local storage. The SPU can fetch instructions and data only from its local storage and can write data only to its local storage. The SPU implements a Cell-specific set of SIMD intrinsics. All single precision floating point operations on the SPU are fully pipelined and the SPU can issue one single-precision floating point operation per cycle. Double precision floating point operations are partially pipelined and two double-precision floating point operations can be issued every six cycles. Double-precision FP performance is therefore significantly lower than single-precision FP performance. With all eight SPUs active and fully pipelined double precision FP operation, the Cell BE is capable of a peak performance of 21.03 Gflops. In single-precision FP operation, the Cell BE is capable of a peak performance of 230.4 Gflops [33]. The SPE can access RAM through direct memory access (DMA) requests. DMA transfers are handled by the MFC. All programs running on an SPE use the MFC to move data and instructions between local storage and main memory. Data transferred between local storage and main memory must be 128-bit aligned. The size of each DMA transfer can be at most 16 KB. DMA-lists can be used for transferring more than 16 KB of data. A list can have up to 2,048 DMA requests, each for up to 16 KB. The MFC supports only DMA transfer sizes that are 1, 2, 4, 8 or multiples of 16 bytes long. The EIB is an on-chip coherent bus that handles communication between the PPE, SPE, main memory, and I/O devices. Physically, the EIB is a 4-ring structure, which can transmit 96 bytes per cycle, for a maximum theoretical memory bandwidth of 204.8 Gigabytes/second. The EIB can support more than 100 outstanding DMA requests. In this work we are using a Cell blade (IBM BladeCenter QS20) with two Cell BEs running at 3.2 GHz, and 1GB of XDR RAM (512 MB per processor). The PPEs run Linux Fedora Core 6. We use IBM SDK2.1 and Lam/MPI 7.1.3.

15 16 Chapter 4

Code Optimization Methdologies for Asymmetric Multi-core Systems with Explicitly Managed Memories

Accelerator-based architectures with explicitly managed memories have the advantage of achiev- ing a high degree of communication-computation overlap. While this is a highly desirable goal in high-performance computing, it is also a significant drawback prom the programability per- spective. Managing all memory accesses from the application level significantly increases the complexity of the written code. In our work, we investigate the execution models that reduce the complexity of the code written for the asymmetric architectures, but still achieve desirable performance and high utilization of the available architectural resources. We investigate a set of optimizations that have the most significant impact on the performance of scientific applications executed on the asymmetric architectures. In our case study, we investigate the optimization process which enables efficient execution of RAxML and PBPI on the Cell architecture.

The results presented in this chapter indicate that RAxML and PBPI are highly optimized for Cell, and also motivate the discussion presented in the rest of the thesis. Cell-specific optimiza- tion applied to the two bioinformatics applications resulted in more than two times . At the same time, we show that regardless of being extensively optimized for sequential execution, parallel applications demand sophisticated scheduling support for efficient parallel execution on heterogeneous multi-core platforms.

17 4.1 Porting and Optimizing RAxML on Cell

We ported RAxML to Cell in four steps:

1. We ported the MPI code on the PPE;

2. We offloaded the most time-consuming parts of each MPI process on the SPEs;

3. We optimized the SPE code using vectorization of floating point computation, vectoriza- tion of control statements coupled with a specialized casting transformation, overlapping of computation and communication (double buffering) and other communication opti- mizations;

4. Lastly, we implemented multi-level parallelization schemes across and within SPEs in selected cases, as well as a scheduler for effective simultaneous exploitation of task, loop, and SIMD parallelism.

We outline optimizations 1-3 in the rest of the chapter. We focus on multi-level paralleliza- tion, as well as different scheduling policies in Chapter 5.

4.2 Function Off-loading

We profiled the application using gprofile to identify the computationally intensive functions that could be candidates for offloading and optimization on SPEs. We used an IBM Power5 processor for profiling RAxML. For the profiling and benchmarking runs of RAxML presented in this chapter, we used the input file 42 SC, which contains 42 organisms, each represented by a DNA sequence of 1167 nucleotides. The number of distinct data patterns in a DNA alignment is on the order of 250. On the IBM Power5, 98.77% of the total execution time is spent in three functions:

• 77.24% in newview() - which computes the partial likelihood vector [44] at an inner node of the phylogenetic tree,

18 • 19.16% in makenewz() - which optimizes the length of a given branch with respect to the tree likelihood using the Newton–Raphson method,

• 2.37% in evaluate() - which calculates the log likelihood score of the tree at a given branch by summing over the partial likelihood vector entries.

These functions are the best candidates for offloading on SPEs.

The prerequisite for computing evaluate() and makenewz() is that the likelihood vectors at the nodes of the phylogenetic tree that are right and left of the current branch have been computed. Thus, makenewz() and evaluate() initially make calls to newview(), before they can execute their own computation. The newview() function at an inner node p of a tree, calls itself recursively when the two children r and q are not tips (leaves) and the likelihood array for r and q has not already been computed. Consequently, the first candidate for offloading is newview().

Although makenewz() and evaluate() are both taking a smaller portion of the execution time than newview(), offloading these two functions results in significant speedup (see Section 4.2.6). Besides the fact that each function can be executed faster on an SPE, having all three functions offloaded to an SPE reduces significantly the amount of PPE-SPE communication. In order to have a function executed on an SPE, we spawn an SPE thread at the beginning of each MPI process. The thread executes the offloaded function upon receiving a signal from the PPE and returns the result back to the PPE upon completion. To avoid excessive overhead from repeated thread spawning and joining, threads remain bound on SPEs and busy-wait for the PPE signal, before starting to execute a function.

4.2.1 Optimizing Off-Loaded Functions

The discussion in this section refers to function newview(), which is the most computationally expensive in the code. Table 4.1 summarizes the execution times of RAxML before and after newview() is offloaded. The first column shows the number of workers (MPI processes) used in the experiment and the amount of work (bootstraps) performed. The maximum number

19 1 worker, 1 bootstrap 24.4s 1 worker, 1 bootstrap 45s 2 workers, 8 bootstraps 134.1s 2 workers, 8 bootstraps 201.9s 2 workers, 16 bootstraps 267.7s 2 workers, 16 bootstraps 401.7s 2 workers, 32 bootstraps 539s 2 workers, 32 bootstraps 805s

(a) (b)

Table 4.1: Execution time of RAxML (in seconds). The input file is 42 SC. (a) The whole application is executed on the PPE, (b) newview() is offloaded on one SPE. of workers we use is 2, since more workers would conflict on the PPE which is 2-way SMT processor. Executing small number of workers results in low SPE utilization (each worker uses 1 SPE). In Section 4.3, we present results when the PPE is oversubscribed with up to 8 worker processes.

As shown in Table 4.1, merely offloading newview() causes performance degradation. We profiled the new version of the code in order to get a better understanding of the major bot- tlenecks. Inside newview(), we identified 3 parts where the function spends almost its entire lifetime: the first part includes a large if(...) statement with a conjunction of four arithmetic comparisons used to check if small likelihood vector entries need to be scaled to avoid numerical underflow (similar checks are used in every ML implementation); the second time-consuming part involves DMA transfers; the third includes the loops that perform the actual likelihood vector calculation. In the next few sections we describe the techniques used to optimize the aforementioned parts in newview(). The same techniques were applied to the other offloaded functions.

4.2.2 Vectorizing Conditional Statements

RAxML always invokes newview() at an inner node of the tree (p) which is at the root of a sub- tree. The main computational kernel in newview() has a switch statement which selects one out of four paths of execution. If one or both descendants (r and q) of p are tips (leaves), the com- putations of the main loop in newview() can be simplified. This optimization leads to significant

20 performance improvements [87]. To activate the optimization, we use four implementations of the main computational part of newview() for the case that r and q are tips, r is a tip, q is a tip, or r and q are both inner nodes.

Each of the four execution paths in newview() leads to a distinct—highly optimized— version of the loop which performs the actual likelihood vector calculations. Each iteration of this loop executes the previously mentioned if() statement (Section 4.2.1), to check for like- lihood scaling. Mis-predicted branches in the compiled code for this statement incur a penalty of approximately 20 cycles [92]. We profiled newview() and found that 45% of the execution time is spent in this particular conditional statement. Furthermore, almost all the time is spent in checking the condition, while negligible time is spent in the body of code in the fall-through part of the conditional statement. The problematic conditional statement is shown below. The symbol ml is a constant and all operands are double precision floating point numbers.

if (ABS(x3->a) < ml && ABS(x3->g) < ml && ABS(x3->c) < ml && ABS(x3->t) < ml) {

...

}

This statement is a challenge for a branch predictor, since it implies 8 conditions, one for each of the four ABS() macros and the four comparisons against the minimum likelihood value constant (ml). On an SPE, comparing integers can be significantly faster than comparing doubles, since integer values can be compared using the SPE intrinsics. Although the current SPE intrinsics support only comparison of 32-bit integer values, the comparison of 64-bit integers is also pos- sible by combining different intrinsics that operate on the 32-bit integers. The current spu-gcc compiler automatically optimizes an integer branch using the SPE intrinsics. To optimize the problematic branches, we made the observation that integer comparison is faster than floating

21 1 worker, 1 bootstrap 32.5s 2 workers, 8 bootstraps 151.7s 2 workers, 16 bootstraps 302.7s 2 workers, 32 bootstraps 604s

Table 4.2: Execution time of RAxML after the floating-point conditional statement is trans- formed to an integer conditional statement and vectorized. The input file is 42 SC.

point comparison on an SPE. According to the IEEE standard, numbers represented in float and double formats are “lexicographically ordered” [61], i.e., if two floating point numbers in the same format are ordered, then they are ordered the same way when their bits are reinterpreted as Sign-Magnitude integers [61]. In other words, instead of comparing two floating point numbers we can interpret their bit pattern as integers, and do an integer comparison. The final outcome of comparing the integer interpretation of two doubles (floats) will be the same as comparing their floating point values, as long as one of the numbers is positive. In our case, all operands are positive, consequently instead of floating point comparison we can perform an integer com- parison.

To get an absolute value of a floating point number, we used the spu and() logic intrinsic, which performs vector bit-wise AND operation. With spu and() we always set the left most bit of a floating point number to one. If the number is already positive, nothing will change, since the most significant bit is already one. In this way, we avoid using ABS(), which uses a conditional statement to check if the operand is greater than or less than 0. After getting absolute values of all the operands involved in the problematic if() statement, we cast each operand to an unsigned long long value and perform the comparison. The optimized conditional statement is presented in Figure 4.2.2. Following optimization of the offending conditional statement, its contribution to execution time in newview() comes down to 6%, as opposed to 45% before optimization. The total execution time (Table 4.2) improves by 25%–27%.

22 unsigned long long a[4];

a[0] = *(unsigned long long*)&x3->a & 0x7fffffffffffffffULL; a[1] = *(unsigned long long*)&x3->c & 0x7fffffffffffffffULL; a[2] = *(unsigned long long*)&x3->g & 0x7fffffffffffffffULL; a[3] = *(unsigned long long*)&x3->t & 0x7fffffffffffffffULL;

if (*(unsigned long long*)&a[0] < minli && *(unsigned long long*)&a[1] < minli && *(unsigned long long*)&a[2] < minli && *(unsigned long long*)&a[3] < minli){

...

}

4.2.3 Double Buffering and Memory Management

Depending on the size of the input alignment, the major calculation loop (the loop that performs the calculation of the likelihood vector) in newview() can execute up to 50,000 iterations. The number of iterations is directly related to the alignment length. The loop operates on large arrays, and each member in the arrays is an instance of a likelihood vector structure, shown in Figure 4.1 . The arrays are allocated dynamically at runtime. Since there is no limit on the

typedef struct likelihood_vector { double a, c, g, t; int exp; } likelivector __attribute__((aligned(128)));

Figure 4.1: The likelihood vector structure is used in almost all memory traffic between main memory and the local storage of the SPEs. The structure is 128-bit aligned, as required by the Cell architecture. size of these arrays, we are unable to keep all the members of the arrays in the local storage of

23 1 worker, 1 bootstrap 31.1s 2 workers, 8 bootstraps 145.4s 2 workers, 16 bootstraps 290s 2 workers, 32 bootstraps 582.6s

Table 4.3: Execution time of RAxML with double buffering applied to overlap DMA transfers with computation. The input file is 42 SC.

SPEs. Instead, we strip-mine the arrays, by fetching a few array elements to local storage at a time, and execute the corresponding loop iterations on a batch of elements at a time. We use a 2 KByte buffer for caching likelihood vectors, which is enough to store the data needed for 16 loop iterations. It should be noted that the space used for buffers is much smaller than the size of the local storage.

In the original code where SPEs wait for all DMA transfers, the idle time accounts for 11.4% of execution time of newview(). We eliminated the waiting time by using double buffering to overlap DMA transfers with computation. The total execution time of the application after applying double buffering and tuning the data transfer size (set to 2 KBytes) is shown in Table 4.3.

4.2.4 Vectorization

All calculations in newview() are enclosed in two loops. The first loop has a small trip count (typically 4–25 iterations) and computes the individual transition probability matrices (see Sec- tion 4.2.1) for each distinct rate category of the CAT or Γ models of rate heterogeneity [86]. Each iteration executes 36 double precision floating point operations. The second loop com- putes the likelihood vector. Typically, the second loop has a large trip count, which depends on the number of distinct data patterns in the data alignment. For the 42 SC input file, the second loop has 228 iterations and executes 44 double precision floating point operations per iteration. Each SPE on the Cell is capable of exploiting data parallelism via vectorization. The SPE vector registers can store two double precision floating point elements. We vectorized the two loops in

24 newview() using these registers.

The kernel of the first loop in newview() is shown in Figure 4.2a. In Figure 4.2b we

for( ... ) 1: vector double *left_v = { (vector double*)left; ki = *rptr++; 2: vector double lz1011 = (vector double)(lz10,lz11); d1c = exp (ki * lz10); ... d1g = exp (ki * lz11); d1t = exp (ki * lz12); for( ... ) { *left++ = d1c **EV++; 3: ki_v = spu_splats(*rptr++); *left++ = d1g **EV++; *left++ = d1t **EV++; 4: d1cg = _exp_v ( spu_mul(ki_v,lz1011) ); d1tc = _exp_v ( spu_mul(ki_v,lz1210) ); *left++ = d1c **EV++; d1gt = _exp_v ( spu_mul(ki_v,lz1112) ); *left++ = d1g **EV++; *left++ = d1t **EV++; left_v[0] = spu_mul(d1cg,EV_v[0]); ... left_v[1] = spu_mul(d1tc,EV_v[1]); } left_v[2] = spu_mul(d1gt,EV_v[2]); ... }

(a) (b)

Figure 4.2: The body of the first loop in newview(): a) Non–vectorized code, b) Vectorized code. show the same code vectorized for the SPE. For better understanding of the vectorized code we briefly describe the SPE vector instructions we used:

• Instruction labeled 1 creates a vector pointer to an array consisting of double elements.

• Instruction labeled 2 joins two double elements, lz10 and lz11, into a single vector element.

• Instruction labeled 3 creates a vector from a single double element.

• Instruction labeled 4 is a composition of 2 different vector instructions:

25 for( . . . ) for( . . . ) { { ump_x1_0 = x1->a; a_v = spu_splats(x1->a); ump_x1_0 += x1->c **left++; c_v = spu_splats(x1->c); ump_x1_0 += x1->g **left++; g_v = spu_splats(x1->g); ump_x1_0 += x1->t **left++; t_v = spu_splats(x1->t); l1 = (vector double)(left[0],left[3]); ump_x1_1 = x1->a; l2 = (vector double)(left[1],left[4]); ump_x1_1 += x1->c **left++; l3 = (vector double)(left[2],left[5]); ump_x1_1 += x1->g **left++; ump_v1[0] = spu_madd(c_v,l1,a_v); ump_x1_1 += x1->t **left++; ump_v1[0] = spu_madd(g_v,l2,ump_v1[0]); ump_v1[0] = spu_madd(t_v,l3,ump_v1[0]); ...... } }

Figure 4.3: The second loop in newview(). Non–vectorized code shown on the left, vectorized code shown on the right. spu madd() multiplies the first two arguments and adds the result to the third argument. spu splats() creates a vector by replicating a scalar element.

1. spu mul() multiplies two vectors (in this case the arguments are vectors of dou- bles.)

2. exp v() is the vector version of the exponential instruction.

After vectorization, the number of the floating point instructions executed in the body of the first loop is 24. Also, there is one additional instruction for creating a vector from a scalar element. Note that due to involved pointer arithmetic on dynamically allocated data structures, automatic vectorization of this code would be particularly challenging for a compiler.

Figure 4.3 illustrates the second loop(showing a few selected instructions which dominate execution time in the loop). The variables x1->a, x1->c, x1->g, and x1->t belong to the same

C structure (likelihood vector) and occupy contiguous memory locations. Only three of these variables are multiplied by the elements of the array left[ ]. This makes vectorization more dif-

ficult, since the code requires vector construction instructions such as spu splats(). Obviously, there are many different possibilities for vectorizing this code. The scheme shown in Figure 4.3

26 1 worker, 1 bootstrap 27.8 2 workers, 8 bootstraps 132.3s 2 workers, 16 bootstraps 265.2s 2 workers, 32 bootstraps 527s

Table 4.4: Execution time of RAxML following vectorization. The input file is 42 SC. is the one that achieved the best performance in our tests. Note that due to involved pointer arithmetic on dynamically allocated data structures, automatic vectorization of this code may be challenging for a compiler. After vectorization, the number of floating point instructions in the body of the loops drops from 36 to 24 for the first loop, and from 44 to 22 for the second loop. Vectorization adds 25 instructions for creating vectors.

Without vectorization, newview() spends 69.4% of its execution time in the two loops. Fol- lowing vectorization, the time spent in loops drops to 57% of the execution time of newview(). Table 4.4 shows execution times following vectorization.

4.2.5 PPE-SPE Communication

Although newview() accounts for most of the execution time, its granularity is fine and its con- tribution to execution time is attributed to the large number of invocations. For the 42 SC input, newview() is invoked 230,500 times and the average execution time per invocation is 71µs. In order to invoke an offloaded function, the PPE needs to send a signal to an SPE. Also, after an offloaded function completes, it sends the result back to the PPE.

In an early implementation of RAxML, we used mailboxes to implement the communica- tion between the PPE and SPEs. We observed that PPE-SPE communication can be significantly improved if it is performed through main memory and SPE local storage instead of mailboxes. Using memory-to-memory communication improves execution time by 5%–6.4%. Table 4.5 shows RAxML execution times, including all optimizations discussed so far and direct memory to memory communication, for the 42 SC input. It is interesting to note that direct memory-

27 1 worker, 1 bootstrap 26.4s 2 workers, 8 bootstraps 123.3s 2 workers, 16 bootstraps 246.8s 2 workers, 32 bootstraps 493.3s

Table 4.5: Execution time of RAxML following the optimization of communication to use direct memory-to-memory transfers. The input file is 42 SC. to-memory communication is an optimization which scales with parallelism on Cell, i.e. its performance impact grows as the code uses more SPEs. As the number of workers and boot- straps executed on the SPEs increases, the code becomes more communication-intensive, due to the fine granularity of the offloaded functions.

4.2.6 Increasing the Coverage of Offloading

In addition to newview(), we offloaded makenewz() and evaluate(). All three offloaded functions were packaged in a single code module loaded on the SPEs. The advantage of using a single module is that it can be loaded to the local storage once when an SPE thread is created and remain pinned in local storage for the rest of the execution. Therefore, the cost of loading the code on SPEs is amortized and communication between the PPE and SPEs is reduced. For example, when newview() is called by makenewz() or evaluate(), there is no need for any PPE- SPE communication, since all functions already reside in SPE local storage.

Offloading all three critical functions improves performance by a further 25%–31%. A more important implication is that after offloading and optimization of all three functions, the RAxML code split between the PPE and one SPE becomes actually faster than the sequential code executed exclusively on the PPE, by as much as 19%. Function offloading is another optimization which scales with parallelism. When more than one MPI processes are used and more than one bootstraps are offloaded to SPEs by each process, the gains from offloading rise to 36%. Table 4.6 illustrates execution times after full function offloading.

28 1 worker, 1 bootstrap 19.8s 2 workers, 8 bootstraps 86.8s 2 workers, 16 bootstraps 173s 2 workers, 32 bootstraps 344.4s

Table 4.6: Execution time of RAxML after offloading and optimizing three functions: newview(), makenewz() and evaluate(). The input file is 42 SC.

90 160

80 RAxML 140 PBPI

70 120 60 100 50 80 40 60 30 20 40 Execution Time (s) Execution Time (s) 10 20 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Processes Number of Processes

(a) (b)

Figure 4.4: Performance of (a) RAxML and (b) PBPI with different number of MPI processes.

4.3 Parallel Execution

After improving the performance of RAxML and PBPI using the presented optimization tech- niques, we investigated parallel execution of both applications on the Cell processor. To achieve higher utilization of the Cell chip, we oversubscribed the PPE with different number of MPI pro- cesses (2 – 8) and assigned a single SPE to each MPI process. The execution time of different parallel configurations is presented in Figure 4.4. In the presented experiments we use strong scaling, i.e. the computation increases with the number of processors growing.

In Figure 4.4(a) we observe that for any number of processes larger than two, the execution time of RAxML remains constant. There are two reasons responsible for the detected behavior:

1. On-chip contention, as well as bus and memory contention which occurs on the PPE side when the PPE is oversubscribed by multiple processes,

29 2. Linux kernel is oblivious to the process of off-loading which results in poor scheduling decisions. Each process following the off-loading execution model constantly alternates the execution between the PPE and an SPE. Unaware of the execution alternation, the OS allows processes to keep the control over the resources which are not actually used. In other words, the PPE might be assigned to a process which is currently switched to the SPE execution.

In PBPI case, Figure 4.4(b) we observe similar performance as with RAxML. From the presented experiments it is clear that naive parallelization of the applications, where the PPE is simply oversubscribed with multiple processes, does not provide satisfactory performance. Poor scaling of the applications is a strong motivation for detail exploration of different parallel programming models as well as scheduling policies for the asymmetric processors. We continue the discussion about parallel execution on heterogeneous architectures in Chapter 5.

4.4 Chapter Summary

In this chapter we presented a set of optimizations which enable efficient sequential execution of scientific applications on asymmetric platforms. We exploited the fact that our test application contain large computational functions (loops) which consume majority of the execution time. Nevertheless, this assumption does not reduce the generality of the presented techniques, since large, time-consuming computational loops are common in most of the scientific codes. We explored a total of five optimizations and the performance implications of these opti- mizations: I) Offloading the bulk of the maximum likelihood tree calculation on the accelera- tors; II) Casting and vectorization of expensive conditional statements involving multiple, hard to predict conditions; III) Double buffering for overlapping memory communication with com- putation; IV) Vectorization of the core of the floating point computation; V) Optimization of communication between the host core and accelerators using direct memory-to-memory trans- fers;

30 In our case study, starting from an optimized version of RAxML and PBPI for conventional uniprocessors and multiprocessors, we were able to performance on the Cell processor by more than a factor of two.

31 32 Chapter 5

Scheduling Multigrain Parallelism on Asymmetric Systems

5.1 Introduction

In this chapter, we investigate runtime scheduling policies for mapping different layers of par- allelism, exposed by an application, to the Cell processor. We assume that applications describe all available algorithmic parallelism to the runtime system explicitly, while the runtime system dynamically selects the degree of granularity and the dimensions of parallelism to expose to the hardware at runtime, using dynamic scheduling mechanisms and policies. In other words, the runtime system is responsible for partitioning algorithmic parallelism in layers that best match the diverse capabilities of the processor cores, while at the same time rightsizing the granularity of parallelism in each layer.

5.2 Scheduling Multi-Grain Parallelism on Cell

We hereby explore the possibilities for exploiting multi-grain parallelism on Cell. The Cell PPE can execute two threads or processes simultaneously, from where parts of code can be off- loaded and executed on SPEs. To increase the sources of parallelism for SPEs, the user may consider two approaches:

• The user may oversubscribe the PPE with more processes or threads, than the number of

33 processes/threads that the PPE can execute simultaneously. In other words, the program- mer attempts to find more parallelism to off-load to accelerators, by attempting a more fine-grain task decomposition of the code. In this case, the runtime system needs to sched- ule the host processes/threads so as to minimize the idle time on the host core while the computation is off-loaded to accelerators. We present an event-driven task-level scheduler (EDTLP) which achieves this goal in Section 5.2.1.

• The user can introduce a new dimension of parallelism to the application by distributing loops from within the off-loaded functions across multiple SPEs. In other words, the user can exploit data parallelism both within and across accelerators. Each SPE can work on a part of a distributed loop, which can be further accelerated with SIMDization. We present case studies that motivate the dynamic extraction of multi-grain parallelism via loop distribution in Section 5.2.2.

5.2.1 Event-Driven Task Scheduling

EDTLP is a runtime scheduling module which can be embedded transparently in MPI codes. The EDTLP scheduler operates under the assumption that the code to off-load to accelerators is specified by the user at the level of functions. In the case of Cell, this means that the user has either constructed SPE threads in a separate code module, or annotated the host PPE code with directives to extract SPE threads via a compiler [17]. The EDTLP scheduler avoids un- derutilization of SPEs by oversubscribing the PPE and preventing a single MPI process from monopolizing the PPE.

Informally, the EDTLP scheduler off-loads tasks from MPI processes. A task ready for off- loading serves as an event trigger for the scheduler. Upon the event occurrence, the scheduler immediately attempts to serve the MPI process that carries the task to off-load and sends the task to an available SPE, if any. While off-loading a task, the scheduler suspends the MPI process that spawned the task and switches to another MPI process, anticipating that more tasks

34 will be available for off-loading from ready-to-run MPI processes. Switching upon off-loading prevents MPI processes from blocking the PPE while waiting for their tasks to return. The scheduler attempts to sustain a high supply of tasks for off-loading to SPEs by serving MPI processes round-robin.

The downside of a scheduler based on oversubscribing a processor is context-switching overhead. Cell in particular also suffers from the problem of interference between processes or threads sharing the SMT PPE core. The granularity of the off-loaded code determines if the overhead introduced by oversubscribing the PPE can be tolerated. The code off-loaded to an SPE should be coarse enough to marginalize the overhead of context switching performed on the PPE. The EDTLP scheduler addresses this issue by performing granularity control of the off-loaded tasks and preventing off-loading of code that does not meet a minimum granularity threshold.

Figure 5.1 illustrates an example of the difference between scheduling MPI processes with the EDTLP scheduler and the native Linux scheduler. In this example, each MPI process has one task to off-load to SPEs. For illustrative purposes only, we assume that there are only 4 SPEs on the chip. In Figure 5.1(a), once a task is sent to an SPE, the scheduler forces a context switch on the PPE. Since the PPE is a two-way SMT, two MPI processes can simultaneously off-load tasks to two SPEs. The EDTLP scheduler enables the use of four SPEs via function off- loading. On the contrary, if the scheduler waits for the completion of a task before providing an opportunity to another MPI process to off-load (Figure 5.1 (b)), the application can only utilize two SPEs. Realistic application tasks often have significantly shorter lengths than the time quanta used by the Linux scheduler. For example, in RAxML, task lengths measure in the order of tens of microseconds, when Linux time quanta measure to tens of milliseconds.

Table 5.1(a) compares the performance of the EDTLP scheduler to that of the native Linux scheduler, using RAxML and running a workload comprising 42 organisms. In this experiment, the number of performed bootstraps is not constant and it is equal to the number of MPI pro- cesses. The EDTLP scheduler outperforms the Linux scheduler by up to a factor of 2.7. In the

35 (a) (b)

Figure 5.1: Scheduler behavior for two off-loaded tasks, representative of RAxML. Case (a) illustrates the behavior of the EDTLP scheduler. Case (b) illustrates the behavior of the Linux scheduler with the same workload. The numbers correspond to MPI processes. The shaded slots indicate context switching. The example assumes a Cell-like system with four SPEs. experiment with PBPI, Table 5.1(b), we execute the code with one Markov chain for 20,000 generations and we change the number of MPI processes used across runs. PBPI is also exe- cuted with weak scaling, i.e. we increase the size of the DNA alignment with the number of processes. The workload for PBPI includes 107 organisms. EDTLP outperforms the Linux scheduler policy in PBPI by up to a factor of 2.7.

5.2.2 Scheduling Loop-Level Parallelism

The EDTLP model described in Section 5.2 is effective if the PPE has enough coarse-grained functions to off-load to SPEs. In cases where the degree of available task parallelism is less than the number of SPEs, the runtime system can activate a second layer of parallelism, by splitting an already off-loaded task across multiple SPEs. We implemented runtime support for parallelization of for-loops enclosed within off-loaded SPE functions. We parallelize loops in off-loaded functions using work-sharing constructs similar to those found in OpenMP. In RAxML, all for-loops in the three off-loaded functions have no loop-carried dependencies, and obtain speedup from parallelization, assuming that there are enough idle SPEs dedicated to their execution. The number of SPEs activated for work-sharing is user- or system-controlled, as in

36 EDTLP Linux 1 worker, 1 bootstrap 19.7s 19.7s 2 workers, 2 bootstraps 22.2s 30s 3 workers, 3 bootstraps 26s 40.7s 4 workers, 4 bootstraps 28.1s 43.3s 5 workers, 5 bootstraps 33s 60.7s 6 workers, 6 bootstraps 34s 61.8s 7 workers, 7 bootstraps 38.8s 81.2s 8 workers, 8 bootstraps 39.8s 81.7s

(a)

EDTLP Linux 1 worker, 20,000 gen. 27.77s 27.54s 2 workers, 20,000 gen. 30.2s 30s 3 workers, 20,000 gen. 31.92s 56.16s 4 workers, 20,000 gen. 36.4s 63.7s 5 workers, 20,000 gen. 40.12s 93.71s 6 workers, 20,000 gen. 41.48s 93s 7 workers, 20,000 gen. 53.93s 144.81s 8 workers, 20,000 gen. 52.64s 135.92s

(b)

Table 5.1: Performance comparison for (a) RAxML and (b) PBPI with two schedulers. The second column shows execution time with the EDTLP scheduler. The third column shows execution time with the native Linux kernel scheduler. The workload for RAxML contains 42 organisms. The workload for PBPI contains 107 organisms.

OpenMP. We discuss dynamic system-level control of loop parallelism further in Section 5.3.

The parallelization scheme is outlined in Figure 5.2. The program is executed on the PPE until the execution reaches the parallel loop to be off-loaded. At that point the PPE sends a signal to a single SPE which is designated as the master. The signal is processed by the master and further broadcasted to all workers involved in parallelization. Upon a signal reception, each SPE worker fetches the data necessary for loop execution. We ensure that SPEs work on different parts of the loop and do not overlap by assigning a unique identifier to each SPE thread involved in parallelization of the loop. Global data, changed by any of the SPEs during

37 loop execution, is committed to main memory at the end of each iteration. After processing the assigned parts of the loop, the SPE workers send a notification back to the master. If the loop includes a reduction, the master collects also partial results from the SPEs and accumulates them locally. All communication between SPEs is performed on chip in order to avoid the long latency of communicating through shared memory.

Note that in our loop parallelization scheme on Cell, all work performed by the master SPE can also be performed by the PPE. In this case, the PPE would broadcast a signal to all SPE threads involved in loop parallelization and the partial results calculated by SPEs would be accumulated back at the PPE. Such collective operations increase the frequency of SPE-PPE communication, especially when the distributed loop is a nested loop. In the case of RAxML, in order to reduce SPE-PPE communication and avoid unnecessary invocation of the MPI pro- cess that spawned the parallelized loop, we opted to use an SPE to distribute loops to other SPEs and collect the results from other SPEs. In PBPI, we let the PPE execute the master thread during loop parallelization, since loops are coarse enough to overshadow the loop exe- cution overhead. Optimizing and selecting between these loop execution schemes is a subject of ongoing research.

SPE threads participating in loop parallelization are created once upon off-loading the code for the first parallel loop to SPEs. The threads remain active and pinned to the same SPEs during the entire program execution, unless the scheduler decides to change the parallelization strategy and redistribute the SPEs between one or more concurrently executing parallel loops. Pinned SPE threads can run multiple off-loaded loop bodies, as long as the code of these loop bodies fits on the local storage of the SPEs. If the loop parallelization strategy is changed on the fly by the runtime system, a new code module with loop bodies that implement the new parallelization strategy is loaded on the local storage of the SPEs.

Table 5.2 illustrates the performance of the basic loop-level parallelization scheme of our runtime system in RAxML. Table 5.2(a) illustrates the execution time of RAxML using one MPI process and performing one bootstrap, on a data set which comprises 42 organisms. This

38 Master Worker1 Worker7 Master sending start signal to Worker7 Master sending start signal to Worker1 Master Worker7 executes Worker1 executes iterations executes iterations iterations from 1 to x/8 from x/8 to x/4 . . . from 7/8x to x

Worker1 sending stop signal to Master Worker7 sending stop signal to Master

Total number x −of iterations

Figure 5.2: Parallelizing a loop across SPEs using a work-sharing model with an SPE designated as the master. experiment isolates the impact of our loop-level parallelization mechanisms on Cell. The num- ber of iterations in parallelized loops depends on the size of the input alignment in RAxML. For the given data set, each parallel loop executes 228 iterations.

The results shown in Table 5.2(a) suggest that when using loop-level parallelism RAxML sees a reasonable yet limited performance improvement. The highest speedup (1.72) is achieved with 7 SPEs. The reasons for the modest speedup are the non-optimal coverage of loop-level parallelism —more specifically, less than 90% of the original sequential code is covered by par- allelized loops—, the fine granularity of the loops, and the fact that most loops have reductions, which create bottlenecks on the Cell DMA engine. The performance degradation that occurs when 5 or 6 SPEs are used, happens because of specific memory alignment constraints that have to be met on the SPEs. It is due to alignment constraints that in certain occasions it is not pos- sible to evenly distribute the data used in the loop body and therefore the workload of iterations between SPEs. More specifically, the use of character arrays for the main data set in RAxML

39 1 worker, 1 boot., no LLP 19.7s 1 worker, 1 boot., 2 SPEs used for LLP 14s 1 worker, 1 boot., 3 SPEs used for LLP 13.36s 1 worker, 1 boot., 4 SPEs used for LLP 12.8s 1 worker, 1 boot., 5 SPEs used for LLP 13.8s 1 worker, 1 boot., 6 SPEs used for LLP 12.47s 1 worker, 1 boot., 7 SPEs used for LLP 11.4s 1 worker, 1 boot., 8 SPEs used for LLP 11.44s

(a)

1 worker, 1 boot., no LLP 47.9s 1 worker, 1 boot., 2 SPEs used for LLP 29.5s 1 worker, 1 boot., 3 SPEs used for LLP 23.3s 1 worker, 1 boot., 4 SPEs used for LLP 20.5s 1 worker, 1 boot., 5 SPEs used for LLP 18.7s 1 worker, 1 boot., 6 SPEs used for LLP 18.1s 1 worker, 1 boot., 7 SPEs used for LLP 17.1s 1 worker, 1 boot., 8 SPEs used for LLP 16.8s

(b)

Table 5.2: Execution time of RAxML when loop-level parallelism (LLP) is exploited in one bootstrap, via work distribution between SPEs. The input file is 42 SC: (a) DNA sequences are represented with 10,000 nucleotides, (b) DNA sequences are represented with 20,000 nu- cleotides. forces array transfers in multiples of 16 array elements. Consequently, loop distribution across processors is done with a minimum chunk size of 16 iterations.

Loop-level parallelization in RAxML can achieve higher speedup in a single bootstrap with larger input data sets. Alignments that have a larger number of nucleotides per organism have more loop iterations to distribute across SPEs. To illustrate the behavior of loop-level paral- lelization with coarser loops, we repeated the previous experiment using a data set where the DNA sequences are represented with 20,000 nucleotides. The results are shown in Table 5.2(b). The performance of the loop-level parallelization scheme always increases with the number of SPEs in this experiment.

40 PBPI exhibits clearly better scalability than RAxML with LLP, since the granularity of loops is coarser in PBPI than RAxML. Table 5.3 illustrates the execution times when PBPI is executed with a variable number of SPEs used for LLP. Again, we control the granularity of the off-loaded code by using different data sets: Table 5.3(a) shows execution times for a data set that contains 107 organisms, each represented by a DNA sequence of 3,000 nucleotides. Table 5.3(b) shows execution times for a data set that contains 107 organisms, each represented by a DNA sequence of 10,000 nucleotides. We run PBPI with one Markov chain for 20,000 generations. For the two data sets, PBPI achieves a maximum speedup of 4.6 and 6.1 respectively, after loop-level parallelization.

1 worker, 1,000 gen., no LLP 27.2s 1 worker, 1,000 gen., 2 SPEs used for LLP 14.9s 1 worker, 1,000 gen., 3 SPEs used for LLP 11.3s 1 worker, 1,000 gen., 4 SPEs used for LLP 8.4s 1 worker, 1,000 gen., 5 SPEs used for LLP 7.3s 1 worker, 1,000 gen., 6 SPEs used for LLP 6.8s 1 worker, 1,000 gen., 7 SPEs used for LLP 6.2s 1 worker, 1,000 gen., 8 SPEs used for LLP 5.9s

(a)

1 worker, 20,000 gen., no LLP 262s 1 worker, 20,000 gen., 2 SPEs used 131.3s 1 worker, 20,000 gen., 3 SPEs used 92.3s 1 worker, 20,000 gen., 4 SPEs used 70.1s 1 worker, 20,000 gen., 5 SPEs used 58.1s 1 worker, 20,000 gen., 6 SPEs used 49s 1 worker, 20,000 gen., 7 SPEs used 43s 1 worker, 20,000 gen., 8 SPEs used 39.7s

(b)

Table 5.3: Execution time of PBPI when loop-level parallelism (LLP) is exploited via work distribution between SPEs. The input file is 107 SC: (a) DNA sequences are represented with 1,000 nucleotides, (b) DNA sequences are represented with 10,000 nucleotides.

41 struct Pass{

volatile unsigned int v1_ad; volatile unsigned int v2_ad; //...arguments for loop body volatile unsigned int vn_ad; volatile double res; volatile int sig[2];

} __attribute__((aligned(128)));

Figure 5.3: The data structure Pass is used for communication among SPEs. The vi ad vari- ables are used to pass input arguments for the loop body from one local storage to another. The variable sig is used as a notification signal that the memory transfer for the shared data updated during the loop is completed. The variable res is used to send results back to the master SPE, and as a dependence resolution mechanism.

5.2.3 Implementing Loop-Level Parallelism

The SPE threads participating in loop work-sharing constructs are created once upon function off-loading. Communication among SPEs participating in work-sharing constructs is imple- mented using DMA transfers and the communication structure Pass, depicted in Figure 5.3.

The Pass structure is private to each thread. The master SPE thread allocates an array of Pass structures. Each member of this array is used for communication with an SPE worker thread. Once the SPE threads are created, they exchange the local addresses of their Pass structures. This address exchange is performed through the PPE. Whenever one thread needs to send a signal to a thread on another SPE, it issues an mfc put() request and sets the destination address to be the address of the Pass structure of the recipient.

In Figure 5.4, we illustrate a RAxML loop parallelized with work-sharing among SPE threads. Before executing the loop, the master thread sets the parameters of the Pass struc- ture for each worker SPE and issues one mfc put() request per worker. This is done in send to spe(). Worker i uses the parameters of the received Pass structure and fetches the data needed for the loop execution to its local storage (function fetch data()). After

42 finishing the execution of its portion of the loop, a worker sets the res parameter in the local copy of the structure Pass and sends it to the master, using send to master(). The master accumulates the results from all workers and commits the sum to main memory.

Immediately after calling send to spe(), the master participates in the execution of the loop. The master tends to have a slight head start over the workers. The workers need to complete several DMA requests before they can start executing the loop, in order to fetch the required data from the master’s local storage or shared memory. In fine-grained off-loaded functions such as those encountered in RAxML, load imbalance between the master and the workers is noticeable. To achieve better load balancing, we set the master to execute a slightly larger portion of the loop. A fully automated and adaptive implementation of this purposeful load unbalancing is obtained by timing idle periods in the SPEs across multiple invocations of the same loop. The collected times are used for tuning iteration distribution in each invocation, in order to reduce idle time on SPEs.

5.3 Dynamic Scheduling of Task- and Loop-Level Parallelism

Merging task-level and loop-level parallelism on Cell can improve the utilization of acceler- ators. A non-trivial problem with such a hybrid parallelization scheme is the assignment of accelerators to tasks. The optimal assignment is largely application-specific, task-specific and input-specific. We support this argument using RAxML as an example. The discussion in this section is limited to RAxML, where the degree of outermost parallelism can be changed ar- bitrarily by varying the number of MPI processes executing bootstraps, with a small impact on performance. PBPI uses a data decomposition approach which depends on the number of processors, therefore dynamically varying the number of MPI processes executing the code at runtime can not be accomplished without data redistribution.

43 Master SPE: Worker SPE:

struct Pass pass[Num_SPE]; struct Pass pass;

for(i=0; i < Num_SPE; i++){ while(pass.sig[0]==0); pass[i].sig[0] = 1; fetch_data(); ... send_to_spe(i,&pass[i]); /* Paralelized loop */ } for ( ... ) { /* Paralelized loop */ ... for ( ... ) } { ... tr->likeli = sum; } pass.res = sum; pass.sig[1] = 1; tr->likeli = sum; send_to_master(&pass);

for(i=0; i < Num_SPE; i++){ while(pass[i].sig[1] == 0); pass[i].sig[1] = 0; tr->likeli += pass[i].res; }

commit(tr->likeli);

Figure 5.4: Parallelization of the loop from function evaluate() in RAxML. The left side depitcs the code executed by the master SPE, while the right side depitcs the code executed by a worker SPE. Num SPE represents the number of SPE worker threads.

5.3.1 Application-Specific Hybrid Parallelization on Cell

We present a set of experiments with RAxML performing a number of bootstraps ranging be- tween 1 and 128. In these experiments we use three versions of RAxML. Two of the three ver- sions use hybrid parallelization models combining task- and loop-level parallelism. The third version exploits only task-level parallelism and uses the EDTLP scheduler. More specifically, in the first version, each off-loaded task is parallelized across 2 SPEs, and 4 MPI processes are multiplexed on the PPE, executing 4 concurrent bootstraps. In the second version, each off-loaded task is parallelized across 4 SPEs and 2 MPI processes are multiplexed on the PPE,

44 120 110 EDTLP+LLP with 4 SPEs per parallel loop EDTLP+LLP with 2 SPEs per parallel loop 100 EDTLP 90 80 70 60 50 40 Execution time in seconds 30 20 10 0 2 4 6 8 10 12 14 16 Number of bootstraps (a) 900

800 EDTLP+LLP with 4 SPEs per parallel loop EDTLP+LLP with 2 SPEs per parallel loop 700 EDTLP

600

500

400

300

Execution time in seconds 200

100

0 0 20 40 60 80 100 120 140 Number of bootstraps (b)

Figure 5.5: Comparison of task-level and hybrid parallelization schemes in RAxML, on the Cell BE. The input file is 42 SC. The number of ML trees created is (a) 1–16, (b) 1–128. executing 2 concurrent bootstraps. In the third version, the code concurrently executes 8 MPI processes, the off-loaded tasks are not parallelized and the tasks are scheduled with the EDTLP scheduler. Figure 5.5 illustrates the results of the experiments, with a data set representing 42 organisms. The x-axis shows the number of bootstraps, while the y-axis shows execution time in seconds.

As expected, the hybrid model outperforms EDTLP when up to 4 bootstraps are executed, since only a combination of EDTLP and LLP can off-load code to more than 4 SPEs simul-

45 taneously. With 5 to 8 bootstraps, the hybrid models execute bootstraps in batches of 2 and 4 respectively, while the EDTLP model executes all bootstraps in parallel. EDTLP activates 5 to 8 SPEs solely for task-level parallelism, leaving room for loop-level parallelism on at most 3 SPEs. This proves to be unnecessary, since the parallel execution time is determined by the length of the non-parallelized off-loaded tasks that remain on at least one SPE. In the range between 9 and 12 bootstraps, combining EDTLP and LLP selectively, so that the first 8 boot- straps execute with EDTLP and the last 4 bootstraps execute with the hybrid scheme is the best option. For the input data set with 42 organisms, performance of EDTLP and hybrid EDTLP- LLP schemes is almost identical when the number of bootstraps is between 13 and 16. When the number of bootstraps is higher than 16, EDTLP clearly outperforms any hybrid scheme (Figure 5.5(b)).

The reader may notice that the problem of hybrid parallelization is trivialized when the problem size is scaled beyond a certain point, which is 28 bootstraps in the case of RAxML (see Section 5.3.2). A production run of RAxML for real-world phylogenetic analysis would require up to 1,000 bootstraps, thus rendering hybrid parallelization seemingly unnecessary. However, if a production RAxML run with 1,000 bootstraps were to be executed across multiple Cell BEs, and assuming equal division of bootstraps between the processors, the cut-off point for EDTLP outperforming the hybrid EDTLP-LLP scheme would be set at 36 Cell processors. Beyond this scale, performance per processor would be maximized only if LLP were employed in conjunction with EDTLP on each Cell. Although this observation is empirical and somewhat simplifying, it is further supported by the argument that scaling across multiple processors will in all likelihood increase communication overhead and therefore favor a parallelization scheme with less MPI processes. The hybrid scheme reduces the volume of MPI processes compared to the pure EDTLP scheme, when the granularity of work per Cell becomes fine.

46 5.3.2 MGPS

The purpose of MGPS is to dynamically adapt the parallel execution by either exposing only one layer of task parallelism to the SPEs via event-driven scheduling, or expanding to the second layer of data parallelism and merging it with task parallelism when SPEs are underutilized at runtime.

MGPS extends the EDTLP scheduler with an adaptive processor-saving policy. The sched- uler runs locally in each process and it is driven by two events:

• arrivals, which correspond to off-loading functions from PPE processes to SPE threads;

• departures, which correspond to completion of SPE functions.

MGPS is invoked upon arrivals and departures of tasks. Initially, upon arrivals, the scheduler conservatively assigns one SPE to each off-loaded task. Upon a departure, the scheduler mon- itors the degree of task-level parallelism exposed by each MPI process, i.e. how many discrete tasks were off-loaded to SPEs while the departing task was executing. This number reflects the history of SPE utilization from task-level parallelism and is used to switch from the EDTLP scheduling policy to a hybrid EDTLP-LLP scheduling policy. The scheduler monitors the num- ber of SPEs that execute tasks over epochs of 100 off-loads. If the observed SPE utilization is over 50% the scheduler maintains the most recently selected scheduling policy (EDTLP or EDTLP-LLP). If the observed SPE utilization falls under 50% and the scheduler uses EDTLP, it switches to EDTLP-LLP by loading parallelized versions of the loops in the local storages of SPEs and performing loop distribution.

To switch between different parallel execution models at runtime, the runtime system uses code versioning. It maintains three versions of the code of each task. One version is used for execution on the PPE. The second version is used for execution on an SPE from start to finish, using SIMDization to exploit the vector execution units of the SPE. The third version is used for distribution of the loop enclosed by the task between more than one SPEs. The use of code

47 versioning increases code management overhead, as SPEs may need to load different versions of the code of each off-loaded task at runtime. On the other hand, code versioning obviates the need for conditionals that would be used in a monolithic version of the code. These con- ditionals are expensive on SPEs, which lack branch prediction capabilities. Our experimental analysis indicates that overlaying code versions on the SPEs via code transfers ends up being slightly more efficient than using monolithic code with conditionals. This happens because of the overhead and frequency of the conditionals in the monolithic version of the SPE code, but also because the code overlays leave more space available in the local storage of SPEs for data caching and buffering to overlap computation and communication [20].

We compare MGPS to EDTLP and two static hybrid (EDTLP-LLP) schedulers, using 2 SPEs per loop and 4 SPEs per loop respectively. Figure 5.6 shows the execution times of MGPS, EDTLP-LLP and EDTLP with various RAxML workloads. The x-axis shows the number of bootstraps, while the y-axis shows execution time. We observe benefits from using MGPS for up to 28 bootstraps. Beyond 28 bootstraps, MGPS converges to EDTLP and both are increasingly faster than static EDTLP-LLP execution, as the number of bootstraps increases.

A clear disadvantage of MGPS is that the time needed for any adaptation decision depends on the total number of off-loading requests, which in turn is inherently application-dependent and input-dependent. If the off-loading requests from different processes are spaced apart, there may be extended idle periods on SPEs, before adaptation takes place. Another disadvantage of MGPS is the dependency of its dynamic scheduling policy on the initial configuration used to execute the application. In RAxML, MGPS converges to the best execution strategy only if the application begins by oversubscribing the PPE and exposing the maximum degree of task-level parallelism to the runtime system. This strategy is unlikely to converge to the best scheduling policy in other applications, where task-level parallelism is limited and data parallelism is more dominant. In this case, MGPS would have to commence its optimization process from a dif- ferent program configuration favoring data-level rather than task-level parallelism. We address the aforementioned shortcomings via a sampling-based MGPS algorithm (S-MGPS), which we

48 120 110 MGPS EDTLP+LLP with 4 SPEs per parallel loop 100 EDTLP+LLP with 2 SPEs per parallel loop 90 EDTLP 80 70 60 50 40 Execution time in seconds 30 20 10 0 2 4 6 8 10 12 14 16 Number of bootstraps (a) 900

800 MGPS EDTLP+LLP with 4 SPEs per parallel loop 700 EDTLP+LLP with 2 SPEs per parallel loop EDTLP 600

500

400

300

Execution time in seconds 200

100

0 0 20 40 60 80 100 120 140 Number of bootstraps (b)

Figure 5.6: MGPS, EDTLP and static EDTLP-LLP. Input file: 42 SC. Number of ML trees created: (a) 1–16, (b) 1–128. introduce in the next section.

5.4 S-MGPS

We begin this section by presenting a motivating example to show why controlling concur- rency on the Cell is useful, even if SPEs are seemingly fully utilized. This example motivates the introduction of a sampling-based algorithm that explores the space of program and system

49 configurations that utilize all SPEs, under different distributions of SPEs between concurrently executing tasks and parallel loops. We present S-MGPS and evaluate S-MGPS using RAxML and PBPI.

5.4.1 Motivating Example

Increasing the degree of task parallelism on Cell comes at a cost, namely increasing contention between MPI processes that time-share the PPE. Pairs of processes that execute in parallel on the PPE suffer from contention for shared resources, a well-known problem of simultaneous multithreaded processors. Furthermore, with more processes, context switching overhead and lack of co-scheduling of SPE threads and PPE threads from which the SPE threads originate, may harm performance. On the other hand, while loop-level parallelization can ameliorate PPE contention, its performance benefit depends on the granularity and locality properties of parallel loops.

Figure 5.7 shows the efficiency of loop-level parallelism in RAxML when the input data set is relatively small. The input data set in this example (25 SC) has 25 organisms, each of them represented by a DNA sequence of 500 nucleotides. In this experiment, RAxML is executed multiple times with a single worker process and a variable number of SPEs used for LLP. The best execution time is achieved with 5 SPEs. The behavior illustrated in Figure 5.7 is caused by several factors, including the granularity of loops relative to the overhead of PPE-SPE commnication and load imbalance (discussed in Section 5.2.2).

By using two dimensions of parallelism to execute an application, the runtime system can control both PPE contention and loop-level parallelization overhead. Figure 5.8 illustrates an example in which multi-grain parallel executions outperform one-dimensional parallel execu- tions in RAxML, for any number of bootstraps. In this example, RAxML is executed with three static parallelization schemes, using 8 MPI processes and 1 SPE per process, 4 MPI processes and 2 SPEs per process, or 2 MPI processes and 4 SPEs per process respectively. The input data

50 50

45

40

35

30

25

20

15

Exeutin time (s) 10

5

0 1 2 3 4 5 6 7 8 Number of SPEs

Figure 5.7: Execution time of RAxML with a variable number of SPE threads. The input dataset is 25 SC.

400

350 8 worker processes, 1 SPE per off-loaded task 4 worker processes, 2SPEs per off-loaded task 300 2 worker processes, 4 SPEs per off-loaded task

250

200

150

100 Execution time (s) 50

0 0 20 40 60 80 100 120 140 Number of bootstraps

Figure 5.8: Execution times of RAxML, with various static multi-grain scheduling strategies. The input dataset is 25 SC. set is 25 SC. Using this data set, RAxML performs the best with a multi-level parallelization model when 4 MPI processes are simultaneously executed on the PPE and each of them uses 2 SPEs for loop-level parallelization.

5.4.2 Sampling-Based Scheduler for Multi-grain Parallelism

The S-MGPS scheduler automatically determines the best parallelization scheme for a specific workload, by using a sampling period. During the sampling period, S-MGPS performs a search

51 of program configurations along the available dimensions of parallelism. The search starts with a single MPI process and during the first step S-MGPS determines the optimal number of SPEs that should be used by a single MPI process. The search is implemented by sampling execution phases of the MPI process with different degrees of loop-level parallelism. Phases represent code that is executed repeatedly in an application and dominates execution time. In case of RAxML and PBPI, phases are the off-loaded tasks. Although we identify phases manually in our execution environment, the selection process for phases is trivial and can be automated in a compiler. Furthermore, parallel applications almost always exhibit a very strong runtime peri- odicity in their execution patterns, which makes the process of isolating the dominant execution phases straightforward.

Once the first sampling step of S-MGPS is completed, the search continues by sampling ex- ecution intervals with every feasible combination of task-level and loop-level parallelism. In the second phase of the search, the degree of loop-level parallelism never exceeds the optimal value determined by the first sampling step. For each execution interval, the scheduler uses execution time of phases as a criterion for selecting the optimal dimension(s) and granularity of paral- lelism per dimension. S-MGPS uses a performance-driven mechanism to rightsize parallelism on Cell, as opposed to the utilization-driven mechanism used in MGPS.

Figure 5.9 ilustrates the steps of the sampling phase when 2 MPI processes are executed on the PPE. This process can be performed for any number of MPI processes that can be exe- cuted on a single Cell node. For each MPI process, the runtime system uses a variable number of SPEs, ranging from 1 up to the optimal number of SPEs determined by the first phase of sampling.

The purpose of the sampling period is to determine the configuration of parallelism that maximizes efficiency. We define a throughput metric W as:

52 SPE5 SPE6 SPE7 SPE8 SPE5 SPE6 SPE7 SPE8

Process1 Process1

PPE EIB PPE EIB

Process2 Process2

SPE1 SPE2 SPE3 SPE4 SPE1 SPE2 SPE3 SPE4 (a) (b)

SPE5 SPE6 SPE7 SPE8 SPE5 SPE6 SPE7 SPE8

Process1 Process1

PPE EIB PPE EIB

Process2 Process2

SPE1 SPE2 SPE3 SPE4 SPE1 SPE2 SPE3 SPE4 (c) (d)

Figure 5.9: The sampling phase of S-MGPS. Samples are taken from four execution intervals, during which the code performs identical operations. For each sample, each MPI process uses a variable number of SPEs to parallelize its enclosed loops.

C W = (5.1) T where C is the number of completed tasks and T is execution time. Note that a task is de- fined as a function off-loaded on SPEs, therefore C captures application- and input-dependent behavior. S-MGPS computes C by counting the number of task off-loads. This metric works reasonably well, assuming that tasks of the same type (i.e. the same function or chunk of an expensive computational loop, off-loaded multiple times on an SPE) have approximately the same execution time. This is indeed the case in the applications that we studied. The metric can be easily extended so that each task is weighed with its execution time relative to the execution time of other tasks, to account for unbalanced task execution times. We do not explore this option further in this thesis.

53 S-MGPS calculates efficiency for every sampled configuration and selects the configuration with the maximum efficiency for the rest of the execution. In Table 5.4 we represent partial results of the sampling phase in RAxML for different input datasets. In this example, the degree of task-level parallelism sampled is 8, 4 and 2, while the degree of loop-level parallelism sampled is 1, 2 and 4. In the case of RAxML we set a single sampling phase to be time necessary for all active worker processes to finish a single bootstrap. Therefore, in the case of RAxML in Table 5.4, the number of bootstraps and the execution time differ across sampling phases: when the number of active workers is 8, the sampling phase will contain 8 bootstraps, when the number of active workers is 4 the sampling phase will contain 4 bootstraps, etc. Nevertheless, the throughput (W ) remains invariant across different sampling phases and always represents the efficiency of a certain configuration, i.e. amount of work done per second. Results presented in Table 5.4 confirm that S-MGPS converges to the optimal configurations (4x2 and 8x1) for the input files 25 SC and 42 SC.

Dataset deg(TLP) × # bootstr. per # off-loaded phase W deg (LLP) sampling phase tasks duration time 42 SC 8x1 8 2,526,126 41.73s 60,535 42 SC 4x2 4 1,263,444 21.05s 60,021 42 SC 2x4 2 624,308 14.42s 43,294 25 SC 8x1 8 1,261,232 16.53s 76,299 25 SC 4x2 4 612,155 8.01s 76,423 25 SC 2x4 2 302,394 5.6s 53,998

Table 5.4: Efficiency of different program configurations with two data sets in RAxML. The best configuration for 42 SC input is deg(TLP)=8, deg(LLP)=1. The best configuration for 25 SC is deg(TLP)=4, deg(LLP)=2. deg() corresponds the degree of a given dimension of parallelism (LLP or TLP).

Since the scheduler performs an exhaustive search, for the 25 SC input, the total number of bootstraps required for the sampling period on Cell is 17, for up to 8 MPI processes and 1 to 5 SPEs used per MPI process for loop-level parallelization. The upper bound of 5 SPEs per loop is determined by the first step of the sampling period. Assuming that performance is optimized if the maximum number of SPEs of the processor are involved in parallelization, the

54 feasible configurations to sample are constrained by deg(TLP)×deg(LLP)=8, for a single Cell with 8 SPEs. Under this constraint, the number of samples needed by S-MGPS on Cell drops to 3. Unfortunately, when considering only configurations that use all SPEs, the scheduler may omit a configuration that does not use all SPEs but still performs better than the best scheme that uses all processor cores. In principle, this situation may occur in certain non-scalable codes or code phases. To address such cases, we recommend the use of exhaustive search in S-MGPS, given that the total number of feasible configurations of SPEs on a Cell is manageable and small compared to the number of tasks and the number of instances of each task executed in real applications. This assumption may need to be revisited in the future for large-scale systems with many cores and exhaustive search may need to be replaced by heuristics such as hill climbing or simulated annealing. In Table 5.5 we compare the performance of S-MGPS to the static scheduling policies with both one-dimensional (TLP) and multi-grain (TLP-LLP) parallelism on Cell, using RAxML. For a small number of bootstraps, S-MGPS underperforms the best static scheduling scheme by 10%. The reason is that S-MGPS expends a significant percentage of execution time in the sampling period, while executing the program in mostly suboptimal configurations. As the number of bootstraps increases, S-MGPS comes closer to the performance of the best static scheduling scheme (within 3%–5%).

deg(TLP)=8, deg(TLP)=4, deg(TLP)=2, deg(LLP)=1 deg(LLP)=2 deg(LLP)=4 S-MGPS 32 boots. 60s 57s 80s 63s 64 boots. 117s 112s 161s 118s 128 boots. 231s 221s 323s 227s

Table 5.5: RAxML – Comparison between S-MGPS and static scheduling schemes, illustrating the convergence overhead of S-MGPS.

To map PBPI to Cell, we used a hybrid parallelization approach where a fixed number of MPI processes is multiplexed on the PPE and multiple SPEs are used for loop-level paralleliza- tion. The performance of the parallelized off-loaded code in PBPI is influenced by the same

55 600 1 MPI process 500 2 MPI processes 4 MPI processes 8 MPI processes 400

300

200

Execution time (s) 100

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of SPEs

Figure 5.10: PBPI executed with different levels of TLP and LLP parallelism: deg(TLP)=1-4, deg(LLP)=1–16 factors as in RAxML: granularity of the off-loaded code, PPE-SPE communication, and load imbalance. In Figure 5.10 we present the performance of PBPI when a variable number of SPEs is used to execute the parallelized off-loaded code. The input file we used in this experiment is 107 SC, including 107 organisms, each represented by a DNA sequence of 1,000 nucleotides. We run PBPI with one Markov chain for 200,000 generations. Figure 5.10 contains four exe- cutions of PBPI with 1, 2, 4 and 8 MPI processes with 1–16, 1–8, 1–4 and 1–2 SPEs used per MPI process respectively. In all experiments we use a single BladeCenter with two Cell BE processors (total of 16 SPEs).

In the experiments with 1 and 2 MPI processes, the off-loaded code scales successfully only up to a certain number of SPEs, which is always smaller than the number of total available SPEs. Furthermore, the best performance in these two cases is reached when the number of SPEs used for parallelization is smaller than the total number of available SPEs. The optimal number of SPEs in general depends on the input data set and on the outermost parallelization and data decomposition scheme of PBPI. The best performance for the specific dataset is reached by using 4 MPI processes, spread across 2 Cell BEs, with each process using 4 SPEs on one Cell BE.This optimal operating point shifts with different data set sizes.

The fixed virtual processor topology and data decomposition method used in PBPI prevents

56 dynamic scheduling of MPI processes at runtime without excessive overhead. We have exper- imented with the option of dynamically changing the number of active MPI processes via a gang scheduling scheme, which keeps the total number of active MPI processes constant, but co-schedules MPI processes in gangs of size 1, 2, 4, or 8 on the PPE and uses 8, 4, 2, or 1 SPE(s) per MPI process per gang respectively, for the execution of parallel loops. This scheme also suffered from system overhead, due to process control and context switching on the SPEs. Pending better solutions for adaptively controlling the number of processes in MPI, we evalu- ated S-MGPS in several scenarios where the number of MPI processes remains fixed. Using S-MGPS we were able to determine the optimal degree of loop-level parallelism, for any given degree of task-level parallelism (i.e. initial number of MPI processes) in PBPI. Being able to pinpoint the optimal SPE configuration for LLP is still important since different loop paral- lelization strategies can result in a significant difference in execution time. For example, the na¨ıve parallelization strategy, where all available SPEs are used for parallelization of off-loaded loops, can result in up to 21% performance degradation (see Figure 5.10).

Table 5.6 shows a comparison of execution times when S-MGPS is used and when different static parallelization schemes are used. S-MGPS performs within 2% of the optimal static parallelization scheme. S-MGPS also performs up to 20% better than the na¨ıve parallelization scheme where all available SPEs are used for LLP (see Table 5.6(b)).

5.5 Chapter Summary

In this chapter we investigated policies and mechanisms pertaining to scheduling multigrain parallelism on the Cell Broadband Engine. We proposed an event-driven task scheduler, striv- ing for higher utilization of SPEs via oversubscribing the PPE. We have explored the conditions under which loop-level parallelism within off-loaded code can be used. We have also proposed a comprehensive scheduling policy for combining task-level and loop-level parallelism autonom- ically within MPI code, in response to workload fluctuation. Using a bio-informatics code with

57 deg(LLP) deg(TLP)=1 1 2 3 4 5 6 7 8 Time 502 267.8 222.8 175.8 142.1 118.6 108.1 134.3 (a) deg(TLP)=1 9 10 11 12 13 14 15 16 Time (s) 122 111.9 138.3 109.2 122.3 133.2 115.3 116.5 S-MGPS Time (s) 110.3

deg(LLP) deg(TLP)=2 1 2 3 4 5 6 7 8 (b) Time (s) 275.9 180.8 139.4 113.5 91.3 97.3 102.55 115 S-MGPS Time (s) 93

deg(LLP) deg(LLP) deg(TLP)=4 1 2 3 4 deg(TLP)=8 1 2 (c) (d) Time (s) 180.6 118.67 94.63 83.61 Time (s) 355.5 265 S-MGPS Time (s) 85.9 S-MGPS Time (s) 267

Table 5.6: PBPI – comparison between S-MGPS and static scheduling schemes: (a) deg(TLP)=1, deg(LLP)=1–16; (b) deg(TLP)=2, deg(LLP)=1–8; (c) deg(TLP)=4, deg(LLP)=1– 4; (d) deg(TLP)=8, deg(LLP)=1–2. inherent multigrain parallelism as a case study, we have shown that our user-level scheduling policies outperform the native OS scheduler by up to a factor of 2.7.

Our MGPS scheduler proves to be responsive to small and large degrees of task-level and data-level parallelism, at both fine and coarse levels of granularity. This kind of parallelism is commonly found in optimization problems where many workers are spawned to search a very large space of solutions, using a heuristic. RAxML is representative of these applica- tions. MGPS is also appropriate for adaptive and irregular applications such as adaptive mesh refinement, where the application has task-level parallelism with variable granularity (because of load imbalance incurred while meshing subdomains with different structural properties) and, in some implementations, a statically unpredictable degree of task-level parallelism (because of non-deterministic dynamic load balancing which may be employed to improve execution time). N-body simulations and ray-tracing are applications that exhibit similar properties and can also benefit from our scheduler. As a final note, we observe that MGPS reverts to the best static scheduling scheme for regular codes with a fixed degree of task-level parallelism, such as blocked linear algebra kernels.

We also investigated the problem of mapping multi-dimensional parallelism on hetero-

58 geneous parallel architectures with both conventional and accelerator cores. We proposed a feedback-guided dynamic scheduling scheme, S-MGPS, which rightsizes parallelism on the fly, without a priori knowledge of application-specific information and regardless of the input data set.

59 60 Chapter 6

Model of Multi-Grain Parallelism

6.1 Introduction

The migration of parallel programming models to accelerator-based architectures raises many challenges. Accelerators require platform-specific programming interfaces and re-formulation of parallel algorithms to fully exploit the additional hardware. Furthermore, scheduling code on accelerators and orchestrating parallel execution and data transfers between host processors and accelerators is a non-trivial exercise, as discussed in Chapter 5.

Although being able to accurately determine the most efficient execution configuration of a multi-level parallel application, the S-MGPS scheduler (Section 5.4) requires sampling of many different configurations, at runtime. The sampling time grows with the number of accelerators on the chip, and with the number of different levels of parallelism available in the applica- tion. To pinpoint the most efficient execution configuration without using the sampling phase, we develop a model for multi-dimensional parallel computation on heterogeneous multi-core processors. We name the model Model of Multi-Grain Parallelism (MMGP). The model is ap- plicable to any type of accelerator based architecture, and in Section 6.4 we test the accuracy and usability of the MMGP model on the multicore Cell architecture.

61 HPU/LM HPU/LM #1 #NHP

Shared Memory / Message Interface

APU/LM APU/LM APU/LM #1 #2 #NAP

Figure 6.1: A hardware abstraction of an accelerator-based architecture with two layers of parallelism. Host processing units (HPUs) relatively supply coarse-grain parallel computation across accelerators. Accelerator processing units (APUs) are the main computation engines and may support internally finer grain parallelism. Both HPUs and APUs have local memories and communicate through shared-memory or message-passing. Additional layers of parallelism can be expressed hierarchically in a similar fashion.

6.2 Modeling Abstractions

Performance can be dramatically affected by the assignment of tasks to resources on a complex parallel architecture with multiple types of parallel execution vehicles. We intend to create a model of performance that captures the important costs of parallel task assignment at multiple levels of granularity, while maintaining simplicity. Additionally, we want our techniques to be independent of both programming models and the underlying hardware. Thus, in this section we identify abstractions necessary to allow us to define a simple, accurate model of parallel computation for accelerator-based architectures.

6.2.1 Hardware Abstraction

Figure 6.1 shows our abstraction for accelerator-based architectures. In this abstraction, each node consists of multiple host processing units (HPU) and multiple accelerator processing units (APU). Both the HPUs and APUs have local and shared memory. Multiple HPU-APU nodes form a cluster. We model the communication cost for i and j, where i and j are HPUs, APUs,

62 and/or HPU-APU nodes, using a variant of the LogP model [35] of point-to-point communi- cation:

Ci,j = Oi + L + Oj (6.1)

Where Ci,j is the communication cost, Oi and Oj is the overhead of sender and receiver respec- tively, and L is the communication latency.

In this hardware abstraction, we model an HPU, APU, or HPU-APU node as a sequential device with streaming memory accesses. For simplicity, we assume that additional levels of parallelism in HPUs or APUs, such as ILP and SIMD, can be reflected with a parameter that represents computing capacity. We could alternatively express multi-grain parallelism hierar- chically, but this complicates model descriptions without much added value. Assumption of streaming memory accesses, allows inclusion of the effects of communication and computation overlap.

6.2.2 Application Abstraction

Figure 6.2 provides an illustrative view of the succeeding discussion. We model the workload of a parallel application using a version of the Hierarchical Task Graph (HTG [52]). An HTG represents multiple levels of concurrency with progressively finer granularity when moving from outermost to innermost layers. We use a phased HTG, in which we partition the application into multiple phases of execution and split each phase into nested sub-phases, each modeled as a single, potentially parallel task. Each subtask may incorporate one or more layers of data or sub-task parallelism. The degree of concurrency may vary between tasks and within tasks.

Mapping a workload with nested parallelism as shown in Figure 6.2 to an accelerator-based multi-core architecture can be challenging. In the general case, any application task of any granularity could map to any type combination of HPUs and APUs. The solution space under these conditions can be unmanageable.

In this work, we confine the solution space by making some assumptions about the applica-

63 Main Process

Task1 Task2 Subtask1 Subtask2 Subtask3 Subtask1 Subtask2 Subtask3

Task1 Task2

Main Process Time

Figure 6.2: Our application abstraction of two parallel tasks. Two tasks are spawned by the main process. Each task exhibits phased, multi-level parallelism of varying granularity. In this paper, we address the problem of mapping tasks and subtasks to accelerator-based systems. tion and hardware. First, we assume that the amount and type of parallelism is known a priori for all phases in the application. In other words, we assume that the application is explicitly par- allelized, in a machine-independent fashion. More specifically, we assume that the application exposes all available layers of inherent parallelism to the runtime environment, without how- ever specifying how to map this parallelism to parallel execution vehicles in hardware. In other words, we assume that the application’s parallelism is expressed independently of the number and the layout of processors in the architecture. The parallelism of the application is represented by a phased HTG graph. The intent of our work is to improve and formalize programming of accelerator-based multicore architectures. We believe it is not unreasonable to assume those interested in porting code and algorithms to such systems would have detailed knowledge about the inherent parallelism of their application. Furthermore, explicit, processor-independent par- allel programming is considered by many as a means to simplify parallel programming mod- els [10].

Second, we prune the number and type of hardware configurations. We assume hardware

64 configurations consist of a hierarchy of nested resources, even though the actual resources may not be physically nested in the architecture. Each resource is assigned to an arbitrary level of parallelism in the application and resources are grouped by level of parallelism in the applica- tion. For instance, the Cell Broadband Engine can be considered as 2 HPUs and 8 APUs, where the two HPUs correspond to the PowerPC dual-thread SMT core and APUs to the synergistic (SPE) accelerator cores. HPUs support parallelism of any granularity, however APUs support the same or finer, not coarser, granularity. This assumption is reasonable since it represents faithfully all current accelerator architectures, where front-end processors offload computation and data to accelerators. This assumption simplifies modeling of both communication and com- putation.

6.3 Model of Multi-grain Parallelism

This section provides theoretical rigor to our approach. We present MMGP, a model which predicts execution time on accelerator-based system configurations and applications under the assumptions described in the previous section. Readers familiar with point-to-point models of parallel computation may want to skim this section and continue directly to the results of our execution time prediction techniques discussed in Section 6.4.

We follow a bottom-up approach. We begin by modeling sequential execution on the HPU, with part of the computation off-loaded to a single APU. Next, we incorporate multiple APUs in the model, followed by multiple HPUs. We end up with a general model of execution time, which is not particularly practical. Hence, we reduce the general model to reflect different uses of HPUs and APUs on real systems. More specifically, we specialize the model to capture the scheduling policy of threads on the HPUs and to estimate execution times under different map- pings of multi-grain parallelism across HPUs and APUs. Lastly, we describe the methodology we use to apply MMGP to real systems.

65 Phase_1 HPU_1

shared Memory Phase_2

APU_1 Phase_3 (a) an architecture with one HPU and one APU (b) an application with three phases

Figure 6.3: The sub-phases of a sequential application are readily mapped to HPUs and APUs. In this example, sub-phases 1 and 3 execute on the HPU and sub-phase 2 executes on the APU. HPUs and APUs are assumed to communicate via shared memory.

6.3.1 Modeling sequential execution

As the starting point, we consider the mapping of the program to an accelerator-based architec- ture that consists of one HPU and one APU, and an application with one phase decomposed into three sub-phases, a prologue and epilogue running on the HPU, and a main accelerated phase running on the APU, as illustrated in Figure 6.3.

Offloading computation incurs additional communication cost, for loading code and data on the APU, and saving results calculated from the APU. We model each of these communication costs with a latency and an overhead at the end-points, as in Equation 6.1. We assume that APU’s accesses to data during the execution of a procedure are streamed and overlapped with APU computation. This assumption reflects the capability of current streaming architectures, such as the Cell and Merrimac [37], to aggressively overlap memory latency with computa- tion, using multiple buffers. Due to overlapped memory latency, communication overhead is assumed to be visible only during loading the code and arguments of a procedure on the APU and during returning the result of a procedure from the APU to the HPU. We combine the com- munication overhead for offloading the code and arguments of a procedure and signaling the execution of that procedure on the APU in one term (Os), and the overhead for returning the result of a procedure from the APU to the HPU in another term (Or).

We can model the execution time for the offloaded sequential execution for sub-phase 2 in

66 Figure 6.3 as:

Toffload(w2) = TAP U (w2) + Or + Os (6.2)

where TAP U (w2) is the time needed to complete sub-phase 2 without additional overhead. Further, we can write the total execution time of all three sub-phases as:

T = THPU (w1) + TAP U (w2) + Or + Os + THPU (w3) (6.3)

To reduce complexity, we replace THPU (w1)+THPU (w3) with THPU , TAP U (w2) with TAP U , and Os + Or with Ooffload. Therefore, we can rewrite Equation 6.3 as:

T = THPU + TAP U + Ooffload (6.4)

The application model in Figure 6.3 is representative of one of potentially many phases in an application. We further modify Equation 6.4 for a generic application with N phases, where each phase i offloads a part of its computation on one APU:

N X T = (THP U,i + TAP U,i + Ooffload) (6.5) i=1

6.3.2 Modeling parallel execution on APUs

Each offloaded part of a phase may contain fine-grain parallelism, such as task-level parallelism at the sub-procedural level or data-level parallelism in loops. This parallelism can be exploited by using multiple APUs for the offloaded workload. Figure 6.4 shows the execution time de- composition for execution using one APU and two APUs. We assume that the code off-loaded to an APU during phase i, has a part which can be further parallelized across APUs, and a part executed sequentially on the APU. We denote TAP U,i(1, 1) as the execution time of the further parallelized part of the APU code during the ith phase. The first index 1 refers to the use of one HPU thread in the execution. We denote TAP U,i(1, p) as the execution time of the same

67 th part when p APUs are used to execute this part during the i phase. We denote as CAP U,i the non-parallelized part of APU code in phase i. Therefore, we obtain:

T (1, 1) T (1, p) = AP U,i + C (6.6) AP U,i p AP U,i

Overhead associated with offloading (gap)   PPE, SPE Computation   HPU APU HPU APU1 APU2            T (1,2)   HPU,i  T (1,1)   HPU,i   Offloading gap    Os Os                T (1,1)   APU T (1,2)  i   T (1,2) i     TAPU,i (1,1)     CAPU       APU  C    r  O  Or   Receiving gap             

Time Time (a) Offloading to one APU (b) offloading to two APUs

Figure 6.4: Parallel APU execution. The HPU (leftmost bar in parts a and b) offloads compu- tations to one APU (part a) and two APUs (part b). The single point-to-point transfer of part a is modeled as overhead plus computation time on the APU. For multiple transfers, there is additional overhead (g), but also benefits due to parallelization.

Given that the HPU offloads to APUs sequentially, there exists a latency gap between con- secutive offloads on APUs. Similarly, there exists a gap between receiving return values from two consecutive offloaded procedures on the HPU. We denote with g the larger of the two gaps. On a system with p APUs, parallel APU execution will incur an additional overhead as large as p · g. Thus, we can model the execution time in phase i as:

T (1, 1) T (1, p) = T + AP U,i + C + O + p · g (6.7) i HP U,i p AP U,i offload

68 6.3.3 Modeling parallel execution on HPUs

An accelerator-based architecture can support parallel HPU execution in several ways, by pro- viding a multi-core HPU, an SMT HPU or combinations thereof. As a point of reference, we consider an architecture with one SMT HPU, which is representative of the Cell BE.

Since the compute intensive parts of an application are typically offloaded to APUs, the HPUs are expected to be in idle state for extended intervals. Therefore, multiple threads can be used to reduce idle time on the HPU and provide more sources of work for APUs, so that APUs are better utilized. It is also possible to oversubscribe the HPU with more threads than the number of available hardware contexts, in order to expose more parallelism via offloading on APUs.

Figure 6.5 illustrates the execution timeline when two threads share the same HPU, and each thread offloads parallelized code on two APUs. We use different shade patterns to represent the workload of different threads.

APU4 APU3 HPU APU1 APU2   HPU Thread 1    HPU Thread 2  1 (2,2)  THPU,i   Os   1 2   Ti (2,2) T (2,2)   HPU,i    1    TAPU,i(2,2) O s  2   CAPU Ti (2,2)    Or  2   TAPU,i(2,2)  C  APU  Or

Figure 6.5: Parallel HPU execution. The HPU (center bar) offloads computations to 4 APUs (2 on the right and 2 on the left). The first thread on the HPU offloads computation to APU1 and APU2 then idles. The second HPU thread is switched in, offloads code to APU3 and APU4, and then idles. APU1 and APU2 complete and return data followed by APU3 and APU4.

69 For m concurrent HPU threads, where each thread uses p APUs for distributing a single APU task, the execution time of a single off-loading phase can be represented as:

k k k Ti (m, p) = THP U,i(m, p) + TAP U,i(m, p) + Ooffload + p · g (6.8)

k th th where Ti (m, p) is the completion time of the k HPU thread during the i phase.

Modeling the APU time

Similarly to Equation 6.6, we can write the APU time of the k-th thread in phase i in Equation 6.8 as: T (m, 1) T k (m, p) = AP U,i + C (6.9) AP U,i p AP U,i

Different parallel implementations may result in different TAP U,i(m, 1) terms and a different number of offloading phases. For example, the implementation could parallelize each phase among m HPU threads and then offload the work of each HPU thread to p APUs, resulting in the same number of offloading phases and a reduced APU time during each phase, i.e.,

TAP U,i(1,1) TAP U,i(m, 1) = m . As another example, the HPU threads can be used to execute multi- ple identical tasks, resulting in a reduced number of offloading phases (i.e., N/m, where N is the number of offloading phases when there is only one HPU thread) and the same APU time in each phase, i.e., TAP U,i(m, 1) = TAP U,i(1, 1).

Modeling the HPU time

The execution time of each HPU thread is affected by three factors:

1. Contention between HPU threads for shared resources.

2. Context switch overhead related to resource scheduling.

3. Global synchronization between dependent HPU threads.

70 Considering all three factors, we can model the execution time of an HPU thread in phase i as:

k THP U,i(m, p) = αm · THP U,i(1, p) + TCSW + OCOL (6.10)

In this equation TCSW is context switching time on the HPU, and OCOL is the time needed for collective communication. The parameter αm is introduced to account for contention between threads that share resources on the HPU. On SMT and CMP HPUs, such resources typically include one or more levels of the on-chip cache memory. On SMT HPUs in particular, shared resources include also TLBs, branch predictors and instruction slots in the pipeline. Contention between threads often introduces artificial load imbalance due to occasional unfair hardware policies of allocating resources between threads.

Synthesis

Combining Equation (6.8)-(6.10) and summarizing all phases, we can write the execution time for MMGP as:

T (1, 1) T (m, p) = α ·T (1, 1)+ AP U +C +N ·(O +T +O +p·g) (6.11) m HPU m · p AP U Offload CSW COL

Due to limited hardware resources (i.e. number of HPUs and APUs), we further constrain this equation to m × p ≤ NAP U , where NAP U is the number of available APUs. As described later in this paper, we can either measure or approximate all parameters in Equation 6.11 from microbenchmarks and profiles of sequential runs of the program.

6.3.4 Using MMGP

Given a parallel application, MMGP can be applied using the following process:

1. Calculate parameters including OOffload, αm, TCSW and OCOL using micro-benchmarks for the target platform.

71 2. Profile a short run of the sequential execution with off-loading to a single APU, to estimate

THPU (1), g, TAP U (1, 1) and CAP U .

3. Solve a special case of Equation 6.11 (e.g. 6.7) to find the optimal mapping between application concurrency and HPUs and APUs available on the target platform.

6.3.5 MMGP Extensions

We note that the concepts and assumptions mentioned in this section do not preclude further specialization of MMGP for higher accuracy. For example, in Section 6.3.1 we assume com- putation and data communication overlap. This assumption reflects the fact that streaming processors can typically overlap completely memory access latency with computation. For non-overlapped memory accesses, we can employ a DMA model as a specialization of the overhead factors in MMGP. Also, in Sections 6.3.2 and 6.3.3 we assume only two levels of parallelism. MMGP is easily extensible to additional levels but the terms of the equations grow quickly without conceptual additions. Furthermore, MMGP can be easily extended to reflect specific scheduling policies for threads on HPUs and APUs, as well as load imbalance in the distribution of tasks between HPUs and APUs. To illustrate the usefulness of our techniques we apply them to a real system. We next present results from applying MMGP to Cell.

6.4 Experimental Validation and Results

We use MMGP to derive multi-grain parallelization schemes for two bioinformatics applica- tions, RAxML and PBPI, described in Chapter 3, on a shared-memory dual Cell blade, IBM QS20. Although we are using only two applications in our experimental evaluation, we should point out that these are complete applications used for real-world biological data analyses, and that they are fully optimized for the Cell BE using an arsenal of optimizations, including vector- ization, loop unrolling, double buffering, if-conversion and dynamic scheduling. Furthermore,

72 these applications have inherent multi-grain concurrency and non-trivial scaling properties in their phases, therefore scheduling them optimally on Cell is a challenging exercise for MMGP. Lastly, in the absence of comprehensive suites of benchmarks (such as NAS or SPEC HPC) ported on Cell, optimized, and made available to the community by experts, we opted to use, PBPI and RAxML, codes on which we could verify that enough effort has been invested towards Cell-specific parallelization and optimization.

6.4.1 MMGP Parameter approximation

MMGP has eight free parameters, THPU , TAP U , CAP U , Ooffload, g, TCSW , OCOL and αm. We estimate four of the parameters using micro-benchmarks.

αm captures contention between processes or threads running on the PPE. This contention depends on the scheduling algorithm on the PPE. We estimate αm under an event-driven schedul- ing model which oversubscribes the PPE with more processes than the number of hardware threads supported for simultaneous execution on the PPE, and switches between processes upon each off-loading event on the PPE [19].

To estimate αm, we use a parallel micro-benchmark that computes the product of two M×M square matrices consisting of double-precision floating point elements. Matrix-matrix multi- plication involves O(n3) computation and O(n2) data transfers, thus stressing the impact of sharing execution resources and the L1 and L2 caches between processes on the PPE. We used several different matrix sizes, ranging from 100 × 100 to 500 × 500, to exercise different levels of pressure on the thread-shared caches of the PPE. In the MMGP model, we use the mean of

αm obtained from these experiments, which is 1.28.

PPE-SPE communication is optimally implemented through DMAs on Cell. We devised a ping-pong micro-benchmark using DMAs to send a single word from the PPE to one SPE and backwards. We measured PPE→SPE→PPE round-trip communication overhead (Ooffload) to 70 ns. To measure the overhead caused by various collective communications we used

73 mpptest [55] on the PPE. Using a micro-benchmark that repeatedly executes the sched yield() system call, we estimate the overhead caused by the context switching (TCSW ) on the PPE to be 2 µs. This is a conservative upper bound for context switching overhead, since it includes some user-level library overhead.

THPU , TAP U , CAP U and the gap g between consecutive DMAs on the PPE are application- dependent and cannot be approximated easily with a micro-benchmark. To estimate these pa- rameters, we use a profile of a sequential run of the code, with tasks off-loaded on one SPE. We use the timing instructions inserted into the applications at specific locations. To estimate THPU we measure the time that applications spend on the HPU. To estimate TAP U and CAP U we mea- sure the time that applications spend on the accelerators, in large computational loops which can be parallelized (TAP U ), and in the sequential accelerator code outside of the large loops

(CAP U ). To estimate g, we measure the time intervals between the consecutive task off-loads and task completions.

6.4.2 Case Study I: Using MMGP to parallelize PBPI

PBPI with One Dimension of Parallelism

We compare the PBPI execution times predicted by MMGP to the actual execution times ob- tained on real hardware, using various degrees of PPE and SPE parallelism, i.e. the equivalents of HPU and APU parallelism on Cell. These experiments illustrate the accuracy of MMGP, in a sample of the feasible program configurations. The sample includes one-dimensional decompo- sitions of the program between PPE threads, with simultaneous off-loading of code to one SPE from each PPE thread, one-dimensional decompositions of the program between SPE threads, where the execution of tasks on the PPE is sequential and each task off-loads code which is data-parallel across SPEs, and two-dimensional decompositions of the program, where multi- ple tasks run on the PPE threads concurrently and each task off-loads code which is data-parallel across SPEs. In all cases, the SPE code is SIMDized in the innermost loops, to exploit the vec-

74 (a) (b)

Figure 6.6: MMGP predictions and actual execution times of PBPI, when the code uses one dimension of PPE (HPU) parallelism. tor units of the SPEs. We believe that this sample of program configurations is representative of what a user would reasonably experiment with while trying to optimize the codes on the Cell.

For these experiments, we used the arch107 L10000 input data set. This data set consists of 107 sequences, each with 10000 characters. We run PBPI with one Markov chain for 20000 generations. Using the time base register on the PPE and the decrementer register on one SPE, we obtained the following model parameters for PBPI: THPU = 1.3s, TAP U = 370s, g = 0.8s and O = 1.72s.

Figure 6.6 compares MMGP and actual execution times for PBPI, when PBPI only ex- ploits one-dimensional PPE (HPU) parallelism in which each PPE thread uses one SPE for off-loading. We execute the code with up to 16 MPI processes, which off-load code to up to 16 SPEs on two Cell BEs. Referring to Equation 6.11, we set p = 1 and vary the value of m between 1 to 8. The X-axis shows the number of processes running on the PPE (i.e. HPU parallelism), and the Y-axis shows the predicted and measured execution times. The maximum prediction error of MMGP is 5%. The arithmetic mean of the error is 2.3% and the standard deviation is 1.4.

Figure 6.7 illustrates predicted and actual execution times when PBPI uses one dimension

75 (c) (d)

Figure 6.7: MMGP predictions and actual execution times of PBPI, when the code uses one dimension of SPE (APU) parallelism, with a data-parallel implementation of the maximum likelihood calculation.

of SPE (APU) parallelism. Referring to Equation 6.11, we set p = 1 and vary m from 1 to 8. MMGP remains accurate, the mean prediction error is 4.1% and the standard deviation is 3.2. The maximum prediction error in this case is higher (approaching 10%) when the APU parallelism increases and the code uses SPEs on both Cell processors. A closer inspection of this result reveals that the data-parallel implementation of tasks in PBPI stops scaling beyond the 8 SPEs confined in one Cell processor, because of DMA bottlenecks and non-uniformity in the latency of memory accesses by the two Cell processors on the blade. Capturing the DMA bottlenecks requires the introduction of a model of DMA contention in MMGP, while captur- ing the NUMA bottleneck would require an accurate memory hierarchy model integrated with MMGP. The NUMA bottleneck can be resolved by a better page placement policy implemented in the operating system. We intend to examine these issues in our future work. For the purposes of this paper, it suffices to observe that MMGP is accurate enough despite its generality. As we show later, MMGP predicts accurately the optimal mapping of the program to the Cell multi- processor, regardless of inaccuracies in execution time prediction in certain edge cases.

76 PBPI with Two Dimensions of Parallelism

Multi-grain parallelization aims at exploiting simultaneously task-level and data-level paral- lelism in PBPI. We only consider multi-grain parallelization schemes in which degHPU·degAPU ≤ 16, i.e. the total number of SPEs (APUs) on the dual-processor Cell Blade we used in this study. deg() denotes the degree of a layer of parallelism, which corresponds to the number of SPE or PPE threads used to run the code. Figure 6.8 shows the predicted and actual execution times of PBPI for all feasible combinations of multi-grain parallelism under the aforementioned con- straint. MMGP’s mean prediction error is 3.2%, the standard deviation of the error is 2.6 and the maximum prediction error is 10%. The important observation in these results is that MMGP agrees with the experimental outcome in terms of the mix of PPE and SPE parallelism to use in PBPI for maximum performance. In a real program development scenario, MMGP would point the programmer to the direction of using both task-level and data-level parallelism with a balanced allocation of PPE contexts and SPEs between the two layers.

6.4.3 Case Study II: Using MMGP to Parallelize RAxML

RAxML with a Single Layer of Parallelism

The units of work (bootstraps) in RAxML are distributed evenly between MPI processes, there- fore the degree of PPE (HPU) concurrency is bound by the number of MPI processes. As discussed in Section 6.3.3, the degree of HPU concurrency may exceed the number of HPUs, so that on an architecture with more APUs than HPUs, the program can expose more concurrency to APUs. The degree of SPE (APU) concurrency may vary per MPI process. In practice, the degree of PPE concurrency can not meaningfully exceed the total number of SPEs available on the system, since as many MPI processes can utilize all available SPEs via simultaneous off-loading. Similarly to PBPI, each MPI process in RAxML can exploit multiple SPEs via data-level parallel execution of off-loaded tasks across SPEs. To enable maximal PPE and SPE concurrency in RAxML, we use a version of the code scheduled by a Cell BE event-driven

77 400

350 MMGP model 300 Measured time 250

200

150

100 Execution time in seonds 50

0 (1,1) (1,2) (2,1) (1,3) (3,1) (1,4) (2,2) (4,1) (1,5) (5,1) (1,6) (2,3) (3,2) (6,1) (1,7) (7,1) (1,8) (2,4) (4,2) (8,1) (1,9) (3,3) (9,1) (2,5) (5,2) (2,6) (3,4) (4,3) (6,2) (2,7) (7,2) (3,5) (5,3) (2,8) (4,4) (8,2) (1,10) (10,1) (1,11) (11,1) (1,12) (12,1) (1,13) (13,1) (1,14) (14,1) (1,15) (15,1) (1,16) (16,1) Executed Configuration (#PPE,#SPE)

Figure 6.8: MMGP predictions and actual execution times of PBPI, when the code uses two dimensions of SPE (APU) and PPE (HPU) parallelism. The mix of degrees of parallelism which optimizes performance is 4-way PPE parallelism combined with 4-way SPE parallelism. The chart illustrates the results when both SPE parallelism and PPE parallelism are scaled to two Cell processors. scheduler [19], in which context switches on the PPE are forced upon task off-loading and PPE processes are served with a fair-share scheduler, so as to have even chances for off-loading on SPEs.

We evaluate the performance of RAxML when each process performs the same amount of work, i.e. the number of distributed bootstraps is divisible by the number of processes. The case of unbalanced distribution of bootstraps between MPI processes can be handled with a minor

B (d M e·M) modification to Equation 6.11, to scale the MMGP parameters by a factor of B , where B is the number of bootstraps (tasks) and M is the number of MPI processes used to execute the code.

We compare the execution time of RAxML to the time predicted by MMGP, using two input data sets. The first data set contains 10 organisms, each represented by a DNA sequence of 20,000 nucleotides. We refer to this data set as DS1. The second data set (DS2) contains 10

78 200 180 MMGP Model 160 Measured time 140 120 100 80 60 40

Execution time (s) 20 0 1 2 4 8 16 Number of PPEs

(a)

350 MMGP Model 300 Measured time 250 200 150 100 50 Execution time (s) 0 1 2 4 8 16 Number of PPEs

(b)

Figure 6.9: MMGP predictions and actual execution times of RAxML, when the code uses one dimension of PPE (HPU) parallelism: (a) with DS1, (b) with DS2. organisms, each represented by a DNA sequence of 50,000 nucleotides. For both data sets, we set RAxML to perform a total of 16 bootstraps using different parallel configurations.

The MMGP parameters for RAxML, obtained from profiling a sequential run of the code are

THPU = 3.3s,TAP U = 63s,CAP U = 104s for DS1, and THPU = 8.8s,TAP U = 118s,CAP U =

157s for DS2. The values of other MMGP parameters are negligible compared to TAP U , THPU , and CAP U , therefore we disregard them for RAxML. Note that the off-loaded code that cannot be parallelized (CAP U ) takes 57-62% of the execution time of a task on the SPE. Figure 6.9 illustrates the estimated and actual execution times of RAxML with up to 16 bootstraps, using one dimension of PPE (HPU) parallelism. In this case, each MPI process offloads tasks to one SPE and SPEs are utilized by oversubscribing the PPE with more processes than the number of hardware threads available on the PPEs. For DS1, the mean MMGP prediction error is 7.1%, the standard deviation is 6.4. and the maximum error is 18%. For DS2, the mean MMGP prediction error is 3.4%, the standard deviation is 1.9 and the maximum error is 5%.

79 200 180 MMGP Model 160 Measured time 140 120 100 80 60 40

Execution time (s) 20 0 1 2 3 4 5 6 7 8 Number of SPEs

(a)

350 MMGP Model 300 Measured time 250

200

150

100

50 Execution time (s) 0 1 2 3 4 5 6 7 8 Number of SPEs

(b)

Figure 6.10: MMGP predictions and actual execution times of RAxML, when the code uses one dimension of SPE (APU) parallelism: (a) with DS1, (b) with DS2.

Figure 6.10 illustrates estimated and actual execution times of RAxML, when the code uses one dimension of SPE (APU) parallelism, with a data-parallel implementation of the maxi- mum likelihood calculation functions across SPEs. We should point out that although both RAxML and PBPI perform maximum likelihood calculations in their computational cores, RAxML’s loops have loop-carried dependencies that prevent scalability and parallelization in many cases [20], whereas PBPI’s core computation loops are fully parallel and coarse enough to achieve scalability. The limited scalability of data-level parallelization of RAxML is the reason why we confine the executions with data-level parallelism on at most 8 SPEs. As shown in Fig- ure 6.10, the data-level parallel implementation of RAxML does not scale substantially beyond 4 SPEs. When only APU parallelism is extracted from RAxML, for DS1 the mean MMGP prediction error is 0.9%, the standard deviation is 0.8. and the maximum error is 2%. For DS2, the mean MMGP prediction error is 2%, the standard deviation is 1.3 and the maximum error is 4%.

80 RAxML with Two Dimensions of Parallelism

Figure 6.11 shows the actual and predicted execution times in RAxML, when the code exposes two dimensions of parallelism to the system. Once again, regardless of execution time predic- tion accuracy, MMGP is able to pin-point the optimal parallelization model, which in the case of RAxML is task-level parallelization with no further data-parallel decompositions of tasks between SPEs, as the opportunity for scalable data-level parallelization in the code is limited. Innermost loops in tasks are still SIMDized within each SPE. MMGP remains accurate, with mean execution time prediction error of 4.3%, standard deviation of 4, and maximum prediction error of 18% for DS1, and mean execution time prediction error of 2.8%, standard deviation of 1.9, and maximum prediction error of 7% for DS2. It is worth noting that although the two codes tested are fundamentally similar in their computational core, their optimal parallelization model is radically different. MMGP accurately reflects this disparity, using a small number of parameters and rapid prediction of execution times across a large number of feasible program configurations.

6.4.4 MMGP Usability Study

We demonstrate a practical use of MMGP through a simple usability study. We modified PBPI to execute an MMGP sampling phase at the beginning of the execution. During the sampling phase, the application is profiled and all MMGP parameters are determined. After finishing the sampling phase, MMGP estimates the optimal configuration and the application is executed with the MMGP-recommended configuration. The profiling, sampling and MMGP actuation phases are performed automatically without any user intervention. We set PBPI to execute 106 generations, since this is a number of generations typically required by biologists. We set the sampling phase to be 10,000 generations. Even with the overhead introduced by the sampling phase included in the measurements, the configuration provided by MMGP outperforms all other configurations by margins ranging from 1.1% (compared to the next best configuration

81 200

180 MMGP Model 160 Measured time 140

120

100

80

60

40

20 Execution time in seconds 0 (1,1) (1,2) (2,1) (1,3) (1,4) (2,2) (4,1) (1,5) (1,6) (2,3) (1,7) (1,8) (2,4) (4,2) (8,1) (2,5) (2,6) (4,3) (2,7) (8,2) (2,8) (16,1) Executed configuration (#PPE/#SPE)

(a)

350

300 MMGP model Measured time 250

200

150

100

50 Execution time in seconds 0 (1,1) (1,2) (2,1) (1,3) (1,4) (2,2) (4,1) (1,5) (1,6) (2,3) (1,7) (1,8) (2,4) (4,2) (8,1) (2,5) (2,6) (4,3) (2,7) (8,2) (2,8) (16,1) Executed configuration (#PPE/#SPE)

(b)

Figure 6.11: MMGP predictions and actual execution times of RAxML, when the code uses two dimensions of SPE (APU) and PPE (HPU) parallelism. Performance is optimized by over- subscribing the PPE and maximizing task-level parallelism.

82 1400 Sampling Phase 1200 PBPI Execution

1000

800

600

Total Time (s) 400

200

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Sequence Size

Figure 6.12: Overhead of the sampling phase when MMGP scheduler is used with the PBPI application. PBPI is executed multiple times with 107 input species. The sequence size of the input file is varied from 1,000 to 10,000. In the worst case, the overhead of the sampling phase is 2.2% (sequence size 7,000). identified via an exhaustive search), and up to 4 times (compared to the worst configuration identified via an exhaustive search). The sampling phase takes in the worst case 2.2% of the total execution time, but completely eliminates the exhaustive search that would otherwise be necessary to find the best mapping of the application to the Cell architecture. Figure 6.12 illustrates the overhead of the sampling phase with PBPI application.

6.5 Chapter Summary

The introduction of accelerator-based parallel architectures complicates the problem of map- ping algorithms to systems, since parallelism can no longer be considered as a one-dimensional abstraction of processors and memory. We presented a new model of multi-dimensional parallel computation, MMGP, which we introduced to relieve users from the arduous task of mapping parallelism to accelerator-based architectures. We have demonstrated that the model is fairly accurate, albeit simple, and that it is extensible and easy to specialize for a given architecture. We envision three uses of MMGP: i) As a rapid prototyping tool for porting algorithms to

83 accelerator-based architectures. More specifically, MMGP can help users derive not only a de- composition strategy, but also an actual mix of programming models to use in the application in order to best utilize the architecture, while using architecture-independent programming tech- niques. ii) As a compiler tool for assisting compilers in deriving efficient mappings of programs to accelerator-based architectures automatically. iii) As a runtime tool for dynamic control of parallelism in applications, whereby the runtime system searches for optimal program configu- rations in the neighborhood of optimal configurations derived by MMGP, using execution time sampling or prediction-based techniques.

84 Chapter 7

Scheduling Asymmetric Parallelism on a PS3 Cluster

7.1 Introduction

Cluster computing is already feeling the impact of multi-core processors [30]. Several highly ranked entries of the latest Top-500 list include clusters of commodity dual-core processors1. The availability of abundant chip-level and board-level parallelism changes fundamental as- sumptions that developers currently make while writing software for HPC clusters. While recent work has improved our understanding of the implications of small-scale symmetric multi-core processors on cluster computing [7], emerging asymmetric multi-core processors such as the Cell/BE and boards with conventional processors and hardware accelerators –such as GPUs–, are rapidly making their way into HPC clusters [94]. There are strong incentives that support this trend, not the least of which is higher performance with higher energy-efficiency made possible through asymmetric, rather than symmetric multi-core processor organizations [57].

Understanding the implications of asymmetric multi-core processors on cluster computing and providing models and software support to ease the migration of parallel programs to these

1http://www/top500.org

85 platforms is a challenging and relevant problem. This study makes four contributions:

i) We conduct a performance analysis of a Linux cluster of Sony PlayStation3 (PS3) nodes. To the best of our knowledge, this is the first study to evaluate this cost-effective and unconven- tional HPC platform with microbenchmarks and realistic applications from the area of bioin- formatics. The cluster we used has 22 PS3 nodes connected with a GigE switch and was built at Virginia Tech for less than $15,000. We first evaluate the performance of MPI collective and point-to-point communication on the PS3 cluster, and explore the scalability of MPI com- munication operations under contention for bandwidth within and across PS3 nodes. We then evaluate performance and scalability of the PS3 cluster with bioinformatics applications. Our analysis reveals the sensitivity of computation and communication to the mapping of asym- metric parallelism to the cluster and the importance of coordinated scheduling across multiple layers of parallelism. Optimal scheduling of MPI codes on the PS3 cluster requires coordi- nated scheduling and mapping of at least three layers of parallelism (two layers within each Cell processor and an additional layer across Cell processors), whereas the optimal mapping and schedule changes with the application, the input data set, and the number of nodes used for execution. ii) We adapt and validate MMGP on the PS3 cluster. We model a generic heterogeneous clus- ter built from compute nodes with front-end host cores and back-end accelerator cores. The extended model combines analytical components with empirical measurements, to navigate the optimization space for mapping MPI programs with nested parallelism on the PS3 cluster. Our evaluation of the extended MMGP model shows that it estimates execution time with an average error rate of 5.2% on a cluster composed of PlayStation3 nodes. The model captures the effects of application characteristics, input data sets, and cluster scale on performance. Furthermore, The model pin-points optimal mappings of MPI applications to the PS3 cluster with remarkable accuracy. iii) Using the cluster of Playstation3 nodes, we analyze earlier proposed user-level scheduling heuristics for co-scheduling threads (Chapter 5). We show that co-scheduling algorithms yield

86 significant performance improvements (1.7–2.7×) over the native OS scheduler in MPI appli- cations. We also explore the trade-off between different co-scheduling policies that selectively spin or yield the host cores, based on runtime prediction of task execution lengths on the accel- erator cores. iv) We present a comparison between our PS3 cluster and an IBM QS20 blade cluster (based on Cell/BE), illustrating that despite important limitations in computational ability and the com- munication substrate, the PS3 cluster is a viable platform for HPC research and development.

The rest of this Chapter is organized as follows: Section 7.2 presents our experimental platform. Section 7.3 presents our performance analysis of the PS3 cluster. Section 7.4 presents the extended model of hybrid parallelism and its validation. Section 7.5 presents co-scheduling policies for clusters of asymmetric multi-core processors and evaluates these policies. Section 7.6 compares the PS3 cluster against an IBM QS20 Cell-blade cluster. Section 7.7 concludes the chapter.

7.2 Experimental Platform

Our experimental platform for this thesis is a cluster of 22 PS3 nodes. 8 of these nodes were available to us in dedicated mode for the purposes of this thesis. The PS3 nodes are connected to a 1000BASE-T Gigabit Etherent switch, which supports 96 Gbps switching capacity. Each PS3 runs Linux FC5 with kernel version 2.6.16, compiled for the 64-bit PowerPC architecture with platform-specific kernel patches for managing the heterogeneous cores of the Cell/BE. The nodes communicate with LAM/MPI 7.1.1. We used the IBM Cell SDK 2.1 for intra- Cell/BE parallelization of the MPI codes. The Linux kernel on the PS3 is running on top of a proprietary hypervisor. Though some devices are directly accessed, the built-in Gigabit Ethernet controller in the PS3 is accessed via hypervisor calls, therefore communication performance is not optimized.

87 7.3 PS3 Cluster Scalability Study

7.3.1 MPI Communication Performance

As we extend our MMPG model for the cluster of PlayStation3 machines, the cost of MPI calls becomes a more significant parameter of the prediction model than it would be for a single machine. To study how the MMGP model scales for the new environment, we experimented with the real-world parallel computing applications, PBPI and RAxML.

To measure communication performance on the PS3 cluster, we use mpptest [56]. We present mpptest results only for two MPI communication primitives, which dominate com- munication time in our application benchmarks. Figure 7.1 shows the overhead of MPI Allreduce(), with various message sizes. Each data point represents a number and a distribution of MPI processes between PS3 nodes. For any given number of PS3 nodes, we use 1 to 6 MPI pro- cesses, using shared memory for communication within the PS3. Our evaluation methodology stresses the impact of contention for communication bandwidth both within and across PS3 nodes. There is benefit in exploiting shared memory for communication between MPI pro- cesses on each PS3. For example, collective operations between 8 processes running on 2 PS3 nodes are up to 30% faster than collective operations between 8 processes running across 8 PS3 nodes. However, there is also a noticeable penalty for oversubscribing each PPE with more than two processes, due to OS overhead, despite our use of blocking shared memory commu- nication within each PS3. Similar observations can be made for point-to-point communication (Figure 7.2), albeit the effect of using shared-memory within each PS3 is less pronounced and the effect of oversuscribing the PPE is more pronounced.

7.3.2 Application Benchmarks

We evaluate the performance of two state-of-the-art phylogenetic tree construction codes, RAxML and PBPI, on our PS3 cluster, described in Chapter 3. The applications have been painstakingly

88 1000

900

800

700

600

500

Time (usec) 400

300

200

100

0 1 2 3 4 5 6 7 8 Number of Nodes (a) MPI Allreduce() latency, one double.

20000 8192 18000 4096 2048 16000 1024 512 14000 256 1 12000

10000

Time (usec) 8000

6000

4000

2000

0 1 2 3 4 5 6 7 8 Number of nodes (b) MPI Allreduce() latency, arrays of doubles.

Figure 7.1: MPI Allreduce() performance on the PS3 cluster. Processes are distributed evenly between nodes. Each node runs up to 6 processes, using shared memory for communi- cation within the node.

89 140000 24871 11761 120000 9921

100000

80000

60000 Time (usec)

40000

20000

0 1 2 3 4 5 6 7 8 Number of nodes

Figure 7.2: MPI Send/Recv() latency on the PS3 cluster. Processes are distributed evenly between nodes. Each node runs up to 6 processes, using shared memory for communication within the node. optimized for the Cell BE, using vectorization, loop unrolling and tiling, branch optimizations, double buffering, and optimized numerical implementations of kernels utilizing fast single- precision arithmetic to implement double-precision operations. The optimization process is described in Chapter 4. Both RAxML and PBPI are capabale of exploiting multiple levels of PPE and SPE parallelism. We used a task off-loading execution model in the codes. The execu- tion commences on the PPE, and SPEs are used for accelerating computation-intensive loops. The off-loaded loops are parallelized across SPEs and vectorized within SPEs. The number of PPE processes and the number of SPEs per PPE process are user-specified.

When the PPE is oversubscribed with more than two processes, the processes are scheduled using an event-driven task-level parallelism (EDTLP) scheduler, described in Chapter 5. Each process executes until the point it off-loads an SPE task and then releases the PPE, while waiting for the off-loaded task to complete. The same process resumes execution on the PPE only after all other processes off-load SPE tasks at least once. The RAxML and PBPI ports on the PS3

90 cluster are adaptations of the original MPI codes and are capable for execution in a distributed environment. No algorithmic modifications have been applied to the applications.

For each application we use three data sets, briefly termed small, medium, and large. The large data set occupies the entire memory of a PS3, minus memory used by the operating system and the hypervisor. The medium and small data sets occupy 40% and 15% of the free memory of the PS3 respectively. In PBPI the small, medium, and large data sets represent 218 species with 1250, 3000, and 5000 nucleotides respectively. In RAxML, the small, medium, and large datasets represent 42, 50, and 107 species respectively. We execute PBPI using weak scal- ing, i.e. we scale the data set as we add more PS3 nodes, which is the recommended execution mode. For RAxML we use strong scaling, since the application uses a master-worker paradigm, where each worker performs independent, parameterized phylogenetic tree bootstrapping and processes the entire tree independently. Workers are distributed between nodes to maximize throughput. We perform 192 bootstraps, which is a realistic workload for real-world phyloge- netic analysis.

Figure 7.3 illustrates the measured execution times of RAxML and PBPI on the PS3 clus- ter. The predicted execution times on the same charts are derived from the extended MMGP model, which is discussed in Section 7.4. We make three observations regarding measured performance: i) The PS3 cluster scales well under strong scaling (RAxML) and relatively well under weak scaling (PBPI), for the problem sizes considered. PBPI is more communication-bound than RAxML, as it involves several collective operations between executions of its Markov-chain Monte Carlo kernel. We note that due to the hypervisor of the PS3 and the lack of Cell/BE- specific optimization of the MPI library we used, the performance measurements on the PS3 cluster are conservative. ii) The optimal layered decomposition of the applications is at the opposite ends of the opti- mization space. RAxML executes optimally if the PPE on each PS3 is oversubscribed by 6 MPI processes, each off-loading simultaneously on 1 SPE. PBPI generally executes optimally

91 500 Measured−L PBPI 450 Predicted−L 400 Measured−M Predicted−M 350 Measured−S Predicted−S 300 250 200

Execution Time (sec) 150 100 50

(1,1,6) (1,2,3) (1,3,2) (1,6,1) (2,1,6) (2,2,3) (2,3,2) (2,6,1) Configuration (3,1,6) (3,2,3) (3,3,2) (3,6,1) (4,1,6) (4,2,3) (N (4,3,2) (4,6,1) (5,1,6) , N (5,2,3) (5,3,2) (5,6,1) (6,1,6) , N (6,2,3) (6,3,2) ) (6,6,1) (7,1,6) (7,2,3) (7,3,2) (7,6,1) (8,1,6) (8,2,3) (8,3,2) (8,6,1) node process SPE 4 x 10 2.5 Measured−L RAxML Predicted−L 2 Measured−M Predicted−M Measured−S 1.5 Predicted−S

1 Execution Time (sec) 0.5

0

(1,1,6) (1,2,3) (1,3,2) (1,6,1) (2,1,6) (2,2,3) (2,3,2) (2,6,1) Configuration (3,1,6) (3,2,3) (3,3,2) (3,6,1) (4,1,6) (4,2,3) (N (4,3,2) (4,6,1) (5,1,6) , N (5,2,3) (5,3,2) (5,6,1) (6,1,6) , N (6,2,3) (6,3,2) ) (6,6,1) (7,1,6) (7,2,3) (7,3,2) (7,6,1) (8,1,6) (8,2,3) (8,3,2) (8,6,1) node process SPE

Figure 7.3: Measured and predicted performance of applications on the PS3 cluster. PBPI is executed with weak scaling. RAxML is executed with strong scaling. x-axis notation: Nnode - number of nodes, Nprocess - number of processes per node, NSPE - number of SPEs per process.

92 with 1 MPI process per PS3 using all 6 SPEs for data-parallel computation, as this configuration reduces inter-PS3 communication volume and avoids PPE contention. iii) Although the optimal layered decomposition does not change with the problem size for the three data sets used in this experiment2, it changes with the scale of the cluster. When PBPI is executed with 8 SPEs, the optimal operating point of the code shifts from 1 to 2 MPI processes per node, each off-loading simultaneously on 3 SPEs. We have verified with an out-of-band experiment that this shift is permanent beyond 8 SPEs. This shift happens because of the large drop in the per process overhead of MPI Allreduce() (Figure 7.1), when 2 MPI processes are packed per node, on 3 or more PS3 nodes. This drop is large enough to outweigh the over- head due to contention between MPI processes on the PPE. The difficulty in experimentally discovering the hardware and software implications on the optimal mapping of applications to asymmetric multi-core clusters motivates the introduction of an analytical model presented in the next section.

7.4 Modeling Hybrid Parallelism

We present an analytical model of layered parallelism on clusters of asymmetric multi-core nodes, which is a generalization of a model of parallelism on stand-alone asymmetric multi-core processors (MMGP), presented in Chapter 6. Our generalized model captures computation and communication across nodes with host cores and acceleration cores. We specialize the model for the PS3 cluster, to capture the overhead of non-overlapped DMA operations, wait times during communication operations in the presence of contention for bandwidth both within and across nodes, and non-overlapped scheduling overhead on the PPEs. In the rest of the section we present overview of the MMGP model and we discuss the extensions related to the context- switch overhead, on-chip and inter-node communication.

2The optimal decomposition does not change with the data set, however, as we show in Section 7.5, the optimal scheduling of an application may change with the data set.

93 We model the non-overlapped components of execution time on the Cell/BE’s PPE and SPE, for single-threaded PPE code which off-loads to one SPE as:

T = (Tppe + Oppe) + (Tspe + Ospe) (7.1)

where Tppe and Tspe represent non-overlapped computation, while Oppe and Ospe represent non- overlapped overhead on the PPE and SPE respectively. We apply the aforementioned model for phases of parallel computation individually. Phases are separated by collective communication operations.

7.4.1 Modeling PPE Execution Time

The overhead on the PPE includes instructions and DMA operations to off-load data and code to SPEs, and wait time for receiving synchronization signals from SPEs on the PPE.

Assuming that multiple PPE threads can simultaneously off-load computation, we introduce an additional factor for context switching overhead on the PPE. This factor depends on the thread scheduling algorithm on the PPE. In the general case, Oppe for code off-loaded from a single PPE thread to l SPEs is modeled as:

Oppe = l · Ooff−load + Tcsw(p) (7.2)

We assume that a single PPE thread off-loads to multiple SPEs sequentially and that the context switching overhead is a function of the number of threads co-executed on the PPE, which is denoted by p. Ooff−load is application-dependent and includes DMA setup overhead which we measure with microbenchmarks. Tcsw depends on system software and includes context switching overhead for p/C context switches, C the number of hardware contexts on the PPE. The overhead per context switch is also measured with microbenchmarks.

If a hardware thread on the PPE is oversubscribed with multiple application threads, the

94 S3 S3 S3

S2 S2 S2S2 S2

S1 S1 S1 S1

* *

P1 P2 P3 P1 P1 P2 P3 P1 P2 P3 P1 P2 P3 spin idle spin idle OS quantum OS quantum OS quantum

(a) (b)

S3 S3 S3 S3

S2 S2 S2 S2 S S 1 S1 1 S1 S1

* * *

P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3 P1 P2 P3 P1

spin spin spin spin

(c) (d)

Figure 7.4: Four cases illustrating the importance of co-scheduling PPE threads and SPE threads. Threads labeled ”P” are PPE threads, while threads labeled ”S” are SPE threads. We as- sume that P-threads and S-threads communicate through shared memory. P-threads poll shared memory locations directly to detect if a previously off-loaded S-thread has completed. Striped intervals indicate yielding of the PPE, dark intervals indicate computation leading to a thread off-load on an SPE, light intervals indicate computation yielding the PPE without off-loading on an SPE. Stars mark cases of mis-scheduling.

computation time of each thread may increase due to on-chip resource contention. To accu- rately model this case, we introduce a scaling parameter, α(p) to the PPE computation compo- nent, which depends on the number of threads co-executed on the PPE. The PPE component of the model therefore becomes αp · Tppe + Oppe. The factor αp is estimated using linear regres- sion with one free parameter, the number of threads sharing a PPE hardware thread, and co- efficients derived from training samples of Tppe taken during executions of a microbenchmark that oversubscribes the PPE with 3-6 threads and executes a parameterized ratio of computation to memory instructions.

95 T rec SPE execution Tsen

Tc − Communication overhead Tp − Parallel computation Ts − Sequential communication

Figure 7.5: SPE execution

The formulation of Tppe derived thus far ignores additional wait time of threads on the PPE due to lack of co-scheduling between a PPE thread and an SPE thread off-loaded from it. This scenario arises when the PPE hardware threads are time-shared between application threads, as shown in Figure 7.4(a). Ideal co-scheduling requires accurate knowledge of the execution time of tasks on SPEs by both the operating system and the runtime system. This knowledge is not generally available. Our model assumes an idealized co-scheduling scenario. SPE tasks for a given phase of computation are assumed to be of the same execution length and are off-loaded in bundles with as many tasks per bundle as the number of SPEs on a Cell/BE. We also assume that the SPE execution time of the first task is long enough to allow for idealized co-scheduling, i.e. each PPE thread that off-loads a task is rescheduled on the PPE timely, to receive immediately the signal from the corresponding finishing SPE task. We explore this scheduling problem in Section 7.5 under more realistic assumptions and propose solutions.

7.4.2 Modeling the off-loaded Computation

Execution on SPEs is divided into stages, as shown in Figure 7.5. Tspe is modeled as:

Tspe = Tp + Ts (7.3)

Tp denotes the computation executed in parallel by more than one SPE. An example is a parallel loop distributed across SPEs. Ts denotes the part of the off-loaded computation that is inherently

96 sequential and cannot be parallelized across SPEs.

When l SPEs are used for parallelization of off-loaded code, the Tspe term becomes:

T T = p + T (7.4) spe l s

The accelerated execution on SPEs includes three more stages, shown in Figure 7.5. Trec and Tsen account for PPE-SPE communication latency, while Tc captures the SPE overhead that occurs when an SPE sends to or receives a message from the PPE. The per-byte latencies for

Trec, Tsen and Tc are application-independent and are obtained from microbenchmarks designed to stress the PPE-SPE communication. Tp and Ts are application-dependent and are obtained from a profile of a sequential run of the application, annotated with directives that delimit the code off-loaded to SPEs.

7.4.3 DMA Modeling

Each SPE on the Cell/BE is capable of moving data between main memory and local storage, while at the same time executing computation. To overlap computation and communication, applications use loop tiling and double buffering, which are illustrated in pseudocode in Fig- ure 7.6. When double-buffering is used, the off-loaded loop can be either computation or com-

1: DMA(Fetch Iteration 1, TAG1); 2: DMA_Wait(TAG1);

3: for( ... ){ 4: DMA(Fetch Iteration i+1, TAG1); 5: compute(Iteration i); 6: DMA_Wait(TAG1); 7: DMA(Commit Iteration i, TAG2); }

8: DMA_Wait(TAG2);

Figure 7.6: Double buffering template for tiled parallel loops.

97 munication bound: if the amount of computation in a single iteration of the loop is sufficient to completely mask the latency of fetching the data necessary for the next iteration, the loop is computation bound. Otherwise the loop is communication bound.

Note that a parallel off-loaded loop can be described using Equation 7.4, independently of whether the parallel part of the loop is computation or communication bound. In both cases, the loop iterations are assumed to be distributed evenly across SPEs and blocking DMA accesses can be interspersed with computation in the loop. With double buffering, the DMA request used to fetch data for the first iteration, as well as the DMA request necessary to commit data to main memory after the last iteration, can be neither overlapped with computation, nor distributed (lines 2 and 8 in Figure 7.6). We capture the effect of blocking and non-overlapped DMA in the model as:

Ospe = Trec + Tsen + Tc + TDMA (7.5)

The last term in equation 7.5 is itemized to the blocking DMAs performed within loop iterations and the non-overlapped DMAs exposed when the loop is unrolled, tiled, and executed with double buffering in the code. We use static analysis of the code to captures the DMA sizes.

7.4.4 Cluster Execution Modeling

We generalize our model of a single asymmetric multi-core processor to a cluster by introducing an inter-processor communication component as:

T = (Tppe + Oppe) + (Tspe + Ospe) + C (7.6)

We further decompose the communication term C in communication latency due to each dis- tinct type of communication pattern in the program, including point-to-point and all-to-all com-

98 munication. Assuming MPI as the programming model used to communicate across nodes or between address spaces within nodes, we use mpptest to estimate the MPI communication overhead for variable message sizes and communication primitives. The message sizes are captured by static analysis of the application code.

7.4.5 Verification

We verify our model by exhaustively executing PBPI and RAxML on all feasible layered de- compositions that use 1 to 6 PPE threads, 1 to 6 SPEs per PPE and up to 8 PS3 nodes. Fig- ure 7.3(a),(b) illustrates that the model is accurate both in terms of predicting execution time and in terms of discovering optimal application decompositions and mappings for different cluster scales and data sets. The optimal decomposition may vary across multiple dimensions, includ- ing application characteristics, such as granularity of off-loaded tasks, and frequency and size of communication and DMA operations, size and structure of the data set used in the applica- tion, and number of nodes available to the application for execution. Accurate modeling of the application under each scenario is valuable to tame the complexity of discovering the optimal decomposition and mapping experimentally. In our test cases, the model achieves error rates consistently under 15%. The mean error rate is 5.2%. The errors tend to be higher when the PPE is oversubscribed with large number of processes, due to error in estimating the thread interference factor. With respect to prediction accuracy for any given application, data set, and number of PS3 nodes, the model predicts accurately the optimal configuration and mapping in all 48 test cases.

7.5 Co-Scheduling on Asymmetric Clusters

Although our model projects optimal mappings of MPI applications on the PS3 cluster with high accuracy, it is oblivious to the implications of user-level and kernel-level scheduling on

99 oversubscribed cores. More specifically, the model ignores cases in which PPE threads and SPE threads are not co-scheduled when they need to synchronize through shared memory. We explore user-level co-scheduling solutions to this problem.

The main objective of co-scheduling is to minimize slack time on SPEs, since SPEs bear the brunt of the computation in practical cases. This slack is minimized whenever a thread off- loaded to an SPE needs to communicate or synchronize with its originating thread at the PPE and the originating thread is running on a PPE hardware context.

As illustrated in Figure 7.4, different scheduling policies can have a significant impact on co- scheduling, slack, SPE utilization, and eventually performance. In Figure 7.4(a), PPE threads are spinning while waiting for the corresponding off-loaded threads to return results from SPEs. The time quantum allocated to each PPE thread by the OS can cause continuous mis-scheduling of PPE threads with respect to SPE threads.

In Figure 7.4(b), the user-level scheduler uses a yield-if-not-ready policy, which forces each PPE thread to yield the processor, whenever a corresponding off-loaded SPE thread is pend- ing completion. This policy can be implemented at user-level by having PPE threads poll shared-memory flags that matching SPE threads set upon completion. Figure 7.7 illustrates the performance of this policy in PBPI and RAxML on a PS3 cluster, when the PPE on each node is oversubscribed with 6 MPI processes, each off-loading on 1 SPE (recall that the PPE is a two-way SMT processor). The results show that compared to a scheduling policy which is oblivious to PPE-SPE co-scheduling (Linux scheduling policy), yield-if-not-ready achieves a performance improvement of 1.7–2.7×, on a cluster composed of PS3 nodes. Yield-if-not- ready bounds the slack by the time needed to context switch across p − 1 PPE threads, p is the total number of active PPE threads, but can still cause temporary mis-scheduling and slack, as shown in Figure 7.4(c). Figure 7.4(d) illustrates an adaptive spinning policy, in which a thread either spins or yields the processor, based on which thread is anticipated to offload the soonest on an SPE. This policy uses a prediction which can be derived with various algorithms, the simplest of which is using the execution length of the most recently off-loaded task from any

100 800 PBPI Linux 700 yield−if−not−ready 600

500

400

300

Execution Time (sec) 200

100

0 (1,6,1) (2,6,1) (3,6,1) (4,6,1) (5,6,1) (6,6,1) (7,6,1) (8,6,1) Configuration (N , N , N ) node process SPE

12000 RAxML Linux 10000 yield−if−not−ready

8000

6000

4000 Execution Time (sec) 2000

0 (1,6,1) (2,6,1) (3,6,1) (4,6,1) (5,6,1) (6,6,1) (7,6,1) (8,6,1) Configuration (N , N , N ) node process SPE Figure 7.7: Performance of yield-if-not-ready policy and the native Linux scheduler in PBPI and RAxML. x-axis notation: Nnode - number of nodes, Nprocess - number of processes per node, NSPE - number of SPEs per process.

given thread as a predictor of the earliest time that the same thread will be ready to off-load in the future. The thread spins if it anticipates that it will be the first to off-load, otherwise it yields the processor.

Although the aforementioned adaptive policy can reduce accelerator slack compared to the yield-if-not-ready policy, it is still suboptimal, as it may mis-schedule threads due to variations in the execution lengths of consecutive tasks off-loaded by the same thread, or variations in the run lengths between any two consecutive off-loads on a PPE thread. We should also note

101 that better policies –with tighter bounds on the maximum slack–, can be obtained if the user- level scheduler is not oblivious to the kernel-level scheduler and vice versa. Devising and implementing such policies is described in Chapter 8.

Figure 7.8 illustrates results when RAxML and PBPI are executed with various co-scheduling policies. Both applications are executed with variable sequence length (x-axis), hence variable SPE task sizes. In RAxML, Figure 7.8(a), adaptive spinning performs better for small data sets, while yield-if-not-ready performs better for large data sets. In PBPI, Figure 7.8(b), adaptive spinning outperforms yield-if-not-ready in all cases. In RAxML, there is variance in length of the offloaded tasks which increases with the size of the input sequence. This causes more mis-scheduling when the adaptive policy is used. In PBPI, the task length is not varying, which enables nearly optimal co-scheduling by the adaptive spinning policy. In general, the best co- scheduling algorithm can improve performance by more than 10%. We emphasize that the opti- mal co-scheduling policy changes with the dataset, therefore support for flexible co-scheduling algorithms in system software is essential on the PS3 cluster.

7.6 PS3 versus IBM QS20 Blades

We compare the performance of the PS3 cluster to a cluster of IBM QS20 dual-Cell/BE blades located at Georgia Tech. The Cell/BE processors on the QS20 have 8 active SPEs and possibly other undisclosed microarchitectural differences. Furthermore, although both the QS20 cluster and the PS3 cluster use GigE, communication latencies tend to be markedly lower on the QS20 cluster, first due to the absence of a hypervisor, which is a communication bottleneck on the PS3 cluster, and second due to exploitation of shared-memory communication between two Cell/BE processors on each QS20, instead of one Cell/BE processor on each PS3.

We present selected experimental data points where the two platforms use the same number of Cell processors. On the QS20 cluster, we use both Cell processors per node. Figure 7.9 illustrates execution times of PBPI and RAxML on the two platforms. We report the execution

102 60 adaptive yield−if−not−ready 50

PBPI 40

30

20 Execution Time (sec) 10

0 200 1000 3000 5000 8000 10000 Sequence

200 adaptive yield−if−not−ready

150 RAxML

100

Execution Time (sec) 50

0 300 400 500 600 1500 2000 3000 4000 Sequence Figure 7.8: Performance of different scheduling strategies in PBPI and RAxML. time of the most efficient pair of application configuration and co-scheduling policy, on any given number of Cell processors.

We observe that the performance of the PS3 cluster is reasonably close (within 14% to 27% for PBPI and 11% to 13% for RAxML) to the performance of the QS20 cluster. The difference is attributed to the reduced number of active SPEs per processor on the PS3 cluster (6 versus 8 for the QS20 cluster), and faster communication on the QS20 cluster. The difference between the two platforms in RAxML is less than PBPI, as RAxML is not as communication-intensive.

Interestingly, if we compare datapoints with the same total number of SPEs (48 SPEs on 8 PS3’s versus 48 SPEs on 6 QS20’s), in RAxML the PS3 cluster outperforms the QS20 blade.

103 300 PS3 Cluster PBPI Blade Cluster 250

200

150

100 Execution Time (sec) 50

0 2 4 6 8 Number of Cells

7000 PS3 Cluster RAxML 6000 Blade Cluster

5000

4000

3000

2000 Execution Time (sec)

1000

0 2 4 6 8 Number of Cells Figure 7.9: Comparison between the PS3 cluster and an IBM QS20 cluster.

This result does not indicate superiority of the PS3 hardware or system software, as we apply experimentally defined optimal decompositions and scheduling policies on both platforms. It rather indicates the implications of layered parallelization. Oversubscribing the QS20 with 8 MPI processes (versus 6 on the PS3) introduces significantly higher scheduling overhead and brings performance below that of the PS3. This result stresses our earlier observations on the necessity of models and better schedulers for asymmetric multi-core clusters.

7.7 Chapter Summary

We evaluated a very low-cost HPC cluster based on PS3 consoles and proposed a model of asymmetric parallelism and software support for orchestrating asymmetric parallelism extracted from MPI programs on the PS3 cluster. While the Sony PlayStation has several limitations as

104 an HPC platform, including limited storage and limited support for advanced networking, it has enough computational power compared to vastly more expensive multi-processor blades and forms a solid experimental testbed for research on programming and runtime support for asymmetric multi-core clusters, before migrating software and applications to production-level asymmetric machines, such as the LANL RoadRunner. The model presented in this paper accurately captures heterogeneity in computation and communication substrates and helps the user or the runtime environment map layered parallelism effectively to the target architecture. The co-scheduling heuristics presented in this thesis increase parallelism and minimize slack on computational accelerators.

105 106 Chapter 8

Kernel-Level Scheduling

8.1 Introduction

The ideal scheduling policy, which minimizes the context-switching overhead, assumes that whenever an SPE communicates to the PPE, the corresponding PPE thread is scheduled and running on the PPE. In Chapter 7 we discussed the possibility of predicting the next-to-run thread on the PPE. We have implemented a prototype of the scheduling strategy capable of predicting what process will be the next to run, and the produced results imply that predicting the next thread to run might be difficult, especially if the the off-loaded tasks exhibit high variance in execution time.

As another approach which targets minimizing the context switching overhead on the PPE, we investigate a user-level scheduler which is capable of influencing the kernel scheduling decisions. We explore the scheduling strategies where the scheduler can not only decide when a process should release the PPE, but also what will be the next process to run on the PPE. By reducing the response time related to scheduling on the PPE, our new approach also reduces the idle time that occurs on the SPE side, while waiting for the new off-loaded task. We call our new scheduling strategy Slack-minimizer Event-Driven Scheduler (SLEDS).

Besides improving the overall performance, the new scheduling strategy enables more accu-

107 PPE with multiple SPE1 processes SPE2 1 2 PPE schedules process which is ready to run SPE3 3 4 SPE4 5 6 SPE5 SPE6 7 8 SPE7 Shared Memory SPE8 Region

Figure 8.1: Upon completing the assigned tasks, the SPEs send signal to the PPE processes through the ready-to-run list. The PPE process which decides to yield passes the data from the ready-to-run list to the kernel, which in return can schedule the appropriate process on the PPE.

rate performance modeling. Although the MMGP model projects the most efficient mappings of MPI applications on the Cell processor with high accuracy, it is oblivious to the implications of user-level and kernel-level scheduling on oversubscribed cores. More specifically, the model ignores cases in which PPE threads and SPE threads are not appropriately co-scheduled. The scheduling policy where the PPE threads are not arbitrary scheduled by the OS scheduler, in- troduces more regularity in the application execution and consequently improves the MMGP predictions.

8.2 SLED Scheduler Overview

The SLED scheduler is invoked through a user-level library calls which can easily be integrated into the existing Cell application. The overview of the SLED scheduler is illustrated in Fig- ure 8.1. Each SPE thread upon completing the assigned task sends its own pid to the shared ready to run list, from where this information is further passed to the kernel. Using the knowl- edge of the SPE threads that have finished processing the assigned tasks, the kernel can decide what will be the next process to run on the PPE.

Although it is invoked through a user-level library, part of the scheduler resides in the ker- nel. Therefore, the implementation of the SLED scheduler can be vertically divided into two

108 distinguishable parts:

1. The user-level library, and

2. The kernel code that enables accepting and processing the user-level information, which is further used in making kernel-level scheduling decisions.

ready−to−run List

User−Level Information is passed to the Kernel

Kernel Scheduling Decisions Kernel−Level

Figure 8.2: Vertical overview of the SLED scheduler. The user level part contains the ready- to-run list, shared among the processes, while the kernel part contains the system call through which the information from the ready-to-run list is passed to the kernel.

Passing the information from the ready to run list to the kernel can be achieved in two ways:

• The information from the list can be read by the processes running on the PPE, and the information can be passed to the kernel through a system call, or

• The ready to run list can be visible to the kernel and the kernel can directly read the information from the list.

In the current study we followed the first approach, where the information to the kernel is passed through a system call, see Figure 8.2. Placing the ready to run list inside the kernel will be the subject of our future research. In the current implementation of the SLED scheduler, the size of the list is constant and is equal to the total number of SPEs available on the system. Each

SPE is assigned an entry in the list. We further described the ready to run list organization in the following section.

109 8.3 ready to run List

In the current implementation of the IBM SDK on the Cell BE, the local storage of every SPE is memory-mapped to the address space of the process which has spawned the SPE thread. Using DMA requests, the running SPE thread is capable of accessing the global memory of the system. However, these accesses are restricted to the areas of main memory that belong to the address space of the corresponding PPE thread. Therefore, if the SPE threads do not belong to the same process, the only possibility of sharing a data structure among them is through the globally shared memory segments which can reside in the main memory.

The ready to run list needs to be a shared structure accessible by all SPE threads (even if they belong to different processes). Therefore, it is implemented as a part of a global shared memory region. The shared memory region is attached to each process at the beginning of execution.

8.3.1 ready to run List Organization

Initial observation suggests that the ready to run list should be organized in the FIFO order, i.e. the process corresponding to the SPE thread which was the first to finish processing a task, should also be the first to run on the PPE. Nevertheless, the strict FIFO organization of the scheduler might cause certain problems. Consider the situation where the PPE process A has off-loaded, but the granularity of the off-loaded task is relatively small, and the SPE execution finishes before process A had a chance to yield on the PPE side. If process B is in the ready to run list waiting to be scheduled, in the case when the FIFO order is strictly followed, process A will yield the PPE and process B will be scheduled to run on the PPE. In the described scenario, the strict FIFO scheduling causes an extra context switch to occur (there is no necessity for process A to yield the PPE to process B).

Therefore the SLED scheduler is not designed as a strictly FIFO scheduler. Instead, after off- loading and before yielding, the process verifies that the off-loaded task is still executing. If the

110 SPE task has finished executing, instead of yielding, the PPE process will continue executing.

Following the described soft FIFO policy, it is possible that a process (we will call it A) upon off-loading will not yield the PPE, but at the same time the off-loaded task upon completion will write the pid of process A to the ready to run list. Because the pid is written to the list, at some point process A will be scheduled by the SLED scheduler. However, when scheduled by the SLED scheduler, process A might not have anything useful to process, since it has not yield upon off-loading. To avoid the described situation, the process which decides not to yield upon off-loading also needs to clear the field that has been filled in the ready to run list with its own pid. Since multiple processes require simultaneous read/write access to the list, maintaining the list in the consistent state requires usage of locks, which can bring significant overhead.

Instead of allowing processes to access any field in the ready to run list, we found more efficient solution to be if each process is assigned an exclusive entry in the list. By not allowing processes to share the entries in the ready to run list, we avoid any type of locking, which significantly reduces the overhead of maintaining the list in the consistent state.

8.3.2 Splitting ready to run List

The ready to tun list serves as a buffer through which the necessary off-loading related infor- mation are passed to the kernel. Initially, the SLED scheduler was designed to use a single ready to run list. However, in certain cases the single-list design forced the SLED scheduler to perform process migration across the PPE execution contexts.

Consider the situation described in Figure 8.3. The off-loaded task which belongs to the process P1 has finished processing on the SPE side (the pid of the process P1 has been written to the ready to run list). Process P1 is bound to CPU1, but the process P2 which is running on

CPU2 off-loads, and initiates the context switch by passing the pid of process P1 to the kernel.

Since the context switch occurred on CPU2 and P1 is bound to run on CPU1, the kernel needs to migrate process P1 to CPU2. Initially, we implemented a system call which performs the

111 process migration, and the designed of this system call is outlined in Figure 8.4. The essential step in this system call is outlined in Line 9, where the sched migrate task() function is invoked. This is a kernel function, which accepts two parameters: task to be migrated, and the destination cpu where the task should migrate.

ready−to−run List ...... SPE1 P1 is bound to CPU1 ...... SPE2 SPE3 P Next to run 1 SPE4

  SPE5  P3 is running CPU1   SPE6   P2 reads the pid of the CPU2 P2 yields SPE7  next process P2 sends information to the kernel SPE8

Kernel migrates P1 to CPU2

Figure 8.3: ProcessP1, which is bound to CPU1, needs to be scheduled to run by the scheduler that was invoked on CPU2. Consequently, the kernel needs to perform migration of the process P1, from CPU1 to CPU2

The described process migration might create some drawbacks. Specifically, the sched migrate task() function might be expensive due to the required locking of the running queues, and also this function can create uneven distribution of processes across available cpus. To avoid the draw- backs caused by the sched migrate task() function, we redesigned the ready to run list. Instead of having a single ready to run list shared among all processes in the system, we assign one ready to run list to each execution context on the PPE. In this way, only processes sharing the execution context are accessing the same ready to run list. This mechanism is presented in

Figure 8.5. With the separate ready to run lists there is no more necessity for expensive task migration, and also we avoid possible load imbalance on the PPE processor.

112 1: void migrate(pid_t next){

2: int next, i, j=0;

3: p = find_process_by_pid(next); 4: if (p){ 5: rq_p = task_rq(p); 6: this_cpu = smp_processor_id(); 7: p_cpu = task_cpu(p); 8: if (p_cpu != this_cpu && p!= rq_p->curr){ 9: sched_migrate_task(p, this_cpu); 10: }

11: SLEDS_yield(next) ...

12: }

Figure 8.4: System call for migrating the processes across the execution contexts. Function sched migrate task() performs the actual migration. SLEDS yield() function schedules the pro- cess to be the next to run on the CPU.

8.4 SLED Scheduler - Kernel Level

The standard scheduler used in the Linux kernel, starting from version 2.6.23, is the Completely Fair Scheduler (CFS). CFS implements a simple algorithm based on the idea that at any given moment in time, the CPU should be evenly divided across all active processes in the system. While this is a desirable theoretical goal, in practice it cannot be achieved since at any moment in time the CPU can serve only one process. For each process in the system, the CFS records the amount of time that the process has been waiting to be scheduled on the CPU. Based on the amount of time spent in waiting to be scheduled and the number of processes in the system, as well as the static priority of the process, each process is assigned dynamic priority. The dynamic priority of a process is used to determine the time when and for how long the process will be scheduled to run.

The structure used by the CFS for storing the active processes is a red-black tree. The processes are stored in the nodes of the tree, and the process with the highest dynamic priority

113 ready−to−run List

SPEs

SPE1 SPE2 SPE3     SPE4 CPU1    SPE5  CPU2   SPE6 SPE7 SPE8

ready−to−run List

Figure 8.5: The ready to run list is split in two parts. Each of the two sublists contain processes that are sharing the execution context (CPU1 or CPU2). This approach avoids any possibility of expensive process migration across the execution contexts.

(which will be the first to run on the CPU) is stored in the left most node in the tree.

The SLED scheduler passes the information from the ready-to-run list to the kernel through the SLEDS yield() system call. SLEDS yield() extends the standard kernel scheduler yield() system call by accepting an integer parameter pid which represents the process that should be the next to run. A high level overview of the SLEDS yield() function is described in Figure 8.6(a)- (c) (assumption is that the passed parameter pid is different than zero). First, the process which should be the next to run is pulled out from the running tree, and its static priority is increased to the maximum value. The process is then returned to the running tree, where it will be stored in the left most node (since it has the highest priority). After being returned to the tree, the static priority of the process is decreased to the normal value. Besides increasing the static priority of the process, we also increase the time that the process is supposed to run on the CPU. Increasing the CPU time is important, since if a process is artificially scheduled to run many times, it might exhaust all the CPU time that it was assigned by the Linux scheduler.

114 Running List (Tree) Running List (Tree)

P1 P2 P3. . . Pn P1 P2. . . Pn

System Call − Kernel Level System Call − Kernel Level

Find the process using pid Find the process using pid Increase the priority Increase the priority P3 Return the process Return the process (a) (b)

Running List (Tree)

P3 P1P2 . . . Pn

System Call − Kernel Level Find the process using pid Increase the priority Return the process (c)

Figure 8.6: Execution flow of the SLEDS yield() function: (a) The appropriate process is found in the running list (tree), (b) The process is pulled out from the list, and its priority is increased, (c) The process is returned to the list, and since its priority is increased it will be stored at the left most position.

In that case, although we are capable of scheduling the process to run on the CPU using the

SLEDS yield() function, the process will almost immediately be switched out by the kernel.

Before it exits, the SLEDS yield() function calls the kernel-level schedule() function which initiates context switching.

We measured the overhead in the SLEDS yield() system call caused by the operations per- formed on the running tree. We found that the SLEDS yield() incurs overhead of approximately

8% compared to the standard sched yield() system call.

115 8.5 SLED Scheduler - User Level

Figure 8.7 outlines the part of the SLED scheduler which resides in the user space. Upon off- loading, the process is required to call the SLEDS Offload() function (Figure 8.7, Line 13). This function is polling over the the member of the structure signal in order to check if the SPE has finished processing the off-loaded task. Structure signal resides in the local storage of an SPE, and the process executing on the PPE knows the address of this structure and uses it for accessing the members of the sructure. While the SPE task is running, the stop field of the structure signal is equal to zero, and upon completion of the off-loaded task, the value of this field is set to one.

If the SPE has not finished processing the off-loaded task, the SLEDS Offload() function calls the SLEDS yield() function, Figure 8.7, Line 15. The SLEDS yield() function scans the ready to run list searching for the SPE that has finished processing the assigned task, Figure 8.7,

Line 3 – 10. Two interesting things can be noticed in the function SLEDS yield(). First, the function scans only three entries in the ready to run list. The reason for this is that the list is divided among the execution contexts on the PPE, as described in Section 8.3.2. Since the presented version of the scheduler is adapted to the PlayStation3 (contains a Cell processor with only 6 SPEs), each ready to run list contains only 3 entries. Second, the list is scanned at most N times (see Figure 8.7, Line 3), after which the process is forced to yield. If the N parameter is relatively large, repeatedly scanning of the ready to run list becomes harmful for the process executing on the adjacent PPE execution context. However, if the parameter N is not large enough, the process might yield before having a chance to find the next-to-run process. Although the results presented in Figure 8.8 show that execution time of RAxML depends on N, the theoretical model capable of describing this dependence will be the subject of our future work. Currently, for RAxML we chose N to be 300, as this is the value which achieves the most efficient execution in our test cases (Figure 8.8). For PBPI we did not see any variance in execution times for values of N smaller than 1000. When N is larger than 1000, performance of

116 1: void _yield(){

2: int next, i, j=0;

3: while(next == 0 && j < N){ 4: i=0; 5: j++; 6: while(next == 0 && i < 3){ 7: next = ready_to_run[i]; 8: i++; 9: } 10: }

11: SLEDS_yield(next);

12: }

13: void SLEDS_Offload(){

14: while (((struct signal *)signal)->stop == 0){ 15: _yield(); 16: } 17: }

Figure 8.7: Outline of the SLEDS scheduler: Upon off-loading a process is required to call the SLEDS Offload() function. SLEDS Offload() checks if the off-loaded task has finished (Line 14), and if not, calls the yield() function. yield() scans the ready to run list, and yields to the next process by executing SLEDS yield() system call.

PBPI decreases due to contention caused by scanning the ready to run list.

8.6 Experimental Setup

To test the SLED scheduler we used the Cell processor inbuilt in the PlayStation3 console. As an operating system, we used a variant of the 2.6.23 kernel version, specially adapted to run on the PlayStation3. We also changed the kernel by introducing system calls necessary for executing the SLED scheduler. We used SDK2.1 to execute our applications on Cell.

117 13.4 13.3 13.2 13.1 13 12.9 12.8 12.7 Execution Time (s) 12.6 12.5 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 ready-to-run list scanning

Figure 8.8: Execution times of RAxML when the ready to run list is scanned between 50 and 1000 times. x-axis represents the number of scans of the ready to run list. y-axis represents the execution time. Note that the lowest value for the y-axis is 12.5, and the difference between the lowest and the highest execution time is 4.2%. The input file contains 10 species, each represented by 1800 nucleotides.

8.6.1 Benchmarks

In this section we describe the benchmarks used to test the performance of the SLED scheduler. We compared the SLED scheduler to the EDTLP scheduler using microbenchmarks and real- world bioinformatics applications, RAxML and PBPI.

The microbenchmarks we used are designed to imitate the behavior of real applications utilizing the off-loading execution model. Using the microbenchmarks we aimed to determine the dependence of the context switch overhead to the size of the off-loaded tasks.

8.6.2 Microbenchmarks

The microbenchmarks we designed are composed of multiple MPI processes, and each process uses an SPE for task off-loading. The tasks in each process are repeatedly off-loaded inside a loop which iterates 1,000,000 times. The part of the process executed on the PPE only initiates task off-loading and waits for the off-loaded task to complete. The off-loaded task executes a loop which may vary in length. In our experiments we oversubscribe the PPE with 6 MPI

118 100

90 SLEDS 80 EDTLP

70

60

50

40

30 Execution Time (s) 20

10

0 6 11 15 20 24 28 33 37 41 46 50 55 59 63 68 72 76 81 85 Task Length (us)

Figure 8.9: Comparison of the EDTLP and SLED schemes using microbenchmarks: Total execution time is measured as the length of the off-loaded tasks is increased.

processes.

We compare the performance of the microbenchmarks using the SLED and EDTLP sched- ulers. Figure 8.9 represents the total execution time of the microbenchmarks that are executed with different lengths of the off-loaded tasks. For large task sizes the SLED scheduler outper- forms EDTLP by up to 17%. However, when the size of the off-loaded task is relatively small, the EDTLP scheme outperforms the SLED scheme by up to 29%, as represented in Figure 8.10.

We will use the example presented in Figure 8.11 to explain the behavior of the EDTLP and

SLED schemes for small task sizes. Assume that 3 processes, P1, P2, and P3 are oversubscribed on the PPE. In the EDTLP scheme Figure 8.11 (EDTLP), upon offloading, P1 yields and the operating system decides which should be the next process to run on the PPE. Since the process

P1 was the first to off-load and yield, it is not likely that the same process will be scheduled until all other processes have off-loaded and been switched out from the PPE. If the size of the off-loaded task is relatively small, by the time the process P1 gets scheduled to run again on the PPE, the off-loaded task will already be completed and the process P1 can immediately continue running on the PPE.

119 30

SLEDS 25 EDTLP

20

15

10 Execution Time (s)

5

0 6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 20 21 Task Length (us)

Figure 8.10: Comparison of the EDTLP and SLED schemes using microbenchmarks: Total execution time is measured as the length of the off-loaded tasks is increased – task size is limited to 2.1us.

Consider now the situation represented in Figure 8.11 (SLED), when the SLED scheduler is used for scheduling the processes with small off-loaded tasks. Due to complexity introduced by the SLED scheduler, the time necessary for the context switch to complete is increased. Conse- quently, the time interval for process P1, between the off-loading and the next opportunity to run on the PPE, increases. Based on the performed analysis, we can conclude that for scheduling the processes with relatively fine-grain off-loaded tasks (the execution time of a task is shorter than 15µs), it is more efficient to use the EDTLP scheme than the SLED scheme.

For coarser task sizes (the execution time of a task is longer than 15µs), the SLED scheme almost always outperforms the EDTLP scheme. The exceptions are certain tasks sizes which are equal to the exact multiple of scheduling intervals, as can be seen in Figure 8.9. Scheduling interval represents the time interval after which a process will be scheduled to run on the PPE, upon offloading. For specific task sizes, the processes will be ready to run at exact moment when they get scheduled on the PPE using only the EDTLP scheme. We need to point out that these situations are rare, and in real applications described in Section 8.6.3 and Section 8.6.4 we did not observe this behavior.

120 Context switch overhead

Task off−loaded from P1 has completed

. . . P1 is scheduled on the PPE, and can EDTLP immediately continue running P1P2 P3. . . P1

Task off−loaded from P1 has completed

. . . SLED P1 is scheduled on the PPE P1 P2 P3 . . . P1

Time Figure 8.11: EDTLP outperforms SLED for small task sizes due to higher complexity of the SLED scheme.

100

90 SLEDS+EDTLP EDTLP 80 SLEDS 70

60

50

40

30 Execution Time (s) 20

10

0 6 10 13 17 20 24 27 31 34 38 41 45 48 52 55 59 62 66 69 73 76 80 83 Task Length (us)

Figure 8.12: Comparison of the EDTLP scheme and the combination of SLED and EDTLP schemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs.

121 30

SLEDS+EDTLP 25 EDTLP SLEDS

20

15

10 Execution Time (s)

5

0 6 7 8 9 10 11 12 13 13 14 15 16 17 18 19 20 20 21 Task Length (us)

Figure 8.13: Comparison of the EDTLP scheme and the combination of SLED and EDTLP schemes using microbenchmarks. EDTLP is used for the task sizes smaller than 15µs – task size is limited to 2.µs.

To address the issues related to the small task sizes (when EDTLP outperforms SLED), we combined the two schemes into a single scheduling policy. The EDTLP scheme is used when the size of the off-loaded tasks are smaller than 15µs. The results of the combined scheme are presented in Figure 8.12 and Figure 8.13.

8.6.3 PBPI

We also compared the performance of the two schemes, EDTLP and SLED, using the PBPI application. As an input file for PBPI we used a set that contains 107 species and we varied the length of the DNA sequence that represents the species. In the PBPI application, the length of the input DNA sequence is directly related to the size of the off-loaded tasks. We varied the length of the DNA sequence from 200 to 5,000. Figure 8.14 represents the the execution time of PBPI when the EDTLP and SLED scheduling schemes are used. In all experiments the configu- ration for PBPI was 6 MPI processes and each process was assigned an SPE for off-loading the expensive computation. As in the previous example, the EDTLP outperforms the SLED scheme

122 40

SLEDS 35 EDTLP

30

25

20

15 Execution Time (s) 10

5

0 200 600 1000 1400 1800 2200 2600 3000 3400 3800 4200 4600 5000 Sequence Size

Figure 8.14: Comparison of EDTLP and SLED schemes using the PBPI application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis).

for small task sizes. Again we combined the two schemes, EDTLP for task sizes smaller than 15µs and the SLED for the larger tasks sizes and we present the obtained performance in Fig- ure 8.15. The combined scheme constantly outperforms the EDTLP scheduler, and the highest difference we recorded is 13%.

8.6.4 RAxML

We executed RAxML with the input file that contained 10 species in order to compare the EDTLP and SLED schedulers. As in PBPI case, we varied the length of the input DNA se- quence, as the size of the input sequence is directly related to the size of the off-loaded tasks. The length of the sequence in our experiments was between 100 and 5,000 nucleotides. In case of RAxML, SLED outperforms EDTLP by up to 7%. As in previous experiments, for relatively small task sizes the EDTLP scheme outperforms the SLED scheme, as represented in Figure 8.16 and Figure 8.17. For larger tasks sizes the SLED scheme outperforms EDTLP. Again, by combining the two schemes we can achieve the best performance.

123 40

SLEDS+EDTLP 35 EDTLP

30

25

20

15 Execution Time (s) 10

5

0 200 600 1000 1400 1800 2200 2600 3000 3400 3800 4200 4600 5000 Sequence Size

Figure 8.15: Comparison of EDTLP and the combination of SLED and EDTLP schemes using the PBPI application. The application is executed multiples time with varying length of the input sequence (represented on the x-axis).

50

45 SLEDS EDTLP 40

35

30

25

20

15 Execution Time (s)

10

5

0

100 300 500 700 900 1100 1500 1700 1900 2300 2600 2800 3000 3200 3400 3700 3900 4200 4500 4700 4900 Sequence Size

Figure 8.16: Comparison of EDTLP and SLED schemes using the RAxML application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis).

124 25

SLEDS EDTLP 20

15

10 Execution Time (s)

5

0 100 200 300 400 500 600 Sequence Size

Figure 8.17: Comparison of EDTLP and the combination of SLED and RAxML schemes using the RAxML application. The application is executed multiple times with varying length of the input sequence (represented on the x-axis).

8.7 Chapter Summary

In this chapter we investigated strategies targeting to reduce the scheduling overhead which oc- curs on the PPE side of the Cell BE. We designed and tested the SLED scheduler, which uses the user-level off-loaded-related information in order to influence kernel-level scheduling decisions. On a PlayStation3, which contains 6 SPEs, we conducted a set of experiments comparing the SLED and EDTLP scheduling schemes. For comparison we used the real scientific applications RAxML and PBPI, as well as a set of microbenchmarks developed to simulate the behavior of larger applications. Using the microbenchmarks, we found that the SLED scheme is capable outperforming the EDTLP scheme by up to 17%. SLEDS performs better by up to 13% with PBPI, and up to 7% with RAxML. Note that higher advantage of the SLED scheme is likely to occur on the Cell BE with all 8 SPEs available (the Cell BE used in PS3 has only 6 SPEs available) due to higher PPE contention and consequently higher context-switch overhead.

125 126 Chapter 9

Future Work

This chapter discusses directions of future work. The proposed extensions are summarized as follows:

• We plan to extend the presented kernel-level scheduling policies by implementing the

ready to run list into the kernel and considering more scheduling parameters such as load balancing and job priorities when making scheduling decisions.

• We plan to increase utilization of the host and accelerator cores by sharing the accelerators among multiple tasks and by extending the loop-level parallelism to also include the host core besides already considered accelerator cores.

• We plan to port more applications to Cell. Specifically, we will focus on streaming, memory intensive applications, and evaluate the capability of Cell to execute these ap- plications. By using memory intensive applications, we hope to get better insights in scheduling strategies which would enable efficient execution of communication bound applications on asymmetric processors. We consider this to be an important problem since the memory and bus contention will grow rapidly as the number of cores in multi- core asymmetric architectures increases.

• Most of the presented techniques in this thesis are not specifically designed for Cell and

127 heterogeneous accelerator-based architectures, and in our future work we plan to extend the techniques presented in this thesis to homogeneous parallel architectures.

• Finally, we plan to extend the MMGP model by capturing the overhead caused by the Element Interconnect Bus congestion that can significantly limit the abilities of Cell to overlap computation and communication.

We expand on our plans for future work in the following sections.

9.1 Integrating ready-to-run list in the Kernel

As described in Chapter 8, the SLED scheduler spans across both, the kernel and the user space.

The ready-to-run list resides in the user space, and it is shared among all active processes. The information from the ready-to-run list is passed to the kernel-level part of the SLED scheduler through a system call. Based on the received information, the kernel part of the SLED scheduler biases kernel scheduling decisions. Further in this section we explain the possible drawbacks of having the ready-to-run residing in the user space.

The timeline diagram of the SLED scheduler is presented in Figure 9.1 (upper figure). Each process upon off-loading issues a call to the SLED scheduler. The scheduler iterates through the ready-to-run list, in order to determine the pid of the next process. As presented in Figure 9.1 (upper figure), it is possible that all processes have been switched to SPE execution and the scheduler will iterate through the list until one of the SPEs sends the signal to the ready-to-run list. Therefore, it is likely that some idle time will occur (when no useful work is performed) upon off-loading and before finding the next-to-run process,. Once it finds the next to run process, the scheduler will switch to the kernel mode, and influence the kernel scheduler to run the appropriate process.

The possible drawback of this scheme is that upon determining which process should be the next to run, the system still needs to perform two context switches, between the user process

128 Process off−loading, and switching to SLED SLED has found next process

SLED iterates through (next) Process contiues Process running ready_to_run list

User−Level Space Context−switch SLED−>Kernel

Kernel−Level Space

Context switch Kernel−>Process

Kernel schedules next process to run Process off−loading, and switching to SLED

Process running (next) Process contiues

Context−switch User−Level Space SLED−>Kernel

Kernel−Level Space

ContextKernel−>Process switch

Kernel schedules SLED iterates through next process to run ready_to_run list

Time SLED has found process

Figure 9.1: Upon completing the assigned tasks, SPEs send signals to PPE processes through the ready-to-run list. The PPE process which decides to yield passes the data from the ready- to-run queue to the kernel, which in return can schedule the appropriate process on the PPE.

129 and the kernel, and between the kernel and the user process. In our future work we plan to allow kernel to directly access the list, which would eliminate one context switch. In other words, by allowing the kernel to see the ready-to-run list, we overlap the first context switch with the idle time which occurs before some of the active processes is ready to be rescheduled on the PPE. When the next-ro-run process is determined the scheduler would already be in the kernel space and there would be only one context switch left to return the execution to a specific process in the user space, see Figure 9.1 (bottom figure).

9.2 Load Balancing and Task Priorities

So far we have considered applying the MGPS and SLED schedulers only to a single applica- tion. In our future work we plan to investigate the described scheduling strategies in a context of multi-program workloads. Since the schedulers are already designed to work in a distributed en- vironment, using the schedulers with entirely separated applications should be relatively simple. However, we envision several challenges with multi-program workload that could potentially influence the system performance.

First, using the SLED scheduler with multi-program workload can cause load imbalance.

The SLED scheduler contains two ready-to-run lists, and each list is shared among the processes running on a single cpu. Therefore, the scheduler needs to be capable of deciding how to group the active processes across cpus in order to minimize load imbalance. The grouping of processes will depend on parameters such as granularities of the PPE and SPE tasks, PPE-SPE communication, inter-node and on-chip communication. Furthermore, the scheduler needs to be able to recognize when the load of the system has changed (for example when one of the processes has finished executing), and appropriately reschedule the remaining tasks across the available cpus.

Besides being able to handle the load balancing issues, our future work will focus on in- cluding support for real-time tasks in our scheduling policies. So far in our experiments all

130 processes were assumed to have the same priority. This is not the case in all situations, and one example would be streaming video applications. While trying to increase the system throughput with different process grouping and load-balancing policies, we might actually hurt the perfor- mance of the real-time jobs in the system. A simple example would be if a real-time task is grouped with processes that require a lot of resources. Although this might be the best grouping decision for overall system performance, that particular real-time task might suffer performance degradation. To address the mentioned issues, we plan to include multiple applications in our experiments and focus more on load-balancing problems as well as real-time task priorities.

9.3 Increasing Processor Utilization

Our initial scheduling scheme, Event-Driven Task-Level Parallelization (EDTLP), reduces the idle time on the PPE by forcing each process to yield upon off-loading, and assigning the PPE to a process which is ready to do work on the PPE side. For further reducing the idle time on both, the PPE and SPEs, we developed the Slack-minimizer Event Driven Scheduler (SLEDS).

In our future work, as another approach for increasing utilization of SPEs, we plan to intro- duce sharing of SPEs among multiple PPE threads. The processes in an MPI application are almost identical, and the off-loaded parts of each process are exactly the same. Therefore, a sin- gle SPE thread could potentially execute the off-loaded computation from multiple processes. However, different processes cannot share the SPE threads, since the SPE threads exclusively belong to the process which created them. Therefore, we plan to investigate another level of parallelism on the Cell processor, namely thread-level parallelism. Inside of a single node, instead of having multiple MPI processes, a parallel application would operate with multiple threads which could share SPEs among themselves. Different processes would be used across nodes. To further increase utilization of the PPE, we will consider extended loop-level schedul- ing policies which would also involve the PPE in computation, besides already used accelerator cores.

131 9.4 Novel Applications and Programming Models

In our thesis we used a limited number of applications that were able to benefit from the off- loading execution approach. While it is obvious that many scientific (computationally expen- sive) applications will benefit from the proposed execution models and scheduling strategies, in our future work we plan to focus on applications with high-bandwidth requirements. Specifi- cally, we plan to investigate the capability of accelerator-based architectures to execute applica- tions such as database servers and network package processing.

The mentioned applications are computationally intensive, but also these applications usu- ally require high memory bandwidth because they stream large amounts of data. Besides being extremely computationally powerful, Cell has a high-bandwidth bus which connects the on- chip cores among themselves and with the main memory. While the high-bandwidth bus will be capable of improving the performance of streaming applications, in a near future it might become a bottleneck as the number of on-chip cores increases. Therefore, in our future work we will focus on runtime systems which improve the execution of data-intensive applications on asymmetric processors.

9.5 Conventional Architectures

The main focus of our thesis were heterogeneous, accelerator-based architectures. However, parallel architectures comprising homogeneous cores represent the majority of processors that are in use nowadays. When working with conventional, highly parallel architectures, it is likely that similar problems will occur as those that we were facing on heterogeneous architectures.

As with asymmetric architectures, applications designed for homogeneous parallel archi- tectures need to be parallelized at multiple levels in order to achieve efficient execution. Ap- plications with multiple levels of parallelism are likely to experience load imbalance, which might result in poor utilization of chip resources. Therefore, we need to have techniques that

132 are capable of detecting and correcting these anomalies.

Most of the techniques we presented in this thesis are not bound to heterogeneous archi- tectures. In our future work we plan to extend and test our scheduling and modeling work to homogeneous parallel architectures. While scheduling approaches such as MGPS and S- MGPS might be relatively simple to apply on any kind of architecture, the MMGP modeling approach will require more detailed communication modeling. On Cell architecture, because of the specifics of the SPE design, we were able to assume significant computation-communication overlap. This obviously will not be the case on architectures with conventional caches, therefore we will focus more on modeling communication patterns.

9.6 MMGP extensions

Another direction of our future work regarding the MMGP model is more accurate modeling of the off-loaded tasks, specifically DMA communication in the off-loaded tasks. Each SPE on the Cell/BE is capable of moving data between main memory and local storage, while at the same time executing computation. To overlap computation and communication, applications use loop tiling and double buffering.

In this thesis in our MMGP model we have included all blocking DMA requests that can- not be computation overlapped. However, unrolling and increased DMA communication can influence the performance on a completely different architectural level. Although the Element Interconnect Bus (structure that connects cores on Cell) can achieve bandwidth of over 200 GB/s, the processor-memory bandwidth is limited to 25 GB/s. When many SPEs work simul- taneously the available bandwidth might not be sufficient. Consider a case where each SPE ex- ecutes exactly the same loop – realistic scenario when an off-loaded loop is parallelized across multiple accelerators. If the off-loaded execution is synchronized, all SPEs will issue a DMA request at the same time. Although the total bandwidth requirements might be less than 25 GB/s, when all SPEs simultaneously and synchronously perform memory communication the

133 requirements might exceed the available bandwidth. The described scenario is likely to occur when significant loop unrolling is performed, due to heavily increased DMA communication necessary for bringing data for the enlarged loop bodies. In our future work we plan to extend the MMGP model by capturing all on-chip contention caused by high bandwidth requirements of the off-loaded code.

134 Chapter 10

Overview of Related Research

10.1 Cell – Related Research

Cell has recently attracted considerable attention as a high-end computing platform. Recent work on Cell covers modeling, performance analysis, programming and compilation environ- ments, and application studies.

Kistler et. al [63] analyze the performance of Cell’s on-chip interconnection network and provide insights into its communication and synchronization protocols. They present experi- ments that estimate the DMA latencies and bandwidth of Cell, using microbenchmarks. They also investigate the system behavior under different patterns of communication between local storage and main memory. Based on the presented results, the Cell communications network provides the speed and bandwidth that applications need to exploit the processor’s computa- tional power. Williams et. al [98] present an analytical framework to predict performance on Cell. In order to test their model, they use several computational kernels, including dense ma- trix multiplication, sparse matrix vector multiplication, stencil computations, and 1D/2D FFTs. In addition, they propose micro-architectural modifications that can increase the performance of Cell when operating on double-precision floating point elements. Chen et. al [33] investi- gate communication (DMA) performance on the SPEs. They found strong relation between

135 the size of the prefetching buffers allocated in local storage, and application performance. To determine the optimal buffer size, they present a detailed analytical model of DMA accesses on the Cell and use the model to optimize the buffer size for DMAs. To evaluate performance of their model, they use a set of micro-kernels. Our work differs in that it considers the overall performance implications of multigrain parallelization strategies on Cell.

Balart et. al [16] present a runtime library for asynchronous communication in the Cell BE processor. The library is organized as a Software Cache and provides opportunities for over- lapping communication and computation. They found that the full-associative scheme offers better chances for communication-computation overlap. To evaluate their system they used benchmarks from the HPCC suite. While their concern was design and implementation of the off-loaded code, in our work we assume that the application is already Cell-optimized, and we focus on scheduling of already applicatoin-exposed parallelism.

Eichenberger et. al [39] present several compiler techniques targeting automatic generation of highly optimized code for Cell. These techniques attempt to exploit two levels of parallelism, thread-level and SIMD-level, on the SPEs. The techniques include compiler assisted memory alignment, branch prediction, SIMD parallelization, OpenMP thread-level parallelization, and compiler-controlled software caching. The study of Eichenberger et. al does not present details on how multiple dimensions of parallelism are exploited and scheduled simultaneously by the compiler. Our contribution addresses this issue. The compiler techniques presented in [39] are complementary to the work presented in this paper. They focus primarily on extracting high performance out of each individual SPE, whereas our work focuses on scheduling and orches- trating computation across SPEs. Zhao and Kennedy [102] present a dependence-driven com- pilation framework for simultaneous automatic loop-level parallelization and SIMDization on Cell. They also implement strategies to boost performance by managing DMA data movement, improving data alignment, and exploiting memory reuse in the innermost loop. To evaluate per- formance of their techniques, Zhao and Kennedy use microbenchmarks. Similar to the results presented in our study, they do not see linear speedup when parallelizing tasks across multiple

136 SPEs. The framework of Zhao and Kennedy does not consider task-level functional parallelism and its coordinated scheduling with data parallelism, two central issues explored in this thesis.

Although Cell has been a focal point in numerous articles in popular press, published re- search using Cell for real-world applications beyond games was scarce until recently. Hjelte [58] presents an implementation of a smooth particle hydrodynamics simulation on Cell. This simu- lation requires good interactive performance, since it lies on the critical path of real-time appli- cations such as interactive simulation of human organ tissue, body fluids, and vehicular traffic. Benthin et. al [18] present an implementation of ray-tracing algorithms on Cell, also targeting high interactive performance. They have shown how to efficiently map the ray tracing algorithm to Cell, with the performance improvements of nearly an order of magnitude over the conven- tional processors. However, they found that for certain algorithms Cell does not perform well due to frequent memory accesses. Petrini et. al [73] recently reported experiences from porting and optimizing Sweep3D on Cell, in which they consider multi-level data parallelization on the SPEs. They heavily optimized Sweep3D for Cell and achieved impressive performance of 9.3 Gflops/s for double precision and 50 Gflops/s for single precision floating point computation. Contrary to their conclusion that the memory performance and the data communication patterns play central role in Sweep3D, we were able to achieve complete communication-computation overlap in the bioinformatics code we ported to Cell. The same author presented a study of graph explorations algorithms on Cell [75]. They investigated suitability of the breath-search first (BFS) algorithm on the Cell BE. The achieved performance is an order of magnitude better compared to conventional architectures. Bader et. al [13] examine the implementation of list ranking algorithms on Cell. List ranking is a challenging algorithm for Cell due to its highly irregular patterns. When utilizing the entire Cell chip, they reported an overall speedup of 8.34 over a PPE-only implementation of the same algorithm. Recently several Cell studies have ben conducted as a result of 2007 IBM Cell Challenge. Moorkanikara-Nageswaran et. al [1] de- veloped Brain Circuit simulation on a PS3 node. As part of the same contest, De Kruijf ported MapReduce [38] algorithm to Cell. The main goal of our work is both, to develop and opti-

137 mize applications for Cell and develop system software tools and methodologies for improving performance on the Cell architecture across application domains. We use a case study from bioinformatics to understand the implications of static and dynamic multi-grain parallelization on Cell.

10.2 Process Scheduling - Related Research

Dynamic and off-line process scheduling which improves the performance and overall through- put of the system has been a very active research area. With the introduction of multi-core systems many scheduling related studies have been conducted targeting performance improve- ment on these novel systems. We list several contributions in this area.

Anderson et. al [9] argue that the performance of kernel-level threads is inherently worse than the performance of user-level threads. While user-level threads are essential for high per- formance computation, kernel-level threads, which support user-level threads, are a poor kernel- level abstraction due to inherently bad performance. The authors propose new kernel interface and user-level thread package that together provide the same functionality as kernel threads while at the same time the performance of their thread library is comparable to that of any user-level thread library.

Siddha et. al [82] conducted a thorough study on possible scheduling strategies on emerg- ing multi-core architectures. They consider different multi-core topologies and the associ- ated power management technologies, and try to point to possible tradeoffs when performing scheduling on these novel architectures. They focus on symmetric processors and do not con- sider any of the asymmetric architectures. Somewhat similar to results obtained from our study (using asymmetric cores), they conclude that the most efficient performance can be achieved by making the process scheduler aware of the multi-core topologies and the task characteristics.

Fedorova et. al [42] designed a kernel-level scheduling algorithm targeting to improve the performance of multi-core architectures with shared levels of cache. The motivation for their

138 work comes from the fact that application on multi-core systems are performance dependent on the behavior of their co-runners. This performance dependency occurs as a consequence of shared on-chip resources, such as the cache. Their algorithm ensures that the processes always run as quickly as they would if the cache was fairly shared among all co-running processes. To achieve this behavior they adjust the CPU timeslices assigned to the running processes by the kernel scheduler.

Calandrino et. al [26] developed an approach for scheduling soft real-time periodic tasks in Linux, on Asymmetric Multi-Core Processors. Their approach performs dynamic scheduling of real-time tasks, while at the same time attempts to provide good performance for non-real-time processes. To evaluate their approach they used a Linux scheduler simulator, as well as the real Linux operating system running on a dual-core Intel Xeon processor.

Settle et. al [80] proposed the memory monitoring framework, an architectural support that provides cache resource information to the operating system. The authors introduce the concept of an activity vector which represents a collection of event counters for a set of contiguous cache blocks. Using runtime information the operating system can improve the process scheduling. Their scheme schedules threads based on run-time cache use and miss pattern for each active hardware thread. Their techniques improve system performance by 5%. The performance improvements are caused by increased cache hit rate.

Thekkath and Eggers [93] tested a hypothesis that scheduling threads that share data on the same processor will decrease compulsory and invalidation misses. They evaluated a variety of thread placement algorithms. Their workload was composed of fourteen parallel programs that are representative of real-world scientific applications. They found that placing threads that share data on the same processor does not have any impact on performance. Instead, the performance was mostly affected by thread load balancing.

Rajagopalan et. al [76] introduce a scheduling framework for multi-core processors tar- geting to achieve a balance between control over the system and the level of abstraction. Their framework uses high-level information supplied by the user to guide thread scheduling and also,

139 where necessary, gives the programmer fine control over thread placement.

Snavely and Tullsen [83] designed the SOS (Sample, Optimize, Symbios) scheduler - an OS level scheduler that dynamically choses the best scheduling strategy in order to increase the throughput of the system. The SOS scheduler samples the space of possible process com- binations and collects values of the hardware counters for different scheduling combinations. The scheduler applies heuristics to the collected counters in order to determine the most effi- cient scheduling strategy. The scheduler is designed for the SMT architecture and is capable of improving system performance by up to 17%. The same authors extend their initial work by introducing job priorities [84]. While different jobs might have various priorities from the user’s perspective, the SOS scheduler might be unaware of that. In this way while trying to improve the system throughput, the SOS scheduler might increase the response time of high priority jobs.

Sudarsan et. al [90] developed ReSHAPE, a runtime scheduler for dynamic resizing of paral- lel application executed in a distributed environment. MPI-based applications using ReSHAPE framework can expand or shrink depending on the availability of underlying hardware. Using ReSHAPE they demonstrated improvement in job turn-around time and overall system through- put. McCan et. al [67] propose a dynamic processor-scheduling policy for multiprogrammed shared-memory multiprocessors. Their scheduling policy also assumes multiple independent process and it is capable of allocating processors from one parallel job to another based on the requirement of the parallel jobs. The authors show that is it possible to run beneficially low priority jobs on the same cpu with high priority jobs, without hurting the high priority jobs. Their new scheduling scheme can improve system performance by up to 40%.

Curtis-Maury et. al [36] present a prediction model for identifying energy-efficient operating points of concurrency in multithreaded scientific applications. Their runtime system optimizes applications at runtime, by using live analysis of hardware event rates. Zhang et. al [100] de- veloped an OMP-based loop scheduler that selects the number of threads to use per processor based on sample executions of each possibility. The authors extend that work to incorporate de-

140 cision tree based prediction of the optimal number of threads to use [101]. Springer et. al [85] developed a scheduler that conforms two conditions: the scheduling strategy satisfies an exter- nal upper limit for energy consumption and minimizes the execution time. The execution model chosen by their scheduler is usually less tha 2% of optimal.

10.3 Modeling – Related Research

In this section we review related work in programming environments and models for paral- lel computation on conventional homogeneous parallel systems and programming support for nested parallelization. The list of related work in models for parallel computation is by no means complete, but we believe it provides adequate context for the model presented in this thesis.

10.3.1 PRAM Model

Fortune and Wyllie presented a model based on random access machine operating in paral- lel and sharing a common memory [46]. They are modeling execution of a finite program on PRAM (parallel random access machine) that consists of unbounded set of processors connected through unbounded global shared memory. The model is rather simple but not realistic for mod- ern multicore processors, since it assumes that all processors work synchronously and that in- terprocessor communication is free. PRAM also does not consider network congestion. There are several variants of the PRAM models: (i) EREW - exclusive read exclusive write PRAM model does not allow simultaneous execution of read or write operations, (ii) CEREW - concur- rent read exclusive write, allows simultaneous reading but prevents simultaneous writing, (iii) CRCW allows both, simultaneous read and simultaneous write operations. Cameron et. al [28] describe two different implementations of CRCW PRAM: priority and arbitrary.

Several extensions of the PRAM model have been developed in order to make it more prac-

141 tical, but at the same time to preserve its simplicity [5,6,51,62,68,72]. Aggarwal et. al [5] add communication latency to the PRAM model, while the same author includes reduced commu- nication costs when blocks of data are transfered [6].

The original PRAM model assumes no asyncronous execution. Asynchronous PRAM (APRAM) model includes the synchronization costs [51, 68]. APRAM contains four different types of instructions: global reads, global writes, local operations and synchronization steps. Synchro- nization step represents a global synchronization among processors.

10.3.2 BSP model

Valiant introduced the bulk-synchronous parallel model (BSP) [95], which is a bridging model between parallel software and hardware. The BSP model is intended neither as a hardware nor as a programming model, but something in between. The model is defined as a combination of three attributes: 1) A number of components each performing processing and/or memory functions; 2) A router that delivers point-to-point messages between the components; and 3) Facilities for synchronizing all or a subset of components at regular intervals. The computa- tion is performed in supersteps - each component is allocated a task, and all components are synchronized at the end of computation/superstep. BSP as well as the other models mentioned, does not captures the overhead of context switching, which is a significant part of accelerator- based execution and the MMGP model. BSP allows processors to work asynchronously and models latency and limited bandwidth.

Baumker et. al [25] introduce extensions of the BSP model, where they introduce blockwise communication in parallel algorithms. A good parallel algorithm should communicate using smaller number of large messages rather than using a large number of small messages. There- fore they introduce a new parameter B which represents a minimum size of the message in order to fully exploit the bandwidth of the router. Fantozzi et. al [41] introduce D-BSP, a model where a machine can be divided into submachines capable of exploiting locality. Furthermore,

142 each submachine can execute different algorithms independently. Juurlink et. al [60] extend the BSP model by providing a way to deal with unbalanced communication patterns, and by adding a notion of general locality to the BSP model, where the delay of a remote memory access depends on the relative location of the processors in the interconnection network.

10.3.3 LogP model

LogP [35] is another widely used machine-independent model for parallel computation. The LogP model captures the performance of parallel applications using four parameters: the latency (L), overhead (o), bandwidth (g) of communication, and the number of processors (P ).

The drawback of LogP is that it can accurately predict performance only when short mes- sages are used for communication. Alexandrov et. al [8] propose the LogGP model, which is an extension of LogP that supports large communication messages and high bandwidth. They introduce an extra parameter G which captures the bandwidth obtained for large messages. Ino et. al [59] introduce an extension of LogGP, named LogGPS. LogGPS improves the ac- curacy of the LogGP model, by capturing the synchronization needed before sending a long message by high-level communication libraries. They introduce a new parameter S, defined as the threshold for message length, above which synchronous messages are sent. Frank et. al [47] extend the LogP model by capturing the impact of contention for message processing resources. Cameron et. al [27] extend the LogP model by modeling the point-to-point memory latencies of inter-node communication in a shared memory cluster.

Traditional parallel programming models, such as BSP [95], LogP [35], PRAM [51] and de- rived models [8,27,59,70] developed to respond to changes in the relative impact of architectural components on the performance of parallel systems, are based on a minimal set of parameters to capture the impact of communication overhead on computation running across a homogeneous collection of interconnected processors. MMGP borrows elements from LogP and its deriva- tives, to estimate performance of parallel computations on heterogeneous parallel systems with

143 multiple dimensions of parallelism implemented in hardware. A variation of LogP, HLogP [24], considers heterogeneous clusters with variability in the computational power and interconnec- tion network latencies and bandwidths between the nodes. Although HLogP is applicable to heterogeneous multi-core architectures, it does not consider nested parallelism. It should be noted that although MMGP has been evaluated on architectures with heterogeneous processors, it can readily support architectures with heterogeneity in their communication substrates as well (e.g. architectures providing both shared-memory and message-passing communication).

10.3.4 Models Describing Nested Parallelism

Several parallel programming models have been developed to support nested parallelism, in- cluding nested parallel languages such as NESL [21], task-level parallelism extensions to data- parallel languages such as HPF [89], extensions of common parallel programming libraries such as MPI and OpenMP to support nested parallel constructs [29, 64], and techniques for combin- ing constructs from parallel programming libraries, typically MPI and OpenMP, to better ex- ploit nested parallelism [11,50,77]. Prior work on languages and libraries for nested parallelism based on MPI and OpenMP is largely based on empirical observations on the relative speed of data communication via cache-coherent shared memory, versus communication with message passing through switching networks. Our work attempts to formalize these observations into a model which seeks optimal work allocation between layers of parallelism in the application and optimal mapping of these layers to heterogeneous parallel execution hardware. NESL [21] and [22] are languages based on formal algorithmic models of performance that guaran- tee tight bounds on estimating performance of multithreaded computations and enable nested parallelization. Both NESL and Cilk assume homogeneous machines.

Subhlok and Vondran [88] present a model for estimating the optimal number of homoge- neous processors to assign to each parallel task in a chain of tasks that form a pipeline. MMGP has a similar goal of assigning co-processors to simultaneously active tasks originating from

144 the host processors, however it also searches for the optimal number of tasks to activate in host processors, in order to achieve a balance between supply from host processors and demand from co-processors. Sharapov et. al [81] use a combination of queuing theory and cycle- accurate simulation of processors and interconnection networks, to predict the performance of hybrid parallel codes written in MPI/OpenMP on ccNUMA architectures. MMGP uses a sim- pler model, designed to estimate scalability along more than one dimensions of parallelism on heterogeneous parallel architectures.

Research on optimizing compilers for novel microprocessors, such as tiled and streaming processors, has contributed methods for multi-grain parallelization of scientific and media com- putations. Gordon et. al [53] present a compilation framework for exploiting three layers of parallelism (data, task and pipelined) on streaming microprocessors running DSP applications. The framework uses a combination of fusion and fission transformations on data-parallel com- putations, to ”right-size” the degree of task and data parallelism in a program running on a homogeneous multi-core microprocessor. MMGP is a complementary tool which can assist both compile-time and runtime optimization on heterogeneous multi-core platforms. The de- velopment of MMGP coincides with several related efforts on measuring, modeling and opti- mizing performance on the Cell Broadband Engine [32, 75]. An analytical model of the Cell presented by Williams et. al [97], considers execution of floating point code and DMA accesses on the Cell SPE for scientific kernels parallelized at one level across SPEs and vectorized fur- ther within SPEs. MMGP models the use of both the PPE and SPEs and has been demonstrated to work effectively with complete application codes. In particular, MMGP factors the effects of PPE thread scheduling, PPE-SPE communication and SPE-SPE communication into the Cell performance model.

145 This page intentionally left blank.

146 Bibliography

[1] http://www-304.ibm.com/jct09002c/university/students/contests/cell/index.html.

[2] http://www.rapportincorporated.com.

[3] The Cell project at IBM Research; http://www.research. ibm.com/cell .

[4] www.gpgpu.org.

[5] A. Aggarwal, A. K. Chandra, and M. Snir. On communication latency in pram com- putations. In SPAA ’89: Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, pages 11–21, New York, NY, USA, 1989. ACM.

[6] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity of prams. Theor. Comput. Sci., 71(1):3–28, 1990.

[7] S. Alam, R. Barrett, J. Kuehn, P. Roth, and J. Vetter. Characterization of scientific work- loads on systems with multi-core processors. In Proc. of IEEE International Symposium on Workload Characterization (IISWC), 2006.

[8] A. Alexandrov, M. Ionescu, C. Schauser, and C. Scheiman. LogGP: Incorporating Long Messages into the LogP Model: One Step Closer towards a Realistic Model for Parallel Computation. In Proc. of the 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 95–105, Santa Barbara, CA, June 1995.

[9] Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler activations: effective kernel support for the user-level management of paral- lelism. ACM Trans. Comput. Syst., 10(1):53–79, 1992.

[10] K. Asanovic, R. Bodik, C. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick. The Landscape of Parallel Comput- ing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California–Berkeley, December 2006.

147 [11] E. Ayguade,´ X. Martorell, J. Labarta, M. Gonzalez,´ and N. Navarro. Exploiting Multiple Levels of Parallelism in OpenMP: A Case Study. In Proc. of the 1999 International Con- ference on Parallel Processing (ICPP’99), pages 172–180, Aizu, Japan, August 1999.

[12] A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, M. Alvarez, and A. Ramirez. Analysis of video filtering on the cell processor. In Proceeding in Prorisc Conference, pages 116– 121, November 2007.

[13] D. Bader, V. Agarwal, and K. Madduri. On the Design and Analysis of Irregular Al- gorithms on the Cell Processor: A Case Study on List Ranking. In Proc. of the 21st International Parallel and Distributed Processing Symposium, Long Beach, CA, March 2007.

[14] D.A. Bader, B.M.E. Moret, and L. Vawter. Industrial applications of high-performance computing for phylogeny reconstruction. In Proc. of SPIE ITCom, volume 4528, pages 159–168, 2001.

[15] David A. Bader, Virat Agarwal, Kamesh Madduri, and Seunghwa Kang. High perfor- mance combinatorial algorithm design on the cell broadband engine processor. Parallel Comput., 33(10-11):720–740, 2007.

[16] Jairo Balart, Marc Gonzalez, Xavier Martorell, Eduard Ayguade, Zehra Sura, Tong Chen, Tao Zhang, Kevin O’brien, and Kathryn O’Brien. A novel asynchronous software cache implementation for the cell/be processor. In The 20th International Workshop on Lan- guages and Compilers for Parallel Computing, 2007.

[17] P. Bellens, J. Perez, R. Badia, and J. Labarta. CellSs: A Programming Model for the Cell BE Architecture. In Proc. of Supercomputing’2006, Tampa, FL, November 2006.

[18] Carsten Benthin, Ingo Wald, Michael Scherbaum, and Heiko Friedrich. Ray Tracing on the CELL Processor. Technical Report, inTrace Realtime Ray Tracing GmbH, No inTrace-2006-001 (submitted for publication), 2006.

[19] F. Blagojevic, D. Nikolopoulos, A. Stamatakis, and C. Antonopoulos. Dynamic Multi- grain Parallelization on the Cell Broadband Engine. In Proc. of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 90–100, March 2007.

148 [20] F. Blagojevic, A. Stamatakis, C. Antonopoulos, and D. Nikolopoulos. RAxML-CELL: Parallel Phylogenetic Tree Construction on the Cell Broadband Engine. In Proc. of the 21st International Parallel and Distributed Processing Symposium, March 2007.

[21] G. Blelloch, S. Chatterjee, J. Harwick, J. Sipelstein, and M. Zagha. Implementation of a Portable Nested Data Parallel Language. In Proc. of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pages 102–112, San Diego, CA, June 1993.

[22] R. Blumofe, C. Joerg, B. Kuszmaul, C. Leiserson, K. Randall, and Y. Zhou. Cilk: an Efficient Multithreaded Runtime System. In Proc. of the 5th ACM Symposium on Princi- ples and Practices of Parallel Programming (PPoPP’95), pages 207–216, Santa Barbara, California, August 1995.

[23] Jeff Bolz, Ian Farmer, Eitan Grinspun, and Peter Schrooder.¨ Sparse matrix solvers on the gpu: conjugate gradients and multigrid. ACM Trans. Graph., 22(3):917–924, 2003.

[24] J. Bosque and L. Pastor. A Parallel Computational Model for Heterogeneous Clusters. IEEE Transactions on Parallel and Distributed Systems, 17(12):1390–1400, December 2006.

[25] Armin BŁumker and Wolfgang Dittrich. Fully dynamic search trees for an extension of the bsp model.

[26] John M. Calandrino, Dan Baumberger, Tong Li, Scott Hahn, and James H. Anderson. Soft real-time scheduling on performance asymmetric multicore platforms. In RTAS ’07: Proceedings of the 13th IEEE Real Time and Embedded Technology and Applications Symposium, pages 101–112, Washington, DC, USA, 2007. IEEE Computer Society.

[27] K. Cameron and X. Sun. Quantifying Locality Effect in Data Access Delay: Memory LogP. In Proc. of the 17th International Parallel and Distributed Processing Symposium, Nice, France, April 2003.

[28] Kirk W. Cameron and Rong Ge. Predicting and evaluating distributed communication performance. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercom- puting, page 43, Washington, DC, USA, 2004. IEEE Computer Society.

[29] F. Cappello and D. Etiemble. MPI vs. MPI+OpenMP on the IBM SP for the NAS Bench- marks. In Proc. of the IEEE/ACM Supercomputing’2000: High Performance Networking and Computing Conference (SC’2000), Dallas, Texas, November 2000.

149 [30] L. Chai, Q. Gao, and D. K. Panda. Understanding the Impact of Multi-Core Architec- ture in Cluster Computing: A Case Study with Intel Dual-Core System. In Proc. of CCGrid2007, May 2007.

[31] Maria Charalambous, Pedro Trancoso, and Alexandros Stamatakis. Initial experiences porting a bioinformatics application to a graphics processor. In Panhellenic Conference on Informatics, pages 415–425, 2005.

[32] T. Chen, Z. Sura, K. O’Brien, and K. O’Brien. Optimizing the Use of Static Buffers for DMA on a Cell Chip. In Proc. of the 19th International Workshop on Languages and Compilers for Parallel Computing, New Orleans, LA, November 2006.

[33] Thomas Chen, Ram Raghavan, Jason Dale, and Eiji Iwata. Cell broadband engine archi- tecture and its first implementation. IBM developerWorks, Nov 2005.

[34] Benny Chor and Tamir Tuller. Maximum likelihood of evolutionary trees: hardness and approximation. Bioinformatics, 21(1):97–106, 2005.

[35] D. Culler, R. Karp, D. Patterson, A. Sahay, K. Scauser, E. Santos, R. Subramonian, and T. Von Eicken. LogP: Towards a Realistic Model of Parallel Computation. In Proc. of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’93), pages 1–12, San Diego, California, May 1993.

[36] Matthew Curtis-Maury, Filip Blagojevic, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. Prediction-based power-performance adaptation of multithreaded scien- tific codes. IEEE Transaction on Parallel and Distributed Systems.

[37] William J. Dally, Francois Labonte, Abhishek Das, Patrick Hanrahan, Jung-Ho Ahn, Jayanth Gummaraju, Mattan Erez, Nuwan Jayasena, Ian Buck, Timothy J. Knight, and Ujval J. Kapasi. Merrimac: Supercomputing with streams. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35, Washington, DC, USA, 2003. IEEE Computer Society.

[38] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. pages 137–150.

[39] A. Eichenberger, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. Gschwind, K. O’Brien, K. O’Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, and B. So. Optimizing Compiler for the CELL Processor. In Proc. of the 14th International Conference on

150 Parallel Architectures and Compilation Techniques, pages 161–172, Saint Louis, MO, September 2005.

[40] B. Flachs et al. The Microarchitecture of the Streaming Processor for a CELL Processor. Proceedings of the IEEE International Solid-State Circuits Symposium, pages 184–185, February 2005.

[41] Carlo Fantozzi, Andrea Pietracaprina, and Geppino Pucci. Translating submachine lo- cality into locality of reference. J. Parallel Distrib. Comput., 66(5):633–646, 2006.

[42] Alexandra Fedorova, Margo Seltzer, and Michael D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In PACT ’07: Pro- ceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 25–38, Washington, DC, USA, 2007. IEEE Computer Society.

[43] J. Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood ap- proach. Journal of Molecular Evolution, 17:368–376, 1981.

[44] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood ap- proach. J. Mol. Evol., 17:368–376, 1981.

[45] X. Feng, K. Cameron, B. Smith, and C. Sosa. Building the Tree of Life on Terascale Sys- tems. In Proc. of the 21st International Parallel and Distributed Processing Symposium, Long Beach, CA, March 2007.

[46] Steven Fortune and James Wyllie. Parallelism in random access machines. In STOC ’78: Proceedings of the tenth annual ACM symposium on Theory of computing, pages 114–118, New York, NY, USA, 1978. ACM.

[47] Matthew Frank, Anant Agarwal, and Mary K. Vernon. Lopc: Modeling contention in parallel algorithms. In Principles Practice of Parallel Programming, pages 276–287, 1997.

[48] Bugra Gedik, Rajesh Bordawekar, and Philip S. Yu. Cellsort: High performance sorting on the cell processor. In Proc. of the 33rd Very Large Databases Conference, pages 1286–1207, 2007.

[49] Bugraˇ Gedik, Philip S. Yu, and Rajesh R. Bordawekar. Executing stream joins on the cell processor. In VLDB ’07: Proceedings of the 33rd international conference on Very large data bases, pages 363–374. VLDB Endowment, 2007.

151 [50] A. Gerndt, S. Sarholz, M. Wolter, D. An Mey, C. Bischof, and T. Kuhlen. Particles and Continuum – Nested OpenMP for Efficient Computation of 3D Critical Points in Multiblock Data Sets. In Proc. of Supercomputing’2006, Tampa, FL, November 2006.

[51] P. Gibbons. A More Practical PRAM Model. In Proc. of the First Annual ACM Sym- posium on Parallel Algorithms and Architectures, pages 158–168, Santa Fe, NM, June 1989.

[52] M. Girkar and C. Polychronopoulos. The Hierarchical Task Graph as a Universal Inter- mediate Representation. International Journal of Parallel Programming, 22(5):519–551, October 1994.

[53] M. Gordon, W. Thies, and S. Amarasinghe. Exploiting Coarse-Grained Task, Data and Pipelined Parallelism in Stream Programs. In Proc. of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 151–162, San Jose, CA, October 2006.

[54] Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. Fast computation of database operations using graphics processors. In SIGMOD ’04: Pro- ceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 215–226, New York, NY, USA, 2004. ACM.

[55] W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics. In Proc. of the 6th European PVM/MPI User’s Group Meeting, pages 11–18, Barcelona, Spain, September 1999.

[56] W. Gropp and E. Lusk. Reproducible Measurements of MPI Performance Characteristics. In Proc. of the 6th European PVM/MPI Users Group Meeting, pages 11–18, September 1999.

[57] M. Hill and M. Marty. Amdahls Law in the Multi-core Era. Technical Report 1593, Department of Computer Sciences, University of Wisconsin-Madison, March 2007.

[58] Nils Hjelte. Smoothed particle hydrodynamics on the cell broadband engine. Master’s thesis, Umea˚ University, Department of Computer Science, Jun 2006.

[59] F. Ino, N. Fujimoto, and K. Hagihara. LogGPS: A Parallel Computational Model for Synchronization Analysis. In Proc. of the 8th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 133–142, Snowbird, UT, June 2001.

152 [60] Ben H. H. Juurlink and Harry A. G. Wijshoff. The e-BSP model: Incorporating general locality and unbalanced communication into the BSP model. In Euro-Par, Vol. II, pages 339–347, 1996.

[61] W. Kahan. Lecture notes on the status of ieee standard 754 for binary floating-point arithmetic. 1997.

[62] Richard M. Karp, Michael Luby, and Friedhelm Meyer auf der Heide. Efficient pram simulation on a distributed memory machine. In STOC ’92: Proceedings of the twenty- fourth annual ACM symposium on Theory of computing, pages 318–326, New York, NY, USA, 1992. ACM.

[63] Mike Kistler, Michael Perrone, and Fabrizio Petrini. Cell Multiprocessor Interconnection Network: Built for Speed. IEEE Micro, 26(3), May-June 2006. Available from http:// hpc.pnl.gov / people / fabrizio / papers / ieeemicro-cell.pdf.

[64] G. Krawezik. Performance Comparison of MPI and three OpenMP Programming Styles on Shared Memory Multiprocessors. In Proc. of the 15th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 118–127, San Diego, CA, June 2003.

[65] E. Scott Larsen and David McAllister. Fast matrix multiplies using graphics hardware. In Supercomputing ’01: Proceedings of the 2001 ACM/IEEE conference on Supercom- puting (CDROM), pages 55–55, New York, NY, USA, 2001. ACM.

[66] L-K. Liu, Q. Li, A. Natsev, K.A. Ross, J.R. Smith, and A.L. Varbanescu. Digital me- dia indexing on the cell processor. In ICME 2007, pages 1866 – 1869. IEEE Signal Processing Society, July 2007.

[67] Cathy McCann, Raj Vaswani, and John Zahorjan. A dynamic processor allocation pol- icy for multiprogrammed shared-memory multiprocessors. ACM Trans. Comput. Syst., 11(2):146–178, 1993.

[68] Kurt Mehlhorn and Uzi Vishkin. Randomized and deterministic simulations of prams by parallel machines with restricted granularity of parallel memories. Acta Inf., 21(4):339– 374, 1984.

[69] Barry Minor, Gordon Fossum, and Van To. Terrain rendering engine (tre), http:// www.research.ibm.com / cell / whitepapers / tre.pdf. May 2005.

153 [70] C. Moritz and M. Frank. LoGPC: Modeling Network Contention in Message Passing Programs. In Proc. of the 1998 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 254–263, Madison, WI, June 1998.

[71] PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Pro- gramming Environments Manual. http:// www-306.ibm.com / chips / techlib.

[72] Christos Papadimitriou and Mihalis Yannakakis. Towards an architecture-independent analysis of parallel algorithms. In STOC ’88: Proceedings of the twentieth annual ACM symposium on Theory of computing, pages 510–513, New York, NY, USA, 1988. ACM.

[73] F. Petrini, G. Fossum, A. Varbanescu, M. Perrone, M. Kistler, and J. Fernandez Periador. Multi-core Surprises: Lessons Learned from Optimized Sweep3D on the Cell Broadband Engine. In Proc. of the 21st International Parallel and Distributed Processing Sympo- sium, Long Beach, CA, March 2007.

[74] Fabrizio Petrini, Gordon Fossum, Mike Kistler, and Michael Perrone. Multicore Suprises: Lesson Learned from Optimizing Sweep3D on the Cell Broadbend Engine.

[75] Fabrizio Petrini, Daniel Scarpazza, Oreste Villa, and Juan Fernandez. Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors. In Proc. of the 21st International Parallel and Distributed Processing Symposium, Long Beach, CA, March 2007.

[76] Mohan Rajagopalan, Brian T. Lewis, and Todd A. Anderson. Thread scheduling for multi-core platforms. In HotOS 2007: Proceedings of the Eleventh Workshop on Hot Topics in Operating Systems, 2007.

[77] T. Rauber and G. Ruenger. Library Support for Hierarchical Multiprocessor Tasks. In Proc. of Supercomputing’2002, Baltimore, MD, November 2002.

[78] Daniele Paolo Scarpazza, Oreste Villa, and Fabrizio Petrini. Peak-performance dfa-based string matching on the cell processor. In IPDPS, pages 1–8. IEEE, 2007.

[79] Harald Servat, Cecilia Gonzalez, Xavier Aguilar, Daniel Cabrera, and Daniel Jimenez. Drug design on the cell broadband engine. In PACT ’07: Proceedings of the 16th In- ternational Conference on Parallel Architecture and Compilation Techniques, page 425, Washington, DC, USA, 2007. IEEE Computer Society.

154 [80] Alex Settle, Joshua Kihm, Andrew Janiszewski, and Dan Connors. Architectural support for enhanced smt job scheduling. In PACT ’04: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 63–73, Wash- ington, DC, USA, 2004. IEEE Computer Society.

[81] I. Sharapov, R. Kroeger, G. Delamater, R. Cheveresan, and M. Ramsay. A Case Study in Top-Down Performance Estimation for a Large-Scale Parallel Application. In Proc. of the 11th ACM SIGPLAN Symposium on Pronciples and Practice of Parallel Programming, pages 81–89, New York, NY, March 2006.

[82] Suresh Siddha, Venkatesh Pallipadi, and Asit Mallick. Process Scheduling Challenges in the Era of Multi-core Processors. Intel Technology Journal, 11:btl446, 2007.

[83] Allan Snavely and Dean M. Tullsen. Symbiotic jobscheduling for a simultaneous mul- tithreaded processor. In ASPLOS-IX: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, pages 234– 244, New York, NY, USA, 2000. ACM.

[84] Allan Snavely, Dean M. Tullsen, and Geoff Voelker. Symbiotic jobscheduling with pri- orities for a simultaneous multithreading processor. In SIGMETRICS ’02: Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 66–76, New York, NY, USA, 2002. ACM.

[85] Robert Springer, David K. Lowenthal, Barry Rountree, and Vincent W. Freeh. Minimiz- ing execution time in mpi programs on an energy-constrained, power-scalable cluster. In PPoPP ’06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 230–238, New York, NY, USA, 2006. ACM.

[86] A. Stamatakis. Phylogenetic models of rate heterogeneity: A high performance com- puting perspective. In Proceedings of 20th IEEE/ACM International Parallel and Dis- tributed Processing Symposium (IPDPS2006), High Performance Computational Biol- ogy Workshop, Proceedings on CD, Rhodos, Greece, April 2006.

[87] Alexandros Stamatakis. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics, page btl446, 2006.

[88] J. Subhlok and G. Vondran. Optimal Use of Mixed Task and Data Parallelism for Pipelined Computations. Journal of Parallel and , 60(3):297– 319, March 2000.

155 [89] J. Subhlok and B. Yang. A New Model for Integrated Nested Task and Data Parallelism. In Proc. of the 6th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 1–12, Las Vegas, NV, June 1997.

[90] Rajesh Sudarsan and Calvin J. Ribbens. Reshape: A framework for dynamic resizing and scheduling of homogeneous applications in a parallel environment, 2007.

[91] Alias Systems. Alias cloth technology demonstration for the cell processor, http:// www.research.ibm.com / cell / whitepapers / alias cloth.pdf. 2005.

[92] Cell broadband engine programming tutorial version 1.0; http:// www-106. ibm.com / developerworks / eserver / library / es-archguide-v2.html.

[93] R. Thekkath and S. J. Eggers. Impact of sharing-based thread placement on multithreaded architectures. SIGARCH Comput. Archit. News, 22(2):176–186, 1994.

[94] John A. Turner. Roadrunner: Heterogeneous Petascale Com-puting for Predictive Simu- lation. Technical Report LANL-UR-07-1037, Los Alamos National Lab, Las Vegas, NV, February 2007. ASC Principal Investigator Meeting.

[95] L. Valiant. A bridging model for parallel computation. Communications of the ACM, 22(8):103–111, August 1990.

[96] Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick Y. Yang, Guei-Yuan Lueh, and Hong Wang. Exochi: architecture and programming environment for a heterogeneous multi-core multithreaded system. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming lan- guage design and implementation, pages 156–166, New York, NY, USA, 2007. ACM.

[97] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick. The Potential of the Cell Processor for Scientific Computing. In Proc. of the 3rd Conference on Computing Frontiers, pages 9–20, Ischia, Italy, June 2006.

[98] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Kather- ine Yelick. The Potentinal of the Cell Processor for Scientific Computing. ACM Interna- tional Conference on Computing Frontiers, May 3-6 2006.

[99] Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, and Kather- ine Yelick. Scientific computing kernels on the cell processor. Int. J. Parallel Program., 35(3):263–298, 2007.

156 [100] Yun Zhang, Mihai Burcea, Victor Cheng, Ron Ho, and Michael Voss. An adaptive loop scheduler for hyperthreaded smps. In David A. Bader and Ashfaq A. Khokhar, editors, ISCA PDCS, pages 256–263. ISCA, 2004.

[101] Yun Zhang and Michael Voss. Runtime empirical selection of loop schedulers on hyper- threaded smps. In IPDPS ’05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS’05) - Papers, page 44.2, Washington, DC, USA, 2005. IEEE Computer Society.

[102] Y. Zhao and K. Kennedy. Dependence-based Code Generation for a Cell Processor. In Proc. of the 19th International Workshop on Languages and Compilers for Parallel Computing, New Orleans, LA, November 2006.

157