<<

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Xin , B.S., M.S.

Graduate Program in Department of Computer Science and Engineering

The Ohio State University

2014

Dissertation Committee:

Dr. Gagan Agrawal, Advisor Dr. P. Sadayappan Dr. Feng Qin c Copyright by

Xin Huo

2014 Abstract

Parallel computing architectures have become ubiquitous nowadays. Particularly, Graph- ics Processing Unit (GPU), Intel Xeon Phi co-processor (many integrated core architec- ture), and heterogeneous coupled and decoupled CPU-GPU architectures have emerged as a very significant player in high performance computing due to their performance, cost, and energy efficiency. Based on this trend, important research issue is how to efficiently utilize these varieties of architectures to accelerate different kinds of applications.

Our overall goal is to provide parallel strategies and runtime support for modern GPUs,

Intel Xeon Phi, and heterogeneous architectures to accelerate the applications with differ- ent communication patterns. The challenges rise from two aspects, application and archi- tecture respectively. In the application aspect, not all kinds of applications can be easily ported to GPUs or Intel Xeon Phi with optimal high performance. From the architecture aspect, the SIMT (Single Instruction Multiple Thread) and SIMD (Single Instruction Mul- tiple Data) execution models employed by GPUs and Intel Xeon Phi respectively do not favor applications with data and control dependencies. Moreover, heterogeneous CPU-

GPU architectures, i.e., the integrated CPU-GPU, bring new models and challenges for data sharing. Our efforts focus on four kinds of application patterns: generalized reduc- tion, irregular reduction, stencil computation, and recursive applications, and explore the mechanism of parallelizing them on various parallel architectures, including GPU, Intel

Xeon Phi architecture, and heterogeneous CPU-GPU architectures.

ii We first study the parallel strategies for generalized reductions on modern GPUs. The

traditional mechanism to parallelize generalized reductions on GPUs is referred to as full

replication method, which assigns each thread an independent data copy to avoid data rac- ing and maximize parallelism. However, it introduces significant memory and result com- bination overhead, and is inapplicable for a large number of threads. In order to reduce memory overhead, we propose a locking method, which allows all threads per block to share the same data copy, and supports fine-grained and coarse-grained locking access.

The locking method succeeds in reducing memory overhead, but introduces thread compe- tition. To achieve a tradeoff between memory overhead and thread competition, we also present a hybrid scheme.

Next, we investigate the strategies for irregular or unstructured applications. One of the key challenges is how to utilize the limited-sized shared memory on GPUs. We present a partitioning-based locking scheme, by choosing the appropriate partitioning space which can guarantee all the data can be put into the shared memory. Moreover, we propose an efficient partitioning method and data reordering mechanism.

Further, to better utilize the computing capability on both host and accelerator, we extend our irregular parallel scheme to the heterogeneous CPU-GPU environment by intro- ducing a multi-level partitioning scheme and a dynamic task scheduling framework, which supports pipelining between partitioning and computation, and work stealing between CPU and GPU. Then, motivated by the emerging of the integrated CPU-GPU architecture and its new shared physical memory, we develop the thread block level scheduling framework for three communication patterns, generalized reduction, irregular reduction, and stencil com- putation. This novel scheduling framework introduces locking free access between CPU

iii and GPU, reduces command launching and synchronization overhead by removing extra synchronizations, and improves load balance.

Although the support of unstructured control follow has been included in GPUs, the performance of recursive applications on modern GPUs is limited due to the current thread reconvergence method in the SIMT execution model. To efficiently port recursive applica- tions to GPUs, we present our novel dynamic thread reconvergence method, which allows threads to reconverge either before or after the static reconvergence point, and improves instructions per cycle (IPC) significantly.

Intel Xeon Phi architecture is an emerging x86 based many-core coprocessor architec- ture with wide SIMD vectors. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPUs with CUDA or

OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even re- ductions in the past. We consider the problem of accelerating applications involving differ- ent communication patterns on Xeon Phis, with an emphasis on effectively using available

SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reorganization and our methods to effectively manage control flow.

In addition, all proposed methods have been extensively evaluated with multiple appli- cations and different data inputs.

iv This is dedicated to the ones I love: my parents and my fiancee.

v Acknowledgments

During the last five years of my Ph.D. pursuing, there are so may people to whom I would like to give my sincerest thanks! Without their help and support, it would not have been possible to write this doctoral dissertation. To only some of them, it is possible to give particular mention here.

First and foremost, I would like to thank my advisor Dr. Gagan Agrawal for his non- compromising dedication to the development and education of his students. This disserta- tion would be not have been possible without his guidance, support and patience. During the last five years, he not only guided me on the research of computer system and software, but also help me developing a broad and strong foundation upon which a life pursuing computer science and technology will be built. His commitment not only taught me all the required qualities as a researcher but also inspired me as a person. The joy, passion, and enthusiasm has has for his research were contagious and motivational for me in my future career. I would also like to thank the members of my thesis committee, Drs. P. Sadayappan,

Feng Qin, and Radu Teodorescu. It is through their active interest an incisive and critical analysis that I have been able to complete my research.

I would like to acknowledge the financial, academic, technical and all other kinds of support from National Science Foundation, The Ohio State University, and the Computer

Science and Engineering Department, especially the head of our department, Dr. Xiaodong

Zhang, and all administrative staffs our department, especially Kathryn M. Reeves and

vi Lynn Lyons. Also, I would like to extend my sincerest thanks to my internship mentor, Dr.

Sriram Krishnamoorthy from Pacific Northwest National Laboratory.

I am truly grateful to the past and present members of my lab, Data-Intensive and High

Performance Computing Research Group. They gave me invaluable help and support on research, career, and life, and they are: Wenjing Ma, David Chiu, Qian Zhu, Fan Wang,

Vignesh Ravi, Jiang, Tantan , Tekin Bicer, Jiedan Zhu, Bin Ren, Su, Linchuan

Chen, Wang, Mehmet Can Kurt, Mucahid Kutlu, Jiaqi Liu, and Sameh Shohdy.

Special thanks are given to my fiancee, Jingyan , who has always been by my side.

I am also thankful to my parents and all the rest of my family in China and as well as all of the friends that I have encounter throughout my life. It is with their love, support, and encouragement, that this dissertation is in existence.

vii Vita

October 13th, 1984 ...... Born - Tianjin, China

2007 ...... B.S. Software Engineering, Beijing Institute of Technology, Beijing, China 2009 ...... M.S. Software Engineering, Beijing Institute of Technology, Beijing, China 2013 ...... M.S. Computer Science and Engineer- ing, The Ohio State University, Colum- bus, USA

Publications

Research Publications

Xin Huo, Bin Ren, and Gagan Agrawal. “A Programming System for Xeon Phis with Runtime SIMD Parallelization”. In Proceedings of The 28th ACM International Confer- ence on SuperComputing (ICS), June 2014.

Linchuan Chen, Xin Huo, and Gagan Agrawal “Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores”. In Proceedings of The 23th International Heterogeneity in Computing Workshop (HCW), May, 2014.

Xin Huo, Sriram Krishnamoorthy, and Gagan Agrawal. “Efficient Scheduling of Recur- sive Control Flow on GPUs”. In Proceedings of The 27th ACM International Conference on SuperComputing (ICS), June 2013.

Linchuan Chen, Xin Huo, and Gagan Agrawal. “Accelerating MapReduce on a Coupled CPU-GPU Architecture”. In Proceedings of The 25th IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November, 2012.

viii Xin Huo, Vignesh T. Ravi, and Gagan Agrawal. “Porting Irregular Reductions on Het- erogeneous CPU-GPU Configurations”. In Proceeding of The 18th IEEE International Conference on High Performance Computing (HiPC), December, 2011.

Xin Huo, Vignesh T. Ravi, Wenjing Ma, and Gagan Agrawal. “An Execution Strategy and Optimized Runtime Support for Parallelizing Reductions on Modern GPUs”. In Proceed- ings of The 25th ACM International Conference on SuperComputing (ICS), June 2011.

Xin Huo, Vignesh T. Ravi, and Gagan Agrawal. “Approaches for Parallelizing Reductions on Modern GPUs”. In Proceeding of The 17th IEEE International Conference on High Performance Computing (HiPC), December, 2010.

Fields of Study

Major Field: Computer Science and Engineering

ix Table of Contents

Page

Abstract ...... ii

Dedication ...... v

Acknowledgments ...... vi

Vita ...... viii

List of Tables ...... xiv

List of Figures ...... xv

1. Introduction ...... 1

1.1 Overview of Parallel Architectures ...... 3 1.1.1 Decoupled-GPUs and Coupled-GPUs ...... 3 1.1.2 Intel Xeon Phi Many-core Architecture ...... 6 1.2 Dissertation Contributions ...... 9 1.2.1 Approaches for Parallelizing Reductions on Modern GPUs . . .9 1.2.2 An Execution Strategy and Optimized Runtime Support for Par- allelizing Irregular Reductions on Modern GPUs ...... 11 1.2.3 Porting Irregular Reductions on Heterogeneous CPU-GPU Con- figurations ...... 12 1.2.4 Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture ...... 13 1.2.5 Efficient Scheduling of Recursive Control Flow on GPUs . . . . 15 1.2.6 A Programming System for Xeon Phis with Runtime SIMD Par- allelization ...... 17 1.3 Outline ...... 19

x 2. Approaches for Parallelizing Reductions on Modern GPUs ...... 21

2.1 Parallelization Approaches ...... 22 2.1.1 Generalized Reductions ...... 22 2.1.2 Full Replication ...... 24 2.1.3 Locking Scheme ...... 24 2.1.4 Hybrid Scheme ...... 30 2.2 Applications ...... 32 2.2.1 k-means Clustering ...... 32 2.2.2 PCA ...... 33 2.2.3 KNN Search ...... 33 2.3 Experimental Results ...... 36 2.3.1 Results from K-Means ...... 37 2.3.2 Results from PCA ...... 40 2.3.3 Results from KNN ...... 44 2.3.4 Discussion ...... 49 2.4 Related Work ...... 50 2.5 Summary ...... 51

3. An Execution Strategy and Optimized Runtime Support for Parallelizing Re- ductions on Modern GPUs ...... 52

3.1 Overview of Irregular Reductions ...... 52 3.2 Execution Strategy for Indirection Array Based Computations on GPUs . 54 3.2.1 Overheads with Existing Methods ...... 55 3.2.2 Partitioning-Based Locking Scheme ...... 56 3.3 Runtime Support ...... 62 3.3.1 Runtime Partitioning Approaches ...... 63 3.3.2 Iteration Reordering ...... 64 3.4 Experimental Results ...... 66 3.4.1 Experimental Platform ...... 66 3.4.2 Evaluation Goals ...... 66 3.4.3 Efficiency of Partitioning Based Locking Scheme ...... 67 3.4.4 Impact of Various Partitioning Schemes on Irregular Reductions . 69 3.4.5 Other Performance Issues ...... 73 3.4.6 Performance with Adaptive Execution ...... 75 3.5 Related Work ...... 76 3.6 Summary ...... 78

xi 4. Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations . . . 80

4.1 Multi-level Partitioning Framework ...... 80 4.1.1 Multi-level Partitioning ...... 82 4.2 Runtime Support and Schemes ...... 85 4.2.1 Task Scheduling Framework ...... 86 4.2.2 Runtime Pipeline Scheme ...... 86 4.2.3 Optimized Task Scheduling ...... 88 4.3 Experimental Results ...... 89 4.3.1 Scalability of Irregular Applications on Multi-core CPU and GPU 90 4.3.2 Performance Trade-offs with Computation and Reduction Space Partitioning ...... 92 4.3.3 Benefits From Pipelining ...... 95 4.3.4 Benefits From Dividing Computations Between CPU and GPU . 97 4.3.5 Discussion ...... 99 4.4 Related Work ...... 101 4.5 Summary ...... 102

5. Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture ...... 104

5.1 Proposed Scheduling Frameworks ...... 104 5.1.1 Scheduling Considerations and Existing Schemes ...... 104 5.1.2 Scheduling Requirements for a Coupled Architecture ...... 107 5.1.3 Master-Worker Scheduling Framework ...... 108 5.1.4 Token Scheduling Framework ...... 112 5.2 Experimental Results ...... 115 5.2.1 Experimental Platform and Goals ...... 115 5.2.2 Efficiency of Transmission on Fusion Architecture ...... 116 5.2.3 Effectiveness of Thread Block Level Scheduling Framework . . . 117 5.3 Related Work ...... 127 5.4 Summary ...... 128

6. Efficient Scheduling of Recursive Control Flow on GPUs ...... 130

6.1 Current Recursion Support on modern GPUs ...... 130 6.1.1 Modern GPU Architectures and Thread Divergence ...... 130 6.1.2 Performance of Recursive Control Flow on GPUs ...... 132 6.2 Static Immediate Post-dominator Reconvergence ...... 134 6.2.1 Reconvergence in NVIDIA (Kepler) GPUs ...... 135 6.2.2 Frontier-Based Early Reconvergence ...... 136

xii 6.2.3 Limitations for Recursive Functions ...... 138 6.3 Dynamic Reconvergence and Scheduling ...... 140 6.3.1 Dynamic Greedy Reconvergence ...... 140 6.3.2 Dynamic Majority-based Scheduling ...... 142 6.4 Implementation ...... 144 6.4.1 Distinguishing Branch Types ...... 146 6.4.2 Implementation of the New Schemes ...... 146 6.5 Experiment Evaluation ...... 152 6.5.1 Evaluation Goals ...... 153 6.5.2 Performance of Methods with Increasing Divergence ...... 154 6.5.3 Scalability with Respect to Warp Width ...... 156 6.6 Related Work ...... 161 6.7 Summary ...... 163

7. A Programming System for Xeon Phis with Runtime SIMD Parallelization . . 165

7.1 Parallelization and Performance Issues in Intel Xeon Phi ...... 166 7.1.1 Our Approach ...... 166 7.1.2 Challenges and Opportunities ...... 166 7.2 API for Application Development on Xeon Phi ...... 169 7.2.1 Shared Memory (MIMD) API ...... 169 7.2.2 SIMD API ...... 171 7.2.3 Sample Kernels ...... 174 7.3 Runtime Scheduling Framework ...... 178 7.3.1 MIMD Parallelization ...... 178 7.3.2 SIMD Parallelization Support ...... 180 7.4 Evaluation ...... 186 7.4.1 Benchmarks ...... 187 7.4.2 Speedups from Our Framework ...... 188 7.4.3 Overall Scalability ...... 192 7.4.4 Comparison with OpenMP ...... 194 7.5 Related Work ...... 195 7.6 Summary ...... 197

8. Conclusions ...... 198

8.1 Contributions ...... 198 8.2 Future Work ...... 201 8.2.1 Support SIMD Vectors with Different Lengths ...... 201 8.2.2 Support CPU-MIC Heterogeneous Environment ...... 202

Bibliography ...... 204

xiii List of Tables

Table Page

1.1 Read / Write Speed on Device, Uncached and Cacheable Host Memory for CPU and GPU ...... 5

2.1 Kmeans: Comparison between sequential time and GPU time ...... 36

2.2 PCA: Comparison between sequential time and GPU time ...... 41

2.3 KNN: Comparison between sequential time and GPU time ...... 43

2.4 Profiler output when k = 10 ...... 45

2.5 Profiler output when k = 20 ...... 47

4.1 Dataset Characteristics of Euler and Molecular Dynamics ...... 93

5.1 The Structure of Task Interface ...... 109

6.1 GPGPU-sim Configuration ...... 145

7.1 Parallelization Challenges of Different Communication Patterns ...... 167

7.2 User Interface and MIMD Parallel Framework API ...... 169

7.3 The data types defined in SIMD API ...... 171

7.4 SIMD API ...... 173

xiv List of Figures

Figure Page

1.1 Intel MIC Architecture ...... 7

2.1 Generalized Reduction Processing Structure ...... 22

2.2 Locking Scheme ...... 25

2.3 Updating a Float with an Atomic Operation ...... 26

2.4 Lock (a) and Unlock (b) Operations ...... 27

2.5 Example of Explicit Conditional Branch (a) and Warp Serialization (b) . . 29

2.6 Hybrid Scheme ...... 30

2.7 KNN on GPU ...... 34

2.8 Inconsistency in KNN ...... 35

2.9 Results from K-Means (k=10) ...... 37

2.10 Comparison of the Best Results from K-Means (k=10) ...... 38

2.11 Results from K-Means (k=100) ...... 39

2.12 Comparison of the Best Results from K-Means (k=100) ...... 40

2.13 Results from PCA (16x16) ...... 41

2.14 Comparison of the Best Results from PCA (16x16) ...... 42

xv 2.15 Results from PCA (32x32) ...... 43

2.16 Comparison of the Best Results from PCA (32x32) ...... 44

2.17 Results from KNN (k=10) ...... 45

2.18 Comparison of the Best Results from KNN (k=10) ...... 46

2.19 Results from KNN (k=20) ...... 47

2.20 Comparison of the Best Results from KNN (k=20) ...... 48

2.21 KNN - Comparison of Full Replication and Locking Scheme with different k 49

3.1 Typical Structure of Irregular Reduction ...... 52

3.2 Computation Space partitioning and reduction size in each partition . . . . 57

3.3 Reduction Space partitioning and computation size in each partition . . . . 57

3.4 Execution with Partitioning based Locking Scheme ...... 61

3.5 Runtime Iteration Reordering ...... 65

3.6 Euler: Comparison of PBL Scheme Over Conventional Strategies and Se- quential CPU Execution ...... 67

3.7 Molecular Dynamics: Comparison of PBL Over Conventional Strategies and Sequential CPU Execution ...... 68

3.8 Cost Components of Partitioners (Euler) ...... 69

3.9 Euler: Workload with Metis, GPU, and Multi-dimensional Partitioners on 14, 28, and 42 Partitions ...... 71

3.10 Molecular Dynamics: Workload with Metis, GPU, and Multi-dimensional Partitioners on 14, 28, and 42 Partitions ...... 72

3.11 Comparison of Metis, GPU and Multi-dimensional using 28 Partitions for Euler (20K) ...... 73

xvi 3.12 Comparison of Metis, GPU and Multi-dimensional using 42 Partitions for Molecular Dynamics (37K) ...... 74

3.13 Shared Memory Preferred (14 Partitions) Vs. Cache Preferred (28 Parti- tions) - (left) Shared Memory Preferred (14 Partitions) Vs. Cache Preferred (42 Partitions) - (right) ...... 75

3.14 Comparison of MP, GP, and MD for Adaptive Molecular Dynamics (37K dataset, 42 partitions) ...... 76

4.1 Framework of Multi-level Partitioning ...... 82

4.2 Runtime Support Component for Partitioning ...... 84

4.3 Runtime system: Pipelining Scheme (Left) and Task Scheduling Scheme (Right) ...... 87

4.4 Scalability of Molecular Dynamics on Multi-core CPU and GPU across Different Datasets ...... 91

4.5 Scalability of Euler on Multi-core CPU and GPU across Different Datasets 92

4.6 Performance Trade-offs with Computation and Reduction Space Partition- ing on Multi-core CPU (MD) ...... 94

4.7 Performance Trade-offs with Computation and Reduction Space Partition- ing on GPU (MD) ...... 95

4.8 Effect of Pipelining CPU Partitioning and GPU Computation(MD) ..... 96

4.9 Effect of Pipelining CPU Partitioning and GPU Computation(EU) ..... 97

4.10 Benefits From Dividing Computations Between CPU and GPU for EU and MD ...... 98

4.11 Effect of Ratio of Partitioning to Computation ...... 99

4.12 Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies on Euler ...... 100

5.1 Dynamic Task Queue Scheduling Framework ...... 106

xvii 5.2 Master-worker Scheduling Framework ...... 108

5.3 Scheduling Master(DevList, T askInfo)...... 110

5.4 Scheduling ThreadBlocks(taskInterface)...... 111

5.5 Token Scheduling Framework ...... 113

5.6 Relative Cost of Data Transmission on Fusion APU and Fermi GPU . . . . 117

5.7 Execution times for Jacobi: Token, Master-worker, Dynamic Task Queue, Optimal Static, CPU-Only and GPU-Only ...... 118

5.8 Execution times for Jacobi in Dynamic Task Queue with different task sizes: Copy DTQ, Uncopy DTQ, and Average execution time per row for CPU and GPU in different versions ...... 119

5.9 Execution times for Jacobi with different task sizes: Master-worker, Un- blocking Token and Blocking Token ...... 120

5.10 Execution times for Kmeans: Token, Master-worker, Dynamic Task Queue, Optimal Static, CPU-Only and GPU-Only ...... 121

5.11 Execution times for Kmeans in Dynamic Task Queue with different task sizes: Copy DTQ, Uncopy DTQ, and Average execution time per point for CPU and GPU in different versions ...... 122

5.12 Execution times for Kmeans with different task sizes: Master-worker, Un- blocking Token and Blocking Token ...... 123

5.13 Execution times for Molecular Dynamics on different partitions: Token, Master-worker, Dynamic Task Queue, Optimal Static, CPU-Only and GPU- Only...... 124

5.14 Molecular Dynamics: Standard deviation of computation for different thread blocks with the same partition number in Dynamic Task Queue ...... 125

5.15 Execution times for Molecular Dynamics with different task sizes: Master- worker, Unblocking Token and Dynamic Task Queue, and standard devia- tion of workloads in DTQ ...... 126

xviii 6.1 An example of unstructured control flow and its immediate post-dominator 131

6.2 Performance of Fibonacci and Binomial Coefficients benchmarks on - pler (Tesla K20M) and Fermi (Tesla C2050) architectures: 20 Fib(24)+Fib(23) and 20 Bio(18,9)+Bio(18,8) invocations on one thread, two threads in matched pattern, and two threads in unmatched pattern ...... 133

6.3 Timing the steps on two threads computing the fourth and third Fibonacci numbers. F() in (c) and (d) denotes the Fibonacci function...... 135

6.4 Recursive Fibonacci code, its control flow graph, and execution steps to compute the 3rd and 2nd Fibonacci numbers on two threads under various scheduling mechanisms ...... 137

6.5 Fibonacci control flow graph with three threads and stack-based implemen- tation of dynamic greedy reconvergence method ...... 144

6.6 stack::update greedy(stack t &m stack, addr t &next pc, bool ret first) . . . 148

6.7 stack::update majority(stack t &m stack, addr t &next pc, bool if threshold)151

6.8 An example of task rotation with 0, 1, 2, and 4 offsets on 8 threads . . . . . 153

6.9 IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods with 1, 8, 16, and 32 divergence degree in Fibonacci and Binomial Coefficients benchmarks (32 threads in awarp) ...... 154

6.10 IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods with 1, 8, 16, and 32 divergence degree in NQueens and Graph Coloring benchmarks (32 threads in a warp) 154

6.11 IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods with 1, 8, 16, and 32 divergence degree in Tak and Mandelbrot benchmarks (32 threads in a warp) ...... 155

6.12 IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods while increasing the number of threads in a warp in Fibonacci and Binomial Coefficient benchmarks ...... 157

xix 6.13 IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods while increasing the number of threads in a warp in N-Queens, Graph Coloring benchmarks ...... 157

6.14 IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods while increasing the number of threads in a warp in Tak and Mandelbrot benchmarks ...... 158

6.15 Distribution of scheduled warp sizes for a warp width of 16 ...... 160

7.1 Sobel: Stencil Computation with serial codes ...... 175

7.2 Sobel: Stencil Computation with SIMD API ...... 175

7.3 Sobel: Stencil Computation with manual vectorization ...... 176

7.4 Kmeans: Generalized Reduction with SIMD API ...... 177

7.5 An example of vectorization in overloaded functions ...... 181

7.6 An example of control follow (a) without vectorization (b) with vectoriza- tion (c) with vectorization and mask type ...... 184

7.7 Overall SIMD Parallelization Framework ...... 185

7.8 Speedup of Pthread without SIMD (Pthread-novec), Pthread with auto- SIMD (Pthread-vec), MIC SIMD with our framework (SIMD-API), and hand-written SIMD (SIMD-manual): Kmeans and NBC with small and large datasets each ...... 188

7.9 Speedup of Pthread without SIMD (Pthread-novec), Pthread with auto- SIMD (Pthread-vec), MIC SIMD with our framework (SIMD-API), and hand-written SIMD (SIMD-manual): Sobel and Heat3D with small and large datasets each ...... 189

7.10 Speedup of Pthread without SIMD (Pthread-novec), Pthread with auto- SIMD (Pthread-vec), MIC SIMD with our framework (SIMD-API), and hand-written SIMD (SIMD-manual): Euler and MD with small and large datasets each ...... 189

xx 7.11 Scalability with Increasing Number of Threads: Pthread without vectoriza- tion (Pthread-novec), Pthread with auto-vectorization (Pthread-vec), SIMD with API (SIMD-API), and hand-written SIMD (SIMD-manual) with Kmeans and NBC (large datasets) - Relative Speedups Over 1 Thread Execution on Xeon Phi with no Vectorization ...... 192

7.12 Scalability with Increasing Number of Threads: Pthread without vectoriza- tion (Pthread-novec), Pthread with auto-vectorization (Pthread-vec), SIMD with API (SIMD-API), and hand-written SIMD (SIMD-manual) with Sobel and Heat3D (large datasets) - Relative Speedups Over 1 Thread Execution on Xeon Phi with no Vectorization ...... 192

7.13 Scalability with Increasing Number of Threads: Pthread without vectoriza- tion (Pthread-novec), Pthread with auto-vectorization (Pthread-vec), SIMD with API (SIMD-API), and hand-written SIMD (SIMD-manual) with Euler and MD (large datasets) - Relative Speedups Over 1 Thread Execution on Xeon Phi with no Vectorization ...... 193

7.14 Benefits of MIMD+SIMD Execution in our Framework (Comparison with OpenMP-vec - left) and MIMD-only execution (Comparison with OpenMP-novec - right) ...... 194

xxi Chapter 1: Introduction

Our overall research work has been motivated by four aspects. First, modern archi-

tectures are trying to utilize increasing density of transistor by increasing the number of

processing cores, and using parallelism to improve application performance. One popu-

lar approach has been using Single Instruction Multiple Data (SIMD) designs, since they

simplify control logic, and allow resources to be used towards supporting a larger num-

ber of cores. In recent years, use of SIMD parallelism has been taken to another level

by General Purpose Graphics Processing Units (GPGPUs) and Intel Xeon Phi coproces-

sor. Modern GPUs have emerged as the means for achieving extreme-scale, cost-efficient, power-efficient high performance computing within the last few years. The new emerging

Intel Xeon Phi architecture utilizes wide SIMD vectors to further extend shared mem- ory level parallelism. However, compared to the traditional HPC platforms, nowadays the

SIMD execution model employed by GPU and Intel Xeon Phi still requires high expertise and is difficult to extract performance from.

The second aspect is based on the fact that the coprocessor is taken as an acceler- ator connecting to a host, which usually equips multi-core CPUs, and can also provide computing capability. Thus, GPU and CPU together create a heterogeneous environment.

However, their different computing and memory characteristics lead a series of challenges

1 to fully utilize both CPU and GPU to execute the same application. In addition, the de- velopment of coupled-GPUs, i.e., AMD Fusion APU and Intel Ivy Bridge, provides a new methodology to share data in a single physical memory between CPU and GPU, but a challenge on how to effectively utilize this new kind of heterogeneous architecture.

Third, the use of SIMT (Single Instruction Multiple Thread) parallelism in GPUs has been difficult for applications that involve complex control flow, since these applications either have limited data parallelism, or the control flow makes it more difficult to exploit data parallelism. One such class of applications, which have almost never been successfully mapped to SIMD instruction sets, comprises the applications programmed with recursion.

The class of algorithms that are popularly or conveniently expressed using recursion is extremely large. Thus, as speed of a single core does not improve and additional transistor density is at least partially targeted towards SIMD parallelism, speeding up this large class of recursive applications is a major challenge.

Last, use of SSE-like instruction sets has always been a hard problem, and it turns out that such parallelism has not been consistently used for applications outside dense ma- trix or imaging kernels. Moreover, there are significant programming differences between

CUDA and SSE-like instruction sets, since they target SIMT and SIMD models, respec- tively. Specifically, while coalesced memory accesses are important for performance in

SIMT programming, parallelism is still available, whereas programmers need to explic- itly create aligned and contiguous accesses in the case of SSE or IMCI. Similarly, while branches are automatically managed in SIMT, with masks internally implemented, pro- grammers or compilers must identify instructions executed by all threads with SSE/IMCI.

Consequently, based on the above four challenges, our overall work focuses on the design of parallel strategies with task scheduling and runtime support for applications with

2 different communication patterns on GPUs, heterogeneous CPU-GPU, and Intel Xeon Phi architectures.

In this chapter, first, we would like to introduce the background knowledge about the parallel architectures, including GPUs and Intel Xeon Phi architectures. Then, we will give the overview of our current work including four parallel frameworks for different communication patterns or parallel architectures, and a hardware level design to support recursive applications on modern GPUs.

1.1 Overview of Parallel Architectures

Parallel co-processors have emerged as a major component of the high performance computing landscape today. Some of the fastest machines in the world today involve GPUs or Intel Xeon Phi architecture. Specifically, in the top 500 supercomputers list [4], the number of systems with co-processors increases by more than fivefold from 2010 to 2013.

And in the latest 500 list released in November 2013 [4], 12% systems use co-processors.

The current no.1 and no.2 fastest supercomputer, i.e., Titan-2 and Titan, use Intel Xeon Phi and NVIDIA GPUs respectively. Next, we introduce the background of the two kinds of

GPUs we have been working on, and Intel Xeon Phi architecture.

1.1.1 Decoupled-GPUs and Coupled-GPUs

Decoupled or discrete GPUs, as a co-processor, provide large scale parallel computing capability, and are connected to hosts through PCIe bus. However, because the bandwidth of PCIe bus is a obvious limitation for current decoupled-GPUs, a new trend is to integrate both CPU and GPU in a single silicon, which is referred to as coupled-GPUs. As a result,

3 there is a need to exploit the performance in both of these two architectures. In the follow-

ing, we especially introduce the parallel architectures of decoupled NVIDIA Tesla GPU

and coupled AMD Fusion APU (Accelerated Processing Unit).

For both decoupled or coupled GPUs, its architecture consists of two major compo-

nents, i.e., a multi-threading processing component and a multi-layer memory component.

Decoupled NVIDA GPU Architecture: The processing component in a typical GPU is composed of a certain number of streaming multiprocessors (SM). Each streaming multi- processor, in turn, contains a set of simple cores (SP) that perform in-order processing of the instructions. To achieve high performance, a large number of threads, typically a few tens of thousands, are launched. These threads execute the same operation on different sets of data. Furthermore, threads within a block can be divided into multiple groups, termed as warp in CUDA, or wavefront in OpenCL. Each warp of threads are co-scheduled on the

streaming multiprocessor and execute the same instruction in a given clock cycle (SIMT

execution).

The memory component of a modern GPU-based computing system contains several

layers. One is the host memory, which is available on the CPU main memory. This is

essential as any GPGPU computation can only be launched from the CPU. The second

layer is the device memory, which resides on the GPU card. It represents the global memory

on a GPU and is accessible across all streaming multiprocessors. The device memory is

interconnected with the host through a PCI-Express card, thus enabling DMA transfers

between the host and the device memory. A scratch-pad memory, which is programmable

and supports high-speed access, is available private to a streaming multiprocessor. The

scratch-pad memory is termed as shared memory on NVIDIA cards.

4 DEVICE UNCACHED CACHEABLE CPU READ 1 GB/s 1 GB/s 8-13 GB/s CPU WRITE 8 GB/s 8-13 GB/s 8-13 GB/s GPU READ 17 GB/s 6-12 GB/s 4.5 GB/s GPU WRITE 12 GB/s 6-12 GB/s 5.5 GB/s

Table 1.1: Read / Write Speed on Device, Uncached and Cacheable Host Memory for CPU and GPU

NVIDIA Fermi GPU is the one that is largely used in high performance computing. The architecture of Fermi features 14 or 16 multiprocessors, each of which contains 32 cores.

The clock rate is 1.215 GHZ, and memory clock is 4458MT/s. As compared to earlier systems, Fermi has a much larger shared memory/L1 cache, which can be configured as

48KB shared memory and 16KB L1 cache, or 16KB shared memory and 48KB L1 cache.

In addition, non-programmable L2 cache is also available on Fermi.

Coupled AMD Fusion APU Architecture: Fusion architecture involves an integration of

CPU and GPU in silicon. The processing component comprises of a multi-core CPU and an on-chip GPU. The architecture of the on-chip GPU is similar to recent decoupled GPUs, such as those from NVIDIA. However, the memory component in Fusion is very distinct from those of NVIDIA GPUs. There is no physical device memory located in the GPU part in Fusion architecture, however, divided into two layers in virtual, which are referred to as device memory and host memory. The device memory is designed specifically for GPU’s needs, whereas the host memory is primarily for CPU’s use, though, as we will explain below, both of them can be accessed from either of the devices. Unlike a GPU connected though PCI-Express, there is a unified memory controller for both CPU and GPU devices.

5 Sharing of both device and host memory by CPU and GPU is facilitated by the notion of zero copy buffer. If a buffer is declared to be zero copy type, it can be accessed by both

GPU and CPU. Moreover, while creating a zero copy buffer, it can be specified whether it should be stored on device memory or host memory. This decision has a significant impact on the performance achieved by CPU and GPU memory accesses, as we will explain below.

One complication while sharing the two sections of system memory arises because of the possibility of data being cached by CPU cores. Therefore, in AMD Fusion, the host memory is further divided into a cacheable memory and uncached memory. As the name suggests, data residing at cacheable memory can be cached, where data residing at the uncached memory cannot be. Correspondingly, the code executing from the GPU has two options to access host memory. The first is through the Radeon memory bus (also referred to as the Garlic Bus), which is directly connected to the memory controller, but is not coherent. The other one is the AMD Fusion compute link (also referred to as the Onion

Bus) which is connected to a coherent request queue shared with the CPU cores. The onion bus needs to be used when GPU wants to snoop the CPU cache, so it is a coherent bus. The data read and write bandwidths for different types of memory accesses are summarized in

Table 1.1 [101].

1.1.2 Intel Xeon Phi Many-core Architecture

The x86-compatible Intel Xeon Phi coprocessor, which is a latest commercial release of the Intel Many Integrated Core (MIC) architecture, has already been incorporated in 9 of the top 100 supercomputers [4]. MIC is designed to leverage existing x86 experience and benefit from traditional multi-core parallelization programming models, libraries, and tools.

6 Memory Controller

System & I/O Interface Unit Unit Vector Vector Vector Vector Vector Register Register Instruction Decode Instruction Decode Unit Unit L1-Cache & D-Cache L1-Cache & D-Cache Scalar Scalar Scalar Scalar Register Register 7 Each core of Intel Xeon Phi supports up to 4 L2 Cache execution models. Unit Unit Vector Vector Vector Vector Vector Register Register native Instruction Decode Instruction Decode Unit Unit L1-Cache & D-Cache L1-Cache & D-Cache Scalar Scalar Scalar Scalar Register Register and Figure 1.1: Intel MIC Architecture

Special Function offload

Memory Controller Next, we introduce the three important features of Intel MIC architecture, which need In the available MIC systems, as shown in Figure 1.1, there are 60 or 61 cores, inter- hardware threads, in another word, as many as 240/244 hardware threads sharing the same to be exploited for achieving high performance: Large Number of Concurrent Threads: connected in a topological ring structure. These cores are in-ordercores, dual each issue of x86 processor which supports up to fourregisters hardware on threads. each core In can addition, provide 32 extra 512-bit SIMD operations. vector point For computations instance, can 16 execute single simultaneously. floating The main memoryto sizes 16 vary GB, from and 8 the GB memory is sharedcore, by whereas all each core cores. has The a L1 coherent L2 cache cache, isare 512 interconnected 32KB, KB, in private a where to ring. cache each for Different from different GPU cores co-processors,makes MIC it runs a support Linux both system, memory can run simultaneously. This large number of concurrent threads provides massive

Multiple Instruction Multiple Data (MIMD) parallelism with shared memory, which has not

been common in the past.

Coherent Distributed L2 Cache: Intel Xeon Phi architecture uses coherent L2 Cache with ring interconnection. When a L2 cache miss occurs for a specific core, an address request is sent to the ring. If the address is found in another core’s L2 cache, the corresponding data is forwarded back along the ring. In worst case, the entire process may take hundreds of clock cycles. Thus, Xeon Phi reduces the number of L2 cache misses, but even an L2 cache hit can be very expensive. Thus, data locality is crucial for the overall performance.

Wide SIMD Registers and Vector Processing Units (VPU): VPU has been treated as the most significant feature of Xeon Phi by many previous studies [81, 125, 99, 33]. The reason is that the Intel Xeon Phi co-processor has extensively increase the width of SIMD lane compared to the previous Intel processors, i.e., 128-bit SSE or 256-bit AVX proces- sors, which means that it is possible to execute 16 (8) identical single (double ) precision

floating point operations at the same time. Additionally, a new 512-bit SIMD instruction set, referred to as Intel Initial Many Core Instructions (Intel IMCI), is provided. The Intel

IMCI instruction set extends the old SSE or AVX, and provides built-in gather and scat-

ter operations that allow irregular memory accesses, a hardware supported mask data type, and mask operations that allow operating on some specific elements within the same SIMD register. Even though all of these new instructions could potentially be simulated by the programers in the SSE model, explicit new instructions allow easier implementation and higher performance due to hardware support. Note that SIMD instructions can be gener- ated by the ICC compiler through the auto-vectorization option, or the programmers could use IMCI instruction set directly. The former needs low programming effort, but current

8 compilation systems have several limitations and do not always obtain high performance.

In comparison, the latter option can achieve the best performance, however, is tedious and error prone, and creates non-portable code.

1.2 Dissertation Contributions

1.2.1 Approaches for Parallelizing Reductions on Modern GPUs

In this work, we focus on designing and evaluating parallelization methods for gener- alized reductions, in view of the availability of atomic operations on shared memory, since

CUDA version 1.2.

Prior to the availability of locking operations on GPUs, the only practical approach for parallelizing generalized reductions was to create a copy of the reduction object, or the set of values on which reductions are being performed, for each thread [87]. This was a reasonable approach for the cases when the reduction object was small and the memory and global combination overheads are small. However, when the reduction object is large, simply replicating a copy of the reduction object could introduce significant overheads. A particular issue arises because of the shared memory on each GPU. The shared memory is a very small (typically 16 KB) programmer controlled cache that allows accesses about

100-150 times faster than the device memory. With replication, reduction object is unlikely to fit in shared memory, and data accesses may slow down considerably.

Use of atomic operations can be an attractive approach for this class of applications, especially since atomic operations on shared memory are quite efficient. As we are not replicating the reduction object, it can be placed in shared memory, and accesses to device memory can be replaced by access to shared memory. However, the main performance bottleneck with this approach is the possibility of contention among the threads, which can

9 slow down the processing. In contrast to the full replication approach, this overhead is more significant when the number of independently accessed reduction elements is small.

Nevertheless, the number of threads in a GPU execution is typically very large.

Based on the observations listed above with respect to use of replication and atomic operations, we have designed another method, referred to as the hybrid approach, since it involves a combination of both atomic operations and replication. A small number of copies of the reduction object are created. The number of replicas is typically chosen in such a way that these copies still fit in shared memory. A number of threads within a group use atomic operations to update one copy of reduction object. However, because the threads are divided into several such groups, the contention is reduced. Furthermore, because the replicas fit in shared memory and the number of replicas is small, the global combination overhead is also small.

Overall, the contributions of this work are as follows:

• First, we discuss and compared the tradeoffs between different parallel strategies,

fully replication and locking method. And, our evaluations show how the perfor-

mance varies in different strategies with the influence of the characteristics of the

applications and input data sizes.

• Second, we propose a fine-grained locking scheme, which is not support prior to

CUDA version 1.3, and a robust dead-lock free coarse-grained locking scheme, with

two implementations, explicit conditional branch and explicit warp serialization.

• Third, we introduce a hybrid parallel strategy, which can achieve a balance between

fully replication and locking scheme, by introducing an intermediate layer in thread

configuration.

10 1.2.2 An Execution Strategy and Optimized Runtime Support for Par- allelizing Irregular Reductions on Modern GPUs

Despite the popularity of CUDA, there are significant challenges in porting different classes of HPC applications on modern GPUs. One of the 13 dwarfs in Berkeley view on

parallel computing is irregular applications arising from unstructured grids. Considering

the importance of irregular reductions in scientific and engineering codes, substantial effort

was made in developing compiler and runtime support for parallelization or optimization of

these codes in the previous two decades, with different efforts targeting distributed mem-

ory machines, distributed shared memory machines, shared memory machines, or cache

performance improvement on uniprocessor machines. However, there have not been any

systematic studies on parallelizing these applications on modern GPUs.

There are at least two significant challenges associated with porting irregular applica-

tions on modern GPUs. The first is related to correct and efficient parallelization while

using a large number of threads. Updates to indirection array based computations lead to

race conditions while parallelizing the computation. The previous work on shared memory

systems assumed a relatively small number of concurrent threads, whereas obtaining effi-

ciency on GPUs requires that the computation be divided among a large number of threads.

The second challenge is effective use of shared memory, which is a programmable cache

on modern GPUs. Judicious use of the shared memory has been shown to be critical in

obtaining high performance on GPUs. Since data accesses cannot be determined statically,

runtime partitioning methods are needed for effectively using the shared memory.

Overall, the contributions of this work are as follows:

• We present a novel execution strategy to parallelize this class of applications on

GPUs and to effectively use shared memory. Our strategy is referred to as the

11 partitioning-based locking scheme, and involves creating disjoint partitions of the

reduction space to use shared memory.

• We have also designed and implemented runtime modules to support our execution

strategy. This includes a new partitioning scheme, multi-dimensional partitioning,

which has very low overheads, and efficient modules for reordering the data.

• Using two popular irregular applications, Euler and Molecular Dynamics, we have

carried out a detailed experimental study. We show how we clearly outperform

other parallelization schemes, and that our proposed multi-dimensional partitioning

scheme gives the best trade-offs in terms of partitioning overheads and the result-

ing application execution time. We also show that our runtime methods are efficient

enough to be used for adaptive irregular applications.

1.2.3 Porting Irregular Reductions on Heterogeneous CPU-GPU Con- figurations

Despite the immense potential of CPU-GPU based heterogeneous architecture, today this kind of the system still poses many challenges for programmability and application development [54]. The multi-core CPUs and the GPUs need to be programmed with differ- ent languages, and moreover, dividing work between the CPU and the GPU is not trivial.

There have been a limited number of efforts in utilizing a combination of CPUs and GPUs to accelerate application performance [119, 30, 86, 109, 13]. OpenCL [3] has also been proposed to enable the programming of applications for heterogeneous architectures. One of the limitations of these efforts is that they have all considered limited number of ap- plication classes, and more efforts is needed for mapping other application classes to the emerging CPU-GPU systems.

12 In our work, we focus on effectively using a system comprising a multi-core CPU and

a GPU for irregular applications. In the process, we overcome a key limitation of our prior

work, which assumes that the dataset must fit into the GPU’s limited device memory. For

using a multi-core CPU to further accelerate the application, we make two observations.

First, by effectively pipelining the partitioning of the reduction space (performed on the

CPU) and the computations, we can eliminate most of the overhead of the partitioning.

Second, by partitioning the reduction space at multiple levels, we can divide the computa-

tions between the multi-core CPU and the GPU.

Overall, the contributions of this work are as follows:

• First, we present an overall framework to parallelize irregular reductions on hetero-

geneous architecture composed of a multi-core CPU and a GPU. Our framework is

referred to as the Multi-Level Partitioning Framework, which involves two levels of

partitioning. The underlying scheme addresses the memory limitation of a GPU and

supports division of the work between the cores of the GPU and the cores of the

CPU.

• As part of this framework, we propose a Pipelining Scheme which overlaps partition-

ing with computation to reduce the partitioning overheads.

• We have also designed and implemented a work stealing based distribution strategy

between CPU and GPU, in order to achieve a good load balance.

1.2.4 Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture

Even though GPUs connected through PCI-Express are a critical piece of the high- end computing landscape today, computer architectures have already taken another step

13 towards more effective utilization of the GPUs. This is in the form of integrated chips,

where a multi-core CPU and a GPU are integrated in silicon. AMD, Intel, and NVIDIA

have all either announced or released chips which have these features. Such integrated chips

offer all the benefits of CPU-GPU nodes, while also having an added advantage, which is

that passing parameters and invoking a GPU kernel no longer requires communication over

the PCI-Express bus. Thus, one can expect better utilization of the GPUs, and for a wider

variety of applications.

In this work, we consider the problem of accelerating a single computation on an inte-

grated CPU-GPU node, the multi-core CPU and the GPU simultaneously. Specifically, we

focus on answering the following important questions: 1) How can we use an integrated

CPU-GPU for accelerating data parallel applications involving different communication

patterns? 2) What kind of opportunities and performance advantages does an integrated

CPU-GPU chip offer over the decoupled CPU-GPU architectures? 3) What level of perfor-

mance improvements are possible by using a GPU and a CPU simultaneously for a single

application on a Fusion chip, over using only one of the CPU or the GPU?

In designing a solution for dynamically distributing the work between CPU cores and

GPU cores, we develop a task scheduling framework, which we refer to as the thread

block level scheduling framework. We present two implementations of this scheduling

framework. The first is based on the master-worker model, and employs a locking free

communication mechanism. The second involves use of token passing to eliminate the need for a master. Though our current implementation has been optimized for the AMD

Fusion chip, our design can handle increasing intra-node complexity and heterogeneity, which we expect to be the norm in the emerging systems.

Overall, the contributions of this work are as follows:

14 • We proposed a thread block level scheduling framework, which can minimize the

command launching overhead, and favor very find grained task scheduling.

• A locking-free task scheduling strategy is designed and implemented due to effi-

ciency and no coherent atomic operation when CPU and GPU accessing shared mem-

ory. This strategy also provides a balance for fast accessing shared data between CPU

and GPU.

• Two thread block level scheduling implementations, master-work and token schedul-

ing are provided. The token scheduling is optimized to remove a constant master.

Instead, any cores from GPU and CPU have the fair chance to be the master.

• Both inter and intra load balance can be gained with thread block level scheduling.

1.2.5 Efficient Scheduling of Recursive Control Flow on GPUs

The use of SIMD parallelism, including in GPUs, has been limited for applications that involve complex control flow, since these applications either have limited data paral- lelism, or the control flow makes it harder to exploit data parallelism. One such class of applications, which have almost never been successfully mapped to SIMD instruction sets, comprises the applications programmed with recursion.

This work focuses on the problem of obtaining high performance from recursive appli- cations on modern GPU architectures, through better hardware-level thread scheduling. To motivate this problem, consider the following background. GPGPUs did not even support recursive applications for a considerable amount of time. In the case of NVIDIA GPUs, support for recursion has been available only from CUDA SDK 3.1 and computing capa- bility 2.0. AMD GPUs do not yet support recursion.

15 The existing programming models for GPUs, including such as CUDA, OpenCL, and

DirectX, all employ the Single Instruction Multiple Threads (SIMT) paradigm. In this paradigm, though programmers can write code assuming independent execution of threads, the actual execution is in SIMD fashion, to match the SIMD architecture. Particularly, a modern GPU comprises a set of streaming multiprocessors (SMs), within which a batch of the threads, referred to as warp in CUDA, or wavefront in OpenCL, is executed in a lock-step manner. If there is divergence between the threads in the same warp, the threads executing different branches have to be scheduled serially. The result is that while very regular applications, like dense matrix or structured grid computations and graphics appli- cations, use the GPU architecture extremely well, for applications with a high degree of branch divergence, the speedups can be very limited. Thus, even if the GPU allows exe- cution of the recursive applications, the control flow involved with recursive calls and the resulting thread divergence is likely to limit performance.

To help improve state-of-the-art, we study the behavior of various thread scheduling strategies in the context of recursive control flow. We first show that using immediate post- dominator based reconvergence, a variation of which is used in current NVIDIA GPUs, to schedule threads executing recursive computations in a warp results in poor performance.

We then consider the notion of dynamic reconvergence, where threads can reconverge ei- ther before or after the traditional immediate post-dominator. We introduce two dynamic reconvergence implementations and their variations, namely the greedy method, greedy return-first method, the majority method, and the majority threshold method.

We implement the various dynamic reconvergence methods using GPGPU-sim, a cycle level GPU performance simulator [11]. Through extensive evaluation, we show that our dynamic reconvergence mechanisms obtain significant performance improvement on a set

16 of six recursion benchmarks, which have different characteristics. The majority method

achieved speedups from 1.05 to 6.00 for Fibonacci, Bio-coefficient, NQueens, Graph Col-

oring, Tak function, and Mandelbrot benchmarks over the immediate post-dominator re-

convergence method.

Overall, the contributions of this work are as follows:

• We studied the current intra-warp thread scheduling for recursive applications on

Fermi and Kepler GPUs, and discuss how the performance is limited by the immedi-

ate post-dominator reconvergence method.

• We proposed a dynamic reconvergence strategy, which allows threads reconverge

either before or after the static reconvergence point.

• Two kinds of implementations, greedy method and majority method, and their vari-

ations are implemented on GPGPU-sim simulator. The evaluation shows that our

dynamic reconvergence methods reduce the percentage of the sequential execution

time significantly for most of applications.

1.2.6 A Programming System for Xeon Phis with Runtime SIMD Par- allelization

Xeon Phi is a promising system, because it allows x86 compatible software to be used.

Thus, users could potentially continue to use their MPI and/or OpenMP applications, and not have to program in (and learn) a new language like OpenCL or CUDA for the use of accelerators. At the same time, there are many similarities between GPUs and Xeon Phis.

Both of these systems have a small amount of memory per thread/core, and moreover, both of them extensively employ a form of SIMD parallelism. NVIDIA GPUs have relied on

SIMT model. Xeon Phi is built on top of the long-existing Intel SSE (Streaming SIMD

17 Instructions), and particularly, supports IMCI (Initial Many Core Instructions) instruction set for use of SIMD. The SIMD width has been extended to 512 bits (16 floats), potentially offering large benefits for applications.

Use of SSE-like instruction sets has always been a hard problem, and it turns out that such parallelism has not been consistently used for applications outside dense matrix or imaging kernels. Moreover, there are significant programming differences between CUDA and SSE-like instruction sets, since they target SIMT and SIMD models, respectively.

Specifically, while coalesced memory accesses are important for performance in SIMT programming, parallelism is still available, whereas programmers need to explicitly create aligned and contiguous accesses in the case of SSE or IMCI. Similarly, while branches are automatically managed in SIMT, with masks internally implemented, programmers or compilers must identify instructions executed by all threads with SSE/IMCI.

Effectively exploiting the power of a coprocessor like Xeon Phi requires that we exploit both MIMD and SIMD parallelism. While the former can be done through Pthreads or

OpenMP, it is much harder to extract SIMD performance. This is because the restrictions on the model make hand-parallelization very hard. At the same time, productions compilers are unable to exploit SIMD parallelism for many of the cases.

This paper focuses on the problem of application development on any system that sup- ports both shared memory parallelism and SSE-like SIMD parallelism, with a specific em- phasis on the Intel Xeon Phi system. We describe an API and a runtime system that helps extract both shared memory and SIMD parallelism. One of the key ideas in our approach is to exploit the information about underlying communication patterns, to both partition

18 and schedule the computation for MIMD parallelism, and reorganize the data for achiev-

ing better SIMD parallelism. While our approach is general, we currently focus on stencil

computations,

Overall, the contributions of this work are as follows:

• We provide an end-to-end application development system for the Xeon Phi architec-

ture, or more broadly, any system with both shared memory and SIMD parallelism,

• Our work can be viewed as providing a CUDA or OpenCL-like programming API for

SSE-like instructions, where the responsibility for determining contiguous vs. non-

contiguous accesses or managing conditionals is handled by the underlying library,

• We offer a potential intermediate language which may be generated by a compiler

(for example, systems similar to the ones that generate CUDA code), and subse-

quently, runtime transformations and libraries be used for SIMD parallelization.

Compared to the existing code generation approaches, we can simplify the SIMD

compilation process and make it more portable.

1.3 Outline

The rest of the dissertation is organized as follows. In Chapter 2, we introduce the paral- lel strategy for generalized reductions on modern GPUs. In Chapter 3, we present a parallel framework and runtime support for irregular reductions. We compared different partition- ing methodologies, and provided a runtime partitioning method and reordering module to better utilize GPU’s shared memory. Then, in Chapter 4, we extend our irregular reduc- tion framework to heterogeneous CPU-GPU configurations. New multi-level partitioning method and scheduling module are introduced. To study the new emerging integrated CPU

19 and GPU architecture, we propose and compare different scheduling frameworks on inte- grated Fusion APU for the applications with different communication patterns in Chapter 5.

In Chapter 6, to improve the performance of unstructured control flow, particularly recur- sion, on SIMT GPUs, we present a hardware level strategy to dynamically reconvergence the diverged threads. In Chapter 7, we proposed a MIMD and SIMD parallel framework with the automatic SIMD API for Intel Xeon Phi architecture for different communication patterns. Chapter 8 concludes this dissertation and describes the potential future work.

20 Chapter 2: Approaches for Parallelizing Reductions on Modern GPUs

Since the introduction of CUDA 1.0 on NVIDIA G8X cards, GPU hardware and soft-

ware has evolved rapidly. New GPU products and successive versions of CUDA added new

functionality, for example, support for double precision operations. One such functionality

is the support for atomic operations. Atomic operations on words in device memory were

introduced with CUDA version 1.1, and atomic operations on words in shared memory

were introduced with CUDA version 1.2.

In this work, we focus on designing and evaluating parallelization methods for a par-

ticular class of applications, in view of the availability of atomic operations on shared

memory. Our target application class are those where the communication pattern is based

on generalized reductions, more recently known as Map-Reduce[28] class of applications.

This application class has been considered as one of the dwarfs in the Berkeley view on

parallel processing. A large number of data mining applications follow this processing

structure [63]. Many runtime and compilation efforts targeting scientific applications have

also actively considered this processing structure [5, 15, 48, 49, 78, 84, 104, 132]. Since

the emergence of GPGPUs, several popular data mining applications involving generalized

reductions have been ported on GPUs, with impressive results [18, 105, 43, 47, 87].

21 2.1 Parallelization Approaches

In this section, we first describe the generalized reduction structure and the challenges

in parallelizing this class of applications on GPUs.

2.1.1 Generalized Reductions

{ * Outer Sequential Loop * } While () { { * Reduction Loop * } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } }

Figure 2.1: Generalized Reduction Processing Structure

The common structure behind these algorithms is summarized in Figure 2.1. The func-

tion op is an associative and commutative function. Thus, the iterations of the foreach loop can be performed in any order. The data-structure Reduc is referred to as the reduction object. The reduction performed is, however, irregular, in the sense that specific elements of the reduction object that are updated depend upon the results of the processing of an element.

A specific example of such structure are as follows. In k-means clustering [62], each iteration involves processing each point in the dataset. For each point, we determine the closest center to this point, and compute how this center’s location should be updated.

22 For algorithms following such generalized reduction structure, parallelization can be done by dividing the data instances (or records or transactions) among the processing threads. If we have a shared address space, the main correctness challenge arises because of the possible race conditions when multiple processors update the same element of the reduction object. The element of the reduction object that is updated in a loop iteration (i) is determined only as a result of the processing. For example, in the k-means clustering algorithm, first the cluster to which a data item belongs is determined. Then, the center of this cluster is updated using a reduction operation. It is not possible to statically partition the reduction object so that different processors update disjoint portions of the collection.

Thus, race conditions must be avoided at runtime. The execution time of the function process can be a significant part of the execution time of an iteration of the loop. Thus, runtime preprocessing or scheduling techniques cannot be applied. Furthermore, in many of the algorithms, the size of the reduction object can be quite large. This means that the reduction object cannot be replicated or privatized without significant memory overheads.

Besides the normal issues in parallelizing this class of applications on any shared mem- ory platform [65, 64, 107], there are specific issues related to features of current GPUs and

CUDA. For instance, CUDA does not support atomic operations on floating point number on currently available devices, i.e, until devices with compute capability 2.0 are available.

As another issue, the number of threads needed for effective parallelization on a GPU is extremely large, i.e., of the order of a few tens of thousands. Thus, the problem of shared memory parallelization on GPUs involves additional challenges, beyond normal multi-core or SMP machine related issues.

We now describe three techniques we have developed and evaluated. They are Full

Replication, the Locking Scheme, and the Hybrid Scheme.

23 2.1.2 Full Replication

In any shared memory systems, one simple approach for avoiding race conditions is that each thread keeps its own replica of the reduction object on the device memory, and the work is done separately by each thread. At the end of each iteration, a global combination is done either by a single thread, or using a tree structure. Then, the finalized reduction objects are copied to host memory.

More specifically, three steps are involved in the local reduction phase: read a data block, compute a reduction object update based on the data instance, and write the reduc- tion object update. In the first step, the data to be processed is copied from host to device memory, followed by allocation of reduction objects and other data structures to be used during the course of computation. In the second step, multi-threaded reduction operation is performed on the device. Since each thread has its own replica of the reduction object, race conditions while accessing the same object are avoided. In the last step, the reduc- tion objects can either be copied back to host memory, and a global combination can be performed. Alternatively, global combination can be performed on the device as well.

There are several aspects that could affect the performance of data mining applications on the GPU, which includes the size of the reduction object, amount of computation com- pared with the amount of data copy between devices, and whether global data could be copied to shared memory.

2.1.3 Locking Scheme

This scheme is based on the atomic operations provided by CUDA. As we have stated above, current NVIDIA devices do not support atomic operations on floating point number.

Hence, a wrapper atomic operation has to be written to support locking on updates to

24 Device Host

All threads in one block share the same copy

Threads Reduction Reduction Threads Block Threads Object Object ………

Threads Reduction Reduction Threads Block Final Result Threads Object Object ………

Threads Reduction Reduction Threads Block Threads Object Object ………

Figure 2.2: Locking Scheme

floating point numbers. Another important issue is related to the memory hierarchy of

GPUs. The small programmer-controlled cache available on each multi-processor, i.e, the

shared memory, is not only significantly faster for normal data accesses, but for atomic operations as well. In our early experiences, we found that atomic operations on device memory are very slow, and any scheme based on these cannot be a practical option.

Now, we describe the overall idea behind our lock-based scheme. The logical execution grid of a CUDA program is organized as a set of thread blocks (B), where each block con- tains T threads. Each block is scheduled on a multi-processor, which has its own shared memory (16 KB on current devices). There is no synchronization mechanism available across shared memories of different multiprocessors, (i.e. across different thread blocks).

Thus, we place a separate copy of the reduction object on the shared memory of each mul- tiprocessor. All threads within a thread block use locking to avoid race conditions while

25 performing updates on the reduction object. Similar to full replication, the accumulated up-

dates on different multi-processors need to be combined together to obtain the final results.

This locking scheme is shown in the Figure 2.2.

Another issue is related to the granularity of locking. In this regard, we have imple-

mented two different versions, which are fine-grained locking and coarse-grained locking,

as described below.

void *atomicFloatOperation(float *address, float val) { int iVal = FloatAsInt(val) ; int tmp0 = initVal; // initVal 6= values inside address int tmp1; While((tmp1 = atomicCAS((int *)address, tmp0, iVal)) != tmp0) { tmp0 = tmp1; iVal = Operation(FloatAsInt(val),IntAsFloat(tmp1)); } }

Figure 2.3: Updating a Float with an Atomic Operation

Fine-grained Locking: This scheme simply associates a lock for updating a particular word. One complication, however, is that the CUDA versions available on current cards provide atomic operation on shared memory only for integers. We support atomic updates on floating point numbers by writing a wrapper on top of the existing atomicCAS primi-

tive. This extension is shown in the Figure 2.3. The loop here ensures that the update is

performed atomically. The entire process can be divided into four steps: 1) read the value

of an element, 2) perform the update on this value, 3) read the element’s value again, and

26 4) if its value is the same as the value last read, commit the update and release. Otherwise,

we return to the first step.

{* Get the lock, when lockVal = 0 *} void lock(int *lockVal) { int tmp0=0; int tmp1; int val=1; while((tmp1=atomicCAS(lockVal,tmp0,val))!= tmp0); } (a)

{* Drop the lock, then change lockVal to 0 *} void unlock(int *lockVal) { atomicExch(lockVal,0); } (b)

Figure 2.4: Lock (a) and Unlock (b) Operations

Coarse-grained Locking: In many applications, several words may form one logical unit

of data, and there may be a need for ensuring that only one thread can update any of these

words at given time. This can be done through our coarse-grained locking scheme. The current CUDA version does not provide any direct support for this. Thus, such a coarse- grained lock has to be supported through a combination of a user-defined lock and the atom- icCAS primitive. The corresponding lock and unlock operations are shown in Figure 2.4. In this example, only one lock is associated with the entire block of shared memory. There- fore, any update to a word in the shared memory will block all words in the entire shared

27 memory from other threads. The same idea can be applied for creating locks associated

with other (smaller) groups of words in shared memory.

Thread Divergence and Deadlock Avoidance

It turns out that the code shown in Figure 2.4 can potentially lead to a deadlock due to

the divergence among threads within a warp and the SIMD nature of a multi-processor in a GPU. On current GPUs, the device function is executed by the GPU in a SIMD manner, with threads executing on the device organized in a grid of thread blocks. Each thread block is executed by one multiprocessor, with threads within a block being launched in warps.

Warps of threads are randomly picked by the device multiprocessor for execution. The size of the thread warp is 32 in most recent cards.

Because threads in a warp can only perform the same instruction at any given time, an issue arises if there are conditional branches. If there is a if-then-else statement, and if

different threads evaluate different results from the if statement, threads are said to diverge.

In this case, the subset of threads that need to execute the then clause will execute, while the other threads will be idle. Subsequently, the second group of threads will execute the else clause, while the first group will be idle.

Now, let us revisit the lock and unlock functions shown in Figure 2.4. lockVal is in the device memory and hence can be accessed by all threads. The reason that this code can have a deadlock is as follows. When one thread obtains the lock, it comes out of the while loop, and there is a thread divergence. The remaining threads in the warp are still in the same instruction, i.e., they keep spinning. As thread holding the lock is now idle, the remaining threads will never acquire the lock, and there is a deadlock.

28 bool DoWork = true; While(DoWork) { if(getLock()) { {* Enter into the Critical Section *} {* Do Work *} {* Finish Work *} releaseLock(); DoWork=false; } } (a) for (int t=0; t < Warp Size; t++) { if (threadID mod Warp Size == t) { getLock(); {* Enter into the critical section *} {* Do Work *} releaseLock(); } } (b)

Figure 2.5: Example of Explicit Conditional Branch (a) and Warp Serialization (b)

Now, we describe two ways in which deadlock can be avoided. First is by explicit

conditional branch and the second is by explicit warp serialization. The locking code corresponding to both these approaches is shown in Figure 2.5.

Explicit Conditional Branching: Here, data consistency can be guaranteed by a condi- tional branch. Only the thread that succeeds in acquiring the lock enters into the critical section. Moreover, the difference between the earlier version (prone to deadlocks) and this version is that the spinning loop is moved out from the operation of acquiring the lock. So, no thread will spin endlessly, and thus the deadlock possibility is eliminated.

29 Device Host

The threads are Each group has a organized as private copy of Groups and Blocks reduction object

Block

Threads Reduction Threads Group Threads Object ……… Reduction Intermediate Intermediate Group Object results results

Reduction Group Object

…….

Final Result

Block

Threads Reduction Threads Group Threads Object ……… Reduction Intermediate Intermediate Group Object results results

Reduction Group Object

…….

Figure 2.6: Hybrid Scheme

Explicit Warp Serialization: Since the deadlock is caused by the divergence of the threads in a warp, executing all threads in a warp serially can avoid deadlock. Figure 2.5 (b) shows code corresponding warp serialization. In this method, race conditions only happens among the threads in the different warps. Thus, executing threads in a warp serially eliminates the possibility of a deadlock.

2.1.4 Hybrid Scheme

Both full replication and locking schemes can have significant performance overheads.

The main issue with full replication is the memory overhead. Since a private copy is needed for each thread in a block, for almost any application, the reduction object needs to be stored in the high latency device memory. Moreover, the cost of combination can be very high.

On the other hand, in locking scheme, single copy of reduction object allows the use of

30 low latency shared memory. This also eliminates the need for combination within a thread

block. However, contention among the threads in a block can be very high.

On modern GPUs, configuring an application with a large number of threads typically

leads to better performance, because context switching between warps can mask latencies.

However, increasing the number of threads increases the overheads of the both of these

schemes. For full replication, both memory overheads and combination costs increase with

increasing number of threads. Similarly, a larger number of threads on a multiprocessor

increases the likelihood of contention between them.

To address this problem, we have designed a hybrid scheme. The main goal here is to achieve a balance between the overheads incurred by full replication and locking schemes.

This scheme is described in Figure 2.6. Suppose there are T = M × N threads in a block. We organize these threads as M groups, each of which comprises N threads. For each group, we create a separate copy of reduction object. Thus, we need M copies of reduction object for each block. N threads in one group update the same copy, using one of the locking schemes. After all the input elements have been processed, M copies of the reduction objects are merged.

The advantages of this scheme are three-fold. First, the number of copies of the re- duction object are reduced by a factor of N, as compared to the replication scheme. One important consequence of this can be that reduction object replicas can be placed on shared memory. Second, combination costs are also reduced. Third, as compared to the locking schemes, where T threads will compete for one memory location, now only T/M threads are competing for a memory location. Consequently, the likelihood of locks is also reduced.

One important constraint in choosing the number of groups, M, is that M copies of the reduction object should still fit in shared memory. Other considerations in choosing

31 the number of threads can be as follows. If the reduction object is large, the overhead of combination is likely higher and the overhead because of conflicts is lower. In this case, a smaller number of groups is preferable. On the other hand, if the reduction object is small, reducing conflicts may be the dominant consideration, and therefore, a larger number of groups may be preferable.

2.2 Applications

In this section, we describe how we have implemented three data mining applications involving generalized reductions on GPUs. The algorithms we have chosen are k-means clustering, principle component analysis (PCA), and k-nearest neighbor search (kNN). We briefly summarize these algorithms and discuss main issues in using the techniques de- scribed in the previous section.

2.2.1 k-means Clustering

The clustering problem is as follows. We consider transactions or data instances as representing points in a high-dimensional space. Proximity within this space is used as the criterion for classifying the points into clusters. Four steps in the sequential version of k- means clustering algorithm are as follows [62]: 1) start with k given centers for clusters; 2) scan the data instances, for each data instance (point), find the center closest to it and assign this point to the corresponding cluster, 3) determine the k new cluster centers from the points assigned to the corresponding center, and 4) repeat this process until the assignment of points to cluster does not change.

The reduction object for this application is the shift to each center, which is accumulated based on the points assigned to each center. The shift comprises m floating point numbers and 1 integer, where m is the number of dimensions of the dataset. Thus, the total size of

32 the reduction object is k × (m + 1) words. While using either of the locking scheme or the hybrid scheme, fine-grained locking can be correctly used.

2.2.2 PCA

Principle Component Analysis is a popular dimensionality reduction method, which is developed by Pearson in 1901. The goal of PCA is to find a new set of dimensions

(attributes) that better captures the variability of data. Since it has many steps which are not very compute-intensive, we only ported the calculation of the correlation matrix on the GPU. In this step, given an n × m data matrix D, with n rows and m columns, we compute the matrix S, which is the covariance matrix of D. Every entry sij is defined as

th th sij = covariance(d∗,i, d∗,j), i.e., sij is the covariance of the i and j columns of data.

The reduction object for this application is the covariance matrix. The size of the re- duction object increases quadratically by increasing m (number of columns). As a result,

memory and combination overheads associated with full replication can increase rapidly.

With the locking scheme, fine-grained locking can be used correctly.

2.2.3 KNN Search

k-nearest neighbor classifier is based on learning by analogy [51]. The training samples

are described by an n-dimensional numeric space. Given an unknown sample, the k-nearest

neighbor classifier searches the pattern space for k training samples that are closest, using the euclidean distance, to the unknown sample.

Again, the algorithm can be parallelized with replication as follows. The training sam- ples are distributed among the threads. Given an unknown sample, each node processes the training samples it owns to calculate the k-nearest neighbors locally. After this local phase,

33 device reduc shared memory ← d center nBlocks = blockSize / thread number tid = thread ID For i = 1 to nBlocks dis = distance (d data[tid],d center) For j = 1 to k if (dis < d update[tid][j]) shiftRight (d update[tid]) d update[tid][j] ← dis Thread 0 combines all copies of d update

Figure 2.7: KNN on GPU

a combination step computes the overall k-nearest neighbors from the k-nearest neighbors

on each thread. This implementation is showed in Figure 2.7.

The main problem with this approach is that the cost of the combination step can be

very high. Thus, use of the locking scheme, where each multi-processor maintains only

one set of nearest neighbors, can be an attractive approach. However, locking itself in-

volves several challenges for this application. In the previous two applications, each pair of

updated elements is independent, and fine-grained locking can be used. However, in kNN,

the reduction object is an ordered k-nearest neighbor queue. Although fine-grained lock-

ing can prevent two threads from updating the same element simultaneously, inconsistent

results may be obtained if two threads try to update the queue at the same time. Such a

case is presented in Figure 2.8. Suppose Thread1 and Thread2 update the k-nearest queue concurrently. Let Thread1 insert C before A, and Thread2 insert B before D. However,

when Thread1 finishes, D has been shifted to the next position, and the position where

34 Figure 2.8: Inconsistency in KNN

D was has been now filled by A. If the Thread2 continues to insert B, the queue will be inconsistent.

Thus, we have to use the coarse-grained locking scheme, and associate only one lock with each copy of the reduction object. This obviously limits the performance of the lock- ing scheme, as we will examine in the next Section. Among the two coarse-grained locking schemes introduced in the previous section, the performance of Explicit Warp Serialization is better than that of Explicit Conditional Branch.

Hybrid scheme turns out be an attractive option for this application. Particularly, imple- menting the hybrid scheme with 32 copies of the reduction object on each multi-processor works very well. As the number of threads in a warp is 32, this allows each thread in the

35 Execution time (sec) Version K=10 K=100 Sequential 13.88 81.49 Full replication 0.33 0.51 Locking scheme 1.01 0.32 Hybrid scheme 0.12 0.32

Table 2.1: Kmeans: Comparison between sequential time and GPU time

warp to work with a different copy of the reduction object. Thus, there is no need for se-

rialization of the threads within a warp, which was needed in the locking scheme to avoid

deadlocks.

2.3 Experimental Results

In this section, we present results from a number of experiments we conducted to

evaluate the relative performance of the three techniques. In understanding the perfor-

mance trade-offs of the three parallelization techniques, i.e., Full Replication, the Locking

Scheme, and the Hybrid Scheme, we also consider different thread block (execution) con-

figurations, and variation in application-specific parameters.

Our experiments were conducted using a NVIDIA Tesla C1060 GPU with 240 pro-

cessor cores (30 × 8), a clock frequency of 1.296 GHz, and 4 GB device memory. This

GPU was connected to a machine with two AMD 2.6 GHz dual-core Opteron CPUs, with

8 GB main memory. Three data mining applications described in the previous section, i.e., k-means clustering, Principal Component Analysis (PCA), and k-Nearest Neighbor search

(kNN), were used for our studies.

36 2.3.1 Results from K-Means

=6?#

=6:#

<6?#

<6:# ,2++#"'.+)%#0-,# !-%*),(# ;6?# 4$/)&7@>#

!0$"/-*))-($)2,$"3) ;6:# 4$/)&7=<#

:6?#

:6:# =:3@># =:3;

Figure 2.9: Results from K-Means (k=10)

The dataset we used was 2 GB, comprising 3-dimensional points, i.e., each point in the dataset is a 3-tuple with floating-point values. In Table 2.1, we compare the best perfor- mance obtained from the use of three approaches on the GPU against the sequential CPU version. The best GPU performance is achieved from the use of hybrid scheme, which is about a factor of 115 times and 254 times for k = 10 and k = 100 respectively. A detailed comparison of the results from three approaches and different thread block configurations is presented below.

Figure 2.9 shows the results for K-Means with different thread configurations when the value of k is 10. We show two different versions of the hybrid scheme, which are hybrid-64 and hybrid-32, representing number of thread groups or copies of the reduction object within a thread block or multi-processor. Specifically, hybrid-64 in- volves 64 different groups of threads within a thread block. For instance if there are 256

37 <8=;#

<8;;#

;8@;#

;8?;#

;8>;# !)! (&$#)&"!)+%! ,)

;8=;#

;8;;# )5--#"(1-+&$30/# !0&,+/)##&*(.(# 6%2+'##&*(.(#

Figure 2.10: Comparison of the Best Results from K-Means (k=10)

threads in a block, then there are 4 threads within a group, which use locking to update the

same copy of the reduction object.

For K-Means, with k = 10, the size of reduction object is quite small, leading to low memory and combination overheads. Because each thread will update one of the 10 cluster centers after processing each point, the contention for reduction object with the locking scheme is very high. Because of these factors, full replication outperforms the locking scheme. However, the hybrid scheme outperforms both full replication and locking by a significant factor, as it achieves a good balance between contention for threads and combination overhead.

We have also evaluated these techniques with varying thread configurations. In experi- menting with thread configurations, the number of thread blocks was always 30, matching the number of multiprocessors on the device. This is because using a larger number of blocks only increased the contention for resources, while not allowing any additional par- allelism. With full replication, the best performance is achieved with 64 threads per block.

38 More threads per block allow more concurrency, but the overhead of combination increases.

For the locking scheme, increasing the number of threads increases the contention for the lock and thus the synchronization overhead. This is well demonstrated in the execution time with 512 threads per block, which is more than a factor of 3 higher than the execution time with the best thread configuration. Hybrid scheme obtains the best performance with

256 threads per block, which is better than both full replication and locking scheme.

The best results obtained from each scheme are summarized in Figure 2.10. The hybrid scheme is 7.98 times faster than the locking scheme, and 2.62 times over full replication. hybrid-64 outperforms hybrid-32.

;6<#

;6:#

:6B#

,2++#"'.+)%#0-,# :6@# !-%*),(# 4$/)&7B# :6># !0$"/-*))-($)2,$"3) 4$/)&7A#

:6<#

:6:# =:3@># =:3;

Figure 2.11: Results from K-Means (k=100)

We also performed another set of experiments with K-Means, where the number of clus- ters, k, is increased to 100, resulting in a 10-fold increase in the size of the reduction object.

The results are shown in Figures 2.11 and 2.12. Contrary to what we observed with the previous case, here the locking scheme outperforms full replication. This is because with

39 :8@:#

:8?:#

:8>:#

:8=:#

:8<:# !)! (&$#)&"!)+%! ,)

:8;:#

:8::# *5--#"(1-+&$30/# !0&,+/)##&*(.(# 6%2+'##&*(.(#

Figure 2.12: Comparison of the Best Results from K-Means (k=100)

larger reduction object, the contention among the threads in the locking scheme decreases, whereas, the combination overhead of replication increases. As the locking scheme per- forms really well, hybrid scheme is unable to improve much compared to locking scheme.

In addition, the best performance for the locking scheme is obtained with the number of threads being 512.

2.3.2 Results from PCA

We used a dataset size of 17 MB and 34 MB for PCA with column size 16 and 32 respectively. In Table 2.2, we compare the best performance obtained from the use of three approaches on the GPU against the sequential CPU version. Again, the best GPU perfor- mance is achieved from the use of hybrid scheme, which is about a factor of 20.5 times and

53.9 times for m = 16 and m = 32 respectively. Although, for PCA, the performance of

40 Execution time (sec) Version 16x16 32x32 Sequential 0.9664 6.5488 Full replication 0.0495 0.1417 Locking scheme 0.0472 0.1344 Hybrid scheme 0.0470 0.1214

Table 2.2: PCA: Comparison between sequential time and GPU time

locking scheme and full replication are competitive with hybrid scheme. A detailed com- parison of the results from three approaches and different thread block configurations is presented below.

:6;@#

:6;>#

:6;<#

:6;:# ,2++#"'.+)%#0-,# :6:A# !-%*),(# :6:@# 4$/)&7A#

!0$"/-*))-($)2,$"3) :6:># 4$/)&7>#

:6:<#

:6::# =:3@># =:3;

Figure 2.13: Results from PCA (16x16)

In the first set of experiments, the number of columns, m, was chosen to be 16. The results are shown in the Figures 2.13 and 2.14. The locking scheme is 5% faster than

41 :8:@# :8:?# :8:?# :8:># :8:># :8:=# :8:=# :8:<#

!)! (&$#)&"!)+%! ,) :8:<# :8:;# :8:;# :8::# *5--#"(1-+&$30/# !0&,+/)##&*(.(# 6%2+'##&*(.(#

Figure 2.14: Comparison of the Best Results from PCA (16x16)

full replication and the performance of the hybrid scheme is similar to that of the lock- ing scheme. This result is quite different from K-Means due to the characteristics of this application. PCA performs relatively more computation to calculate mean and standard de- viation matrices, before updating the reduction object. So the overhead caused by locking is amortized by the large amount of computation. In addition, the size of reduction object, which affect the cost of combination, is relatively large compared to that in K-Means.

In varying the thread block configurations, there are specific reasons for why full repli- cation and locking scheme achieves their best performance with smaller thread number per block. In full replication, less threads cause less overhead of combination, so it can gain better performance. In the locking scheme, the computation before updating the reduction objects is dependent on the total number of threads. If there are fewer threads, each thread will do more work before updates, resulting in less contention. In hybrid scheme, we can- not go beyond the group size of 8, as the reduction object does not fit into shared memory beyond this.

42 :6=?#

:6=:#

:6

:6<:# ,2++#"'.+)%#0-,#

:6;?# !-%*),(# 4$/)&7># :6;:# !0$"/-*))-($)2,$"3) 4$/)&7<# :6:?#

:6::# =:3@># =:3;

Figure 2.15: Results from PCA (32x32)

Execution time (sec) Version K=10 K=20 Sequential 0.221 0.331 Full replication 0.084 0.223 Locking scheme 0.805 0.871 Hybrid scheme 0.012 0.028

Table 2.3: KNN: Comparison between sequential time and GPU time

When increasing the number of columns, m to 32, we did not observe any significant changes to the trend. The results are presented in Figures 2.15 and 2.16. Hybrid Scheme is about 10% faster than the locking Scheme in this case.

43 :8;=#

:8;<#

:8;:#

:8:>#

:8:=# !)! (&$#)&"!)+%! ,)

:8:<#

:8::# (5--#"(1-+&$30/# !0&,+/)##&*(.(# 6%2+'##&*(.(#

Figure 2.16: Comparison of the Best Results from PCA (32x32)

2.3.3 Results from KNN

The size of the dataset we used for kNN is 20 MB. As explained in the previous section, this application needs coarse-grained locking. We have experimented with both the approaches for coarse-grained locking, i.e., explicit warp serialization and conditional branching. In comparing against other schemes, we present results from explicit warp serialization, since it gave better performance.

In Table 2.3, we compare the best performance obtained from the use of three ap- proaches on the GPU against the sequential CPU version. The best GPU performance is achieved from the use of hybrid scheme, which is about a factor of 18.41 times and 11.82 times for k = 10 and k = 20 respectively. In contrast to PCA, here, the performance of full replication is poor compared to hybrid scheme. Also, locking scheme is even slower than the sequential version. A detailed comparison of the results from three approaches and different thread block configurations is presented below.

44 Parallelization Methods Divergent Branch Global Store Full Replication 54083529 1960770 Locking Scheme 522451527 300 Hybrid Scheme 8305329 13900

Table 2.4: Profiler output when k = 10

In our first set of experiments, the value of k is 10. The results are shown in the Fig- ures 2.17 and 2.18. For the hybrid scheme, we fix the number of reduction object copies to

32, matching the number of threads in a warp in our target architecture. With this choice, as no two threads in the same warp can belong to the same group, race conditions can exist only among the threads in the different warps. This helps to alleviate branch divergence.

=5>9# =599# <5>9# <599# ;5>9# ,1**#.&-*($"/,+# ;599# !,$)(+'# :5>9# 3#.(%6<;# !0$"/-*))-($)2,$"3) :599# 95>9# 9599# <92?=# <92:;@# <92;>?# <92>:;# '+$!#)"*)%&/+!-*) !

Figure 2.17: Results from KNN (k=10)

45 ;8D;# ;8C;# ;8B;# ;8A;# ;8@;# ;8?;# ;8>;# !)! (&$#)&"!)+%! ,) ;8=;# ;8<;# ;8;;# -5--#"(1-+&$30/# !0&,+/)##&*(.(# 6%2+'##&*(.(#

Figure 2.18: Comparison of the Best Results from KNN (k=10)

We summarize the performance of three approaches with the best thread configura- tions in Figure 2.18. Full replication is 9.6 times faster than locking scheme, though hy- brid scheme obtains the best performance. Specifically, the hybrid scheme is 62.3 times faster than the locking scheme. The cost of locking in this application is much larger than previous applications that use fine-grained locking. This is evident from the performance achieved by the locking scheme with varying number of threads. Specifically, the execution time of the locking scheme with 512 threads per block is 5.2 times slower than the execu- tion time with 64 threads per block. In comparison, the performance of full replication and hybrid schemes does not vary significantly with different thread block configurations.

To further understand all the trade-offs, we profiled the execution of kNN with cudaProf version 2.3. The key data is summarized in Table 2.4. Here, we focus on two factors, which are divergent branches (number of divergent branches within a warp), and global stores

(count for number of global memory stores).

46 Parallelization Methods Divergent Branch Global Store Full Replication 143841413 6353670 Locking Scheme 564941895 600 Hybrid Scheme 18655192 37760

Table 2.5: Profiler output when k = 20

The major performance factor for the locking scheme is the high branch divergence within each warp. For full replication, thread divergence is not an issue, but the cost of global stores is very high. In hybrid scheme, we are able to achieve a balance between the overheads due to divergence and global stores. Specifically, for the hybrid scheme, the branch divergences are 1.6% of those in the locking scheme, and global stores are only

0.7% of what were observed with full replication.

>599# =5>9# =599# <5>9# <599# ;5>9# ,1**#.&-*($"/,+# ;599# !,$)(+'# :5>9# 3#.(%6<;# !0$"/-*))-($)2,$"3) :599# 95>9# 9599# <92?=# <92:;@# <92;>?# <92>:;# '+$!#)"*)%&/+!-*))

Figure 2.19: Results from KNN (k=20)

47 <8;;# ;8D;# ;8C;# ;8B;# ;8A;# ;8@;# ;8?;# ;8>;# !)! (&$#)&"!)+%! ,) ;8=;# ;8<;# ;8;;# -5--#"(1-+&$30/# !0&,+/)##&*(.(# 6%2+'##&*(.(#

Figure 2.20: Comparison of the Best Results from KNN (k=20)

In the next set of experiments, we increase the value of k to 20. The results are shown

in the Figures 2.19 and 2.20. When the value of k is increase to 20, we observed a similar trend between three approaches with different thread block configurations. This trend is very similar to what we observed in Figure 2.17.

In Figure 2.20, we summarize the best results from each thread configuration. Full replication is still better than the locking scheme. Also, the hybrid scheme is faster than the locking scheme by a factor of 31.1. Detailed data from profiling this application with k = 20 is shown in Table 2.5. For the full replication, compared to the previous profiling results, there is a sharp increase in global stores, whereas, branch divergences have not changed too much in locking scheme.

These trends show that the relative performance of locking can be expected to get better with increasing k. To validate this observation, we experimented with a range of different values of k, comparing full replication and locking. Figure 2.21 summarizes these results.

With increasing k, execution time for full replication increases rapidly, whereas there is

48 only a marginal increase in the execution time for the locking scheme. Thus, eventually, the locking scheme outperforms full replication.

83;6#

8366#

73;6#

,1**#!%.*($#/-,# 7366# -$)(,&#"$'%+%# !-#"+)'&))%#)/(#"0) 63;6#

6366# 76# 86# 96# :6# ;6# <6# =6# >6# ),!$+#)

Figure 2.21: KNN - Comparison of Full Replication and Locking Scheme with different k

2.3.4 Discussion

To summarize the above results, the performance of each scheme depends on the nature of the application being scaled. For k-means with k = 10, locking scheme is the worst, whereas, the hybrid scheme performs the best. A very different trend is observed with k-means when k = 100 and also with two different sets of parameters for PCA. In these cases, hybrid scheme performs the best, though the locking scheme is very comparable to the hybrid scheme. kNN shows a different trend, where hybrid clearly outperforms other schemes.

Overall, we can see that full replication can be a viable approach when the reduction object is small and the cost of combination is low. The locking scheme works well when

fine-grained locking can be correctly used, and the number of distinct memory locations

49 is large enough to keep contention overheads low. The hybrid scheme we have developed in this paper is able to achieve a balance between the cost of combination and the cost of synchronization, leading to good performance.

2.4 Related Work

Parallelizing applications on GPUs and evaluating their performance is currently one of the most popular topics in high performance computing. Here, we only compare our work against efforts that have focused on the use of atomic operations, or efforts looking at data mining applications.

Many application studies have evaluated the use of atomic operations [122, 82, 19].

Focusing on the parallel Push-Relabel algorithm, Vineet et al. [122] used atomic operations to address the data consistency problem. Another effort [82] considered acceleration of

Monte Carlo (MC) simulation of photons on GPUs, and atomic instructions were used to ensure data consistency. In their work, floating point data is converted to integers to allow the use of atomic operations.

The key differences in our work are: 1) use of atomic operations for locking on floating point numbers and methods for deadlock-free coarse-grained locking, 2) introduction of a hybrid approach, 3) a detailed comparison of several approaches for a set of applications.

Data mining on GPUs is also widely studied. k-Nearest Neighbor search has been ported on GPUs by several groups [18, 106, 43]. Hall and Hart [47] ported different ver- sions of k-means to GPU using Cg. et al [21] did an analysis of CUDA computing model and a comparison with other architectures.

50 2.5 Summary

This work has focused on parallelization methods for reduction-based computations on GPUs, in view of the availability of atomic operations. We have developed method for locking on floating point numbers and for coarse-grained locking. Moreover, we have developed a hybrid approach that combines replication of the reduction object with locking.

We have evaluated full replication, the locking scheme, and the hybrid scheme using three popular data mining applications that involve reductions.

We show that the relative performance of the three techniques vary depending on the application and its parameters. The hybrid scheme we have introduced either matches the performance of other techniques, or clearly outperforms them. In k-means, when k is quite small (k = 10), contention is a significant factor. While full replication works well for this case, the hybrid scheme can still outperform it. For k-means with larger values of k, as well as for PCA with two different set of parameters, full replication is quite slow because of the cost of combination. In these cases, the hybrid is either similar to locking, or can outperform it to some extent. In kNN, both full replication and locking have significant overheads, and the best performance is achieved with the hybrid scheme we have introduced.

51 Chapter 3: An Execution Strategy and Optimized Runtime Support for Parallelizing Reductions on Modern GPUs

In last chapter, we have shown how to utilize shared memory and atomic operations in different parallel strategies for generalized reductions. However, for the applications with irregular structures, which is one of 13 dwarfs in Berkeley view on parallel comput- ing, there have not been any systematic studies on modern GPUs. Thus, in this work, we proposed a execution methodology and optimized runtime modules to effectively execute irregular applications on GPUs.

3.1 Overview of Irregular Reductions

Real X(numNodes), Y (numEdges); ! Data arrays Integer IA(numEdges,2); ! Indirection array Real RA(numNodes); ! Reduction array

for (i=1; i¡numEdges; i++) { RA(IA(i,1)) = RA(IA(i,1)) op (Y(i) op X(IA(i,1) op X(IA(i,2)))) RA(IA(i,2)) = RA(IA(i,2)) op (Y(i) op X(IA(i,1) op X(IA(i,2)))); }

Figure 3.1: Typical Structure of Irregular Reduction

52 A canonical irregular reduction loop is shown in Figure 3.1. The key characteristics of this loop are as follows. Each element at the left-hand-side array might be incremented in multiple iterations of the loop. The set of iterations in which a particular element may be updated depends upon the contents of the indirection array IA, but the updates can only be through an associative and commutative operation op. We refer to the set of elements for the left-hand-side arrays as the reduction space. On the other hand, one or more input arrays might be accessed either directly or through an indirection array. A key aspect of the computation loop is that it iterates over the elements of the indirection array. Thus, the set of elements of the indirection array is referred to as the computation space.

Overall, the data structures involved in an irregular reduction can be categorized into three different types. The first is the input and read-only data, i.e, the arrays X and Y in our example. The second is the reduction or the output array. The third type of data structure include the indirection array(s), which provide the index for accessing some of the data arrays and the reduction array. The indirection shown in Figure 3.1 includes two portions, which are IA(i, 1) and IA(i, 2). In the ith iteration of the loop, the data array X is accessed by using IA(i, 1) and IA(i, 2). However, some data arrays can also be accessed by loop index, like the data array Y in our example. Reduction array RA is updated using two distinct portions of the indirection array in each iteration of the loop.

The loop with indirection array shown in Figure 3.1 can be found in a large number of scientific and engineering applications. For example, in a CFD application Euler [25], an unstructured mesh contains nodes and edges. Edges in this application is the indirection array, which provides indices for both end points of the edge. In each iteration, some values

(such as face data, velocities, etc.) associated with each of the two end points are updated.

53 Molecular Dynamics applications contain similar loop structure [61]. Here, the nodes rep-

resent the molecules, and the edges denote the interaction between a pair of molecules. The

loop iterates over set of the interactions, and the corresponding change in forces for the two

interacting molecules are updated.

Clearly, applications like Euler and the molecular dynamics involve a large number of

loops, not all of which follow the structure shown in Figure 3.1. However, most of other

loops tend to be simple to parallelize, as the data and the iterations can be partitioned in an

affine manner. Thus, irregular reduction loops are ones that involve non-trivial challenges

in parallelization. These applications, however can involve loops with different indirection

arrays. For the rest of the paper, our presentation will focus only on a single indirection

array based loop.

In addition, a key observation that impacts the parallelization and optimization of the

applications is that the size of data structures involved in both the computation space and

reduction space are typically extremely large. Thus, they cannot be stored in the shared

memory of a GPU.

3.2 Execution Strategy for Indirection Array Based Computations on GPUs

In the previous section, we had discussed the nature of irregular reductions. In this

section, we first elaborate on the problems associated with the use of two popular methods

for parallelizing reductions on shared memory systems, which are locking and replication.

Then, we introduce our proposed partitioning based locking scheme, and explain the reason why we choose to partition on the reduction space.

54 3.2.1 Overheads with Existing Methods

We now discuss about why the existing methods for parallelizing reductions on any

shared memory system are not appropriate for execution of irregular applications on GPUs.

Full Replication: In any shared memory system, one simple method for avoiding race condition is to have a private copy of reduction array for each thread. After each iteration is completed, the results from all threads will be combined by a single thread, or using a tree

structure. Since each thread has its own replica of the reduction array, no synchronization

is needed while updating the array. The combination of results can be performed in two

ways. In the first method, we can copy the results from all threads and blocks onto the

host and then perform the combination. Alternatively, we perform the combination among

the threads within a block on the GPU and then perform the combination across blocks on

the host. The second combination method is preferred as the combination among threads

within a block can be performed in parallel on GPU.

The main issues with this scheme are the memory and the combination overheads.

Particularly, a GPU like Fermi has 14 or 16 multi-processors, and most application studies

have shown that 256 or 512 threads per block is desirable for achieving best performance.

Such a large number of threads leads to a very high memory overhead. Moreover, this

reduces the possibility that any significant portion of the reduction array can be stored

in the shared memory. In addition, the cost of combination obviously increases with the

number of threads and blocks, and can be quite high when nearly 4,000 to 8,000 copies are

involved.

Locking Mechanism: Another approach is to use locking while updating any element

of the reduction object. On modern GPUs, this scheme can be implemented using the

55 atomic operations provided by CUDA. Particularly, from computation capability 2.0 on- wards, NVIDIA GPUs support atomic operations on floating point numbers and on shared memory. In contrast to replication scheme discussed above, all threads within a block will share the same reduction array, and still maintain correctness.

Clearly, an atomic operation is more expensive than a simple update, and similarly, con- tention among threads trying to update the same element may lead to slowdowns. However, this scheme has the benefits of avoiding memory overheads and the cost of performing the combination. A related benefit is that utilization of shared memory is seemingly possible.

But, as we had stated earlier, in most irregular reduction based applications, one copy of reduction array is too large to fit into the shared memory. Thus, additional work is needed to enable effective use of the shared memory. Furthermore, there is still a copy of the re- duction array for each thread block or multi-processor. So, a combination among thread blocks is still required.

In summary, while the locking based scheme seems to have certain advantages, both of the above strategies fail to utilize the shared memory effectively. This is the basis for the partitioning-based locking scheme we are proposing.

3.2.2 Partitioning-Based Locking Scheme

We now discuss our proposed scheme for the execution of irregular reductions. Before getting into the details of this scheme, we summarize the key aspects of this scheme:

• Partition the output or the reduction space, such that the data associated with each

resulting partition of the reduction space fits into shared memory.

• Reorder the computation space or the indirection arrays in such a way that iterations

updating elements of the same partition are grouped together.

56 5 1

Partition 1 3 2 4

Partition 2 6 8 12 15 2 7 12 4 15 11 7 9 5 1 11 6 9 13 3 8 12 13 Partition 3 10 16 Partition 4 2 4 14 10 16 7 14 Partition 1 Partition 2 Partition 3 Partition 4 (a) Partitioning on Computation (b) Reduction Size Increase in Each Partition Space

Figure 3.2: Computation Space partitioning and reduction size in each partition

5 1

Partition 1 3 2 4

Partition 2 6 8 12 1 15 12 15 7 9 5 2 1 4 11 Partition 3 7 9 7 9 11 3 11 13 2 6 10 16 4 13 8 12 16 13 15 10 16 Partition 4 14 8 7 9 14

14 Partition 1 Partition 2 Partition 3 Partition 4 (a) Partitioning on Reduction Space (b) Workload Increase in Each Partition

Figure 3.3: Reduction Space partitioning and computation size in each partition

• In the process, completely eliminate the requirement for combination of reduction

array across different thread blocks.

We now discuss the strategy in more details, starting with discussion of why partitioning on the reduction space is chosen.

57 Partitioning Methodologies for Irregular Reductions

As we have discussed earlier, the sizes of arrays involved in any irregular reductions prohibit simple utilization of the shared memory. Thus, clearly, we need to partition the work or data, and execute partitions in a way that a portion of the data can be loaded into shared memory. Now, let us revisit the code example in Figure 3.1. The indirection array

IA and the data array Y have regular memory access patterns. Moreover, there is no reuse of the data within a single loop. Thus, these accesses can be optimized with coalesced references to the device memory. On the other hand, accesses through indirection arrays

(RA and X) if placed on device memory, have no potential for coalescing. But, there is an opportunity for reuse, and hence placing them on shared memory can be beneficial. Thus, while considering partitioning methods, we focus on the accesses to arrays RA and X.

Now, we can consider two possible partitioning methods:

Computation Space-based Partitioning: One approach is to partition the work or the iterations of the loop into different groups. Subsequently, the reduction array elements that are accessed in the set of iterations from one partition can be stored in the shared memory.

To further optimize this, the partitions of iterations can be chosen so that there is a high reuse of the corresponding reduction array elements.

Suppose we have n multi-processors in the GPU. We can create n × k partitions, such that the size of the set of reduction array elements accessed in each partition fits into the shared memory. Each multi-processor can execute k such partitions, loading the corre- sponding reduction array elements before execution of each such partition. The partitions can also be chosen in a way that the number of iterations executed by different multi- processors is almost the same.

58 This simple schemes have several disadvantages, which we illustrate through an ex- ample. Figure 3.2a shows an example of partitioning of the iterations on the computation space. All edges are divided into four partitions, with four edges in each partition. To allo- cate data in shared memory, we need to determine the set of reduction array elements that will be updated while executing the iterations in each partition. Clearly, an element may be updated by iterations from different partitions. Thus, a reduction element can belong to multiple partitions. Moreover, the number of elements corresponding to one partition may vary significantly, as we show in Figure 3.2b. There are 6 nodes in the second parti- tion, but only 4 nodes in the third partition. This points to a difficulty in optimally reusing shared memory for each partition. Moreover, as reduction array elements can be updated by multiple partitions, we need to perform the combination operation in the end. This can introduce significant overheads.

Reduction Space-based Partitioning: Based on the discussion above, we can consider another approach. We could start by first partitioning the reduction space, i.e. the set of the elements in the reduction array. The number of partitions can be chosen so that the data corresponding to each partition fits into shared memory.

A simple example of Reduction Space-based partitioning is shown in Figure 3.3a. Com- pared to the previous example, partitioning is based on the nodes, instead of edges. Now, we have several edges that are connecting elements from two distinct partitions. In the case of these edges, the corresponding iterations from the loop are updating reduction array el- ements from two different partitions, i.e. accesses RA(IA(i, 1)) and RA(IA(i, 2)) belong to the different partitions. In such a case, we add this iteration i to both the partitions. Most of the computation associated with this iteration is repeated, but only one element is up- dated during the computing associated with each partition. In Figure 3.3b, we see that the

59 total number of edges in all the four partitions has been increased to 25, compared to the original number of edges which is 16. This is due to the crossing edges between different partitions. For larger datasets, and with careful partitioning, we can however minimize the number of such crossing edges. Besides redundant computation, this scheme may also lead to computation space imbalance. As an example, the third partition has 7 edges, which is more than the edges in other three partitions.

The single biggest advantage of this scheme is that for the entire GPU, there is only one copy of each reduction array element. Moreover, each thread block performs computation on independent reduction space partitions, thus eliminating the requirement for combina- tion. Thus, while there clearly are tradeoffs between the two schemes, we were driven by the following two observations. First, shared memory is a critical resource on a GPU, and it is important to judiciously utilize it. Second, there is a high degree of parallelism available on a GPU, and thus, redundant computations to an extent are acceptable.

Overall GPU Execution Flow Based on Partitioning

We now summarize the execution of an application with the proposed scheme. Fig- ure 3.4 outlines the execution of our scheme. Before the computation begins, a partitioning method is used to divide the original reduction space into multiple chunks, such that the reduction data associated with the space should fit in the shared memory. Then, indirection array is also partitioned, with the possibility that some iterations may be inserted in more than one partition. Thus, we have a set of iterations (possibly overlapping) associated with each disjoint partition of the reduction array.

The computation associated with each partition is always performed by a single thread block (executed on one multi-processor), with different threads executing different itera- tions. Note that different iterations can be simultaneously updating the same reduction

60 Host Device Host

Multiprocessor 1

Partition 1 Partition Shared Memory

Partition 2 Multiprocessor 2

Reduction Partition Shared Reduction Memory Array Partition 3 Array

......

Multiprocessor N Partition P

Partition Shared Memory

Figure 3.4: Execution with Partitioning based Locking Scheme

array element. Therefore, the reduction operations are performed using the atomic opera-

tions on shared memory, which are supported in modern GPUs. Typically, the number of

partitions we create is a multiple of the number of thread blocks, and each thread block

executes the same number of partitions. Between the execution of two different partitions,

the updated reduction array elements are written back to the device memory, and a new set

of reduction array elements are loaded into the shared memory.

Consider again the Figure 3.1. In our current implementation, only the array RA is stored and updated in shared memory. As we stated previously, there is no reuse on the elements of the arrays IA and Y across iterations of a loop. Therefore, these arrays are kept in device memory, but accesses to their elements are coalesced. The array X can be

partitioned in the same fashion as the array RA. However, a challenge arises for iterations

where X(IA(i, 1)) and X(IA(i, 2)) belong to different partitions. In such a case, one of

the two accesses will have to be to the device memory. In view of this, we choose not to put

61 X in shared memory at all. Instead, larger (and therefore, fewer) partitions can be created

for the array RA, reducing the fraction of iterations that need to be put in more than one partition.

The key observation from the scheme is that the overall reduction space has been di- vided into disjoint partitions, which can be placed on shared memory. Because the par- titions are disjoint, there is no need to perform combination operations at the end of the processing.

However, there are still several challenges, which need to be addressed to maintain ef-

ficiency. First, we need methods for partitioning the data over the reduction space, while keeping the overheads low. The existing partitioners tend to be expensive and CPU based, which can introduce unacceptable costs while attempting GPU-based acceleration of the application. Similarly, the computation space needs to be partitioned or reordered. An- other challenge is associated with adaptive irregular applications, where the indirection array may change between iterations of a time-step loop, which is an outer-loop over the loop shown in Figure 3.1. In the next section, we describe the various partitioning and op- timization techniques to address these challenges, and to support the execution efficiently.

3.3 Runtime Support

To support the execution scheme described in the previous section, we need to develop optimized runtime modules. Though partitioning of unstructured meshes [67, 66] and as- sociated runtime support [103, 114] are very well studied topics, we focus on the specific challenges associated with GPUs. We first discuss the partitioning methods we have im- plemented, and then discuss efficient reordering of iterations.

62 3.3.1 Runtime Partitioning Approaches

We have experimented with three different schemes for partitioning in our system.

METIS Partitioning: METIS is a set of partitioners for graphs and finite element meshes [114], and has been widely disseminated and used in practice. However, METIS based partition- ers execute sequentially on CPU, their relative overhead can be very high as compared to the application’s execution on a GPU. Particularly, the cost of initialization for METIS is very high. Because it requires analysis on the edges, it needs to reorganize the unstructured mesh. The time complexity for this conversion is O(numNodes × numEdges).

GPU-based (Trivial) Partitioning: The primary objective of partitioners implemented in

METIS (or similar partitioners developed) is to reduce the communication volume while

executing the unstructured mesh based computation. In our case, the primary goal is to

exploit the shared memory. Thus, we could consider very simple or trivial partitioning and

implement it on a GPU. The algorithm we have implemented divides the reduction space

simply on the order of the input. So, the cost of the partitioning is very low, and the reuse

of the shared memory is the same as the other partitioners. However, unlike a smarter

algorithm, this method can have a significantly larger number of cross edges. Thus, the

amount of redundant work can be high, leading to some slow-downs.

Multi-Dimensional Partitioning: We have also developed a more effective yet very ef-

ficient partitioning method, which we refer to as multi-dimensional partitioning. This method does not utilize the information on the edges, because gathering the adjacencies information is very time consuming. Instead, node coordinates are utilized to partition the space in a fashion that cross edges are significantly reduced. The key step in the algorithm is finding the kth smallest value. Suppose that the total number of partitions desired is 12, and we decide to partition the underlying three-dimensional space into 3 partitions along

63 the x dimension, and 2 partitions along each of the y and z dimensions. In our algorithm, we first find out the coordinate value of the numNodes/3th and numNodes/3 × 2th node among a sorted list according to the x dimension. Similarly, we find partitioning points along the y and z dimensions, repeating the process iteratively, and partition the space.

In our algorithm, the dominant cost is for finding the kth smallest value along each

dimension, for either the entire dataset or a subset of it. Thus, the time complexity for each

such step is O(n). If we create X, Y , and Z partitions along the x, y, and z dimensions,

total number of times the above step is invoked is (X +X ×Y +X ×Y ×Z), which is O(p), where p is the number of partitions desired. In practice, the number of partitions is quite small. As we will demonstrate, the execution time of this algorithm is very small, where the number of cross edges is reduced over the GPU-based trivial partitioner. In summary, this method provides a balance between the partitioning time and the efficiency of application execution.

3.3.2 Iteration Reordering

The goal of the iteration reordering is to ensure that each thread block can process all data from one partition of the reduction space continuously. This not only enables better reuse of shared memory, but also ensures that the indirection array and other arrays with affine accesses in the original loops are accessed from device memory in a coalesced fashion. The iteration reordering module has two components: reordering component, and updating component. As shown in Figure 3.5, after partitioning, all the data associated with nodes will be reordered by reordering component. For example, in the molecular dynamics application, coordinates, velocities, and forces are associated with each molecule. Thus, these structures need to be reordered based on results of the reduction space partitioning.

64 Reduction Space Computation Space

node node node ... edge edge edge ...

Partitioning Module

node node node ...

part part part ...

Runtime Reordering Module

Reordering Updating Component Component

part1 part2 part3 ... edge edge edge ...

Reordered Reduction & Computation Space

Figure 3.5: Runtime Iteration Reordering

Then, the interactions between every pair of molecules needs to be updated due to the

change in the order of molecules. This work is performed by the updating component. The

partition an interaction belongs to is determined by its end points. If both the end points

of the interaction belong to the partition A, this interaction will be reordered to be in the partition A only. However, if the two nodes on one interaction belong to different partitions, this interaction will be stored into both the partitions. Thus, extra memory is required to store the edges crossing the partitions.

The cost of runtime reordering is as follows. The reordering component involves find- ing all the elements belonging to the ith partition serially. The complexity of a simple

algorithm will be O(numP artitions × N). When N is large, the overhead can be quite

significant. Hence, one optimization is to use numP artitions temporal dynamic vectors

to store the elements for each partition. Then, only one iteration of all the nodes is required

to group the nodes within the same partition together. Next, we copy all the vectors by

65 increasing partition number to a single data array. The time complexity for this method is only O(N).

3.4 Experimental Results

3.4.1 Experimental Platform

Our experiments were conducted using a NVIDIA Tesla C2050 (Fermi) GPU with 448 processor cores (14 × 32). The GPU has a clock frequency of 1.15 GHz, and a 2.68 GB device memory. This GPU was connected to a machine with Intel 2.27 GHz Quad core

Xeon E5520 CPUs, with 48 GB main memory. The sequential CPU execution results we report are also based on this machine.

Our evaluation was based on two representative irregular reduction applications, which have been widely used by studies on this topic [6, 26, 49, 94, 32, 50, 89, 92]. The first application is Euler [25], which is based on Computational Fluid Dynamics (CFD). It takes description of the connectivity of a mesh and calculates quantities like velocities at each mesh point. The second application is Molecular Dynamics (MD) [61], where simulation is used to study the structural, equilibrium, and dynamic properties of molecules. It shows a view of the motion of the molecules and atoms by simulating the interaction of the particles for a period of time.

3.4.2 Evaluation Goals

Our experiments were conducted with the following goals.

• We evaluate the performance of the two applications with the Partitioning-Based

Locking (PBL) scheme introduced in this paper, comparing against simple approaches

66 like full replication and locking, and measuring the speedups against CPU-based se-

quential versions.

• We further study the performance factors impacting PBL scheme, like the number

of partitions used, type of partitioner used, and the choice of configuration of shared

memory.

• Last, we focus on the possibility that the irregular application may be adaptive and

may require the runtime modules to be reinvoked. We evaluate how we keep the

overheads low in such scenarios.

&#"

&!"

%#"

%!"

$#" !"##$%"&

$!"

#"

!" '()" *+,," 6417058" (96" '()" *+,," 6417058" (96" -./,012345" -./,012345" %!:";2<2=.<" #!:";2<2=.<"

Figure 3.6: Euler: Comparison of PBL Scheme Over Conventional Strategies and Sequen- tial CPU Execution

3.4.3 Efficiency of Partitioning Based Locking Scheme

Our experiments were conducted using the following datasets. For Euler, the first dataset comprises 20,000 3-dimensional nodes, 120,000 edges, and 12,000 faces. The

67 '&" '%" '$" '#" '!" &" !"##$%"& %" $" #" !" ()*" +,--" 7528169" ):7" ()*" +,--" 7528169" ):7" ./0-123456" ./0-123456" ;<=">3?3@/?" ';'=">3?3@/?"

Figure 3.7: Molecular Dynamics: Comparison of PBL Over Conventional Strategies and Sequential CPU Execution

second dataset involves 50,000 nodes, 300,000 edges, and 29,000 faces. For both of the datasets, the main computation was repeated over 10,000 time-step iterations. For Molecu- lar Dynamics, the first dataset comprises 37,000 molecules and 4,600,000 interactions. The second dataset contains 131,000 nodes and 16,200,000 edges. For both the datasets, we initially consider a non-adaptive version, where the time-step loop involves 100 iterations

(with no modifications to the indirection array). Later, in Subsection 3.4.6, we consider a different version, where the indirection array is modified after every 20 iterations.

For this subsection, the multi-dimensional partitioner is used for the PBL scheme. Also, the results presented for partitioning-based locking scheme, full replication and the locking scheme are all reported with the thread block configuration which gave the best perfor- mance.

In Figure 3.6, for Euler in 20K dataset, the partitioning-based locking scheme outper- forms the full replication and locking scheme. The PBL scheme is about a factor of 3.3

68 times, 4.1 times, and 30.9 times faster over the locking scheme, full replication, and the se-

quential CPU versions, respectively. The results with the 50K dataset further demonstrate

the effectiveness of the PBL scheme.

The results for Molecular Dynamics are similar. Figure 3.7 summarizes the results for

the 37K dataset and 131K dataset, showing the PBL scheme is 3.1 times, 8.2 times, and

17.1 times faster compared to locking, full replication, and the sequential CPU versions,

respectively in 37K dataset. The results from the second dataset again are very similar

(Figure 3.7).

3.4.4 Impact of Various Partitioning Schemes on Irregular Reduc- tions

In Section 3.3, we described three kinds of partitioning methods, Metis Partitioner,

GPU Partitioner, and Multi-Dimensional Partitioner. Here, we analyze the costs and trade- offs between these three partitioning methods. These three methods are referred to as MP,

GP, and MD, respectively.

'%" '$" '#"

+,%,% '!" * &" ,-./0-/123"415-" %"

!"#$%&'()%$ 6217"415-" $" ,822123"415-" #" !" ()" *)" (+" ()" *)" (+" ()" *)" (+" '$" #&" $#" -.(/)0%"1%23044"5+%'5%6.!)0%

Figure 3.8: Cost Components of Partitioners (Euler)

69 Overall Efficiency of Partitioning Methods: In Figure 3.8, we first compare the run-

time of the three partitioning methods as we vary the number of partitions desired.

For Euler, GP has the shortest running time, across varying number of partitions, as we would expect because of its simplicity. On the other hand, for Euler, MP is around 2.8

times faster than MD when 14 partitions are desired. However, the execution time of MP

increases sharply with the increasing number of partitions, whereas MD is not influenced by

the number of partitions significantly. Thus, we observe that MD outperforms MP when 42

partitions are desired. For Molecular Dynamics, there is similar trend. GP also gains the

best performance. However, the running time of MP is high, because there are more edges

(interactions) per node (molecules), and MP performs an analysis of the edge information.

In comparison, both MD and GP methods only focus on the node information, and their cost is based on the number of nodes.

Cost Components of Partitioning Methods: The total cost of a partitioning scheme in- cludes not only the running times we reported above, but also an initialization cost and the

reordering time. Figure 3.8 considers each of these three costs. The results demonstrate

that the most expensive component of MP is its initialization, which takes more than 90%

of the total partitioning time. In comparison, the initialization component for GP and MD

partitioners is either drastically lower or completely eliminated. The reordering time is

similar for all the three methods. The results are very similar for Molecular Dynamics, and

are not shown here.

Impact of Partitioners on GPU Computation Time: Different partitioning methods dif-

fer not only with respect to the time it takes to partition the data and in reordering iterations,

but also in terms of the resulting efficiency during the execution of the application. Par-

ticularly, as we partition the reduction space, the workload balance and redundancy in the

70 +!!!"

*!!!"

)!!!" ,-.#&" (!!!" /-.#&" ,0.#&" '!!!" ,-.$*" &!!!" /-.$*"

%!!!" ,0.$*" !"#$%"&'()*+,#&-"./0( ,-.&$" $!!!" /-.&$" #!!!" ,0.&$"

!" #" $" %" &" '" (" )" *" +" #!"##"#$"#%"#&"#'"#("#)"#*"#+"$!"$#"$$"$%"$&"$'"$("$)"$*"$+"%!"%#"%$"%%"%&"%'"%("%)"%*"%+"&!"&#"&$" 1&#--".(.234,#(

Figure 3.9: Euler: Workload with Metis, GPU, and Multi-dimensional Partitioners on 14, 28, and 42 Partitions

computation space are critical factors impacting the performance of the application. Fig- ures 3.9 and 3.10 show the workload distribution achieved, with different partitioners and number of partitions. MP-14, for example, refers to the use of Metis partitioner to obtain

14 partitions.

The work distribution with MP is highly balanced as the algorithm explicitly uses the edge information. The workload distribution with GP is clearly more imbalanced when compared to the other two partitioners, since it uses neither the spatial locality of nodes, nor the information on the edges. However, we can see that the imbalance per partition reduces as we increase the number of partitions. On the other hand, MD, using only the spatial locality of nodes, achieves a comparable load balance to MP. Similar trends are observed for the redundant workload arising due to cross edges. In view of the earlier observation about the smaller overheads (because the initialization time is eliminated), MD seems to be the best overall option. We further validate this with the next set of results.

71 )!!!!!"

(!!!!!"

,-.#&" '!!!!!" /-.#&"

&!!!!!" ,0.#&" ,-.$*" %!!!!!" /-.$*" ,0.$*" !"#$%"&'()*+,#&-"./0( $!!!!!" ,-.&$" /-.&$" #!!!!!" ,0.&$"

!" #" $" %" &" '" (" )" *" +" #!" ##" #$" #%" #&" #'" #(" #)" #*" #+" $!" $#" $$" $%" $&" $'" $(" $)" $*" $+" %!" %#" %$" %%" %&" %'" %(" %)" %*" %+" &!" &#" &$" 1&#--".(2345,#(

Figure 3.10: Molecular Dynamics: Workload with Metis, GPU, and Multi-dimensional Partitioners on 14, 28, and 42 Partitions

End-to-End Execution Time Using Different Partitioning Methods: We now compare

the end-to-end execution time using the three partitioning methods. Results with different

thread configurations are shown in Figure 3.11 (Euler - smaller dataset). On one hand, with

MP, the partitioning time is even larger than the computation time due to the significant

cost of initialization. Whereas, GP has the lowest computation performance because more

redundancy and imbalance is introduced. So MD outperforms both of them by achieving a balance between the partitioning time and the execution efficiency. The same comparison for Molecular Dynamics (37K dataset) is shown in Figure 3.12. The trends are very similar.

Overall, the observation with respect to the choice of partitioners is as follows. Metis package currently does not include any GPU-accelerated implementation of partitioners.

As compared to the execution of the application itself on the GPU, the partitioning costs associated with the rigorous algorithms become unreasonably high. Moreover, unlike the use of partitioners for distributed memory machines, the cost of cross edges is not as high,

72 &!"

%#"

%!"

/012302456"7480" $#" .92::15"7480" !"#$%&'$()% ;1<="7480" $!" ;18<>?9:15"7480"

#"

!" %'()*" %'($%'"%'(%#)"%'(#$%" %'()*" %'($%'"%'(%#)"%'(#$%" %'()*" %'($%'"%'(%#)"%'(#$%" +," -." +."

Figure 3.11: Comparison of Metis, GPU and Multi-dimensional using 28 Partitions for Euler (20K)

since no communication is involved in our proposed scheme for the GPUs. The MD parti- tioner, which is based on node information only, has a significantly lower overhead, despite being currently implemented on a CPU only. At the same time, it maintains a good load balance, and introduces only a modest amount of redundant computation for the GPU.

3.4.5 Other Performance Issues

We also evaluated how a very new (and relatively unique) architectural feature of Fermi impacts the performance our PBL scheme. In particular, in the Fermi architecture, appli- cation developers can configure a 64 KB fast memory (on each SM) in two ways. It can be configured as a 48 KB programmable shared memory and 16 KB L1 cache or as a 16 KB programmable shared memory and a 48 KB L1 cache. We refer to them as shared memory preferred and L1 preferred configurations, respectively.

This choice can clearly impact the PBL scheme, as the available shared memory changes the number of partitions we need to create. The number of partitions, in turn, influences

73 #!!!" +!!" *!!" )!!" (!!" 1234524678"96:2" '!!" 0;4<<37"96:2"

!"#$%&'$()% &!!" =3>?"96:2" %!!" =3:>@A;<37"96:2" $!!" #!!" !" &$,(&" &$,#$*"&$,$'("&$,'#$" &$,(&" &$,#$*"&$,$'("&$,'#$" &$,(&" &$,#$*"&$,$'("$*,'#$" -." /0" -0"

Figure 3.12: Comparison of Metis, GPU and Multi-dimensional using 42 Partitions for Molecular Dynamics (37K)

performance in the following way. A larger number of partitions can increase the number of cross edges, and increase the amount of computation. The workload imbalance observed can also vary with number of partitions, though it does not have to be necessarily higher with a higher number of partitions. A larger L1 cache, however, can allow better reuse of data structures not allocated in shared memory.

We conducted an experiment to evaluate this issue using Euler with both 20K and 50K datasets. For the 20K dataset, the L1 preferred strategy (requiring 28 partitions) has better performance over the shared memory preferred strategy (requiring 14 partitions), as we show in Figure 3.13. The reason seems to be that even with 14 partitions, we use only nearly 20 KB of the 48 KB available (Because there are 14 SMs in the card, fewer partitions will imply that not all SMs can process data). In comparison, with 28 partitions fitting into

16 KB shared memory, we obtain better reuse from L1 cache. The trends change with the larger datasets. Now, while 14 partitions can still fit in 48 KB shared memory, 42 partitions

74 !#," !#+" !#*" !#)" !#(" !#'" !"#$%&'$()% !#&" !#%" !#$" !" $'"-./00123" %+"-./00123" $'"-./00123" '%"-./00123" 4567/%!8" 4567/(!8"

Figure 3.13: Shared Memory Preferred (14 Partitions) Vs. Cache Preferred (28 Partitions) - (left) Shared Memory Preferred (14 Partitions) Vs. Cache Preferred (42 Partitions) - (right)

are needed if we have only 16 KB shared memory. This leads to a substantial increase in redundant workload, and worse performance.

3.4.6 Performance with Adaptive Execution

Our experiments so far with the Molecular Dynamics application executed the main computations for 100 iterations, with no change in interactions between the molecules. In practice, however, this application involves a change in the interactions (set of edges or iter- ations) over time. Thus, we also experimented with a version that changed the indirection array after every 20 iterations. In comparison to the conventional locking or replication approaches, our PBL scheme needs to perform partitioning and reordering every time the indirection array changes.

Figure 3.14 shows the comparison among different partitioning methods for Adaptive

Molecular Dynamics. For MP a large overhead is introduced. However, for MD, the cost for re-partitioning is only 0.96% of the total running time. Consequently, the PBL scheme

75 %#!!"

%!!!"

$#!!"

9:0AB:A345"C3D:"

$!!!" -=A>>04"C3D:" ?0;E"C3D:"

?0D;7F=>04"C3D:" #!!"

!" &%'(&" &%'$%)" &%'%#(" &%'#$%" &%'(&" &%'$%)" &%'%#(" &%'#$%" &%'(&" &%'$%)" &%'%#(" &%'#$%" *+" ,-" *-" -./" /012345"6788"9:;<" ?-@" 831=>04"

Figure 3.14: Comparison of MP, GP, and MD for Adaptive Molecular Dynamics (37K dataset, 42 partitions)

still achieves a speedup of 11.6 compared to the sequential CPU version. Figure 3.14 also shows a comparison of the performance obtained from the PBL scheme against the conventional strategies, in which Locking and Full Replication is shown in their best thread configuration. Again, despite the cost of re-invoking the runtime modules, the PBL scheme achieves a significantly better performance when compared to full replication and locking scheme. Overall, these results show that the PBL scheme with the MD partitioner is effective

even for adaptive irregular reductions.

3.5 Related Work

We now compare our work against similar research efforts.

To the best of our knowledge, there is no existing systematic study considering irregular

reductions on modern GPUs, and the use of shared memory for this class of applications.

However, there have been some individual application studies considering Euler, Molecular

Dynamics, or very similar applications. Andrew Corrigan et al. [24] have parallelized the

76 Euler computation on a per-element basis on a GPU, with one thread per element. In their

approach, a large amount of redundant work arises because of the repeated computation of

edges for all the elements they connect to. In comparison, in our approach, through parti-

tioning, redundant work just exists at the boundaries of the partitions. In addition, we are

able to use shared memory effectively. Another effort [123] parallelizes Lennard-Jones-

based molecular dynamics simulation, in which there is no bond between the molecules.

So the computation model in this work is not an irregular reduction. Another study [80]

focusing on bond Molecular Dynamics used a parallelization method very similar to full

replication. The memory overheads are very significant in this method, and there is a limit

on the size of the data that can be processed. The work on Amber 111 and Friedrichs’ work [39] have both created specific molecular dynamics implementations on GPUs. De- tails of their implementation schemes are not available and the actual implementations are also not distributed publicly. The limitations on number of molecules that can be simulated using Amber 11 implementation suggests the use of full replication.

The key differences in our work are: 1) our work is general to the irregular reduction class of application, 2) shared memory is utilized by partitioning, 3) a multi-dimensional partitioning method has been introduced, which balances the partitioning time and the ex- ecution efficiency, and 4) a detailed comparison of several approaches has been carried out across multiple applications.

A class of applications, which is somewhat similar to irregular reductions involves sparse matrix computations, have recently been studied on GPUs as well, including the auto-tuning work at Georgia Tech [23], and earlier studies by Garland [44]. Despite some similarities, there are significant distinct challenges in the parallelization of unstructured

1http://ambermd.org/gpus/

77 mesh computations, which have not been addressed in the efforts on sparse matrix compu-

tations.

Several systems have tried automating the use of shared memory on GPUs. Among

these, Baskaran et al. have provided an approach for automatically arranging shared mem- ory by using the polyhedral model for affine loops [12]. Moazeni et al. have adapted

approaches for register allocation to manage shared memory on GPU [93]. Finally, Ma et

al. have considered an integer programming based formulation [88]. None of these efforts

have considered irregular applications, where runtime preprocessing needs to be used.

As stated earlier, irregular reductions have been studied widely on different types of

architectures. This include efforts targeting distributed memory parallel machines [6, 26,

49, 52, 73, 75, 103, 128], distributed shared memory machines [94, 46], shared mem-

ory machines [15, 48, 78, 132] and cache performance improvement on uniprocessor ma-

chines [32, 50, 89, 92]. The key distinctive aspect of modern GPUs is the programmable

cache or shared memory. Thus, while the use of runtime preprocessing and partitioning is

similar to several of the above efforts, the details of our work are clearly distinct.

3.6 Summary

GPUs have emerged as a major player in high performance computing today. However,

programmability and performance efficiency continue to be challenges which can restrict

the use of GPUs. Particularly, irregular or unstructured computations can be quite chal-

lenging to parallelize and performance-tune on GPUs.

This paper has considered applications that involve unstructured meshes. We have de-

veloped a general methodology and optimized runtime support for this class of applications.

78 Particularly, we have proposed a novel execution strategy, which we refer to as partitioning- based locking scheme. The main idea is to partition the reduction space, and then reorder the computation. We have developed a new low-cost partitioning scheme, which we refer to as multi-dimensional partitioning. Several optimizations have also been performed to improve the execution time of the runtime modules.

Our detailed evaluation using two popular irregular reductions has demonstrated the fol- lowing. First, our partitioning-based scheme clearly outperforms the conventional schemes, and results in impressive speedups over sequential CPU versions. Moreover, our multi- dimensional partitioning scheme achieves the best tradeoff between the partitioning time and the application execution time. Our runtime modules are also efficient enough to sup- port adaptive irregular applications.

79 Chapter 4: Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations

Based on the execution methodology, which is introduced in last chapter, for irregular reduction on modern GPUs, in this chapter we further extend our original methodology to port irregular reductions on heterogeneous CPU-GPU architectures.

4.1 Multi-level Partitioning Framework

This section describes the overall framework for supporting irregular reductions in a heterogeneous setting. The key component of our framework is a multi-partitioning method. In last chapter, we have introduced two kinds of single level partitioning methods, computation space partitioning and reduction space partitioning.

In computation space partitioning, the workload or the iterations of the loop is parti- tioned into different groups. For the GPU, suppose we have n multiprocessors. We can create n × k partitions, such that the size of the set of reduction array elements accessed in each partition can fit into the shared memory. Each multiprocessor can process k partitions, loading the corresponding reduction array elements before execution each partition. Simi- larly, for the CPU, partitions can be equally distributed to n processors. Note that the CPU does not have a programmable cache, and all the data, including reduction array elements, are accessed from the main memory.

80 For both GPU and CPU, this scheme has several disadvantages. First, in this partition- ing method a reduction element can belong to multiple partitions. Moreover, the number of elements corresponding to one partition may vary significantly. So the resulting redun- dancy (i.e. keeping reduction elements in multiple partitions) and unevenness in the size of each partition can make it difficult to reuse shared memory in an effective fashion. In addition, there can be a significant overhead to combine the different copies of reduction elements on both GPU and the CPU cores.

On the other hand, reduction space partitioning do the partition on the reduction space, i.e., the set of elements in the reduction array. For the execution on the GPU, the number of partitions could be chosen that the reduction space corresponding to each partition can be put into shared memory. Compared to computation space partitioning, the reduction sizes cross different partitions are same, and can be controlled to put into shared memory.

However, redundancy workload, causing by the crossing edges ,would increase with the increasing of the number of edges. In last chapter, we introduced a multi-dimensional partitioning method, which can achieve a balance between minimizing the cutting edges and reducing the execution time of the partitioning itself.

The biggest advantage of this scheme is that there is no overlap between reduction elements on different partitions. Thus, thread blocks on a GPU and threads on CPU cores can be executed in parallel without any locking or need for combination.

Thus, while there are clearly trade-offs between these two schemes, we were driven by following observations to choose reduction space partitioning, for both CPU and GPU execution. For GPU, shared memory is a critical resource, and it is important to judiciously

81 First Level Second Level Input Task Scheduling Computation Output Partitioning Partitioning GPU Task Queue GPU Subpart 1 P1 Multiprocessor 1

P3 Shared P4 Subpart 2 Memory

P5

P7 Global Subpart 3 Multiprocessor 2 Task ... Queue Part1 Shared ...... Memory P1

P2 Part2 ...... P3

P4 Reduction Reduction Part3 Array ... Array

...... Subpart 1 ... P2 CPU

Part4 P2 Subpart 2 Core 1 Core 2 P6

P9 Subpart 3 Core 3 Core 4 ...

CPU Task ...... Queue

Heterogeneous Host Host Architectures

Figure 4.1: Framework of Multi-level Partitioning

utilize it. Second, there is a high degree of parallelism available on a GPU, and thus, redun- dant computations to an extent are acceptable. For the CPU, computation space partitioning will normally imply use of locking. This, can, however, introduce significant overheads.

4.1.1 Multi-level Partitioning

A partitioning method like the reduction space partitioning can help divide the work between different SMs of a GPU or different cores of a CPU. In our work, in addition to these goals, we have two other goals, which are utilizing both CPU and the GPU simulta- neously, and being able to execute on datasets that exceed the size of the device memory.

Thus, we introduce two levels of partitioning in our framework.

Our overall framework is referred to as Multi-level Partitioning Framework, and com- prises four components. They are: First-level Partitioning, Task Scheduling, Second-level

82 Partitioning, and Computation execution. The second level partitioning, where either the

CPU or the GPU is considered independently, is the same as the single-level partitioning discussed above. In the rest of this subsection, we discuss the first-level partitioning, where the goal is simultaneous utilization of heterogeneous resources. Then, we discuss task scheduling on heterogeneous architecture in Section 4.2.1.

As stated above, there are two goals behind first-level partitioning. First, the input data is divided into blocks, such that each block can fit in GPU’s device memory. Thus, any arbitrary large dataset can be divided into sufficient number of blocks, to enable execution on the GPU. Second, we can see from Figure 4.1, the output of first-level partitioning is the input to the task scheduling component. Thus, first-level partitioning enables distribution of work on the CPU and the GPU.

Clearly, both computation-space partitioning and reduction-space partitioning can be possible approaches for first-level partitioning. While we use reduction-space partitioning for the second-level, there is clearly no requirement that the same method has to be used for

first-level partitioning. However, there is a significant advantage associated with reduction space partitioning at the first-level as well. First, in computation space partitioning, if we do not make additional copies, the reduction objects might have to be shared by different partitions, which is not possible when CPU and the GPU have different address spaces.

Although we can possibly use a technology like the unified virtual address, the large num- ber of remote accesses will significantly impair the performance. Second, if we do make multiple copies, combination overhead (including the data movement costs), will be very high. Consequently, we also choose reduction space partitioning as the method at the first level.

83 Reduction Space Computation Space

node node node ... edge edge edge ...

Partitioning Module

node node node ...

part part part ...

Yes Search If First-Level Partitioning Component

No

Runtime Reordering Module

Reordering Updating Component Component

part1 part2 part3 ... edge edge edge ...

Runtime Support Component

Reordered Reduction & Computation Space

Figure 4.2: Runtime Support Component for Partitioning

At each partitioning level, we need to perform the reordering and the updates on the re- duction space, the computation space, and other input data, as shown in Figure 4.2. How- ever, more work has to be done at the second-level partitioning, compared to ordinary single-level partitioning. In the first-level partitioning, the partitioning, reordering, and up- dating operations are conducted on the entire dataset. But for second-level, reordering and updating operations cannot be just performed individually on the partitions received from the first-level.

Specifically, the reordering component needs to reorder the nodes by their partition in- dices. But the nodes from a partition created at the first-level only includes the ones in the partition’s reduction space, which is only a subset of the entire input data. For example,

84 in Figure 3.3b, we can see that the dark nodes belong to the input dataset, but not the re-

duction space of a particular partition. On the other hand, an updating operation needs to

update the index of both ends of each edge. Thus, both the white nodes and the dark nodes’

indices need to be updated. Consequently, in addition to the original reordering and updat-

ing components, a search component is involved into this framework. This component is

responsible for searching and marking all the nodes belonging to the input dataset, that are

not in a partition’s reduction space, i.e. the dark nodes in Figure 3.3b.

Now, we can summarize the key aspects of our overall scheme.

• Partitioning the output or reduction space, such that the reduction objects in each

partition (at the second level) can fit into the shared memory for GPU.

• Reorder the computation space or indirect array in such a way that iterations updating

elements from the same partition are grouped together. Computation is independent

between thread blocks on GPU and threads on CPU.

• After the computation is finished, there is no requirement of combination between

different thread blocks on a GPU, different threads on a CPU, or across the CPU and

the GPU.

4.2 Runtime Support and Schemes

This section describes the key components of our runtime system. We initially describe the basic task scheduling framework. Later, we describe the pipelining performed in the framework, and describe how task scheduling is optimized when pipelining is enabled.

85 4.2.1 Task Scheduling Framework

Task scheduling is a critical component for enabling the use of both the multi-core CPU and the GPU for the same application. In our framework, two types of task queues, i.e. global task queue and local task queues, are used. After the first-level partitioning, all the partitions are first put into the global task queue, which is shared between the CPU and the

GPU. Computations from both these units can access it, and update it, while obtaining work to perform. Locking is used to ensure the correctness as both CPU and GPU access it. On the other hand, both CPU and GPU have their own private local task queue (see Figure 4.1).

The partitions obtained from the global task queue are first inserted into the local task queue. Then, the CPU or the GPU can obtain one partition from their respective local queue, carry out the second-level partitioning, and then perform the same computation.

In the heterogeneous architectures comprising a multi-core CPU and the GPU, GPU has a higher throughput for most applications than a typical sized multi-core CPU. Thus, it makes sense to allow multiple partitions at the first-level to be assigned to the GPU. This also helps reduce the overheads due to the data transfer latency on the GPU.

4.2.2 Runtime Pipeline Scheme

For processing of irregular reduction using one of the partitioning methods, there is a partitioning component, and the actual computation. Moreover, the results of the partition- ing components are the input to the computation. The partitioning is typically performed on the CPU, whereas computations can be performed on the GPU or the CPU. But, while partitioning is performed on the CPU, the GPU is idle, leading to significant overall impact on the performance.

86 Global Task B1 B2 B3 B4 B5 B6 B7 B8 Queue

GPU Task B1 B2 B3 B6 B7 B8 Queue

GPU Worker (Pipeline Scheme) CPU Worker

CPU (Partitioning) CPU (Partitioning)

B1 B2 B3 B6 B7 B8 B4 B5 B8

P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ...

GPU (Computation) Multi-core (Computation)

P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ... P1 P2 ...

Finished B1 B2 B3 B4 B5 B6 B7 B8 work

Figure 4.3: Runtime system: Pipelining Scheme (Left) and Task Scheduling Scheme (Right)

We have developed a pipelining framework to overlap partitioning on the CPU with computations on the GPU. Figure 4.3 shows the processing flow with our pipelining scheme.

Each time multiple blocks can be loaded from the global task queue to the GPU local task queue. In each step, the CPU can obtain one block from the GPU local task queue, perform the second-level partitioning, and make ready for processing on the GPU. While the GPU performs the computations, CPU will partition the next available block in the pipeline.

Synchronization is needed when the GPU finishes computation on the first block and CPU

finishes partitioning on the second block.

With this method, the GPU is idle only when the partitionining for the first block is performed. The partitioning for other blocks in the pipeline can be overlapped with com- putation on the previous block. An important factor for the implementation of the pipeline is the length of pipeline. Because the pipelining takes blocks from the local task queue, the

87 maximum length of the pipeline is the number of blocks loaded into the GPU’s local task

queue at any given time. We prefer to keep the pipeline as long as possible to reduce the

number of synchronization steps between each loading operation.

4.2.3 Optimized Task Scheduling

We now revisit task scheduling problem, and show how we optimize the basic scheme

that was described earlier in this section. The new scheme is driven by two goals, which

are improving load balance and factoring the new pipelining scheme.

The optimized scheme is also shown in Figure 4.3. Recall from our basic scheme, the

main step is that the GPU and the CPU can load the partitions from the global task queue to

their local queues. The number of partitions loaded from the global task queue is referred

to as the scheduling granularity. As the performance obtained on the CPU and the GPU

may vary considerably, it is reasonable to provide different scheduling granularity for them.

Particularly, larger granularity for the GPU can reduce the data movement latency related

overheads, whereas smaller granularity for the CPU can provide a better load balance.

However, using larger granularity for the GPU may also cause load imbalance. Con-

sider the following situation. Suppose the scheduling granularity for the GPU is k units,

and there are exactly k units left towards the end of the processing. If GPU finishes the previous job assignment, it can now obtain all k units. But, if the CPU also finishes the

previous work, it will have to idle till the GPU finishes. Overall, as k increases, the load

idle time can get worse. At the same time, with a low value of k, the overheads of data

transfer for the GPU will be high.

Driven by this observation, as shown in Figure 4.3, we have optimized the task schedul-

ing scheme based on work stealing. Suppose the GPU scheduling granularity is 3 and the

88 CPU scheduling granularity is 1. Suppose there are 8 blocks, where CPU finishes execu- tion of the 5th block, while blocks 6, 7, and 8 have been assigned to the GPU. The load balance can be improved by allowing CPU to steal work from the GPU’s local task queue.

However, a problem can arise if CPU gets the next available partition in the task queue. For example, when GPU is doing the computation on the 6th block, because of the pipelining, we may be performing partitioning on the 7th block, which is the next block in the local task queue. Thus, if the CPU now performs the computations for the 7th block, the pipeline between the partitioning and the computation components would be broken, and partition- ing and computation for the 8th block will have to be performed serially. To address this problem, CPU always steals the workload from the tail of the task queue. In this fashion, the pipeline is not disturbed, and better load balance is also achieved.

4.3 Experimental Results

In this section we present results from a number of experiments we conducted to eval- uate the benefits of our framework and the optimizations.

Experimental Setup: Our experiments were conducted on a heterogeneous architecture with a NVIDIA Tesla C2050 GPU and two Intel 2.27 GHz Quad core Xeon E5520 CPUs.

The Tesla C2050 GPU, also called Fermi, has 448 cores (14×32), each of which has a clock frequency of 1.15 GHz. The memory hierarchy consists of a 2.68 GB device memory, 768

KB common L2 cache, and 16 KB or 48 KB configurable shared memory for each SM. On the other hand, the host is configured with two Quad-core Intel Xeon processors with a total of 8 cores, and a 48 GB main memory. The GPU is connected to the host through a (x16)

PCI Express 2.0, which can provide 8 GB/s bandwidth. In the following experiments, for

89 multi-core CPU we always use 8 threads for 8 cores. And for GPU, the results are shown with the thread configuration of 14 blocks and 512 threads in each block.

Our evaluation was based on two representative irregular reduction applications, which

have been widely studied [6, 26, 49, 94, 32, 50, 89, 92]. The first application is Euler

(EU) [25], which is based on Computational Fluid Dynamics (CFD). It takes a description

of the connectivity of a mesh and calculates velocities at each mesh point. The second

application is Molecular Dynamics (MD) [61], where simulation is used to study the struc-

tural, equilibrium, and dynamic properties of molecules. It shows a view of the motion of

the molecules and atoms by simulating the interaction of the particles for a period of time.

Evaluation Goals: The goals of our evaluation study were as follows.

• We first demonstrate the scalability of the irregular applications on multi-core CPU

and GPU separately, using our Multi-level Partitioning Framework. We also evaluate

the performance trade-offs across two partitioning methods, which are the reduction

space-based and computation space-based methods, on both multi-core CPU and

GPU.

• We then show the benefits obtained from pipelining the data partitioning and irregular

reduction computations.

• Finally, we study the performance gains achieved from dynamically distributing the

computations to utilize the multi-core CPU and GPU simultaneously.

4.3.1 Scalability of Irregular Applications on Multi-core CPU and GPU

In this experiment, we demonstrate the scalability of MD and EU on multi-core CPU and

GPU separately across 3 different datasets. The datasets and their characteristics for both

90 25 Dataset=Small Dataset=Medium Dataset=Large 20

15

Speedup Speedup 10

5

0 CPU Multicore GPU CPU Multicore GPU CPU Multicore GPU

Figure 4.4: Scalability of Molecular Dynamics on Multi-core CPU and GPU across Differ- ent Datasets

the applications are shown in Table 4.1. For each application, we have 3 datasets, which

are small, medium, and large. For MD, the dataset sizes are 0.3 GB, 2.6 GB, and 5.3 GB, respectively. For EU, the dataset sizes are 1.8 GB, 2.7 GB, and 3.4 GB, respectively. The main computation loop in MD and EU is repeated for 20 and 100 time steps, respectively.

The results from the scalability experiments are shown in Figures 4.4, and 4.5. Here, we compare the relative speedup of multi-core (8 threads for 8 cores) and GPU executions against sequential CPU execution. It should be noted that, for datasets larger than GPU device memory, the pipelining scheme has been used. Later in this section, we quantify the benefits of pipelined execution by comparing the performance against non- pipelined execution.

From Figures 4.4, and 4.5, we can see that for both MD and EU, there is good scalability from both multi-core as well as the GPU versions. In fact, for MD, the speedup of both multi-core as well as GPU executions increases with increase in dataset size. In the case of EU, the speedup achieved from both multi-core and GPU executions is similar across

91 20 Dataset=Small Dataset=Medium Dataset=Large

15

10 Speedup

5

0 CPU Multicore GPU CPU Multicore GPU CPU Multicore GPU

Figure 4.5: Scalability of Euler on Multi-core CPU and GPU across Different Datasets

different datasets. Overall for MD, we achieved a factor of 7.3 speedup from multi-core

CPU and a factor of 21.1 speedup from the GPU. For EU, we obtained a speedup of about

5 times and 16 times from multi-core CPU and GPU, respectively. For both applications,

GPU execution is about 3 times faster than execution on the multi-core CPU. Overall,

these trends indicate that the our framework is effective in utilizing the multi-core CPU

and GPU for scaling irregular reductions. Particularly, we can see that even as the dataset

sizes become too large to fit in device memory of the GPU, our relative speedups do not

decrease. Thus, we are not only removing the key limitation of the previous work for

supporting irregular reductions on GPUs, we are also not losing any performance while

doing so. For the remainder of this section, we use the large datasets for both MD and EU.

4.3.2 Performance Trade-offs with Computation and Reduction Space Partitioning

In this experiment we compare the performance obtained from the computation space and reduction space partitioning, for both the multi-core CPU and the GPU. These two

92 Application Dataset Description Tag 7.5 Mil 3-D Nodes, 47.8 Mil Edges Small Euler 11.3 Mil 3-D Nodes, 71.8 Mil Edges Medium 15 Mil 3-D Nodes, 95.7 Mil Edges Large

0.13 Mil 3-D Molecules, 14.8 Mil Small Molecular Edges Dynamics 1 Mil 3-D Molecules, 124 Mil Edges Medium 2 Mil 3-D Molecules, 253 Mil Edges Large

Table 4.1: Dataset Characteristics of Euler and Molecular Dynamics

schemes are referred to as CSP and RSP, respectively. We report these results only for

MD, as the trends are identical for the other application. In Figures 4.6 and 4.7, we present the performance trade-offs between RSP and CSP on the multi-core CPU and the GPU, respectively. The X-axis represents the number of partitions used at the second level of partitioning using RSP. For CSP we always use 8 partitions for the multi-core CPU and 14 partitions on the GPU, to match the number of multi-processors available.

We first discuss the results from multi-core CPU shown in Figure 4.6. The experiments were conducted with no partitioning (Partition L1=1) and 64 partitions at the first level

(Partition L1=64). With only one partition on L1, we can see that when L2 partition number is 8, RSP is 1.73 times faster than CSP. Moreover, the execution time of RSP decreases with the increasing of the number of partitions on L2. This is because load balancing is better with increasing number of partitions for RSP. Thus, when L2 partition number is 64,

RSP can outperform CSP by a factor of 2.8. At the first level, RSP is always used and hence

Partition L1=64 denotes the 64 partitions through RSP at the first level. In this case, CSP also improves by a factor of 2 compared to the case when there is no partitioning at the first

93 Reduction Space Partitioning (Partition L1=1) 180 Computation Space Partitioning (Partition L1=1) Reduction Space Partitioning (Partition L1=64) 160 Computation Space Partitioning (Partition L1=64)

140 For Computation Space Paritioning, Level 2 partition 120 number = processor number

100

80

60 Execution Time (sec) Time Execution

40

20

0 8 16 32 64 Partition Number on Level 2

Figure 4.6: Performance Trade-offs with Computation and Reduction Space Partitioning on Multi-core CPU (MD)

level. This is due to the reduced contention arising from larger number of partitions at the

first level. Also, RSP is 1.98 times faster compared to CSP. Another interesting observation

is that RSP has slight degradation in performance with increasing number of partitions at

the second level. This is opposite to what we observed in the case of Partition L1=1. This is

because too many partitions introduce more redundant computations, leading to a decrease

in performance. Figure 4.7 presents a similar comparison on the GPU and the trends are

very similar to what we observed on the multi-core CPU.

Overall, partitioning on computation space is simple, but may lead to large overheads

due to synchronization. On the other hand, RSP has partitioning overheads and may in-

troduce redundant computations. However, because the synchronization overhead can be

avoided to a large degree, RSP is significantly better than CSP. All experiments in the rest of this section will be based on RSP.

94 Reduction Space Partitioning (Partition L1=48) Computation Space Partitioning (Partition L1=48) 50 Reduction Space Partitioning (Partition L1=64) Computation Space Partitioning (Partition L1=64)

40 For Computation Space Paritioning, Level 2 partition number = Multiprocessor number 30

20 Execution Time (sec) Time Execution

10

0 14 28 42 56 Partition Number on Level 2

Figure 4.7: Performance Trade-offs with Computation and Reduction Space Partitioning on GPU (MD)

4.3.3 Benefits From Pipelining

In this experiment, we demonstrate the benefits obtained from the use of the pipelining

scheme for large datasets that do not fit into the GPU device memory. For both MD and EU, we use large dataset described earlier in this section, such that, at least two partitions are required. Also, it should be noted that when the pipeline scheme is used, the partitioning is performed on the CPU (using a single thread) and the computation performed on the GPU is overlapped with this partitioning step.

Figure 4.8 shows the benefits from pipelining for MD. One critical factor effecting the performance of Pipeline Scheme is the length of pipeline. On the X-axis, we have both the

Non-Pipeline and the Pipeline versions. In Pipeline-X, X indicates the length of pipeline and the number of partitions indicate the maximum number of partitions avail- able. The two fractions in figure refer to the computation time on GPU and the idle time of GPU for waiting the results from partitioning. It can be seen that the Non-Pipeline

95 40 Computation Time 4 Partitioning Time Waiting Time 3.5 3 Total time 2.5 Speedup 1.23 2 30 Time (sec) 1.5 1 0 1 2 3 4 5 6 7 39.2% Speedup 1.7 Partition Number (Pipeline-8)

20 48.7% Speedup 1.6

Execution Time (sec) Time Execution 70.6% 81.8% 10 60.8%

51.3%

29.4% 18.2% 0 Non-Pipeline Pipeline-2 Pipeline-8 Pipeline-64 (2 partitions) (2 partitions) (8 partitions) (64 partitions) Length of Pipeline

Figure 4.8: Effect of Pipelining CPU Partitioning and GPU Computation(MD)

execution spends a huge fraction of idle time for partitioning and only a small fraction ac- tual computation, which is clearly not desirable. However, with Pipeline and with an increasing number of partitions and the pipeline length, the overhead because of the parti- tioning is reduced. While increasing the number of partitions and pipeline length reduce the percentage of idle time, more redundancy in computation due to larger number of par- titions degrades the performance beyond a point. The sub-view inside the Figure 4.8 shows the overlapping of partitioning and computation. It indicates that there is a good overlap between partitioning and computation for all the partitions expect for the partition zero, as we will expect.

Figure 4.9 shows similar results for EU on the large dataset. One observation is that, compared to MD, the ratio between number of edges and number of nodes is much smaller for this application. Thus, less redundant work is introduced with increasing number of

96 350 Computation Time 1.4 1.2 Partitioning Time Waiting Time 1 Total Time 300 0.8 0.6 Speedup 1.11 0.4 Time (sec) 0.2 250 0 22% 0 1 2 3 4 5 6 7 24.4% Speedup 2.6 Partition Number (Pipeline-64) 200

150

Execution Time (sec) Time Execution 100 78% Speedup 5.05 75.6% 48.6%

50 51.4% 73.1% 26.9% Non-Pipeline Pipeline-2 Pipeline-8 Pipeline-64 (2 partitions) (2 partitions) (8 partitions) (64 partitions) Length of Pipeline

Figure 4.9: Effect of Pipelining CPU Partitioning and GPU Computation(EU)

partitions. The best performance is obtained from Pipeline-64, which outperforms the

Non-Pipeline by a factor of 5.05.

4.3.4 Benefits From Dividing Computations Between CPU and GPU

In this experiment, we show the performance benefits from the simultaneous use of multi-core CPU (8 threads) and GPU for computations. Figure 4.10, shows the perfor- mance achieved from the use of both the resources (the CPU+GPU version) compared to using the multi-core CPU only and the GPU only. In comparison to the best of multi- core only and GPU only versions, the CPU+GPU version achieves 11% and 22% better performance with EU and MD, respectively. From the experiments conducted so far, it might appear that the benefits from pipelining are higher than simultaneously using CPU and GPU for computation. However, it is not true. For instance, consider the Figure 4.11, where we create two versions of MD executions. MD20 has 20 time steps for main computa- tions, which indicates that the ratio of partitioning to computation is large. On the contrary,

97 Euler 20

10

0 30

Molecular Dynamics Speedup

20

10

0 CPU Sequential Multi-core GPU GPU+CPU

Figure 4.10: Benefits From Dividing Computations Between CPU and GPU for EU and MD

MD100 represents a scenario when the ratio of partitioning to computation is small. From the results, we observe that in the former case, benefits from pipelining is larger, while in the latter, benefits from simultaneous use of CPU and GPU for computation is larger.

Next, in Figure 4.12, we compare our work stealing strategy with two other alternate schemes, which are the fine-grained and the coarse-grained distribution schemes. For

fine-grained scheme, the data is divided between CPU and GPU at a finer granularity, one partition at a time in our case. For coarse-grained scheme, the data is distributed at a coarser granularity, five partitions at a time in our case. The X-axis represents the length of execution for the GPU and the CPU, respectively. The Y-axis represents the amount of work (no. of partitions) performed by the GPU and the CPU individually. With the fine- grained strategy, though the best division of work is achieved (i.e. we achieve the least idle time (4%)), execution time is the longest due to an ineffective use of the pipeline scheme.

With the coarse-grained scheme, the resources consume five partitions at a time, making an effective use of the pipelining. However, this leads to an increased idle time between the

CPU and the GPU (22%). Finally, with the work stealing strategy, we are able to achieve a

98 2 MD100

31%

17% 1

0 MD20 Speedup 2 22%

70% 1

0 GPU (no Pipeline) GPU (Pipeline) GPU(Pipeline) + CPU

Figure 4.11: Effect of Ratio of Partitioning to Computation

balance between the execution time as well as the idle time. Overall, work stealing strategy results in a 39% better performance than the fine-grained scheme.

4.3.5 Discussion

Some of the key results and observations from our experimental results are as follows.

We first demonstrated that our proposed framework is effective in scaling irregular reduc- tion applications on both multi-core CPUs and GPUs, independently. We were able to achieve upto 7.3 times speedup from a 8 core CPU and 21.1 times speedup from a Fermi

GPU. Also, for GPU execution, we can scale to datasets that do not fit into GPU’s device memory, while maintaining the same relative speedups.

To further gain insights into the performance factors, we analyzed the performance trade-offs between CSP and RSP on the multi-core CPU and the GPU. Our results indicate that RSP, despite its partitioning and redundant computation overheads, is a better option than CSP. This is because, with RSP, synchronization overheads are almost non-existent

99 fpriinn ocmuain eso hti spsil oaheehge eet also. benefits higher achieve to possible is it that show we computation, to partitioning of for achieved be about can of 22% improvements and performance 11% additional that shown have we schedul- Here, task framework. our using ing actual GPU, the the dividing and CPU of multi-core benefits the the between computations shown reduction have we Second, execution. non-pipelined a to For performance. for factors critical and are that pipeline show of also length studies the Our and partitions GPU. of on number computations actual the and CPU the the overlapping of on benefits partitioning the show we First, ways. two in simultaneously GPU) and (CPU with performance the than better with 98% that be show can we performance Overall, effectively. utilized be can hierarchy memory the and Strategies Stealing Work and Coarse-grained, Fine-grained, Euler on of Comparison 4.12: Figure ehv lodmntae h eet rmuiiigtehtrgnosresources heterogeneous the utilizing from benefits the demonstrated also have We MD eahee efrac an faot50 n 0 epciey compared respectively, 70% and 500% about of gains performance achieved we ,

Workload (Num. of Partitions)

Work Stealing Coarse-grained Fine-grained 10 15 20 25 10 15 20 25 10 15 20 25 0 5 0 5 0 5 0 0 0 10 10 10 EU and 20 20 20 Execution Time (sec) 100 MD 30 30 30 epciey oee,dpnigo ratio on depending However, respectively. , 40 40 40 CPU CPU GPU GPU 50 50 50 CSP CPU 60 60 60 . GPU 70 70 70 RSP the , EU 4.4 Related Work

In the following, we compare our work against the efforts specific to the modern CPU-

GPU architectures, and various studies that have focused on irregular reductions.

In the last 2-3 years, there has been a considerable interest in application development and execution for CPU/GPU heterogeneous environments. Harmony [30] project has de- veloped an execution model and a runtime framework to schedule computations either on a CPU or on a GPU, based on the estimated kernel performance. Teodoro et al. [119] de- scribe a runtime framework that selects either of a CPU core or the GPU for a particular task, from a single task parallel application. Similarly, Ravi et al. [109] presented a code generation and runtime framework for generalized reduction class of applications, which includes dynamic scheduling mechanism to split the work between the CPU and the GPU.

Qilin [86] system involves an adaptable scheme for mapping the computation between CPU and GPU simultaneously. The adaptivity of the system is based on extensive offline train- ing. Becchi et al. [13] use a data-aware scheduling of kernels to reduce the data transfer overhead between CPU and GPU. None of these efforts are capable of splitting the work between the CPU and the GPU correctly for irregular reductions, which is the focus of our work.

As stated earlier, irregular reductions have been studied widely on different types of architectures. This includes efforts targeting distributed memory parallel machines [6, 26,

49, 52, 73, 75, 103, 128], distributed shared memory machines [94, 46], shared mem- ory machines [15, 48, 78, 132], cache performance improvement on uniprocessor ma- chines [32, 50, 89, 92], and modern GPUs [59]. Also, there are some individual appli- cations studies considering Euler, Molecular Dynamics, or other similar applications on

101 recent architectures, i.e., the one targeting IBM CBE architecture [14], and another for het- erogeneous multi-core clusters [35]. Amber 11 2 provides an implementation of Molecular

Dynamics on a CUDA-based GPU. However, it requires that the entire calculation must fit within the GPU memory, similar to our earlier work with Partitioning-based locking [59].

To the best of our knowledge, there is no existing study considering irregular reductions on heterogeneous architectures composed of GPU and CPU.

4.5 Summary

This work has focused on irregular reduction applications arising from unstructured meshes. We have developed a Multi-level Partitioning Framework to port this class of applications on heterogeneous architectures composed of a multi-core CPU and a GPU.

The main idea is to introduce an additional top-level partitioning to enable the execution of large datasets on the GPU and distribute the workload among the cores of the CPU and the GPU. Our runtime support includes two key features that are instrumental in achieving significant performance improvements. First, a pipeline scheme is developed to overlap partitioning on CPU and computation on GPU. Second, a task scheduling strategy based on work stealing is provided to distribute the work between the multi-core CPU and the

GPU.

Two popular irregular applications are used for our detailed evaluation. Some key results are as follows. First, our framework can provide good scalability, including on large datasets that do not fit into the memory of the GPU. Second, we have shown that reduction-space partitioning used in our framework clearly outperform the other option for

2http://ambermd.org/gpus/

102 partitioning, which is computation-space partitioning. Third, significant performance im- provements are obtained from our pipelining and work distribution schemes. Overall, we show that using our framework and the schemes, irregular reduction applications can be ported effectively on to a modern heterogeneous architecture.

103 Chapter 5: Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture

In last three chapters, we discussed the parallel strategies and runtime support for gen- eralized and irregular reductions on GPU and heterogeneous CPU-GPU architectures. In this chapter, we proposed a scheduling framework for applications with different commu- nication patterns on integrated CPU-GPU environment.

5.1 Proposed Scheduling Frameworks

5.1.1 Scheduling Considerations and Existing Schemes

For a heterogeneous computing node comprising a GPU and a CPU, either in a decou- pled architecture or a coupled architecture, the key overheads that need to be minimized by a scheduling framework are the following:

Transmission Overhead: For a heterogeneous node with a CPU and a decoupled GPU, input data has to be copied to the GPUs before any computation is performed. Performance can be seriously impacted by the cost of the data transfer, especially for the applications where a large amount of data needs to be transferred and the amount of computation to be performed is only modest. Coupled CPU-GPU architectures, on the other hand, can reduce these costs to a certain extent.

104 Command Launching Overhead: In OpenCL (and other frameworks) the parallelized parts of the computation are packaged into kernels. The kernel launching, data copying and address translation operations are viewed as commands, which are enqueued into one or multiple command queues, where they may have to wait for execution. This can be a substantial overhead, especially if a small task size is chosen and the number of kernels to be launched is high.

Synchronization Overhead: In OpenCL, for both CPU and GPU, we have to perform syn- chronization between all the thread blocks when a kernel finishes. This overhead can be high if the tasks are fine-grained, or if the workload varies among the thread blocks.

Load Imbalance Costs: Load balance is one of the most critical considerations in schedul- ing. On a heterogeneous architecture with a CPU and a GPU, the load imbalance can arise at two different levels. The first level is referred to as the inter-device load imbalance. This arises due to the different computing capabilities between the CPU and the GPU, which leads to difficulties in obtaining an ideal distribution of work between the two types of de- vices. Load imbalance can also occur between different thread blocks inside each device, and is referred to as intra-device load imbalance.

As we stated earlier, there has been several research efforts on dividing the computation between a CPU and a GPU [60, 86, 109, 108, 121]. The efforts can be classified into static or off-line ones [86, 121] and dynamic task queue based ones [60, 109, 108]. These efforts all considered decoupled CPU-GPU systems, as they were the only GPU-based systems available at that time. We briefly explain these methods and establish the need for new schemes for the coupled CPU-GPU architectures.

Static Scheduling: A very simple scheduling approach can be to statically divide the entire workload to all the computation resources before the computation. However, the

105 Global Task Task Task Task ... Queue

Host End

GPU Task Manager CPU Task Magnager

Kernel Kernel Synchronization Synchronization Launching Launching

Thread Thread Thread Thread Thread Block Block Block Block Block GPU CPU

Figure 5.1: Dynamic Task Queue Scheduling Framework

difficulty here is in predicting the best ratio of division before computation. Particularly on modern CPU-GPU architectures, different applications have different relative speedups on GPUs. Qilin system from Georgia Tech takes this approach [86], and uses extensive offline training to find the right work distribution. However, such offline training may not be always practical.

Dynamic Task Queue: Dynamic task queue is another simple method for scheduling that has been widely used for both homogeneous and heterogeneous systems and has been ap- plied for CPU-GPU scheduling [8, 60, 109, 108]. The structure of dynamic task queue is shown in Figure 5.1. Before computation, the entire workload is divided into small tasks, which are enqueued into the global task queue that is shared between the devices. A de- vice looking for work obtains one or more tasks from the global task queue using locking, and then splits the tasks among the thread blocks belonging to the device. Specific to the decoupled CPU-GPU architectures, a synchronization step and a new kernel launch are

106 required. For this scheme, the size of the task is a critical factor for the performance. If the number of tasks is small, load imbalance costs can be high. With increasing number of tasks, transmission, command launching, and synchronization overheads increase.

5.1.2 Scheduling Requirements for a Coupled Architecture

We argue that fine-grained yet low overhead schemes are needed for a coupled CPU-

GPU architecture. Particularly, by fine-grained, we imply that scheduling at the granularity of devices, i.e. CPU and GPU, is not desirable. Instead, scheduling should be performed at the level of cores or thread blocks. Furthermore, work is assigned to executing kernels on each cores, instead of requiring new kernel launches.

Such a scheme will have the following advantages. First, by working at the level of thread blocks, there is no synchronization overhead between the thread blocks. Second, because we do not have to launch a separate kernel with each task, the size of the task can be made extremely small, without introducing extra transmission, command launching, or synchronization overheads.

One possible approach for implementing thread block level scheduling is to extend traditional Dynamic Task Queue to let each thread block access a global task queue directly using locking. However, this approach will cause high overheads as the size of each task decreases, and the number of core increases. More specific to the AMD Fusion, there are high overheads and coherent issues introduced because of locking operations, and because placing a task queue where it can be accessed by both CPU cores and GPU cores leads to slower accesses for both.

Thus, to support fine-grained yet low overhead methods, we have developed two locking- free schemes, which are the master-worker and token scheduling schemes, respectively.

107 Task Pool Task Task Task Task Task Task

Master Thread Scheduling Module Fixed- Factoring ….. sized

Host Memory Uncached Cacheable

Task Task Task Task Task Interface Interface Interface Interface Interface

Thread Thread Thread Thread Thread Block Block Block Block Block

GPU CPU

Figure 5.2: Master-worker Scheduling Framework

5.1.3 Master-Worker Scheduling Framework

Our first locking-free implementation is based on the popular master-worker model, which is summarized in Figure 5.2. A master thread is responsible for scheduling the tasks from a task pool. Once all the tasks have been assigned, the master thread also informs all the thread blocks that all the work has been completed. The decisions about dividing the work into tasks depend upon the scheduling module. We have experimented with a

fixed-sized scheme and a factoring algorithm.

A critical component of this framework is the task interface pool, which can be viewed as a communication interface between the master thread and the thread blocks. Task In- terface Pool is a collection of task interfaces, each of which corresponds to a thread block.

The structure of each task interface in the task interface pool is shown in Table 5.1. There are four integer variables in the structure capturing the following information: the starting

108 Task Interface Name Type Description

Start point Int The starting point of the task Work size Int The size of the task Assign work Int Number of tasks assigned to the thread block Req work Int Number of tasks required by the thread block

Table 5.1: The Structure of Task Interface

point and the size of each task, the number of tasks assigned by the master thread, and the number of tasks required by the particular thread block. On the GPU end, task interface is shared between CPU and GPU, so it is created as a zero copy buffer in uncached host memory for the tradeoff of the performance on both CPU and GPU accessing.

The pseudo-code of the master routine is shown as Figure 5.3. The processing in the master routine can be divided into two phases, the scheduling phase and the finalization phase. In the scheduling phase, if there are available tasks in the task pool, the master thread will iterate all the existing thread blocks on all the devices, until it find a free thread block. Such a thread block is located by checking if Req work is larger than 0. Next, the master thread will assign a new task to this thread block by updating the two values, i.e. the

Start point and the Work size, where the value of Work Size is dependent on what schedul- ing algorithm is used. After it is assigned a new task, the thread block is informed that the new task is now available by updating Assign work. After all the tasks are distributed, the master thread moves to the finalization phase, where all thread blocks are sent the end message.

109 Figure 5.3: Scheduling Master(DevList, T askInfo)

/*Do scheduling*/ while TaskInfo.start < TaskInfo.end do for each value Devicei in DevList do blockNumi ← Devicei.blockNum; for each value threadBlockj in blockNumi do Get taskInterfacej of threadBlockj; if taskInterfacej.Req work > 0 then taskInterfacej.Req work ← 0; taskInterfacej.Start point ← TaskInfo.start; workSize ← SchedulingAlg(Devicei, TaskInfo); workSize ← max (smallSizei, min (largeSizei, workSize)); taskInterfacej.W ork size ← workSize; taskInterfacej.Assign work ← 1; TaskInfo.start ← TaskInfo.start + workSize; end end end end /*Send end message*/ while Not all thread blocks received the end message do for each value Devicei in DevList do blockNumi ← Devicei.blockNum; for each value threadBlockj in blockNumi do Get taskInterfacej of threadBlockj; if taskInterfacej.Req work > 0 then taskInterfacej.Assign work ← −1; end end end end

110 Figure 5.4: Scheduling ThreadBlocks(taskInterface) while ifFinish 6= 1 do /*Do scheduling*/ if threadID = 0 then while taskInterface.Assign work = 0 do /*Waiting for new work*/ end /*Received new work*/ if taskInterface.Assign work > 0 then taskInterface.Assign work ← 0; start ← taskInterface.Start point; size ← taskInterface.W ork size; /*Prefetch next available work*/ taskInterface.Req work ← 1; end /*Received end message*/ else if taskInterface.Assign work = −1 then ifFinish ← 1; end end Barrier(all the threads in the blocks); if ifFinish 6= 1 then /*Do computation*/ Computation(start, size); end end

111 In the worker routine shown in Figure 5.4, we can see that one particular thread, which is called thread0, will be in charge of taking new tasks for all the threads from that thread block. If new worker is available, thread0 will store the starting point and the work size.

Then, before doing the computation on the new task, it will also indicate to the master that it is ready to prefetch the next task. The benefit of prefetching is that the cost of distribution of the new tasks can be overlapped with the computation.

In the Fusion APU, as long as the event dependency is set up, the runtime system will ensure that the memory is coherent. Thus, from the two pseudo-codes, we can see that no two threads update a particular task interface at the same time, because of the way the conditions are used before reading and writing. Master thread will not update task interface before work thread sets up Req work. Similarly, work thread will not read task interface before master thread sets up Assign work. Therefore, this strategy can eliminate the need for any locking, while maintaining correctness. In addition, no spurious wake-ups, deadlock, and concurrency-induced errors are introduced.

5.1.4 Token Scheduling Framework

In the Master-worker implementation, one CPU core is used for the master thread to perform the scheduling. Thus, a fraction of the available computing resources are wasted.

This motivates the second Locking-free scheduling implementation which we refer to as the token scheduling framework. This is shown in Figure 5.5.

Unlike the Master-worker model, there is not a central master thread for task schedul- ing. Similar to the structure of a token network, a token, which is created as a zero copy type, is passed among all the thread blocks on both the CPU and the GPU. Each thread block can only access the task information when it has the token. Therefore, no two thread

112 GPU Thread CPU Block

Thread Thread Block Block Host Memory

Task Pool Task Task Thread Task Thread Block Task Block Task Task

2. Take Task Token 1. Get Token Token

Thread 3. Pass Token Thread Block Block Thread Thread Block Block

Figure 5.5: Token Scheduling Framework

blocks can access the task information simultaneously, and a locking-free implementation can be supported.

Initially, an ordered unified BlockID is assigned to the thread blocks in both CPU and

GPU. There are three steps for a thread block to obtain a new task, which are shown in

Figure 5.5. First, each thread block checks whether a global value TokenID is the same as its own BlockID, which is the mechanism for it to check whether the thread block has the token. Next, the thread block with the token can access the task pool to achieve an available task, and update the task information about the remaining work. Third, the token will be passed to other thread blocks.

In the token network, the token will always be transferred between all the nodes until it is captured by a free node. Thus, when a thread block wants to pass the token, it first checks all the other thread blocks. If any of them is in the free status (i.e. waiting for new task), the TokenID will be updated to the BlockID of the free thread block. However, if all

113 the thread blocks have the busy status, the decision is non-trivial. There are three possible token passing strategies for the case when no thread block is in free status.

Drop token: If there is no free thread blocks, the thread block with the token can drop the token, by setting the TokenID to a new value to represent that the token itself is now free.

When any thread block finishes its work, it can capture the free token directly. However, because multiple thread blocks may try to capture the free token simultaneously, locking cannot be avoided.

Token Unblocking: In this strategy, a token is not passed until there is a free thread block. Thus, the thread block with the token will keep checking whether there is a free thread block. Its action is very similar to the master thread in the master-worker model, except that no single thread block (core) is acting as the master.

Token Blocking: In this strategy, the token is also kept by the original thread block. How- ever, instead of this core continuously iterating over all thread blocks to search for a free core, it performs an additional small task, referred to as an offset task. More specifically, the thread block with the token will first go through all the other thread blocks to check their status. If all of them are busy, the thread block will perform the offset work, and then check their status again, repeating this process till one free thread blocks is found. The advantage of this approach is that the idle time for waiting for a free thread block can be utilized productively, though if the offset task is too large, other thread blocks might finish the computation and wait to get the work.

114 5.2 Experimental Results

5.2.1 Experimental Platform and Goals

Our experiments were conducted using a AMD Fusion APU (A8-3850) with a quad-

core CPU and a HD6550D GPU with 400 shader processing cores (5 × 16 × 5). The

CPU has a clock frequency of 2.9 GHz, a 64 KB L1 cache and a 1 MB L2 cache. The

GPU is operating at 600 MHz with a 32KB shared memory. The system memory used

in the machine is a 8 GB DDR3 memory with 1600 MHz clock speed. This memory

is partitioned into a 512 MB device memory, and a 7680 MB host memory. The thread

configurations used in our experiments are as follows. For the GPU, because there are 5

streaming multiprocessors and the maximum thread block size is 256, we use 5 × 256 as

the thread configuration. For the quad-cores CPU, we use 4 × 1 as the thread configuration.

Our evaluation is based on three representative applications, which are Jacobi, K- means, and Molecular dynamics, which represent stencil computations, regular reductions, and irregular reductions. Our experiments were conducted with the following goals: 1)

Quantifying the benefits of the integrated CPU-GPU architecture, by focusing on the ratio between the transmission overhead and the total execution time, for the Fusion APU and a decoupled NVIDIA GPU, 2) Quantifying the speedups achieved while using both CPU and GPU with thread block level scheduling framework including Master-worker Schedul- ing and Token Scheduling, over versions that use only the CPU or the GPU, 3) Compare our scheduling framework against a more traditional scheduling method, which is based on task queues, and evaluating the overheads of dynamic scheduling by a comparison against the optimal static partitioning, and 4) Evaluating different token passing strategies within

Token Scheduling framework, Blocking Token and Unblocking Token.

115 5.2.2 Efficiency of Transmission on Fusion Architecture

On a machine with a decoupled GPU, data is required to be copied to GPU over PCI-

express bus. Thus, the resulting transmission overhead, which is limited by the bandwidth

of PCI-express, is a critical limiting factor for performance. A coupled architecture like

AMD Fusion can likely reduce this overhead. However, an important question is, how

much of this overhead is reduced, and what is the overall impact on the combined perfor-

mance of the CPU and the GPU.

To answer this question, we look at the performance of AMD Fusion and a NVIDIA

C2050 (Fermi) GPU connected to a standard desktop. Because both CPU and GPU archi-

tectures are very different in these configurations, an absolute comparison of the perfor-

mance is meaningless. Thus, instead, we focus on the relative time spent on transmission

as an indicative of the effectiveness of the two systems in speeding up the computation.

This experiment has been performed in the context of Jacobi. In both platforms, we use the

traditional scheduling method, which is based on the Dynamic Task Queue. The transmis-

sion overhead we report includes the data copying time and the copy command launching

time.

Figure 5.6 shows the comparison with different task numbers. The total size of the

dataset is the same, but the work is divided into 1, 10, 20 tasks that are launched separately

on the GPU. With only 1 task, on Fusion APU, data transmission takes 28% of the total execution time, which is 9% less than the overhead on the Fermi GPU. When 10 tasks are used, the data transmission overhead increases to 38% and 39%, respectively. The overheads are even higher with 20 tasks. With the use of 10 or 20 tasks, the overheads are not lower for the APU, despite a very different design. Moreover, for effective load balancing, we do need a larger number of tasks.

116 0.6 Fusion APU Fermi GPU 0.5

0.4

0.3

0.2 Copy Time / Execution Time Execution / Time Copy 0.1

0 1 Task 10 Tasks 20 Tasks

Figure 5.6: Relative Cost of Data Transmission on Fusion APU and Fermi GPU

Thus, we can see that despite a different architecture, AMD Fusion is not reducing the overheads, if the same scheduling scheme (Dynamic Task Queue) is used. However, the architecture of Fusion does enable other schemes, like the master-worker and the to- ken scheduling framework we have developed, which can lead to better performance. We examine this issue in the rest of this section.

5.2.3 Effectiveness of Thread Block Level Scheduling Framework

Stencil Computations - Jacobi: The matrix used for the Jacobi application has 7680 rows and 7680 columns. Figure 5.7 shows the comparison between token, master-worker, dy- namic task queue (DTQ), optimal static, and CPU-only and GPU-only methods. For all the scheduling methods, Figure 5.7 shows the best results by choosing the task size ap- propriately. In addition, in this experiment, all the data are allocated in host memory and accessed by CPU and GPU directly. So the total execution time includes the computation time but not the copy time. The token implementation we use here is the unblocking one, which happens to give the best performance.

117 GPU 600

500

400

300

Execution Time (ms) Time Execution 200

CPU GPU CPU CPU GPU 100 CPU GPU CPU GPU

0 Token Master-worker DTQ Manual OPT CPU GPU Only Only

Figure 5.7: Execution times for Jacobi: Token, Master-worker, Dynamic Task Queue, Op- timal Static, CPU-Only and GPU-Only

Jacobi is a memory intensive application, in which the memory bandwidth is the ma- jor bottleneck, so CPU-only version is 4.5 times faster than GPU-only version. The best version of DTQ only gains 1.025 speedup over CPU-only due to the trade-off between load balance and scheduling overhead. However, based on the thread block level schedul- ing, master-worker and token achieve 1.2 and 1.6 times speedups, respectively, compared to CPU-only version. The performance of token scheduling is very close to that of the optimal static version.

To better understand the relationship between the overhead and task size in DTQ, in

Figure 5.8, we show the execution time of DTQ with different task sizes, which are varied from 768 rows in a task to 10 rows in a task. Because the total row number is 7680, the work can be seen as divided into 10, 20, 40, 80, and 768 kernels with different task sizes. Two different DTQ versions were created, which are copy DTQ and copy-free DTQ. The copy- free DTQ version, results from which are shown in the bottom part of Figure 5.8, allows the CPU and the GPU to access data from the host memory directly, avoiding explicit

118 200 Compute Time Copy DTQ Copy Time GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU 100

200 0 GPU Copy-free DTQ

CPU CPU GPU CPU GPU CPU GPU CPU GPU

Execution Time (ms) Time Execution 100 (ms) Time Execution

0 768 384 192 96 10 Task Size

Figure 5.8: Execution times for Jacobi in Dynamic Task Queue with different task sizes: Copy DTQ, Uncopy DTQ, and Average execution time per row for CPU and GPU in dif- ferent versions

data copy between host memory and device memory. The copy DTQ version, results from which are in the upper part of Figure 5.8, requires GPU to copy the data in and out of the device memory between each two kernels. So, compared to copy-free DTQ extra GPU copy time and a combination between CPU and GPU is needed, the latter because the results are stored in different address spaces. To better compare the heterogeneous execution times, we did not show the combination time, though actually it is even larger than the computation time.

For both DTQ versions, we can see that the load balance is getting better with the in- creasing number of tasks , as we would expect. However, the overall performance does not improve all the time, due to the increasing transmission overhead and command launch- ing overhead. The best performance of the two versions is obtained when each task has

192 and 96 rows, respectively. Without including the combination time, the best version

119 of copy DTQ is 1.16 times faster than copy-free DTQ. However, with the decreasing task size, copy-free DTQ outperforms copy DTQ as a result of the relative low overhead from commands launching and data transmission.

To further see how does the overhead increases with a decrease in task sizes, we ex- amined the average execution time per row for GPU in these two versions. It turns out that the average execution time increases for both of the two versions. Moreover, from the task with 96 rows, the average execution time of copy DTQ exceeds the average execution time of copy-free DTQ. Thus, to summarize, in DTQ, fine-grained task scheduling cannot be performed efficiently due to the command launching and transmission overhead, which offset the benefit from load balance.

Master-worker 350 Unblocking Token Blocking Token 300

250

200

150 Execution Time (ms) Time Execution 100

50

0 30 20 10 5 Task Size

Figure 5.9: Execution times for Jacobi with different task sizes: Master-worker, Unblock- ing Token and Blocking Token

Figure 5.9 compares the performance of master-worker scheduling, and token schedul- ing with two different token passing policies (blocking token passing and unblocking token passing) on different task sizes. Compared to DTQ, thread block level scheduling only

120 invoke one kernel with much smaller overhead for command launching, data transmis- sion, and synchronization. So, we can use very fine task for both master-worker and token scheduling, as shown in Figure 5.9. For master-worker, the performance is improved with the decreasing task size, due to the improvement of load balance, and because scheduling overhead does not increase with the use of our prefetching strategy. However, one CPU thread is used as the master thread during the entire computation. Thus, as shown in Fig- ure 5.9, based on the best case, unblocking token scheduling gains 1.3 times speedup over master-worker. Unblocking token scheduling obtains the best performance when the task size is 20 rows. In blocking token scheduling, we use an offset task size of 1 row. However, in this case, 1 row task turns out to be too large, and prevents other thread blocks from ob- taining new tasks in time. Therefore, in Jacobi, blocking token scheduling is even slower than DTQ. Moreover, as task size decreases (and total number of tasks increases), more blocking overhead is introduced, leading to a slowdown of the blocking token version.

GPU 800

CPU

600

CPU GPU CPU GPU 400

CPU GPU CPU GPU Execution Time (ms) Time Execution

200

0 CPU GPU Token Master-worker DTQ Manual OPT Only Only

Figure 5.10: Execution times for Kmeans: Token, Master-worker, Dynamic Task Queue, Optimal Static, CPU-Only and GPU-Only

121 Regular Reductions - K-means: For K-means, we conducted experiments on a dataset with 20 million 3-dimensional points, with the number of clusters, K, being 10. Figure 5.10 shows the execution times with token scheduling, master-worker scheduling, DTQ, the optimal static, and CPU-only and GPU-only versions. Overall, token scheduling, master- worker and DTQ achieve 1.92, 1.73, and 1.61 times speedup compared to the best case between the CPU-only and GPU-only versions.

Compute time Copy time Copy DTQ 600 GPU CPU CPU GPU CPU GPU

400

200

0

600 Copy-free DTQ CPU GPU Execution Time (ms) Time Execution CPU (ms) Time Execution CPU GPU 400 GPU

200

0 2,000,000 400,000 200,000 Task Size

Figure 5.11: Execution times for Kmeans in Dynamic Task Queue with different task sizes: Copy DTQ, Uncopy DTQ, and Average execution time per point for CPU and GPU in different versions

Figure 5.11 shows the comparison between the copy DTQ and copy-free DTQ with different task sizes, which are 2,000,000, 400,000, and 200,000 points in each task. They correspond to 10, 50, and 100 kernels, respectively. The results are similar to what we observed for Jacobi. For copy-free DTQ, the best performance is gained when we have 50 kernels, whereas the best performance with copy DTQ is only obtained with 10 kernels.

Overall, the improvement in load balance achieved by having a greater number of kernel

122 launches cannot help improve the overall performance due to the increasing overheads of launching and submitting commands.

500 Master-worker Unblocking Token Blocking Token 400

300

200 Execution Time (ms) Time Execution

100

0 16,000 8,000 4,000 2,000 Task Size

Figure 5.12: Execution times for Kmeans with different task sizes: Master-worker, Un- blocking Token and Blocking Token

In Figure 5.12, we compared master-worker, blocking token and unblocking token with much more fine-grained tasks, which correspond to 125, 250, 500, and 1000 total tasks.

The largest difference from the results for Jacobi is that the blocking token version gains

6.9% and 9.6% speedup compared to unblocking token and master-worker, respectively. In k-means, the smallest size of an offset work is 1 point. Thus, it is much easier to select an offset work which is small enough to significantly lower the possibility of another thread block waiting for the token. Therefore, the blocking token policy can achieve better per- formance. We can conclude that for the applications where the size of offset work can be reduced to be very small, the blocking token scheme can achieve better performance.

Irregular Reductions - Molecular Dynamics: The dataset used for molecular dynamics has 131,072 nodes and 1,4877,496 edges. Compared to the stencil computing and regular

123 1200 Token Master-worker DTQ Manual OPT 1000 CPU-Only GPU-Only

800

600

400 Execution Time (ms) Time Execution

200

0 50 Partitions 100 Partitions 200 Partitions 300 Partitions

Figure 5.13: Execution times for Molecular Dynamics on different partitions: Token, Master-worker, Dynamic Task Queue, Optimal Static, CPU-Only and GPU-Only

reductions, two new challenges arise for irregular reductions on heterogeneous architec- tures. First, because the crossing edges between different partitions are computed in both the thread blocks, increasing the number of partitions will introduce extra computations.

This, combined with the fact that the load balance is difficult to achieve when the number of partitions is very small, makes this application challenging. Second, through inter-device load balance is important for all applications, load imbalance within the thread blocks of a device, or the intra-device load imbalance, is a new challenge for molecular dynamics or other irregular reductions. This is because it is very hard to create reduction space parti- tions which have equal amount of computations. In the DTQ method, the synchronization between the launching kernels will force a fast thread block to wait for a slow thread block.

But in token and master-worker scheduling, there is only one kernel launched and executed through all the computation, and no synchronization is involved between the thread blocks until all the work is done. Therefore, token and master-worker scheduling can improve both the inter-device and intra-device load balance.

124 Figure 5.13 shows the comparison between CPU-GPU versions that use different schedul- ing strategies, and the CPU-only and GPU-only versions, with 50, 100, 200, and 300 par- titions. The token scheduling scheme used here is the unblocking scheme, as blocking scheme is ineffective for this application. Because the overhead of data copying is very high, the GPU accesses the data from host memory directly in all versions. Because the computation increases with the increasing number of partitions, the execution time for both

CPU-only and GPU-only increases. However, for the CPU-GPU versions, load balance is difficult to achieve with a small number of partitions. With 50 partitions, all CPU-GPU versions are even slower than the CPU-only version. With a higher number of partitions, the performance of token and master-worker scheduling improves. Both versions achieve best performance with 300 partitions. For master-worker, the best performance very close to the performance of CPU-only with 50 partitions, i.e. no speedups are possible. Token scheduling, with 300 partitions, obtains 1.15 times speedup over the best CPU-only ver- sion. The performance of DTQ is getting worse with the increasing partition numbers, as we will further evaluate.

250000

200000

150000

100000 Standard Deviation Standard

50000

0 50 Partitions 100 Partitions 200 Partitions 300 Partitions

Figure 5.14: Molecular Dynamics: Standard deviation of computation for different thread blocks with the same partition number in Dynamic Task Queue

125 Unblocking Token 500 Master-worker Dynamic Task Queue

400

300

200 Execution Time (ms) Time Execution

100

0 50 Partitions 100 Partitions 200 Partitions 300 Partitions

Figure 5.15: Execution times for Molecular Dynamics with different task sizes: Master- worker, Unblocking Token and Dynamic Task Queue, and standard deviation of workloads in DTQ

In Figure 5.14, we showed the standard deviation of the workloads for each thread block with DTQ scheduling. The partitions are equally distributed to 5 thread blocks. However, because the amount of computation in different partitions varies, the workload distributed to each thread block is also very different. For 50 partitions, each thread block will get 10 partitions to process, but the standard deviation of the workload between the thread blocks is quite high, and further, it increases as the number of partitions increases.

In Figure 5.15, we compare DTQ, master-worker, and unblocking token scheduling with different number of partitions. Because creating a sub-partition is non-trivial for an irregular application, and the execution time of one partition is too large for it to be used as an offset task, blocking token scheme is very inefficient, and not shown in the Figure 5.15.

For DTQ, when the partition number is small, inter-device load imbalance is the major performance factor. With increasing number of partitions, inter-device load balance will be improved, but intra-device load imbalance gets worse, and overall, DTQ turns out to be

126 inefficient for irregular reductions. For token and master-worker, there is no synchroniza- tion between thread blocks at any time. Therefore, both inter-device and intra-device load imbalance can be reduced with increasing number of partitions.

5.3 Related Work

In the context of heterogeneous computing, much recent action has been around OpenCL, which is designed to support portability across CPU, GPU, and other devices. The main distinction in our work is in automatically splitting the work between CPU and GPU, con- sidering the features of an integrated CPU-GPU architecture. Other proposals for heteroge- neous computing include Exochi [124], Merge [79], Harmony [31], Twin Peaks [45], and a proposal from Intel [113]. None of these efforts involve automatic distribution of work between CPU and GPU.

For accelerating an application using CPU and GPU simultaneously, Qilin system [86] has been developed with an adaptable scheme for mapping the computation between CPU and GPU simultaneously. The Qilin system trains a model for data distribution based on curve-fitting. Our work, in comparison, is based on dynamic work distribution, and does not require any offline training. Becchi et al. [13] use a data-aware scheduling of kernels that reduces the data transfer overhead due to GPU. Teodoro et al. [118] describe a runtime framework that selects either of a CPU core or the GPU for a particular task.

In the area of scheduling framework implementation, Cilk [40] provides a compiler- based task scheduling solution for shared memory multiprocessors. A Cilk-like framework for GPGPU has been attempted (OpenCLunk [76]), though it turns out that the overhead to execute spawn and data transmission is too high in GPU. Moreover, work stealing in

Cilk cannot be utilized in CPU-GPU heterogeneous architecture without involving high

127 locking overheads. StarPU [8] provides a dynamic queue-based scheduling framework for heterogeneous architectures, though not for coupled architectures. Because tasks are still scheduled and distributed at the device level in this framework, a direct port on coupled

CPU-GPU architectures will involve high overhead of kernel launching and synchroniza- tion.

Even prior to the recent development of multi-core CPUs and GPUs, many techniques were proposed for scheduling data parallel loops on heterogeneous systems. Guided self- scheduling scheme [102] initially allocates larger chunk size to reduce scheduling over- head, and subsequently reduces the chunk size towards the end of computation. Factoring scheme is similar to guided self-scheduling, except in each iteration all the processors get the same amount of work. But factoring scheduling made the improvement on the scala- bility and the situation in which the execution time has large variances between the itera- tions [58]. Some of the ideas in our schemes, such as use of an offset work before passing token, have similarities with these ideas.

5.4 Summary

Every new architecture offers both new challenges and opportunities to application de- velopers. This is certainly the case for new or emerging integrated CPU-GPU architectures, such as those released or announced by AMD, Intel, and NVIDIA. While these systems are offering a unique opportunity for accelerating applications using the combined processing power of GPUs and CPUs, there are significant challenges in partitioning an application to benefit from these.

This paper has developed and evaluated a dynamic runtime framework. We have demon- strated how master-worker scheme can be made locking-free, and then similarly, shown that

128 a token scheduling framework can be supported efficiently. We have shown how our frame- work can be used for applications with different communication patterns. We demonstrate that new runtime schemes have very low overheads and clearly outperform a scheme origi- nally developed for decoupled CPU-GPU architectures. Overall, we have shown that CPU and GPU cores can be effectively used together, with performance improvements ranging between 1.15 and 1.92 times speedup over the best of CPU-only and GPU-only.

Though our current work has focused on acceleration on a single node, the issues in optimizing an existing MPI application for a cluster of such nodes are the same. Similarly, while our current evaluation has been on the AMD architecture, our schemes should be easily adaptable on other integrated CPU-GPU chips, such as those likely to be released by

Intel and NVIDIA soon.

129 Chapter 6: Efficient Scheduling of Recursive Control Flow on GPUs

In last 4 chapters, we discussed the applications in generalized reduction, irregular reduction, and stencil computation. Here, we focus on a new application pattern, recursive application, on modern GPUs.

6.1 Current Recursion Support on modern GPUs

This section introduces SIMD architectures, particularly GPU devices, discusses schedul- ing of recursive applications on such systems, and finally, details the performance achieved on current GPUs.

6.1.1 Modern GPU Architectures and Thread Divergence

The processing component of a typical modern GPU consists of streaming multipro- cessors (SMs). Each streaming multiprocessor, in turn, contains a set of SIMD cores that perform in-order processing of instructions. To achieve high performance, a large number of threads, typically a few tens of thousands, are launched. These threads execute the same operation on different sets of data. A block of threads are mapped to and executed on a streaming multiprocessor. Furthermore, threads within a block are divided into multiple groups, termed warps (wavefronts in OpenCL). Threads within a warp are co-scheduled on a streaming multiprocessor with all threads executing the same instruction in a lock-step

130 Entry

{T0, T1, T2, T3}

if(cond1 ()) BB1 { bra cond1() … } {T0, T1, T2} {T3} else if(cond2 ()) BB2 { bra cond2() … } {T0} {T1, T2} else { BB3 … bra else } {T0}

Immediate Exit Post-dominator

(a) conditional branches (b) unstructured control flow

Figure 6.1: An example of unstructured control flow and its immediate post-dominator

manner. If threads in the same warp need to execute different instructions, they have to be executed serially, i.e., at any time, only threads executing the same instruction can proceed while other threads stall. An instruction is said to be divergent if it can lead to different threads executing different instructions. Conditional branches are common examples of divergent instructions.

The serialization of divergent instructions is a basic limitation in supporting threads in SIMD architectures, and cannot be overcome with software and hardware scheduling schemes. However, scheduling strategies can maximize opportunities to make progress on all threads by exploiting the structure of the program. These strategies can be further aided by hardware and/or software. For example, one can identify reconvergence points through compiler analyses, and execute one group of divergent threads serially until they reach the reconvergence point, before stalling this group and executing the other group of divergent threads. To illustrate, in Figure 6.1, threads diverge when executing conditions

131 cond1 and cond2, and converge after executing the conditional statements. This scheme, referred to as immediate post-dominator based reconvergence and described in detail in

Section 6.2, has been shown to be beneficial in handling control-flow divergence within a procedure [29, 42].

6.1.2 Performance of Recursive Control Flow on GPUs

While control flow resulting from branches has been extensively studied in the litera- ture, recursion has only been preliminarily investigated (a detailed description of related work is covered in Section 7.5). While traditional data-parallel applications are imple- mented in an iterative manner or using a combination of outer recursion and inner data- parallelism, not all applications can be expressed in this fashion. We focus on the schedul- ing of threads within a warp, with each thread executing a recursive function. While the

flow of control can vary widely among the threads executing such recursive functions, tasks performing similar operations still expose significant optimization opportunities.

As mentioned earlier, AMD GPUs do not support recursive control flow. To under- stand the performance of recursive programs on modern NVIDIA GPUs, we conduct the following experiment using two benchmarks—Fibonacci and Binomial coefficients. On

Fibonacci, we compute the 24th (Fib(24)) and 23rd (Fib(23)) Fibonacci numbers, each 20 times, in three different ways—using a single thread, using two threads in a matched pat- tern, and using two threads in an unmatched pattern. When using two threads, each thread executes half the work as in the serial case, depicting a strong scaling scenario. In the matched pattern, work is assigned to the two threads in the same order, alternating between

Fib(24) and Fib(23). For example, when the first thread computes Fib(24), the second thread also computes Fib(24). In the unmatched pattern, the tasks are assigned to the two

132 1.4 Fib(24)+Fib(23) Bio(18,9)+Bio(18,8) 1.2 Bio (18,8) 1 Fib(23) Bio 0.8 (18,8)

Fib(23) 0.6

Bio Execution times (s) times Execution 0.4 Fib(24) (18,9) Bio (18,9) 0.2 Fib(24)

0 Serial Matched Unmatched Serial Matched Unmatched Kepler Fermi

Figure 6.2: Performance of Fibonacci and Binomial Coefficients benchmarks on Ke- pler (Tesla K20M) and Fermi (Tesla C2050) architectures: 20 Fib(24)+Fib(23) and 20 Bio(18,9)+Bio(18,8) invocations on one thread, two threads in matched pattern, and two threads in unmatched pattern

threads in different orders. The first thread alternates between Fib(24) and Fib(23) starting

with Fib(24), while the second thread does the same computation starting with Fib(23).

18 18 We conduct the same experiment in computing the binomial coefficients C9 and C8, for

terms x9 and x8, respectively, in the expansion of (1 + x)18.

Figure 6.2 shows the results of the experiment on Tesla K20M and C2050 GPUs, which are based on the Kepler and Fermi architectures, respectively. We observe that execution on two threads using the matched pattern achieves perfect speedup compared to execution on one thread. However, unmatched pattern shows significant performance degradation on both Kepler and Fermi GPUs. For example, on Kepler, we only gained 1.4 and 1.2 speedup on Fibonacci and Binomial coefficient benchmarks, respectively, in the unmatched pattern, demonstrating that divergence introduces significant overheads. In fact, this execution time is greater than the time to compute only the more expensive of the two terms, Fib(24) or

133 18 C9 in the two benchmarks, on a single thread. This establishes that there are a significant number of wasted cycles with the current scheduling method.

We also observe that the two GPUs exhibit similar behavior, indicating that the newer

Kepler architecture faces similar challenges as Fermi in thread scheduling under intra-warp thread divergence due to recursion. Note that launching a kernel using dynamic parallelism in Kepler GPUs [98] blocks the executing kernel until the invoked kernel completes, and does not address the problems related to intra-warp scheduling or thread divergence we tackle in this paper.

6.2 Static Immediate Post-dominator Reconvergence

In a control flow graph, a post-dominator of a given node is a node through which all paths (across different branches) from the given node must pass through. The immediate post-dominator is the post-dominator that is not dominated by any other post-dominators.

Figure 6.1 shows an example code with conditional branches and the corresponding control

flow graph. Four threads, T0 to T3, diverge in different branches. First, all the threads start from the Entry node. Then, T1 and T2 go through basic blocks {BB1, BB2}, while T0 and

T3 go through {BB1, BB2, BB3} and {BB1}, respectively. In the end, all threads converge at the Exit node. Thus, in this example, the Exit node is the immediate post-dominator.

One possible scheduling scheme for SIMT execution is static immediate post-dominator reconvergence. The compiler analyzes the code to determine the immediate post-dominator of the divergent branch as the reconvergence point. Returning to the example in Figure 6.1, while scheduling the threads within a warp, this scheme will use the Exit node as the re- convergence point.

134 Fibonacci(n) Fibonacci_PTX(n) Clock T0 T1 { { Time F(4) F(3) --> clock() /*Call F(n)*/ 0 Call F(4) Call F(3) if(n < 2) Ret_Para = 1 return 1 if(n < 2) bra B1 863 Call F(3) Call F(2) else { /*branch if(n>1)*/ 1597 Call F(2) Call F(1) x = fib(n-1) call fib(n-1) x = RET_VALUE 2365 Call F(1) Stall y = fib(n-2) call fib(n-2) 3007 Ret F(1) Stall Path2 Path1 y = RET_VALUE 3947 Call F(0) Stall Ret_Para = x+y B1: 4535 Ret F(0) Stall clock() /*Ret F(n)*/ --> 5124 Ret F(2) Ret F(1) return x+y } ret Ret_Para } } Immediate Post- 6381 Call F(1) Call F(0) dominator 7044 Ret F(1) Ret F(0) (a) Fibonacci codes (b) PTX codes of Fibonacci 7692 Ret F(3) Ret F(2) T0 T1 8961 Call F(2) Call F(1) 4 3 9733 Call F(1) Stall

Call 10319 Ret F(1) Stall

Ret 3 2 2 1 10989 Call F(0) Stall 11575 Ret F(0) Stall 12164 Ret F(2) Ret F(1) 2 1 1 0 1 0 12812 Ret F(4) Ret F(3)

Parallel execution (d) Nominalized clock time on each step 1 0 Serial execution

Re-convergence point

(c) Recursion trees of F(4) and F(3)

Figure 6.3: Timing the steps on two threads computing the fourth and third Fibonacci numbers. F() in (c) and (d) denotes the Fibonacci function.

6.2.1 Reconvergence in NVIDIA (Kepler) GPUs

For establishing a baseline, we wanted to determine the reconvergence mechanism used in the NVIDIA Kepler GPUs. Because this information is not publicly documented, we de- termine the reconvergence method by studying the generated code, and by instrumenting and monitoring its execution. The compiled PTX code of the Fibonacci function is sim- plified and shown in Figure 6.3 (b). Compared to the original Fibonacci code (Figure 6.3

(a)), the compiler simplified the conditional branch control, and unified the two return in- structions in each branch to a single one, which is then moved to the end of the function.

Because the address labeled by B1 points to the first instruction after convergence of the

135 two paths (Path1 and Path2), instruction ret Ret Para is the immediate post-dominator of

the two paths.

Inside the Fibonacci function, we insert two clock function invocations to identify the clock cycle at which specific execution points are reached. One, representing Call F(n), is

inserted at the beginning of the function, and the other, representing Ret F(n), is inserted

just before the return instruction, which is the immediate post-dominator. We then schedule

two threads, denoted by T0 and T1 in Figure 6.3, to compute the fourth and third Fibonacci

numbers (F (4) and F (3)), respectively. Figures 6.3(c) and 6.3(d) show their recursion

trees and the gathered clock time. In Figure 6.3(c), we show which nodes of the recursion

trees are executed in parallel or serial, and the reconvergence, using different shades. From

Figure 6.3(d), we can see that when divergence happens after cycle 1597, thread T1 is

stalled, while thread T0 executes the next recursion level (leftmost nodes 1 and 0 shown in

Figure 6.3(c)). On cycle 5124, threads T0 and T1 reconverge on Ret F(2) and Ret F(1).

Based on this experiment, coupled with similar other experiments, we infer that static

immediate post-dominator reconvergence mechanism is used in NVIDIA Kepler GPUs in

the case of thread divergence due to recursive function calls.

6.2.2 Frontier-Based Early Reconvergence

In an unstructured control flow, there is a chance that the same basic block may appear

on different branch paths. Based on this observation, the frontier-based reconvergence

method [29] improves up on immediate post-dominator reconvergence by supporting two

kinds of reconvergence points. In addition to the immediate post-dominator, any thread can

join the currently active threads in execution, if it needs to execute the same basic block as

the active threads (e.g., basic block BB3 in Figure 6.4 (b)).

136 In this approach, a compiler analysis is conducted to do a topological sort on all the basic blocks between the divergent branch and its immediate post-dominator. Then, early reconvergence checks are inserted at the appropriate points in the code. The key of this approach is to schedule the basic blocks in a valid scheduling order—topological sort of the control flow graph—to ensure that each basic block is executed exactly once.

Thus, compared to immediate post-dominator reconvergence, the frontier method can eliminate code expansion due to branch divergence in unstructured control flow. However, the question is whether it can be applied to recursive control flow.

Entry

T0 T1 T0 T1 T0 T1 BB5 BB6 Fib(3) Fib(2) Fib(3) Fib(2) Fib(3) Fib(2) bra cond1 bra cond2 Step 1 Entry (3) Entry (2) Step 1 Entry (3) Entry (2) Step 1 Entry (3) Entry (2) (n==1) (n==0) Entry Step 2 if cond1 if cond1 Step 2 if cond1 if cond1 Step 2 if cond1 if cond1 Step 3 if cond2 if cond2 Step 3 if cond2 if cond2 Step 3 if cond2 if cond2 BB7 Step 4 else branch else branch Step 4 else branch else branch Step 4 else branch else branch bra cond3 (ret == true) Step 5 Entry (2) Entry (1) Step 5 Entry (2) Entry (1) Step 5 Entry (2) Entry (1) BB1 BB2 bra cond1 bra cond2 Step 6 if cond1 if cond1 Step 6 if cond1 if cond1 Step 6 if cond1 if cond1 (n==1) (n==0) BB8 Next level else branch CFG Step 7 if cond2 Stall Step 7 if cond2 Stall Step 7 if cond2 Stall Step 8 else branch Stall Step 8 else branch Stall Step 8 else branch Stall Exit BB3 Step 9 Entry (1) Stall Step 9 Entry (1) Stall Step 9 Entry (1) Stall bra cond3 (ret == true) Step 10 if cond1 Stall Step 10 if cond1 Stall Step 10 if cond1 Stall Step 11 if cond3 Stall Step 11 if cond3 Stall Step 11 if cond3 if cond3 BB4 else branch Step 12 Exit (1) Stall Step 12 Exit (1) Stall Step 12 Exit (1) Exit (1) Step 13 Entry (0) Stall Step 13 Entry (0) Stall Step 13 Entry (0) Entry (0) Entry Exit Step 14 if cond1 Stall Step 14 if cond1 Stall Step 14 if cond1 if cond1 Step 15 if cond2 Stall Step 15 if cond2 Stall Step 15 if cond2 if cond2 Step 16 if cond3 Stall Step 16 if cond3 Stall Step 16 if cond3 if cond3 Fib(n, ret) BB9 BB10 Step 17 Exit (0) Stall Step 17 Exit (0) Stall Step 17 Exit (0) Exit (0) { bra cond1 bra cond2 if((n == 1 || n == 0) (n==1) (n==0) Step 18 Stall if cond3 Step 18 Stall if cond3 Step 18 Exit (2) Exit(2) && ret == TRUE) Step 19 Exit (2) Exit (1) Step 19 Exit (2) Exit (1) Step 19 Entry (1) Done Step 20 Entry (1) Entry (0) Step 20 Entry (1) Entry (0) Step 20 if cond1 Done { BB11 return 1; bra cond3 Step 21 if cond1 if cond1 Step 21 if cond1 if cond1 Step 21 if cond3 Done (ret == true) } Step 22 if cond3 Stall Step 22 Stall if cond2 Step 22 Exit (1) Done else BB12 Next level Step 23 Stall if cond2 Step 23 if cond3 if cond3 Step 23 Exit (3) Done else branch CFG { Step 24 Stall if cond3 Step 24 Exit (1) Exit (0) Step 24 Done Done x = Fib(n-1, ret); Step 25 Exit (1) Exit (0) Step 25 Exit (2) Exit (2) Step 25 Done Done y = Fib(n-2, ret); Exit Step 26 Exit (3) Exit (2) Step 26 Done Done Step 26 Done Done return x+y; } Step 27 Done Done Step 27 Done Done Step 29 Done Done } (d) Execution steps for (e) Execution steps for (c) Execution steps for immediate (a) Fibonacci recursive application (b) Fibonacci control flow frontier-based early dynamic greedy post-dominator mechanism reconvergence mechanism reconvergence mechanism

Figure 6.4: Recursive Fibonacci code, its control flow graph, and execution steps to com- pute the 3rd and 2nd Fibonacci numbers on two threads under various scheduling mecha- nisms

137 6.2.3 Limitations for Recursive Functions

First, let us see immediate post-dominator reconvergence. While it has been shown

to be effective in handling unstructured control flow within a function, immediate post-

dominator reconvergence has limitations when it comes to recursive applications. Fig-

ures 6.4(a) and 6.4(b) show a modified recursive Fibonacci example and its control flow

graph. In this example, a boolean value ret is introduced in our Fibonacci function. The function returns when ret = TRUE and n = 1kn = 2. Otherwise, the else branch in- volves recursive function invocations with arguments n − 1 and n − 2. Once a thread goes through the else branch, the recursion level is increased by one, and the two function calls are produced with the control flow graph for the new recursion level.

At each recursion level, we can see that the Exit node at each recursion level is the immediate post-dominator at that level. When two threads diverge at a particular recursion level, the returning thread waits for the thread that enters into the deeper recursion level to return, so as to reconverge on the Exit node at the same level. We next explain this point through a detailed example.

Figure 6.4(c) shows the execution steps of two threads using the immediate post-dominator reconvergence method. Thread T0 computes the third Fibonacci number, while thread T1 computes the second. Both threads begin executing identical instructions, but begin to di- verge at step 7. Thread T0 chooses the function call branch, while thread T1 chooses the other branch so as to return to the invoking function. Because the Exit node is the imme- diate post-dominator, thread T1 cannot return until T0 returns to the same recursion level and also reaches the Exit node. Consequently T1 stalls for the next 11 steps, waiting for T0 to finish its recursive function execution.

138 This is in addition to the stalls that occur within each function call due to conditional execution. For example, when the input values in T0 and T1 are 1 and 0, respectively, T0 chooses the path {Entry, cond1, cond3, Exit}, while T1 goes through {Entry, cond1, cond2, cond3, Exit}. That results in three additional stalls starting at step 22 in Figure 6.4(c). As- suming the execution of one instruction costs one cycle, in this example, the instructions per clock (IPC) of the immediate post-dominator method is only 1.42 for 2 threads. More- over, the thread with the lesser amount of work—T1 in this case—cannot being executing its next task early.

For frontier-based early reconvergence, it can eliminate the repeat execution of the same basic block in different divergent paths. For example, in Figure 6.4(d), we show the exe- cution steps using frontier-based early reconvergence method. Unlike with the immediate post-dominator method, threads T0 and T1 can reconverge early on step 23 in cond3, be- cause both threads execute code block cond3 in their own branches. However, this approach only works for unstructured control flow. Moreover, the static immediate post-dominator is still kept in this method, incurring the disadvantages discussed above for static immediate post-dominator method. This is the reason why thread T1 still waits for thread T0 to return to the same recursion level, with the associated stall of 11 steps observed in Figure 6.4 (d).

To summarize, when scheduling using static reconvergence-based methods—immediate post-dominator and frontier methods—threads with shorter branches stall for threads with longer branches. If the execution time of the i-th branch of a recursive application on the t-th thread is Tit, the total execution time of N threads on a recursive task with M branches

PM N is given by Ttask = i maxt (Tit). Moreover, if there are multiple recursive tasks to be executed by each thread, the threads with the smaller tasks cannot continue to subsequent tasks before the threads with the larger tasks finish. Consequently, if there are N threads,

139 each of which has P recursive tasks to execute, and the execution time of the t-th thread on PP the i-th task is Tit, the total execution time i Ttask is equal or greater than the sum of the

PP N maximum execution time to execute each task, i maxt (Tit).

6.3 Dynamic Reconvergence and Scheduling

In this section, we consider the design of dynamic reconvergence mechanisms. Unlike

static reconvergence mechanisms, dynamic reconvergence does not rely on a static (fixed

and compiler determined) reconvergence point.

6.3.1 Dynamic Greedy Reconvergence

Dynamic greedy reconvergence method removes the reconvergence on the static im-

mediate post-dominator. But sharing the same intuition of frontier-based reconvergence,

we greedily reconverge the threads with the same program counter (PC). In this approach,

we define a convergence group, which refers to the group of threads that are at the same

PC address as the currently active threads. In particular, although the instructions at differ-

ent recursion levels can be viewed as belonging to different basic blocks, they might still

execute at the same PC address. So different from frontier-based reconvergence, reconver-

gence with the convergence group on the same PC address can be applied across different

recursion levels. Different from immediate post-dominator reconvergence, before reaching

an immediate post-dominator point, we keep looking up whether there are inactive threads

in convergence group, and greedily reconverge them as soon as possible. Moreover, due to

the avoidance of the immediate post-dominator and reconvergence on the same PC address,

static (compile-time) knowledge is not needed in our greedy reconvergence method.

140 Figure 6.4(e) shows the execution steps with our dynamic greedy reconvergence method.

We can see that although threads T0 and T1 diverge at step 7, they reach the same PC ad-

dress (if cond3) in four more steps. Thus although the two threads are in different recursion levels, they can reconverge based on the same convergence group. The threads can then execute the next few instructions without divergence. Compared to the execution steps of immediate post-dominator method, the total number of execution steps is decreased by

12% and 31% for threads T0 and T1, respectively. Thus, greedy reconvergence can not only overlap the common basic blocks in different branches, but also let the threads in shorter branches return sooner to have an opportunity to execute subsequent tasks.

Dynamic Greedy-return Mechanism

The motivation for this scheme is as follows. For our greedy method, the threads with the same PC are encouraged to reconverge as soon as possible. However, when divergence occurs, the threads are arranged in the stack depending on their thread ID. In our implemen- tation, which will be elaborated in next section, we iterate over the threads in the reverse order of their ID, so the threads with the smaller ID have a larger chance to be put on top of the stack. While a different order can be used as well, the key point is that the order in which the threads are traversed is fixed. This could lead to additional stalls. For exam- ple, if the tasks with longer branches are in threads that are put on top of the stack, they will be popped and executed first. Other threads, potentially working on tasks with shorter branches, will have to wait longer to get their next tasks.

We address this problem by collecting additional control-flow information during com- pilation. Specifically, we define a branch (if-then-else) as recursive, if a recursive function

call or a return statement is reachable from any of the instructions starting from the branch

141 to its immediate post-dominator in the control flow graph. Based on this notion, we in- sert the threads with a return statement to the top of the stack, when divergence occurs on a recursive branch. The resulting scheme is referred to as the dynamic greedy-return mechanism. When divergence occurs, greedy-return method utilizes the compile-time in- formation to check for threads in the return branch. These threads are placed on the top of the stack and are processed with higher priority in the following execution. Note that this variation might not be universally beneficial, because preferentially executing a small number of the threads in return branches can decrease the instructions executed per clock cycle (IPC).

6.3.2 Dynamic Majority-based Scheduling

In the dynamic greedy scheme, each time an instruction is scheduled, the active threads are a subset of the threads at the top of the stack. However, if the threads in the top active group are a minority of all the threads in the warp, we exploit less parallelism, and IPC could be negatively impacted. Note that because we are using a dynamic scheme, finishing a larger number of threads sooner could very well imply that other work may be available to be scheduled together with the threads that are currently a minority. For example, if the tasks in threads T0 to T3 are [A, C], [B, A], [B, A], and [B, A] respectively, on the static or greedy mechanism, it will require 4 steps to finish, as [T0: A, T1-3:B, T0:C, T1-3:A].

However, in the majority mechanism, it only requires 3 steps, as [T1-3:B, T0-3:A, T0:C].

We can see that the minority group, which only includes T0 on the first step, has the chance to be changed to the majority group, which includes T0 to T3.

Based on this observation, we introduce the dynamic majority-based mechanism. In this method, before issuing an instruction, all threads are activated. Then, a selector is used

142 to find the majority group of threads with the same PC address. The PC address and active

mask of this majority group are put on the top of the stack to be issued.

Though a key advantage of this scheme is the ability to improve the number of instruc-

tions executed per clock cycle, it has additional benefits. Because all threads are active

before scheduling, there is no need to record the PC for the threads that are inactive. Thus,

compared to the greedy reconvergence in which the stack size increases with increasing

number of the branches, only one entry is required for the stack in the dynamic majority

method. In addition, this method can be easily implemented in hardware. Moreover, sim-

ple stack operations introduce less cost on instruction issuing compared to the greedy and

immediate post-dominator methods.

Dynamic Majority-threshold Scheduling

In the dynamic majority-based scheme described above, there is a possibility of star-

vation for the threads in the minority group. This is overcome through a variant we now

describe, referred to as majority-threshold scheduling. To prevent starvation while keeping

IPC high, we introduce two thresholds. The threads in the minority group will be sched- uled only when their waiting time is greater than a predefined value TS. At the same time, threads in the minority group cannot be scheduled if the current number of threads in the majority group is greater than another parameter, TP.

In our implementation of the majority-threshold method, TP is set to the maximum of available thread/8 and min value, in which available thread refers to the number of the

threads that are still alive (not waiting for warp termination). So if there are 32 available

threads, TP is 4. Only when the current number of threads in majority group is not greater

than 4, the starved threads in minority group are considered for scheduling. With decrease

in the number of the available threads, TP also decreases. However, when the number of

143 available threads is too small, TP is set to a minimum value. Thus, when the number of the threads in majority group is high, the original dynamic majority-based scheme is employed in order to achieve high IPC. However, when high IPC cannot be achieved, the scheduling strategy is altered to prevent starvation.

To summarize, the dynamic reconvergence mechanisms we have presented not only provide more flexible reconvergence, especially allowing threads at different recursion lev- els to reconverge, but also support different scheduling options, allowing us to increase IPC and prevent starvation. In next section, we discuss the implementation of greedy, greedy- return, majority, and majority-threshold schemes.

6.4 Implementation

Step1 Block Next PC Active Mask Step6 Block Next PC Active Mask T0 T1 Entry2 Entry3 Entry1 PC0 111 BB3 PC1 100 T2 PC0 PC0 BB5 PC1 011 Step2 Block Next PC Active Mask

BB2 PC2 111 Step6 Block Next PC Active Mask BB3(PC1) BB5(PC1) Divergent Re-convergent bra if(n < 2) bra if(n < 2) BB3/5 PC1 111 Entry1 Step3 Block Next PC Active Mask on the same PC PC0 Entry2 PC0 111 BB4(PC2) BB6(PC2) Step7 Block Next PC Active Mask bra else bra else Exit2/3 PC4 111 Step4 Block Next PC Active Mask BB1(PC1) Exit2 Exit3 bra if(n < 2) PC4 PC4 BB3 PC1 100

BB4 PC2 011 ……… BB2(PC2) bra else Step5 Block Next PC Active Mask Entry4 Exit1 PC0 BB3 PC1 100 PC4 Entry3 PC0 011 top of stack

BB7(PC1) Divergent bra if(n < 2)

BB8(PC2) bra else

Exit4 PC4

(a) Fibonacci control flow graph (b) Stack-based implementation

Figure 6.5: Fibonacci control flow graph with three threads and stack-based implementa- tion of dynamic greedy reconvergence method

144 Number of SM 15 Number of SP / SM 32 Number of Registers / Core 32768 Shared Memory / Core 48KB L1 Cache 16KB, 128B line, 4-way assoc L2 Unified Cache 768KB, 256B line, 8-way assoc Number of Memory Channels 6 Core Clock 700MHZ Interconnect Clock 1400MHZ DRAM Clock 1848MHZ DRAM Scheduler Queue Size 6 Memory Controller out of order (FR-FCFS) Memory Channel Bandwidth 8 Bytes/Cycle

Table 6.1: GPGPU-sim Configuration

GPGPU-sim [11] is a cycle-level GPU performance simulator for general purpose com- putation on GPUs. In our work, we have implemented the various reconvergence mech- anisms on the latest version of GPGPU-sim (3.1.1), which has the capability to model

GPU micro-architectures that are similar to NVIDIA GeForce 8x, 9x, and the Fermi se- ries. We chose GPGPU-sim for our work because it has been widely used in GPU re- search [11, 41, 130], and has been shown to have a high simulation accuracy (gaining correlations of 98.3% and 97.3% versus GT200 and the Fermi hardware, respectively), on the RODINIA benchmark suite with scaled down problem sizes [1].

Our modified GPGPU-sim is configured to model a GPU similar to the NVIDIA’s

GTX480, which is based on the Fermi architecture. The major configuration parameters of

GPGPU-sim are shown in Table 6.1.

145 6.4.1 Distinguishing Branch Types

Because our methods target recursive applications, it is important to distinguish control

flow related to recursion from other types of control flow. We classify all the branches into

recursive branches—branches that include a recursive function call or a return statement—

and all other (general) branches. When divergence occurs due to a recursive branch, threads may revisit the same PC, though some threads might reach it having followed a different control flow path. Thus, the dynamic reconvergence mechanisms, which allow reconver- gence to occur before, on, or after the immediate post-dominator, can provide higher par- allelism. However, for a general branch, there is typically no benefit in allowing threads to stay divergent after the immediate post-dominator.

The detection of branch types is carried out at compile-time. During compilation, for each branch instruction, we scan the instructions from the branch target, leading up to the immediate post-dominator, to determine if there is any recursive function call or a return statement. If so, we classify this branch instruction as a recursive branch. At runtime, we switch between static and dynamic reconvergence, for general and recursive branches, respectively.

6.4.2 Implementation of the New Schemes

In implementing the dynamic mechanisms discussed in the last section, we build on prior work by Fung et al. [42], who provided a stack-based implementation of the imme- diate post-dominator reconvergence method. To implement our own reconvergence mech- anisms, we modified the structure of the SIMT stack, which is used to store the next in- struction PC for a warp. Specifically, for each of our proposed schemes, we provided a new

146 implementation of the SIMT stack updating procedure, which is used to generate the next

scheduled instruction for a warp.

Implementation of Dynamic Greedy Mechanism

There are two important data objects in the modified structure of the SIMT stack, the

PC slot, which is used to store the address of the instructions to be scheduled, and the active

mask, which is used to track which threads are active at the corresponding PC. The active

mask is implemented as a bitset, with one bit per thread and active threads represented by

bit 1.

Figure 6.6 shows the method for updating the SIMT stack with the dynamic greedy

and greedy-return reconvergence methods. If the input parameter ret first is TRUE, the

algorithm switches to the greedy-return scheme. The critical part of Figure 6.6 is the while

loop, where a new pair of PC and its corresponding active mask is added into the SIMT

stack. When divergence occurs, all target PCs and their corresponding active masks are

added into the SIMT stack. Similar to the original stack update implementation, each time,

before updating the stack, we read the last active mask as top active mask. Subsequent

operations only impact the threads that are active in top active mask. This step by itself is

not sufficient, as inactive threads that have the same next PC as the PC on top of the stack,

cannot be scheduled. Therefore, after the while loop, we scan the SIMT stack and activate the threads having the same PC as the top PC address by merging the different entries of the stack.

To further elaborate the runtime procedures, Figure 6.5 shows an example. In this ex- ample, three threads execute Fibonacci, shown in Figure 6.5(a). The corresponding SIMT stack operations are shown in Figure 6.5(b). Here, we give the information about the block

147 Figure 6.6: stack::update greedy(stack t &m stack, addr t &next pc, bool ret first) top active mask = m stack.back().m active mask; top pc = m stack.back().m pc; /*Add the pair of PC and active mask to stack*/ while top active mask.any() do mask t tmp active mask = NULL PC; address type tmp next pc; tmp active mask.reset(); /*Find out the active threads with the same PC*/ for i = warp size − 1 → 0 do if thread done.test(i) then top active mask.reset(i); else if tmp next pc = NULL PC then tmp next pc = next pc[i]; tmp active mask.set(i); top active mask.reset(i); else if tmp next pc = next pc[i] then tmp active mask.set(i); top active mask.reset(i);

/*Add new pc and active mask into simt stack*/ m stack.back().m pc = tmp next pc; m stack.back().m active mask = tmp active mask; m stack.push back(simt stack entry()); m stack.pop back(); /*Merge entries for the threads in the same convergence group*/ for i = m stack.size() − 2 → 0 do if m stack[i].m pc = m stack.back().m pc then merge(m stack, i, size-1);

/*Prefer to execute the branch with RET first*/ if ret first = TRUE then if warp diverged = TRUE and branch type = RECURSIVE BRANCH then for i = m stack.size() − 1 → 0 do if checkRet(m stack[i].m pc) = TRUE then move top(m stack, i); break;

148 number, the next PC address, and the active mask showing the threads that will be sched-

uled next. From steps 1 to 3, there is no divergence. Thus, only one entry is kept in the

stack, and the active mask is 111. However, in the step 4, there is divergence: the next PC

of thread T2 is PC1, whereas for threads T0 and T1, it is PC2. We insert the PC and the cor-

responding active mask of all the different branches into the stack. In our implementation,

we always put the PC with the smaller thread ID on the top of the stack. In this example,

PC2 and its corresponding active mask are put on the top of the stack, and threads T0 and

T1 will be scheduled in the next step. Because the threads on top of the stack have higher

priority, in steps 4 and 5, threads T0 and T1 are executed, but T2 is stalled. Then, in step 6,

the next PC for threads T0 and T1 is PC1, which is the same as the next PC for thread T2.

Thus, threads T0, T1, and T2 can reconverge with our dynamic greedy method. The two

entries with the same PC in the stack are also merged, and the active mask is updated back

to 111.

From Figure 6.6, we can see that in the for loop, the active threads are added to the

SIMT stack in the reverse order of their thread ID, so the threads with the smaller ID have a larger chance of being inserted at the top of stack and scheduled first. As discussed earlier, this can lead to a performance loss, and the greedy-return mechanism was designed to address this. Thus, in the algorithm of the dynamic greedy mechanism with return-first priority, which is activated by setting ret first parameter as TRUE, we use compile-time

information to give higher priority to instructions in the return branch. Because only a

recursive branch can have a return statement, we check the type of the branch before taking

actions for return-first priority. On divergence, we scan the SIMT stack, and move the PC

in the path that includes the return instruction, to the top of the stack.

149 Implementation of the Dynamic Majority-based Mechanisms

The implementation of the dynamic majority-based mechanism is shown in Figure 6.7.

Unlike the dynamic greedy method, all threads are activated at the beginning for comput- ing the majority group. Two functions—maj pc and maj mask—are used to find out the majority group of the threads and their corresponding PC. Last, only majority PC, together with its active mask, is added into the SIMT stack. Finding the majority group of threads can be implemented by a selector in hardware.

If if threshold is set to TRUE in Figure 6.7, the implementation switches to the majority- threshold scheme. The threshold model uses two thresholds, TP and TS. TS is a threshold provided by the user, fixed at runtime. TP, on the other hand, is typically a fraction of the number of available threads, and its value can vary at runtime as explained earlier in

Section 6.3.2. If the number of active threads is greater than TP, the scheduling is identical to the original majority version. Otherwise, we utilize the information about the stalling step, number of steps a thread is stalled, on each thread (counted at runtime), to compare with TS. When iterating over threads in the same warp, we check the stalling step of each thread. If a thread’s stalling step is greater than threshold TS, the PC associated with this thread is inserted into the top of SIMT stack. At this time, the number of stalling steps of the thread is reinitialized to 0.

There is a key difference between the implementation of this scheme and those of the dynamic greedy method and the static methods. Specifically, there is always only one entry—the majority group of the PC and the active mask—in the SIMT stack. So, even if there is divergence, only one branch is required to be in the stack. Consequently, the implementation of the majority dynamic method requires less space and fewer operations on the stack.

150 Figure 6.7: stack::update majority(stack t &m stack, addr t &next pc, bool if threshold) top active mask = m stack.back().m active mask; top pc = m stack.back().m pc; /*Active all the threads*/ top active mask.set(); mask t tmp active mask = NULL PC; address type tmp next pc; tmp active mask.reset(); /*Find out the majority thread group and corresponding PC*/ tmp next pc = majority PC(top active mask, next pc); tmp active mask = majority mask(top active mask, next pc); if if threshold = TRUE then /*Generate threshold TP*/ int available threads; for i = 0 → m warp size do if !thread done.test(i) then available threads++; int TP = max(available threads/8, 2); if tmp active mask.count() < TP then for tid = 0 → m warp size do /*check whether the thread is stalled more than TS*/ if m stack.back().stall[tid] >= TS then tmp next pc = next pc[tid]; tmp active mask.reset(); m stack.back().stall[tid]=0; /*Find out the active threads with the same PC*/ for i = 0 → m warp size − 1 do if top active mask.test(i) then if thread done.test(i) then top active mask.reset(i); else if tmp next pc==next pc[i] then tmp active mask.set(i); top active mask.reset(i);

/*Add new pc and active mask into simt stack*/ m stack.back().m pc = tmp next pc; m stack.back().m active mask = tmp active mask;

151 6.5 Experiment Evaluation

We evaluated the various reconvergence mechanisms using our implementation in GPGPU-

sim. Because recursive applications have not received attention in the context of GPUs,

standard benchmark suites currently do not include many recursive benchmarks. For our

study, we implemented six recursive algorithms commonly described in popular textbooks.

Fibonacci computes the n-th Fibonacci number by recursively computing the (n − 1)-th and (n − 2)-th Fibonacci numbers. Binomial Coefficient [72] is used to compute the co- efficient of the xk term in the polynomial expansion of the binomial power (1 + x)n. Tak

Function [71] is a recursive function, where the results from the first three branches in Tak

are inputs of its fourth branch. NQueens searches for all valid arrangement of queens on

an N × N chess board so that no two queens attack each other using a simple backtracking

algorithm. We consider N = 6. Graph Coloring [70] colors the nodes of a graph, with

a given number of colors, such that no two adjacent nodes share the same color. Similar

to NQueens, a backtracking algorithm is employed, but the search is terminated when one

solution is found. We use a graph with 32 nodes, 64 edges, and 3 colors. Mandelbrot [37]

computes a visualization of the Mandelbrot set. Unlike other recursive benchmarks consid-

ered, only one branch is generated on each recursive step. If the coordinate of a node is in

the maximum radius or if the search depth is smaller than a threshold, a new coordinate is

computed from the original one and checked recursively. In our experiment, the maximum

radius and the search threshold are 2.0 and 25, respectively. Among these six benchmarks,

Fibonacci and NQueens are also part of the BOTS benchmark suite [34]. The tasks used

in the evaluation can be produced from recursive expansion of corresponding benchmarks

from a recursive formulation, e.g., using OpenMP Tasks on GPU [9] or CUDA async-finish

extensions [20].

152 0-Offset Rotation 2-Offset Rotation Threads Tasks Threads Tasks Thread-0 10 8 6 9 7 5 2 4 Thread-0 10 8 6 9 7 5 2 4 Thread-1 10 8 6 9 7 5 2 4 Thread-1 6 9 7 5 2 4 10 8 Thread-2 10 8 6 9 7 5 2 4 Thread-2 7 5 2 4 10 8 6 9 Thread-3 10 8 6 9 7 5 2 4 Thread-3 2 4 10 8 6 9 7 5 Thread-4 10 8 6 9 7 5 2 4 Thread-4 10 8 6 9 7 5 2 4 Thread-5 10 8 6 9 7 5 2 4 Thread-5 6 9 7 5 2 4 10 8 Thread-6 10 8 6 9 7 5 2 4 Thread-6 7 5 2 4 10 8 6 9 Thread-7 10 8 6 9 7 5 2 4 Thread-7 2 4 10 8 6 9 7 5 1-Offset Rotation 4-Offset Rotation Threads Tasks Threads Tasks Thread-0 10 8 6 9 7 5 2 4 Thread-0 10 8 6 9 7 5 2 4 Thread-1 8 6 9 7 5 2 4 10 Thread-1 7 5 2 4 10 8 6 9 Thread-2 6 9 7 5 2 4 10 8 Thread-2 10 8 6 9 7 5 2 4 Thread-3 9 7 5 2 4 10 8 6 Thread-3 7 5 2 4 10 8 6 9 Thread-4 7 5 2 4 10 8 6 9 Thread-4 10 8 6 9 7 5 2 4 Thread-5 5 2 4 10 8 6 9 7 Thread-5 7 5 2 4 10 8 6 9 Thread-6 2 4 10 8 6 9 7 5 Thread-6 10 8 6 9 7 5 2 4 Thread-7 4 10 8 6 9 7 5 2 Thread-7 7 5 2 4 10 8 6 9

Figure 6.8: An example of task rotation with 0, 1, 2, and 4 offsets on 8 threads

6.5.1 Evaluation Goals

Our experiments were conducted with the following goals: 1) evaluating the perfor- mance of the reconvergence mechanisms with increasing divergence degree, 2) examining the scalability of the various scheduling schemes with increasing warp size (from 4 to 32),

3) understanding the variation in the performance of our methods on the recursions with dif- ferent characteristics, noting that the six recursive benchmarks used in our experiments can be classified into four categories—recursion with a small number of branches (Fibonacci and Binomial Coefficient), recursion with a large number of branches (NQueens and Graph

Coloring), recursion with dependency (Tak function), and recursion with only one branch

(Mandelbrot), and 4) evaluating SIMD utilization by studying the distribution of number of active threads over the execution.

153 7 7 Serial Serial Post-dom Post-dom 6 Greedy 6 Greedy Greedy-return Greedy-return 5 Majority 5 Majority Majority-threshold Majority-threshold

4 4

3 3 Fibonacci IPC Fibonacci

2 2 Binomial Coefficient IPC Coefficient Binomial 1 1

0 0 Serial 1 8 16 32 Serial 1 8 16 32 Divergence Degree Divergence Degree

Figure 6.9: IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Major- ity, and Majority-threshold methods with 1, 8, 16, and 32 divergence degree in Fibonacci and Binomial Coefficients benchmarks (32 threads in a warp)

Serial 3 Serial 4 Post-dom Post-dom Greedy Greedy 2.5 Greedy-return Greedy-return Majority Majority Majority-threshold Majority-threshold 3 2

1.5 2

NQueens IPC NQueens 1 Graph Coloring IPC Coloring Graph 1 0.5

0 0 Serial 1 8 16 32 Serial 1 8 16 32 Divergence Degree Divergence Degree

Figure 6.10: IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Major- ity, and Majority-threshold methods with 1, 8, 16, and 32 divergence degree in NQueens and Graph Coloring benchmarks (32 threads in a warp)

6.5.2 Performance of Methods with Increasing Divergence

In this section, we compare the performance of the five scheduling mechanisms— immediate post-dominator, greedy, greedy-return, majority, and majority-threshold—while

154 8 Serial 2 Serial Post-dom Post-dom Greedy Greedy Greedy-return Greedy-return 6 Majority 1.5 Majority Majority-threshold Majority-threshold

4 1

Mandelbrot IPC Mandelbrot Tak Function IPC Function Tak 2 0.5

0 0 Serial 1 8 16 32 Serial 1 8 16 32 Divergence Degree Divergence Degree

Figure 6.11: IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Ma- jority, and Majority-threshold methods with 1, 8, 16, and 32 divergence degree in Tak and Mandelbrot benchmarks (32 threads in a warp)

varying the degree of divergence between the threads. By changing the order of elements in the datasets assigned to each thread, we can control the divergence between threads. The

first version, 0-offset, is created by assigning the same set of N tasks in the same order to each thread. In this case, we expect all threads to execute the same instructions at any given time, thus exhibiting no divergence and a perfect speedup. Next, 1-offset version is created by doing a rotation on the order of the tasks in each set. Specifically, the i-th thread begins execution starting with the i-th task. Thus, all threads will obtain a different task in the beginning, representing the case with the largest divergence. Between the 0-offset and

1-offset cases, we also create two other cases by rotating the order of the tasks left by mul- tiples of 2 and 4. The resulting 2-offset and 4-offset cases exhibit some divergence across threads, but also contain some threads doing identical work. Figure 6.8 shows an example of task rotation with 0, 1, 2, and 4 offsets on 8 threads for Fibonacci.

155 Based on the number of distinct tasks in each case, we define the divergence degree as 1, 8, 16, and 32 for 0-offset, 4-offset, 2-offset, and 1-offset on 32 threads, respectively.

Figures 6.9, 6.10, and 6.11 show the instruction per clock cycle (IPC) for the different methods on the six recursive benchmarks with 32 threads per warp. The baseline shown in the figures is the IPC on one thread for the corresponding benchmark. As expected, all methods achieve nearly perfect speedup of 32 for all benchmarks when the divergence degree is 1. With increasing divergence degree the performance of all the methods de- creases. However, the majority method achieves the best performance on all benchmarks.

For Fibonacci, when increasing the divergence degree to 8, 16, and 32, the performance of the immediate post-dominator method decreases by factors of 2.45, 4.04, and 5.02, re- spectively, However, the performance of the majority method only decreases by factors of

1.87, 2.09, and 2.21, respectively. The performance of the greedy method also decreases slower than the immediate post-dominator method, but faster than the majority method. A similar trend is observed for other benchmarks, though the trends are somewhat different for Mandelbrot, which we will elaborate in the next subsection.

6.5.3 Scalability with Respect to Warp Width

In this section, we discuss the scalability of the different scheduling schemes with re- spect to change in warp size. For this experiment, the tasks with different problem sizes and/or control flow structures are generated and assigned to each thread in a random fash- ion. However, the total amount of work assigned to all threads is the same. Moreover, with increasing number of threads in a warp, the workload for each thread is decreased to study strong scaling.

156 2.5 2.5 Serial Serial Post-dom Post-dom Greedy Greedy 2 Greedy-return 2 Greedy-return Majority Majority Majority-threshold Majority-threshold

1.5 1.5

1 1 Fibonacci IPC Fibonacci

0.5 IPC Coefficient Binomial 0.5

0 0 4 8 16 32 4 8 16 32 Warp Width Warp Width

Figure 6.12: IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Major- ity, and Majority-threshold methods while increasing the number of threads in a warp in Fibonacci and Binomial Coefficient benchmarks

0.8 1.6 Serial Serial Post-dom Post-dom Greedy Greedy 1.4 Greedy-return Greedy-return Majority 0.6 Majority 1.2 Majority-threshold Majority-threshold

1 0.4 0.8

0.6 NQueens IPC NQueens

0.4 IPC Coloring Graph 0.2

0.2

0 0 4 8 16 32 4 8 16 32 Warp Width Warp Width

Figure 6.13: IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Major- ity, and Majority-threshold methods while increasing the number of threads in a warp in N-Queens, Graph Coloring benchmarks

In Figures 6.12, 6.13, and 6.14, we show the IPC for the different schemes while warp width is increased from 4 threads to 32 threads. The baseline used in reporting the results is the IPC when we execute just one thread. When warp width is 4, the IPC achieved

157 2.5 Serial Serial Post-dom 0.7 Post-dom Greedy Greedy 2 Greedy-return Greedy-return Majority 0.6 Majority Majority-threshold Majority-threshold 0.5 1.5 0.4

1 0.3

Tak Function Tak Mandelbrot IPC Mandelbrot 0.2 0.5 0.1

0 0 4 8 16 32 4 8 16 32 Warp Width Warp Width

Figure 6.14: IPC of Post-dom (Immediate post-dominator), Greedy, Greedy-return, Major- ity, and Majority-threshold methods while increasing the number of threads in a warp in Tak and Mandelbrot benchmarks

from all the methods is very similar. However, with increasing warp width, for all the benchmarks, the scalability of the majority and the majority-threshold methods improve as compared to immediate post-dominator, greedy, and greedy-return methods. The imme- diate post-dominator method has the worst performance on all benchmarks, except Man- delbrot. In Fibonacci, when increasing the warp width to 4, 8, 16, and 32, the IPC of the majority method is 2.36, 4.16, 6.84, and 10.83 times higher than the IPC with the baseline.

Moreover, the relative improvement (majority method over the immediate post-dominator method) also increases from 1.25 to 1.58 with increase in warp width.

The performance trends do vary depending on the benchmark. For benchmarks with a small number of branches, such as Fibonacci and Binomial Coefficients (both of which have only two branches at each recursion level), increasing warp width resulted in better performance with all methods. With 32 threads, immediate post-dominator, greedy, and majority methods obtain speedups factors 6.8, 8.5, and 10.8, respectively, on Fibonacci,

158 and speedup factors 4.0, 7.8, and 10.2, respectively, on Binomial Coefficients. Trends are different for the benchmarks with a large number of branches, such as NQueens and Graph

Coloring. While increasing the warp width from 4 to 32, the immediate post-dominator method only gains 1.5x and 1.2x speedups on these benchmarks. This is because with larger number of branches, there is a greater chance that longer branches can block the return of the shorter branches. As majority and greedy methods support the early return of shorter branches, increasing the warp width from 4 to 32 yields 5.75x and 5.03x speedup. For the benchmarks with dependencies on different recursion levels, such as the Tak Function, the control flow is more complex than Fibonacci or Binomial Coefficient. The scalability of the immediate post-dominator method is still much worse than that of majority and greedy methods.

Finally, we explain the observed trends with Mandelbrot. In this benchmark, there is only one branch in each recursion level, and all branches have the same control flow and the same longest branch length (25 with the problem size considered). Thus, in the worse case, the immediate post-dominator method requires execution time equal to the sum of the execution times on the 25 recursion levels. When the warp width is less than 16, it is more likely that we see such a worst case (longest branch length) execution time. In this case, the immediate post-dominator method is slower than the majority and the greedy methods. However, with increasing warp width, more tasks with the same longest branch length are likely to be grouped together. Thus, the immediate post-dominator method turns out to be faster than the greedy method, as the benefit of dynamic reconvergence is small.

The majority scheme can still achieve a small improvement compared to the immediate post-dominator method.

159 To further understand the reasons underlying the performance of the different schemes, we did an additional analysis. In Figure 6.15, we show the distribution of the active threads on the six benchmarks for a warp width 16 threads. Each bar in the figure corresponds to the execution of one benchmark with one scheme, and different shades within a block show the percentage of execution time during which a specific number of threads were active. For example, w1 represents the percentage of time when there is only 1 active thread, whereas w16 represents the percentage of the time when all 16 threads are active.

100% w16 w15 w14 w13 80% w12 w11 w10 w9 w8 60% w7 w6 w5 w4 40% w3 w2 w1

Percentage of Execution Time Execution of Percentage 20%

0 Greedy Greedy Greedy Greedy Greedy Greedy Majority Majority Majority Majority Majority Majority Post-dom Post-dom Post-dom Post-dom Post-dom Post-dom Majority-T Majority-T Majority-T Majority-T Majority-T Majority-T Greedy-R Greedy-R Greedy-R Greedy-R Greedy-R Greedy-R Fibonacci Bicoefficient Tak NQueens GC Mandelbrot

Figure 6.15: Distribution of scheduled warp sizes for a warp width of 16

We can see that for most benchmarks, w1 takes the largest percentage of the execution time with immediate post-dominator, greedy, and greedy-return methods. In NQueens and

Graph Coloring, with the use of the immediate post-dominator method, 79% and 95% of the time there is only one active thread. With the greedy method, w1 is reduced to 37% and 50%, respectively, for these 2 benchmarks. The majority method can further decrease

160 w1 to 4% and 13%. In the majority method, the largest percentage of the execution time is for w8, showing the benefits of scheduling based on the majority group. Thus, although the greedy method has all 16 threads active for a larger fraction of time than the majority scheme, the average number of threads that are active at any given time is higher for the majority scheme. Thus, the majority scheme gives better overall performance.

By comparing greedy and greedy-return methods, Figures 6.9 and 6.10 show that when the divergence degree is low, such as with a divergence degree of 8, the greedy-return method can usually provide better performance. However, with increase in divergence de- gree, the greedy-return method may introduce additional overheads by excessively schedul- ing the threads in the return branch. Similarly, from Figures 6.12 and 6.13, we can see that when warp width is small, say 4 or 8, better performance is obtained by the greedy-return method. This is because an execution with small warp width also incurs a smaller degree of divergence. By comparing the majority and majority-threshold methods, we can see that their performance is very similar in both divergence and scalability experiments. This is due to the fact that starvation occurs very rarely in our benchmarks. Moreover, if we are focusing on the starving threads, IPC may also decrease because the starving threads are often not be in the majority group. However, from Figure 6.15, we do see that both greedy- return and majority-threshold methods often have smaller w1, as compared to greedy and the majority methods, respectively.

6.6 Related Work

Since SIMD architectures have emerged, supporting control flow on them has been a topic of much interest. The ILLIAC IV system [16] used a single predicate register, together with a vector masked instruction, to support unnested if-then-else branches and while loops.

161 CHAP [77], a SIMD graphics processor, introduced a runflag stack and explicit if-else-endif and do-while instructions to handle structured control flow. It also supported nested branch statements by introducing push and pop operations on the stack. However, these techniques cannot be easily applied to unstructured control flow or recursive applications.

Specifically on GPUs, the threads in the same warp are executed in the SIMD manner.

The popular branch control method on GPUs involves allowing the threads that have taken the same branch to execute, while all other threads stall. To implement this method, Lorie and Strong [83] proposed a method using mask bits and special instructions such as else and join. Woop et al. [126] demonstrated a hardware stack and mask to implement this method. Fung et al. [42] presented an implementation of the immediate post-dominator reconvergence method, in which arbitrary control flow can be supported by pushing all the branches into the stack and popping the reconvergence instruction. However, it may still involve significant serialization in the presence of unstructured control flow. The thread frontier method [29] provides an approach to early reconvergence.

Besides intra-warp thread scheduling in the presence of divergence, the focus of the above schemes as well as our work, another direction pursued involves optimizing the map- ping of threads in a block to warps to reduce divergence. Dynamic warp formation [42] increases warp utilization by creating a new warp comprising the threads that have taken the same branch. Thread block compaction [41] optimizes intra-warp utilization by using a block-width stack instead of a warp-width stack. Warp subdivision [91] creates warp-splits, which are simply extra schedulable entities, to interleave their execution. Two-level warp scheduling [95] combines threads from large warps to improve utilization of SIMD proces- sors. Simultaneous branch and warp interleaving [17] and dual-path execution model [111]

162 are similar concepts, though they differ in hardware implementation and the level of com- piler support needed. All these methods change the warp structure using hardware support and can complement our approaches to handle recursion.

The work discussed above mainly focused on structured or unstructured control flow in the same function. Recursion, where divergence arises because of new function calls or returns is not a major target for the above work. Some other studies have considered supporting recursion on SIMD architectures. Abdou [131] developed an approach to trans- late recursive serial codes to SIMD target codes. Such automatic transformation continues to be a non-trivial challenge for general recursive applications and inputs. Ke et al. [129] proposed a method to implement recursion by simulating a call stack. Compared to the built-in recursion support in GPUs, using a stack to simulate recursion in software can be inefficient.

In summary, there have been no systematic studies on hardware thread scheduling for recursive applications on SIMD architectures.

6.7 Summary

GPUs and other SIMD architectures have emerged as a major player in high perfor- mance computing, but recursive applications either cannot be mapped to such hardware or do not perform well. This is due to branch divergence and the fact that current re- convergence schemes are not optimized for recursive applications. This paper has thor- oughly studied the branch reconvergence issue for recursive applications. We first analyzed and identified the limitations of existing (static) methods—the immediate post-dominator and frontier methods—for recursive control flow. We then introduced novel (dynamic) schemes, including the dynamic greedy and majority-based methods, which allow more

163 flexible reconvergence of threads either before or after the immediate post-dominator, as well as allowing simultaneous execution of the same instruction from threads that have different recursive call depths.

The main observations from our detailed evaluation using six recursive benchmarks are as follows. First, our dynamic reconvergence mechanisms outperform the immediate post- dominator scheme, with average speedups over the latter method of 1.66, 1.65, 2.23, and

2.24, with greedy, greedy-return, majority, and majority-threshold schemes, respectively.

Second, the dynamic majority mechanism, which focuses on improving instructions per clock, achieves the best scalability both with increasing divergence among the threads and increasing warp sizes. Finally, across 6 applications, the arithmetic mean of the speedups using the majority scheme (with 32 threads in a warp) is 9.2, over serial execution, estab- lishing that recursive applications can benefit from GPUs and similar accelerators.

164 Chapter 7: A Programming System for Xeon Phis with Runtime SIMD Parallelization

The new emerging Intel Xeon Phi architecture provides a new way to accelerate ap- plications on a co-processors. It shares some similar things with GPU co-processor, i.e., both of them connect to hosts through PCIe bus, and input and output data is needed to be transferred between the hosts and co-processors. However, Intel Xeon Phi also shows its own characteristics, which are very different from GPU architectures. Intel Xeon Phi is based on x86 architecture, which means that lots of applications running on current intel

CPU can be ported to Intel Xeon Phi without modification. The x86 compatible software, such as OpenMP and/or MPI, can be used directly. However, to fully utilize the compu- tation power on Intel Xeon Phi, we must fully utilize the wide vector registers on it. The

512 bits vector in Intel Xeon Phi can support 16 single float computation simultaneously in a SIMD manner. But different from GPU, the SIMD programming on MIC or CPU are quite difficult. Moreover, it is limited by lots of factors, such as memory accessing pattern, control or data dependencies. Thus, in this work we proposed a paralleling framework for

Intel Xeon Phi with a new set API to do the automatic SIMD programming. We target to three different communication patterns, generalized reductions, irregular reductions, and stencil computations.

165 7.1 Parallelization and Performance Issues in Intel Xeon Phi

7.1.1 Our Approach

Our approach for providing a solution for application development on Xeon Phi sys- tems, including SIMD parallelization, is based on the observation that most applications follow a small number of patterns or dwarfs (e.g. as summarized by Collela and also described in Berkeley landscape on parallel computing [7]). By exploiting knowledge of individual patterns, needed data transformations and partitioning approaches can be used. Indeed, many previous efforts on SIMD (and SIMT) parallelization have focused specifically on particular patterns, like stencil computations [27, 55, 56] or irregular reduc- tions [59, 69, 133].

We focus on a more general framework for specifying the computations, but where underlying patterns are explicitly known and exploited. Though the idea can be applied to a variety of patterns, we focus on stencil computations, generalized reductions, and irregular reductions in this paper.

7.1.2 Challenges and Opportunities

There are two levels of parallelism one can seek on the Xeon Phi: MIMD parallelism supported by large number of hyper-threads, and SIMD parallelism provided by the wide

VPU. There are challenges associated with each of them, as well as opportunities to ex- ploit information from specific communication patterns. The issues for applications with different types of patterns are summarized in Table 7.1.

166 Commnication Pattern MIMD Challenge SIMD Challenge Generalized Reduction job partition unaligned/non-unit-stride access control flow dependency data dependency/conflicts Stencil Computation job partition unaligned memory access control flow dependency Irregular Reduction job partition unaligned/random memory access load balance control flow dependency data dependency/conflicts

Table 7.1: Parallelization Challenges of Different Communication Patterns

MIMD Parallelization Issues

A Xeon Phi can be viewed as a SMP machine, in which all the cores not only share the same memory address, but also a coherent cache space. Thus, the traditional MIMD paral- lelization methods, like OpenMP, can also be applied with the support of the Intel compiler.

Yet, there are many opportunities for exploiting information about specific communication patterns.

Particularly, applications with different communication patterns usually have different requirements on task partitioning and scheduling. For stencil computation and general- ized reductions, static scheduling could provide better performance, since it can achieve load balance with a small scheduling overhead. For irregular reductions, a technique like the reduction space partitioning [59] can be used to avoid conflicts between the threads.

Moreover, dynamic, fine-grained, scheduling could achieve better performance over static scheduling by achieving better load balance.

Communication pattern specific information can also help in other ways. Data reorga- nization is one of the optimizations to support vectorization, but data reordering can also

167 provide better cache locality for irregular reductions. These optimizations are normally not

performed by a more general framework, such as an OpenMP implementation.

SIMD Parallelization Issues

In SIMD execution, one memory access operation can load (store) multiple data ele-

ments simultaneously from (to) the memory. However, there are strict restrictions on how

and when such operations can be applied.

Unaligned/Non-unit Stride Accesses: For using SIMD parallelism, the start of the read or write memory address has to be 64 bytes aligned on Xeon Phi. But, it is difficult to satisfy this requirement for almost any kind of application. For instance, stencil computation usu- ally needs to access one node’s neighbors in different directions. In a one dimension matrix, if the address of matrix[i] is aligned by 64 bytes, addresses of its neighbors, matrix[i-1] and matrix[i+1], will not be aligned. Similar problems will also arise for a matrix with more di- mensions. In addition, different SIMD lanes can only access continuous memory address.

Thus, accesses of elements from an array of structures or data accessed through indirection arrays cannot exploit SIMD parallelism directly.

Control Flow Dependencies: At any time, all the SIMD lanes have to execute the same instructions on different data elements. However, in the different branches of an if-else clause, different lanes may execute different instructions, which is not supported by SIMD.

This kind of control flow arises very commonly in generalized reduction and irregular re- ductions.

Data Dependencies and Conflicts: When different SIMD lanes try to write to the same location, the behavior is undefined, as there is no locking operation. In the case of both generalized reductions and irregular reductions, such write conflicts arise. Thus, how to

168 User Interface API (class Task) API Descriptions Configuration of the Task size, offset, and accessing struct Configuration stride. Declare the communication patter(Generalized Re- enum Patterns duction, Stencil, and Irregular Reduction). Define the input parameters for a specific applica- tupleParameters tion. The kernel function provides the computing logic void Kernel(vector &index) for a single data, given by the index vector. MIMD Parallel Framework API (class MIMD Framework) API Descriptions The run function has the capability of register the user defined task to MIMD framework, invok- void run(Task &task) ing runtime optimizations, task partitioning, and scheduling on MIC architecture. void join() It will block, until the execution on MIC is finished.

Table 7.2: User Interface and MIMD Parallel Framework API

solve the data dependencies and conflicts for SIMD effectively and efficiently is another challenge.

7.2 API for Application Development on Xeon Phi

Our parallelization framework provides a set of user API. Next, we introduce our

MIMD and SIMD API, and then show how to use it in a variety of sample kernels.

7.2.1 Shared Memory (MIMD) API

MIMD parallelization API is shown in Table 7.2. The first four parameters correspond to a Task class, which has four attributes, Configuration, Pattern, Parameters, and the

Kernel function. The Configuration comprises three vector type variables, representing the

size, offset, and stride of the computation space across different dimensions. Pattern is used

169 to indicate which communication pattern the given task belongs to. Based on the pattern

information, MIMD parallelization framework applies different partitioning methods, and

this information is used by the SIMD parallel framework as well. In addition, users need to

define the Parameters types, which includes the input and output parameters for a specific

application. The most important part in the user interface is the Kernel function, which

gives the smallest computation logic on one data element. It has only one input parameter

representing the index of the target data element. Moreover, users need to guarantee that

the kernel function is independent between different input indices. The independency can

be achieved by either replicating the shared writing data or using locks while updating.

The last two API are related to the execution and optimization of the applications. The

run function receives a user defined task, with a specification of the four set of parameters,

and automatically invokes runtime optimizations, including partitioning and scheduling

methods, for parallel execution on the Xeon Phi. The strategies employed in partitioning,

scheduling, and optimizations are based on the parameters from the user interface. We will

elaborate it in detail in Section 7.3. After these preprocessing, run function will launch a

group of threads, each of which executes the kernel function with different input indices.

The run function is a non-blocking function, which will return immediately after launching

a job. Next, the users can call the join function to wait, until the execution of the task

finishes.

Overall, our MIMD API provides a way to port applications to the Xeon Phi architecture

with a very small efforts on part of the users. After giving a task definition, users can call run(Task) directly to execute the target applications.

170 Data Type Name Description Data is shared by all the SIMD lanes. All the ba- Scalar Type int, float, double, ... sic data types or temporary variables are belonged to shared type. It includes multiple data scaling to all the SIMD Vector Type vint, vfloat, vdouble, ... lanes. Mask Type mask It helps handling control flow in vectorization

Table 7.3: The data types defined in SIMD API

7.2.2 SIMD API

The main idea of our SIMD API is to express collections of data elements on which

parallel operations can be applied. The actual layout and scheduling of the operations is

left up to the runtime system.

Before introducing the API for operations, we first introduce the definition of the new

data types. We introduce three data types in SIMD API, which are shown in Table 7.3.

Scalar Type is the basic data type, which only contain one data element - the implication

is that if this variable is involved in a SIMD operation, it will be shared by all the SIMD

lanes. In contrast, Vector Type, which is represented as vint or vfloat, includes an array

of data elements. Thus, if we declare one array as Vector Type, each time SIMD lanes

will access a group of contiguous data elements. However, when SIMD lanes access a

Scalar Type, the same data element will be automatically duplicated for all the lanes. This

automatic duplication is supported by the implicit conversion from Scalar Type to Vector

Type in our implementation.

The last data type is the Mask Type. Because SIMD vectorization does not support control flow, we require use of a mask variable to express what computations are applied

171 on which elements. The mask type is implemented as a bit set, in which each bit represents one vector lane. Two values, 1 and 0, represent set and unset, respectively.

The supported operations on different data types are shown in Table 7.4. The main idea is to overload most operators on vector types, or even operations involving one vector type and a scalar type. Thus, the difference between the serial codes and the vectorized codes by using our API is quite small, as we will show through several examples. As shown in

Table 7.4, for assignments and mathematical operations, users can use the same operator in serial codes for vector types and a combination of vector and scalar types. The overloaded operator implementation will automatically perform vectorization on the input parameters.

For logic operations, the difference from the traditional logic operators is with respect to the return type. Because there is no support for control flow in SIMD, in the logic operation

API, the return type is the mask type, which is then used to express the conditional clause that will be applied for a particular element.

Moving onto the rest of the API, there are two types of load and store functions, which are for reading and writing contiguous and non-contiguous addresses, respectively. A load

(store) with a single source or destination parameter provide the function of read and write between the vector type and a contiguous memory address space. On the other hand, a function with the extra index and scale parameters helps exploit gather and scatter oper- ations in the IMCI instruction set for non-contiguous memory accessing. For the applica- tions, which data reorganization can be applied, such as generalized reductions and stencil computations, there is no need for non-contiguous load and store API. However, for irregu- lar applications, in which data reorganization cannot eliminate indirect memory accessing, non-contiguous load and store API can provide an alternative way.

172 vint v1, v2; int s; mask m; op represents the supported mathematic or logic operations; API Examples Assignment API Support assignment between vector types, and scalar v1 = v2; v1 = s; v1 op= v2; v1 op= s; type to vector type. Mathematic API Support most mathematic operations, including +, -, v1 = v2 op v1; v1 = v2 op s; *, /, %, between vector types and scalar types. Logic API Support most logic operations, including ==, ! =, <, >, <=, >=, between vector types and scalar m = v1 op v2; m = v1 op s; types. Return type is mask type. Load/Store API void load(void *src); v1.load(addr); void store(void *dst); v1.store(addr); void load( void *src, const vint &index, int scale) v1.load(addr,index,scale); void store(void *dst, const vint &index, int scale) v1.store(addr,index,scale); Generalized Reduction API template void re- reduction(update, scale, offset, index, duction(int *update, int scale, int offset, vint *index, v1); type value, [mask m]) Mask API mask() v1.mask() Mask State Object Members Descriptions mask m mask type variable type old val the default value for unset vector lanes set mask(const mask &m, type &old val); set mask and default value clear default value and set all vector void clear mask(); lanes to active

Table 7.4: SIMD API

173 One specific feature is a reduction function. As we had stated before, multiple SIMD lanes cannot update the same element, and as a result, implementation of a reduction func- tion using SIMD instructions is more complex. The specific reduction function is given as a parameter in the template. Our runtime system ensures that SIMD lanes are correctly used for such computation.

The goal of the mask function is the conversion of a unmask vector type to the mask vector type. After this conversion, all the operations on this collection start using the mask state object to determine which elements an operation is applied to. Function set mask is used to setup the mask for current mask state object on one thread, which is then used till it is cleared or updated.

7.2.3 Sample Kernels

We now illustrate the API using functions involving different communication patterns.

We establish how code using our API is similar to sequential code, and much simpler than a hand-written vectorized code.

Stencil Application

In Figure 7.1, 7.2, and 7.3, we take a simple stencil computation, the sobel filter, and compare serial, vectorized using our API, and manually vectorized versions.

Comparing between Figure 7.1 and 7.2, the vectorized codes in our API are almost as same as the serial version, except new vector types (vfloat) are introduced to replace the original scalar types (float). Another difference is that the assignment from vector type to scalar type is achieved through the store API, because it needs to involve multiple data copies from the vector variable to the target memory locations. Also, to facilitate a possible

174 1 void kernel( int i, int j){ 2 float Dx = 0.0, Dy = 0.0; 3 for (int p = -1; p <= 1; p++){ 4 for (int q = -1; q <= 1; q++){ 5 Dx += weight_H[p+1][q+1] *b[i+p][j+q]; 6 Dy += weight_V[p+1][q+1] *b[i+p][j+q]; 7 } 8 } 9 float z = sqrt(Dx *Dx + Dy *Dy); 10 a[i][j] = z; 11 }

Figure 7.1: Sobel: Stencil Computation with serial codes

1 void kernel( int i, int j){ 2 vfloat Dx = 0.0, Dy = 0.0; 3 //Compute the weight for a node in a 3x3 area 4 for (int p = -1; p <= 1; p++){ 5 for (int q = -1; q <= 1; q++){ 6 Dx += weight_H[p+1][q+1] *b[X(i,p,q)][Y(j,p,q)]; 7 Dy += weight_V[p+1][q+1] *b[X(i,p,q)][Y(j,p,q)]; 8 } 9 } 10 vfloat z = sqrt(Dx *Dx + Dy *Dy); 11 z. store (&a[i][j]); 12 }

Figure 7.2: Sobel: Stencil Computation with SIMD API

data reorganization at runtime, a function Dim(idx, offset1, offset2, ... ) is provided to cal- culate the transformed index in each dimension by applying offsets on different dimensions.

For example, in Figure 7.2, X(i, p, q) calculates the transformed index in the X-dimension when applying p and q offsets on the original X and Y dimensions, respectively.

175 1 void kernel( int i, int j){ 2 m512 Dx = _mm_set1_ps(0.0), Dy = _mm_set1_ps(0.0); 3 //Compute the weight for a node in a 3x3 area 4 for (int p = -1; p <=1; ++p){ 5 for (int q = -1; q <=1; ++q){ 6 m512 *tmp = ( m512 *)&b[i+q][j+p *vec_width]; 7 m512 tmpx = _mm512_mul_ps( *tmp,weight_H[p+1][q+1]); 8 Dx = _mm512_add_ps(Dx, tmpx); 9 m512 tmpy = _mm512_mul_ps( *tmp,weight_V[p+1][q+1]); 10 Dy = _mm512_add_ps(Dy, tmpy); 11 } 12 } 13 m512 sqDX = _mm512_mul_ps(Dx, Dx); 14 m512 sqDy = _mm512_mul_ps(Dy, Dy); 15 m512 ret = _mm512_add_ps(sqDx, sqDy); 16 ret = _mm512_sqrt_ps(ret); 17 _mm512_store_ps(&a[i][j], ret); 18 }

Figure 7.3: Sobel: Stencil Computation with manual vectorization

To summarize, our API provide a convenient way to achieve vectorization with very little modification on the serial code. It is also clear that the manual vectorization codes, shown in Figure 7.3, introduces more new Intel IMCI API, and is much more complicated compared to serial and our SIMD API versions, as about 40% extra lines are added.

Generalized Reduction

In Figure 7.4, we show the main function of Kmeans, a simple data mining kernel, with vectorization by using our API. The procedures of Kmeans can be divided into three steps:

1) compute the distance between one node and the candidate clusters, 2) update index to the cluster with the minimum distance, 3) do reduction on the cluster found in the step 2.

Thus, the step 1 is only simple arithmetic operations, whereas steps 2 and 3 involve control

flow and generalized reduction, respectively.

176 1 void kernel( vfloat *data, int i){ 2 vfloat min = FLT_MAX; 3 vint min_index = 0; 4 for (int j = 0; j < k; ++j){ 5 //step 1 (Computation): compute the distance 6 vfloat dis = 0.0; 7 for (int m = 0; m < 3; ++m){ 8 dis += (data[i+m *n]-cluster[j *3+m]) * 9 (data[i+m *n]-cluster[j *3+m]); 10 } 11 dis = sqrt(dis); 12 //step 2 (Control flow): update index 13 mask m = dis < min; 14 set mask (m, min); 15 min.mask() = dis.mask(); 16 set mask (m, min_index); 17 min_index.mask() = j; 18 } 19 //step 3 (Reduction): reduction 20 reduction (update, 5, min_index, 0, data[i]); 21 reduction (update, 5, min_index, 1, data[i+n]); 22 reduction (update, 5, min_index, 2, data[i+n *2]); 23 reduction (update, 5, min_index, 3, 1.0); 24 reduction (update, 5, min_index, 4, min); 25 }

Figure 7.4: Kmeans: Generalized Reduction with SIMD API

In the step 1, similar to the stencil computation, the only modification is the data types of corresponding variables are changed from scalar type to the vector type. As a result, the computation is automatically vectorized by loading values from the data array to all vector lanes, and computing the distance between the data in each lane and the clusters. The step 2 introduces a branch, specifically, if the distance is smaller than the current minimum distance (min), we update the min and min index, otherwise, min and min index are not changed. Using our mask API, we represent this computation as if-else branch, in which the else branch just assigns its own value to itself. As we can see in step 2 of the Figure 7.4, a mask variable m is returned by the logic computation. Then, m and the default value for

177 else branch are set by set mask function. Next, the mask() function will do the conversion

from unmask vector type to the mask vector type. In the step 3, reduction is performed on

the array update with the add operation. It is not safe to perform the reduction using the general arithmetic and assignment operations, due to the potential written conflict between different vector lane. Thus, we use the API for reduction. Here, add operation, which is

the default reduce operation for reduction function, is used to reduce values to the array

update.

To summarize, in our API, the code with arithmetic operations is almost as same as the

original (serial) code. The reduction in our API is provided through a function interface,

which allows us to vectorize these codes, whereas most compile-time solutions fail to do

this. The most complicated part of our API is handling of control flow, where branches

are replaced by mask operations. However, we note that existing vectorizing compilers do

not handle control flow at all (as we will show through experimental results), and manual

vectorization in presence of control flow is very complicated (please see an example in

Figure 7.6).

7.3 Runtime Scheduling Framework

We now describe the implementation of the framework, and particularly, how runtime

scheduling that is applied for both MIMD and SIMD parallelization.

7.3.1 MIMD Parallelization

Though MIMD parallelization is performed by a number of existing frameworks, our

focus is on providing automatic or guided task partitioning and scheduling for three differ-

ent communication patters (generalized reduction, stencil computation and irregular reduc-

tion) on the Xeon Phi. In each of these patterns, the computation is an iteration over a set

178 of indices, where the following two steps are applied on each index: Step 1 - Loading the index of the targeted data and other auxiliary data for computation, and Step 2 - Executing the computation logic, including both computation and writing results, for the target data.

In MIMD parallelization, each thread will load a different index in the Step 1, and execute

Step 2 simultaneously and independently, except for handling possible race conditions on the output elements. Our API allows the user to provide a task function, which is the serial code for computations associated with a single target data element, which is used for Step

2.

Our runtime system has two major components, task partitioning and runtime schedul- ing, to parallelize the target applications on the Xeon Phi. Task partitioning can potentially be applied on the computation space or the reduction space. Computation space refers to the space of the computation loop. Reduction space refers to the space in which a reduction is executed. For generalized reductions and stencil computations, it is straightforward to perform task partitioning on the computation space. Particularly, the task partitioning com- ponent can just divide the computation loop into a number of blocks with an equal size, and pass these blocks to the dynamic scheduling component to execute.

However, for irregular reduction applications, there is a tradeoff between computation space partitioning and reduction space partitioning [59]. Briefly, computation space can in- troduce significant overhead on locking operations when different threads trying to update the results on the same reduction index, whereas, reduction space partitioning can com- pletely avoid competition between threads by assigning different reduction space to differ- ent threads. Thus, all threads can execute independently by updating non-overlapping parts of the reduction space. Thus, in our framework, different task partitioning strategies will be launched based on the types of the applications, provided by the users.

179 In the runtime scheduling component, we include three scheduling methods. The first is the static scheduling, in which all the tasks from task partitioning module will be equally distributed to the all the available threads. The static method introduces the smallest scheduling overhead, and for stencil computations, this scheduling method achieves bet- ter performance, because it can still achieve good load balance. However, for generalized reductions and irregular reductions, the workload in each task partition may be different.

Especially, for an irregular reduction, after reduction space partitioning, the workload in each partition may be quite different, and depends on the number of edges associated with each node in the reduction space. Thus, a dynamic scheduling method based on factoring is provided in our framework, which assigns large number of tasks to the threads at first, and reduces the number of assigned tasks as execution progresses. The third and the final scheduling method is the user-defined method, where a user can define the number of tasks in each partition.

7.3.2 SIMD Parallelization Support

Our SIMD parallelization support has three components: implementation of overloaded functions which supports SIMD execution, runtime data reorganization, and handling of control flow.

SIMD Parallelization Through Implementation of Overloaded Functions

Our primary method for auto vectorization is based on the implementation of the over- loaded functions we had listed in Table 7.4. The basic idea is as follows - overloaded functions are used inside definition of a task, which applies computation to a particular point. Since these computations can be applied in parallel, an overloaded function’s imple- mentation uses SIMD instructions to achieve parallel execution.

180 int func(vfloat *a, vfloat *b, vfloat *c){ for(int i = 0; i < n; ++i) c[i] = a[i] + b[i]; } (a) An vectorized function by using overloaded functions int func(float *a, float *b, float *c){ for(int i = 0; i < n; i+=16){ mm512 *s a = ( mm512*)a[i]; mm512 *s b = ( mm512*)b[i]; mm512 *s c = ( mm512*)c[i]; *s c = mm512 add ps(*s a, *s b); } } (b) The expansion of the overloaded functions in (a)

Figure 7.5: An example of vectorization in overloaded functions

An example is shown in Figure 7.5. The initial function performs an add operation between arrays a and b, and writes the results to the array c. The sub-figure (a) shows the code with our API, which is the sequential code, except for vector-type declarations. SIMD parallelization is now automatically applied based on the overloaded add operator on the vector types. Sub-figure (b) shows the expansion of the overloaded function. First, it ap- plies a translation between the scalar type and the vector types on the arrays involved. Then, a SIMD add function is called on the translated arrays to perform 16 add and write opera- tions in the SIMD manner. Next, the index is moved to the start of the next 16 operands.

Overall, unlike hand-code SIMD parallelization, our framework uses numerous over- loaded functions to provide a convenient way to achieve the same performance. There is no need for application developers to consider address translation, different vectorization instructions for different operand types, operations, and architectures, and the position of

181 the next computing index. However, in practice, there are many complications in the appli-

cation of overloaded functions, particularly, when data elements are not contiguous, and/or

when branches are involved. We discuss these issues in the rest of this section.

Data Reorganization

SIMD operations on Xeon Phis (or any SSE-like instruction set) can only be applied if

there are continuous and aligned memory access. Many applications have non-unit stride

and unaligned or even random memory accesses. Such kind of accesses impede compiler

vectorization. In our framework, we exploit the knowledge about underlying communica-

tion patterns to reorganize the data and facilitate SIMD parallelization.

Generalized Reductions: For generalization reductions, data array is usually given as an

Array of Structures (AoS). For instance, in Kmeans, the input point array comprises x, y, and z dimensional information (for three dimensional-points). During vectorization, the

Vector Processing Unit (VPU) will apply the same operation for 16 elements, which means that the VPU needs to access 16 continuous values from each dimension. However, with

AoS storage, values corresponding to one particular attribute are non-contiguous. More- over, if the x dimension data is stored aligned, y and z dimension, data is likely to be unaligned. Therefore, both non-continuous and unaligned memory accesses can either prevent vectorization or impact vectorization performance negatively by the compiler or programing having to introduce extra gather and scatter operations.

Our system applies the standard AoS to SoA Structure of Array transformation. In the SoA format, instead storing each member of the structure continuously, all values for a particular member are grouped together. Thus, accesses to the same member will be continuous. Moreover, aligned accesses can be ensured by adding padding at the end of each member array. Because AoS can be viewed as a matrix, in which columns represent

182 different members of the structure, our framework employs a parallel matrix transpose to

apply the transformation efficiently.

Stencil Computations: Unaligned memory accesses is the major problem for vectorization on stencil computations. When computing the value for a target point, we need to access all its neighbors. Thus, in the original format, it is impossible to ensure that the target point and its neighbors are both aligned. In the literature [56], a non-linear data layout transformation has been proposed to make the target point and its neighbors aligned at the same time, achieved by dimension lifting and a matrix transposition. In our framework, we

invoke this data reorganization to be able to achieve aligned memory accesses.

Resolving Control Flow

Control flow (presence of branches) has been a severe impediment to SIMD paralleliza-

tion, from the time of the initial version of SSE released decades ago. Without hardware

support, in SSE and AVX, one has to simulate the mask operations, which is cumbersome.

We show an example in Figure 7.6, where sub-figure(b) shows hand-written SIMD paral- lelization, where a set of tasks are applying the code in sub-figure (a). It is easy to see that a statement with simple control flow leads to very complex and the size of the code is increased dramatically.

Our framework helps manage this complexity, building on top of the mask data type and mask operations in latest Xeon Phi (and the IMCI instruction set). As shown in Figure 7.6

(c), logic operations between the vector variables can return a mask type variable, and we can use the mask variable as part of the mask arithmetic operations to get results from dif- ferent branches. Thus, compared to the code in sub-figure (b), control flow can be handled in a more concise fashion. However, users are still required to be familiar with the new intrinsics, which is still complicated and error prone. This is addressed in our framework.

183 if(a < b) a += b; else a -= b; (a) An example of control follow

mm128i mask1 = mm cmplt epi32(a, b); mm128i mask0 = mm andnot si128(mask1, mm set1 epi32(0xffffffff)); mm128i res = mm and si128( mm add epi32(a, b), mask1); mm128i oldval = mm and si128(a, mask0); a = mm or si128(res, oldval); res = mm and si128( mm sub epi32(a, b), mask0); oldval = mm and si128(a, mask1); a = mm or si128(res, oldval); (b) The vectorization code of control follow in (a)

mmask16 mask1 = mm512 cmplt epi32(a, b); mmask16 mask0 = mm512 cmpge epi32(a, b); a = mm512 mask add epi32(a, mask1, a, b); a = mm512 mask sub epi32(a, mask0, a, b); (c) The vectorization code of control follow with mask type in (a)

Figure 7.6: An example of control follow (a) without vectorization (b) with vectorization (c) with vectorization and mask type

Initially, we further elaborate on the available Mask State Object, shown in Table 7.4.

There are two members, a mask type m, and a scalar or vector type old val. old val is the

default value assigned to the unset or inactive SIMD lanes. The idea is that the inactive

SIMD lanes still need to execute the instruction, but they simply produce old val. So, as

one can see from Figure 7.6 (c), we set a as the old val. Thus, when executing a+ = b, for the lanes in which a is greater than or equal to b, the old val or a, will be assigned to itself in the end. Similar operations occur when executing a− = b.

The API we support is simpler, and was summarized towards the end of Tables 7.3 and

7.4. The key thing to note is that interface of the operations that involve a mask is same as

184 the operations that do not involve any mask. Each thread owns a local mask state object

for its entire life. The mask state object is declared as a global and static variable for each

thread. Each mask state object includes the information about the mask and old val for

current control flow. After users set a mask state object by the set mask function (example shown in Figure 7.4), it is effective, until a new mask is set, or the current mask is cleared

(using the clear mask function). Each thread’s mask information is used to provide two

implementations of each overloaded operation: unmask (default) and with mask. If a mask is set, versions with mask are invoked, and use the thread’s local mask state object as a

parameter. Figure 7.4 includes an example of how to translate unmask vector type to mask

vector type.

Summary: Putting It Together

Without Branch Control Flow and Indirect Input Data Memory Accessing overloaded functions

With Branch Control Data reorganization User Interface Flow

mask operations + mask vector type Reorganized Data Kernel Function + with continuous and overloaded functions implemented in our aligned memory API accessing With Indirect Memory Accessing

load and store functions + Vectorization Translation Module overloaded functions

Figure 7.7: Overall SIMD Parallelization Framework

185 The overall design of the SIMD parallelization module is shown in Figure 7.7. The user inputs, which are in gray color, include input data and the task place parameters (which were listed in Table 7.2), which the kernel function is a part of. Based on the pattern infor- mation, the data reorganization module first reorder the data. Then, the kernel function im- plemented using our API is vectorized (overloaded functions replaced by their implementa- tions) through our vectorization module. For the case without any branches, or unaligned, non-contiguous memory accesses, overloaded function implementations perform vector- ization by invoking the built-in SIMD functions in Xeon Phi. For the case where branches are involved, the only thing that the users need to do is to set the mask. Then, the im- plementation of overloaded functions performs vectorization by invoking the mask SIMD functions in Xeon Phi. In addition, for the case with unaligned and/or non-contiguous ac- cesses, extra load and store operations can help group and distribute the non-continuous data. Then, the provided overloaded functions will handle SIMD parallelization similar to the case without indirect memory accesses.

7.4 Evaluation

In this section, we evaluate our framework using various applications that involve the communication patterns we have focused on. The objectives of our experiments were:

1) Comparing the performance of applications developed using framework, over hand- written parallel versions (using Pthreads), and evaluating the SIMD parallelization in our framework, over the ICC compiler generated SIMD code, 2) Quantifying the overheads of our runtime framework, by comparing performance against the hand-written SIMD code for SIMD parallelization, 3) Comparing the performance of MIMD parallelization from our framework against OpenMP, another high-level framework, and further evaluating the

186 SIMD parallelization by our framework against what is achieved by ICC compiler with

OpenMP directives. All experiments were conducted on a Xeon Phi SE10P card, which

has 61 cores each running at 1.1 GHz, with four hyper-threads per core, along with a 32

MB L2 cache and 8 GB GDDR5 memory. The compiler that we used is Intel ICC compiler

13.1.0. All benchmarks are compiled with the -O3 optimization. Compiler vectorization

is turned on and off by -vec and -no-vec options, respectively. #pragma vector always was

always used with OpenMP. We also attempted SIMD pragmas, such as #pragma ivdep and

#pragma simd, as well as the SIMD Pragmas introduced by OpenMP 4.0. However, none

of them could handle irregular and some generalized reductions, due to data dependencies,

complicated deferences, and the need for outer loop vectorization for our target class of

applications. All experiments are conducted in the Native Model with the -mmic option.

7.4.1 Benchmarks

Six benchmarks are selected from various benchmark suites, two each involving gen-

eralized reductions, stencil computations, and irregular reductions. Kmeans [62] is a very

popular data clustering kernel - in each iteration, it processes each point in the dataset,

determines the closest center to this point, and computes how this center’s location should

be updated. Our experiments used two different values of the parameter K, the number of

clusters, K = 10 and K = 100. Naive Bayes Classifier (NBC) [116] is a simple classi-

fication algorithm based on observed probabilities. We used two datasets, 50 MB and 200

MB. Sobel is a stencil computation. For 2D Sobel, two 3 × 3 weight templates are used to compute the weight on the target point. Two matrices, with the size of 8192 × 8192 and

16384 × 16384, respectively, were used. Heat3D [2] simulates heat transmission in a 3D space, involving a 7-point stencil. The small and large datasets used in the experiments are

187 512 × 256 × 256 and 512 × 512 × 512, respectively. Molecular Dynamics (MD) is an

irregular reduction kernel used to study the structural, equilibrium, and dynamic properties

of molecules. The simulation iterates over all the edges, and updates the attributes asso-

ciate with the two end nodes. The small dataset used in the experiments has 16K nodes and

2M edges, while the large one has 256K nodes and 32M edges. Euler is another irregular

reduction kernel based on Computational Fluid Dynamics (CFD) that takes description of

the connectivity of a mesh and calculates quantities like velocities ate each mesh point. The

small dataset used in our experiments has 182K nodes and 1.13M edges, while the large

one has 1.4M nodes and 8.9M edges.

7.4.2 Speedups from Our Framework

K=10 K=100 NBC-small NBC-large 8 1.5 6

1.0 4

NBC: speedup 0.5

Kmeans: speedup 2

0 0

SIMD-API SIMD-API SIMD-API SIMD-API Pthread-vec Pthread-vec Pthread-vec Pthread-vec Pthread-novec SIMD-ManualPthread-novec SIMD-Manual Pthread-novec SIMD-ManualPthread-novec SIMD-Manual

Figure 7.8: Speedup of Pthread without SIMD (Pthread-novec), Pthread with auto- SIMD (Pthread-vec), MIC SIMD with our framework (SIMD-API), and hand-written SIMD (SIMD-manual): Kmeans and NBC with small and large datasets each

Our first set of experiments focused on comparing the SIMD parallelization with our framework against compiler generated SIMD code (auto-vectorization), and hand-written

188 3 Sobel-small Sobel-large Heat3D-small Heat3D-large

1.5

2

1.0

1

Sobel: speedup 0.5 Heat3D: speedup

0 0

SIMD-API SIMD-API SIMD-API SIMD-API Pthread-vec Pthread-vec Pthread-vec Pthread-vec Pthread-novec SIMD-ManualPthread-novec SIMD-Manual Pthread-novec SIMD-ManualPthread-novec SIMD-Manual

Figure 7.9: Speedup of Pthread without SIMD (Pthread-novec), Pthread with auto- SIMD (Pthread-vec), MIC SIMD with our framework (SIMD-API), and hand-written SIMD (SIMD-manual): Sobel and Heat3D with small and large datasets each

2.5 Euler-small Euler-large MD-small MD-large 1.5 2.0

1.0 1.5

1.0

0.5 MD: speedup Euler: speedup 0.5

0 0

pthread-vec SIMD-API pthread-vec SIMD-API pthread-vec SIMD-API pthread-vec SIMD-API pthread-novec SIMD-Manualpthread-novec SIMD-Manual pthread-novec SIMD-Manualpthread-novec SIMD-Manual

Figure 7.10: Speedup of Pthread without SIMD (Pthread-novec), Pthread with auto- SIMD (Pthread-vec), MIC SIMD with our framework (SIMD-API), and hand-written SIMD (SIMD-manual): Euler and MD with small and large datasets each

SIMD code. Compiler SIMD parallelization was applied on Pthreads code, so as to also al- low shared memory parallelization. Pthreads-based shared memory parallel versions used similar style (and thus obtain similar performance) as the shared memory parallelization

189 supported by our framework, though the programmer effort is much smaller with our frame-

work. In Figure 7.8, 7.9, and 7.10, we compared the best performance between the pthread

versions with and without compiler vectorization, and the vectorization versions with hand-

coding and our API, for small and large datasets described earlier. The performance, shown

in Figure 7.8, 7.9, and 7.10, is the one with the number of threads that leads to the best

performance (which maybe different across versions). The numbers reported are relative

speedups, with baseline of Pthreads version without vectorization.

For generalized reductions, both Kmeans and NBC show similar trends. The SIMD-

API version achieves better performance compared to the Pthread versions with or without

compiler generated SIMD code. Moreover, the runtime overhead introduced by SIMD-

API is very small compared to the hand-written SIMD versions. In Kmeans, compiler

vectorization can only be applied in the inner-most loop, which is the loop calculating the

distance between one node and all the cluster centers. The performance is sensitive to the

amount of computation in this loop, which depends upon the number of clusters, K. Thus,

with K = 10, pthread-vec version is even slower than the pthread-novec version.

With K = 100, pthread-vec gains 3.5x speedup compared to pthread-novec.

However, with SIMD-API, the vectorization is applied on the outer-most loop, which is the loop iterating over all input points. In addition, with data reorganization and effective management of branches, we can further improve the performance. Thus, SIMD-API gains 2.5 and 7.8 times speedups with K = 10 and K = 100. Another optimization

applicable to Kmeans is the reordering of reduction. When K is smaller than the number

of vector lanes, it is impossible to eliminate the write-conflicts, but this optimization is

effective with a large K.

190 Now, considering NBC, large number of branches causes significant overhead on vec-

torization. So, the available production compiler failed to vectorize the kernel function in

NBC, i.e, the difference between pthread-novec and pthread-vec is negligible.

However, with the help of the mask operations introduced in our framework, SIMD-API

still gains 1.5 and 1.6 times speedups on small and large datasets, respectively.

For stencil computations, one of the major problems for vectorization is unaligned

memory accesses. In our framework, we overcome this limitation by reorganizing the

datasets. However, the ICC compiler also has the capability to do the automatic vectoriza-

tion. In Heat3D, we can see that pthread-vec achieves the best performance, which is very close to SIMD-API and SIMD-Manual. But for Sobel, compiler vectorization fails

due to the extra inner loop that applies weights on the neighbors of the target node. Thus,

the performance of pthread-vec is very similar to that of pthread-novec, whereas

SIMD-API can still achieve more than 2x speedup, because vectorization is not limited to the inner loop.

For irregular reductions, the production compiler cannot vectorize a loop with indirection- based memory access at all. In our framework, we use data reordering together with a re- duction in the use of gather and scatter operations to vectorize such kind of loops, which turns out to be effective when the datasets are large. We achieve 1.5 and 2.5 times speedup over the pthread versions for Euler and MD, respectively. For small datasets, the perfor- mance of the best SIMD-API version is comparable to the pthread versions. However, the best configuration with SIMD-API involves fewer threads (60 instead of 244). In other words, for the smaller datasets, enough parallelism is not available to exploit both MIMD and SIMD features. Comparing to the best SIMD-manual versions, SIMD-API incurs neglectable overheads.

191 7.4.3 Overall Scalability

50 Pthread-novec SIMD-Manual Pthread-novec SIMD-Manual 600 Pthread SIMD-API Pthread-vec SIMD-API 40

30 400

20

200 10 NBC-large: Speedup over one core Kmeans-K100: Speedup over one core 0 0 1 2 4 8 16 32 40 50 61 64 122 128 244 256 305 1 2 4 8 16 32 40 50 61 64 122 128 244 256 305 Thread Number Thread Number

Figure 7.11: Scalability with Increasing Number of Threads: Pthread without vectorization (Pthread-novec), Pthread with auto-vectorization (Pthread-vec), SIMD with API (SIMD-API), and hand-written SIMD (SIMD-manual) with Kmeans and NBC (large datasets) - Relative Speedups Over 1 Thread Execution on Xeon Phi with no Vectorization

Pthread-novec SIMD-Manual Pthread-novec SIMD-Manual Pthread-vec SIMD-API 60 Pthread-vec SIMD-API 150

40 100

50 20 Sobel-large: Speedup over one core Heat3D-large: Speedup over one core 0 0 1 2 4 8 16 32 40 50 61 64 122 128 244 256 305 1 2 4 8 16 32 40 50 61 64 122 128 244 256 305 Thread Number Thread Number

Figure 7.12: Scalability with Increasing Number of Threads: Pthread without vectorization (Pthread-novec), Pthread with auto-vectorization (Pthread-vec), SIMD with API (SIMD-API), and hand-written SIMD (SIMD-manual) with Sobel and Heat3D (large datasets) - Relative Speedups Over 1 Thread Execution on Xeon Phi with no Vectorization

192 Pthread-novec SIMD-Manual 250 Pthread-novec SIMD-Manual 150 Pthread-vec SIMD-API Pthread-vec SIMD-API 200

100 150

100 50 50 MD-large: Speedup over one core Euler-large: Speedup over one core 0 0 1 2 4 8 16 32 40 50 61 64 122 128 244 256 305 1 2 4 8 16 32 40 50 61 64 122 128 244 256 305 Thread Number Thread Number

Figure 7.13: Scalability with Increasing Number of Threads: Pthread without vectoriza- tion (Pthread-novec), Pthread with auto-vectorization (Pthread-vec), SIMD with API (SIMD-API), and hand-written SIMD (SIMD-manual) with Euler and MD (large datasets) - Relative Speedups Over 1 Thread Execution on Xeon Phi with no Vectorization

In Figure 7.11, 7.12, and 7.13, we compare the scalability of pthread-novec, pthread-vec, SIMD-API, and SIMD-Manual with an increasing number of threads.

Execution with a single thread and no vectorization on Xeon Phi is used as the base- line, and thus, we are evaluating the combined benefits of shared memory parallelization

(61 cores), hardware multi-threading (4 threads per core) and SIMD units. The perfor- mance scales well for all the versions. SIMD-API outperforms both pthread-vec and pthread-novec in most cases, consistent with what we reported earlier. SIMD-API achieves better relative performance when the number of threads is small. For instance, when the number of threads is one, SIMD-API is 20 times better than the Pthreads-novec version. With small datasets, as the number of threads increases, the vectorization advan- tage with SIMD-API becomes restricted due to limited amount of overall work. The over- all speedups obtained range between 580 and 33, depending upon the application. Thus, we can see that our framework is effective in allowing users to exploit the Xeon Phi chip.

193 As an aside, benefits of hardware multi-threading (more than 1 thread per core) seem lim- ited, except for Kmeans (speedup from 2 threads per core, but slowdown from 4 threads per core) and the irregular applications (where latency is masked by hardware multi-threading).

In the future, we will examine this issue further and develop a module for automatically choosing the number of threads for a given application.

7.4.4 Comparison with OpenMP

OpenMP-vec MIMD+SIMD OpenMP-novec MIMD

6

4 Speedup

2

0

MD MD NBC Sobel Euler NBC Sobel Euler Kmeans Heat3D Kmeans Heat3D

Figure 7.14: Benefits of MIMD+SIMD Execution in our Framework (Comparison with OpenMP-vec - left) and MIMD-only execution (Comparison with OpenMP-novec - right)

Our last set of experiments had two distinct goals. First, we wanted to examine how

SIMD parallelization with our framework compares against SIMD parallelization performed by the ICC compiler with OpenMP directives. Second, because both OpenMP and the

MIMD API in our framework provide a high-level model for developing shared memory

194 applications, we wanted to examine if our framework offers any performance advantages,

possibly because it exploits the knowledge of the underlying communication patterns.

In Figure 7.14, we compared our MIMD parallel framework with and without SIMD

parallelization to the OpenMP MIMD parallelization with and without the compiler vector-

ization. Comparing MIMD+SIMD to OpenMP-vec, more than 3 times speedup is achieved in Kmeans and NBC, due to the better SIMD parallelization and efficient MIMD paral- lelism. For Heat3D, OpenMP with compiler vectorization can provide good performance, but our parallel framework still outperforms the OpenMP version, due to the more efficient

MIMD parallelism. Sobel, where SIMD parallelization is not achieved by the compiler, our framework gains significant speedups compared to the OpenMP version.

Now, focusing just on MIMD parallelization, our parallel framework still obtains better performance compared to OpenMP. The benefits of our framework are modest for Kmeans and stencil computations, but more significant for NBC and the two irregular applications.

Overall, combining both MIMD and SIMD parallelization, our framework is better for all six applications, has relative speedup of 2.5 or better for five of the six applications, and for the two irregular reductions, it has an improvement by a factor of 4 and 7, respectively.

As discussed throughout the paper, these advantages come from a number of factors, e.g., our framework can vectorize an irregular kernel with indirection-based memory accesses, while OpenMP compiler cannot, and pattern-aware MIMD partitioning and scheduling can avoid locking overheads.

7.5 Related Work

Intel SSE has been a part of the x86 since 1999, and there have been many efforts to automatically accelerate various applications using these instructions. For vectorizing

195 stencils, memory alignment is a key problem, which was addressed by Eichenberger et

al. [36] and Nuzman et al. [97] with data reorganization methods. More Recently, Henretty

et al. [55] propose a system that involved improving data locality and utilizing short-vector

SIMD optimizations, and Kong et al. [74] designed a Polyhedral compiler to perform loop

transformation, optimization and vectorization for imperfectly nested loops.

Vectorizing irregular applications on SSE has also gained considerable interest in re-

cent years. Kim and Han [69] propose a compiler method to generate efficient SIMD code

for irregular kernels containing indirection based memory accesses. However, their work

is on Cell SPU, with much shorter SIMD unit compared to the Xeon Phi and their method

primarily focuses on intra-iteration vectorization. We focus on aggressive inter-iteration

parallelism, consistent with presence of wide SIMD lanes. Tian et al. [120] provided an ex-

tension to the current directive vectorization methods to support function call. ISPC [100]

provides a compiler based solution supporting function calls, SOA data structure, and con-

trol flow. The focus of our work is different, as we are providing a template based runtime

solution for auto-vectorization. It utilizes the knowledge of patterns to automatically con-

duct data reorganization for different patterns. Moreover, it can also help resolving data

dependencies in runtime, which is difficult to be handled for compiler solutions. Ren et

al. [110] design a virtual machine together with domain-specific bytecodes method for pointer data traversals. There are also efforts on hand-optimizing irregular applications on

SSE and other vector units [115, 68].

Some of the GPU compilation efforts have a similar favor, because SIMT is closely related to SIMD. This includes work on parallelizing stencil applications on GPUs [27, 90,

96, 57, 22]. For irregular applications on GPU, the coalesced memory access problem has also been addressed [127, 133]. However, because of the differences in the architectures

196 (e.g. lack of atomic stores), our data reorganization methods are different. Overall, as com-

pared to the existing work on SIMD compilation, the key distinctive aspects of our work

are: 1) handling branches in a general way, 2) exploiting features in the IMCI instruction

set, 3) using knowledge of communication patterns for runtime data reorganization, and

4) use of an overloaded function approach, which is unlike all previous efforts on SIMD

parallelization, and can simplify the compiler code generation in the future.

There are also many efforts to parallelize various applications on Xeon Phi, which

includes the work of Liu et al. [81] on Sparse Matrix-Vector Multiplication, Pennycook et al. [99] on parallelizing a Molecular Dynamic application, and et al. [85] on optimizing the MapReduce framework. We have, to the best of our knowledge, offered the first general and end-to-end system for exploiting both MIMD and SIMD parallelism on the Xeon Phis.

7.6 Summary

This paper has presented and evaluated a framework for parallelization on the Xeon Phi coprocessors. Two distinct aspects of our work are 1) use of the knowledge of underlying patterns to perform job partitioning and scheduling in MIMD setting and data reorganiza- tion for SIMD parallelization, and 2) a very different approach for SIMD code execution, based on the implementation of overloaded functions, with runtime management of masks.

Overall, we perform SIMD parallelization in presence of control flow, irregular accesses, and reductions, unlike previous work with SSE-like instruction sets. Moreover, our work can also be seen as providing a CUDA-like language (and its implementation) for using

SSE-like instruction sets.

197 Chapter 8: Conclusions

This is a summary of the contributions of my dissertation and future work.

8.1 Contributions

Modern GPUs and other SIMD accelerators have emerged as the means for achieving extreme-scale, cost-efficient, and power-efficient on high performance computing. Due to their simplified control logic, a large number of cores are allowed to be included to support extremely high parallelism. However, various challenges also arise on performance and programability. In this dissertation, we introduce different strategies and optimizations on

GPUs, heterogeneous CPU-GPU, and Intel Xeon Phi architectures for the applications with different communication patterns. Our contributions can be summarized as following:

• We discuss the critical performance issues on modern GPUs, i.e., locking operations

and shared memory. Three parallel strategies, full replication, locking scheme, and

hybrid scheme, are proposed and compared on GPU. Due to the limitation of locking

support on GPUs, a fine-grained and a coarse-grained locking method for floating

point are provided. Moreover, The proposed hybrid scheme can achieve a tradeoff

between large memory overhead on full replication and race condition on locking

scheme. In the experiments, results show that the performance of different schemes is

depended on the characteristics and parameters of the applications. However, hybrid

198 scheme can always achieve the best performance due to the balance between the other

two strategies.

• We focus on the parallelization of irregular applications on GPUs. Our goal is to uti-

lize shared memory, which is a critical performance resource on GPU, for irregular or

unstructured applications. Due to the large memory overhead in traditional full repli-

cation method, we choose to use locking scheme with a partitioning module, which

is referred to partitioning-based locking scheme. Compared to computation space

partitioning, reduction space partitioning, which is used in our work, can achieve

both benefits of shared memory utilization and avoidance of combination. A multi-

dimensional partitioning method is proposed for the tradeoff between execution time

and computation redundancy. In addition, a reordering module is introduced to gain

coalesced access on computation space.

• We extend our partitioning-based locking scheme to heterogeneous CPU-GPU ar-

chitectures. Considering that the device memory on GPU is limited, we add a new

partitioning level to our original scheme, produce a multilevel partitioning scheme.

In first partitioning level, work is divided for CPU and GPU. Reduction space par-

titioning is applied due to the different address space between CPU and GPU, and

avoidance of result combination. Furthermore, second level partitioning divides the

work for all the thread blocks on GPU or all the multi-cores on CPU. A dynamic

scheduling framework is provided with dynamic task size for different devices and

work stealing strategy. Moreover, work pipelining is used on GPU to overlap the

partitioning and computation.

199 • With the emergence of integrated CPU-GPU architecture, we provide a scheduling

framework for applications with different communication patterns, i.e., generalized

reductions, stencil computing, and irregular reductions, on Fusion APU. Consider-

ing the overhead from command launching, synchronization, and load imbalance,

we present a thread block level scheduling framework by utilizing the shared system

memory between CPU and GPU. Two implementations, master-worker and token

scheme, are provided and compared with the traditional dynamic task queue schedul-

ing. The experiments are conducted on three applications with different communica-

tion patterns.

• Due to the weak recursive control flow support on GPUs, we did the analysis on

the recursive execution on current NVIDIA GPUs, and found out that the immediate

post-dominator reconvergence policy is the main reason of slow recursive execution.

Thus, we propose a greedy scheme and a majority scheme, both of which can pro-

vide dynamic thread reconvergence to let the threads with small recursion tasks have

chance to get their next tasks sooner. Additionally, two variations, greedy-ret and

majority-threshold are provided for optimization. Extensive experiments are con-

ducted on the recursive applications with different characteristics.

• Targeted to Intel Xeon Phi architecture, we present a paralleling framework with

our API, which can provide both MIMD and SIMD parallelization. Based on the

knowledge of the application patterns, MIMD parallel framework employes different

task partitioning methods and scheduling strategies. Moreover, data reorganization is

conducted to further favor the SIMD parallelization. Our SIMD API utilizes operator

200 overload and function template to simplify the SIMD programming, and resolves

data and control dependencies.

8.2 Future Work

Our current work could be extended in two directions in the future. First, our proposed

SIMD API currently only supports 512-bit vector registers in Intel Xeon Phi architecture.

So one extension is to support SIMD vectors with different lengths, i.e., 128 bits for SSE,

256 bits for AVX, or dynamically varied vector registers. Second, CPU and Intel Xeon

Phi together compose a heterogeneous architecture. The task partitioning and scheduling strategies should be reconsidered on this new environment.

8.2.1 Support SIMD Vectors with Different Lengths

A number of common CPUs also support SIMD parallelism, but with shorter vector registers, i.e., 128 bits for SSE or 256 bits for AVX. Thus, providing a SIMD API support- ing vectors with different lengths can extend our API to benefit not only Intel Xeon Phi architecture, but also other processors with shorter vector registers. There are three major challenges to provide SIMD parallelism supporting different vector lengths.

• First, we should hide the different implementation of the SIMD API from users. One

of the challenges in SIMD programming is that the SIMD instructions for different

vector lengths, i.e., SSE, AVX, and IMCI, are different. It means that we need to

provide different API implementations for different vector lengths. However, to gain

portability, we need to provide a consistent API interface and a runtime module to

load the appropriate implementation for different vector lengths.

201 • The new Intel IMCI instruction set introduces lots of new instructions, such as, mask,

gather and scatter operations, which did not exist in the previous SIMD instruction

set, i.e., SSE and AVX. Therefore, we need to simulate these instructions for users.

• In a CPU-MIC heterogeneous environment, two types of vector lengths may coexist

in the runtime. For instance, the host supports SSE instruction set, while Xeon Phi

accelerator supports IMCI instruction set. Thus, to support CPU-MIC heterogeneous

environment, the implementation of our SIMD API should support varied vector

lengths in the runtime.

8.2.2 Support CPU-MIC Heterogeneous Environment

In the work of Chapter 7, we execute applications on Intel Xeon Phi in native model, in which the whole applications are executed on accelerator. However, not all the parts of an application are valuable to execute on an accelerator. For example, for the serial parts of an application, there is no benefit to port it to Intel Xeon Phi, which focuses on large scale parallelism. Then, another execution model, offload model, maybe a good choice for

this case. Users only need to port the specific parts of the application and their associated

data to the accelerator, while other parts of the application still execute on host. By using

OpenMP pragma, users can decide which loops execute on Intel Xeon Phi or host. This

execution model is very similar to the execution model for GPU, which does not have its

own operating system to support native model.

However, because host can also provide computation capability, there is benefit to port

single application on CPU-MIC heterogeneous environment.

In Chapter 4 and 5, we discussed two kinds of scheduling methods for heterogeneous

CPU-GPU architectures. The multi-level partitioning framework for irregular applications

202 can also be ported to CPU-MIC architecture directly. Moreover, because the memory shar- ing between host and Xeon Phi is also supported on MIC architecture, the thread block level scheduling framework, introduced in Chapter 5, can also be applied to CPU-MIC architec- ture. Different from GPU, Intel Xeon Phi runs an independent Linux system. Therefore, synchronization between all the threads, which has to be done out of the kernel on GPU, is possible to be done on Xeon Phi accelerator without involving host. Then, synchronization between all the threads in Xeon Phi could be light weight, due to no extra synchronization with host involved.

203 Bibliography

[1] GPGPU-Sim 3.x manual. http://gpgpu-sim.org/manual/index.php5/ GPGPU-Sim_3.x_Manual#Introduction.

[2] Heat3D. http://dournac.org/info/parallel_heat3d.

[3] Opencl. http://www.khronos.org/opencl/.

[4] Top 500 supercomputers. http://www.top500.org/lists/2013/11/.

[5] Vikram Adve and John Mellor-Crummy. Using integer sets for data-parallel program anal- ysis and optimization. In Proceedings of the SIGPLAN ’98 Conference on Programming Language Design and Implementation, June 1998.

[6] Gagan Agrawal and Joel Saltz. Interprocedural data flow based optimizations for distributed memory compilation. Software Practice and Experience, 27(5):519 – 546, May 1997.

[7] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Hus- bands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al. The landscape of parallel computing research: A view from berkeley. Tech- nical Report EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[8] Dric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andr Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput. : Pract. Exper., 23(2):187–198, February 2011.

[9] Eduard Ayguade,´ Rosa M. Badia, Pieter Bellens, Daniel Cabrera, Alejandro Duran, Roger Ferrer, Marc Gonzalez,´ Francisco D. Igual, Daniel Jimenez-Gonz´ alez,´ and Jesus´ Labarta. Extending openmp to survive the heterogeneous multi-core era. International Journal of Parallel Programming, 38(5-6):440–459, 2010.

[10] Sara Baghsorkhi, Melvin Lathara, and Hwu. CUDA-lite: Reducing GPU Program- ming Complexity. In LCPC 2008, 2008.

[11] Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Ana- lyzing CUDA workloads using a detailed gpu simulator. In ISPASS, pages 163–174, 2009.

204 [12] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In PPoPP ’08: Pro- ceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel pro- gramming, pages 1–10, New York, NY, USA, 2008. ACM.

[13] Michela Becchi et al. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In SPAA ’10, pages 82–91, New York, NY, USA, 2010.

[14] M. J. Berger, M. J. Aftosmis, D. D. Marshall, and S. M. Murman. Performance of a new cfd flow solver using a hybrid programming paradigm. J. Parallel Distrib. Comput., 65:414–423, April 2005.

[15] W. Blume, R. Doallo, R. Eigenman, J. Grout, J. Hoelflinger, T. Lawrence, J. Lee, D. Padua, Y. Paek, B. Pottenger, L. Rauchwerger, and P. . Parallel programming with Polaris. IEEE Computer, 29(12):78–82, December 1996.

[16] W. J. Bouknight, S. A. Denenberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick. The Illiac IV system. Proc. of IEEE, 60(4):369–388, 1972.

[17] Nicolas Brunie, Sylvain Collange, and Gregory Frederick Diamos. Simultaneous branch and warp interweaving for sustained gpu performance. In 39th International Symposium on Computer Architecture (ISCA 2012), June 9-13, 2012, Portland, OR, USA, pages 49–60, 2012.

[18] Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel A. Keim. A graphics hardware accelerated algorithm for nearest neighbor search. In Vassil N. Alexandrov, G. Dick van Albada, Peter M. A. Sloot, and Jack Dongarra, editors, Computational Science - ICCS 2006, 6th International Conference, Reading, UK, May 28-31, 2006, Proceedings, Part IV, volume 3994 of Lecture Notes in Computer Science, pages 196–199. Springer, 2006.

[19] Daniel Cederman and Philippas Tsigas. On sorting and load balancing on gpus. SIGARCH Comput. Archit. News, 36(5):11–18, 2008.

[20] Sanjay Chatterjee, Max Grossman, Alina Simion Sbˆırlea, and Vivek Sarkar. Dynamic task parallelism with a gpu work-stealing runtime system. In International Workshop on - guages and Compilers for Parallel Computing (LCPC’11), pages 203–217, 2011.

[21] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using CUDA. J. Parallel Distrib. Comput, 68(10):1370–1380, 2008.

[22] Linchuan Chen, Xin Huo, and Gagan Agrawal. Scheduling methods for accelerating appli- cations on architectures with heterogeneous cores. In HCW14, 2014.

205 [23] Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven Autotuning of Sparse Matrix-vector Multiply on GPUs. In Proceedings of Principles and Practices of Parallel Programming (PPoPP), February 2010.

[24] Andrew Corrigan, Fernando Camelli, Rainald Lohner,¨ and John Wallin. Running unstruc- tured grid cfd solvers on modern graphics hardware. In 19th AIAA Computational Fluid Dynamics Conference, June 2009.

[25] R. Das, D. J. Mavriplis, J. Saltz, S. Gupta, and R. Ponnusamy. The design and implementation of a parallel unstructured Euler solver using software primitives. AIAA Journal, 32(3):489– 496, March 1994.

[26] Raja Das, , Paul Havlak, Joel Saltz, and Ken Kennedy. Index array flattening through pro- gram transformation. In Proceedings Supercomputing ’95. IEEE Computer Society Press, December 1995.

[27] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David A. Patterson, John Shalf, and Katherine A. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC, page 4. IEEE/ACM, 2008.

[28] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clus- ters. In OSDI, pages 137–150, 2004.

[29] Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. SIMD re-convergence at thread frontiers. In MICRO-44, pages 477–488, 2011.

[30] Gregory F. Diamos and Sudhakar Yalamanchili. Harmony: an execution model and runtime for heterogeneous many core systems. In HPDC ’08, pages 197–200, New York, NY, USA, 2008.

[31] Gregory F. Diamos and Sudhakar Yalamanchili. Harmony: An Execution Model and Run- time for Heterogeneous Many Core Systems. In HPDC ’08, pages 197–200. ACM, 2008.

[32] Chen Ding and Ken Kennedy. Improving cache performance of dynamic applications with computation and data layout transformations. In Proceedings of the SIGPLAN ’99 Confer- ence on Programming Language Design and Implementation, May 1999.

[33] Jiri Dokulil, Enes Bajrovic, Siegfried Benkner, Sabri Pllana, Martin Sandrieser, and Beverly Bachmayer. Efficient hybrid execution of c++ applications using intel xeon phi coprocessor. CoRR, abs/1211.5530, 2012.

[34] Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task par- allelism in openmp. In ICPP ’09.

206 [35] A. Ecer, J. D. Chen, T. Boenisch, and H. U. Akay. Dynamic load balancing for distributed heterogeneous computing of parallel cfd problems, 2000.

[36] Alexandre E. Eichenberger, Peng Wu, and Kevin O’Brien. Vectorization for simd architec- tures with alignment constraints. In PLDI, pages 82–93. ACM, 2004.

[37] J.H. Ewing and G. Schober. The area of the mandelbrot set. Numerische Mathematik, 61(1):59–72, 1992.

[38] Reza Farivar, Abhishek Verma, Ellick Chan, and Roy Campbell. MITHRA: Multiple Data Independent Tasks on a Heterogeneous Resource Architecture. In CLUSTER, pages 1–10. IEEE, 2009.

[39] Mark S Friedrichs, Peter Eastman, Vishal Vaidyanathan, Mike Houston, Scott Legrand, Adam L Beberg, Daniel L Ensign, Christopher M Bruns, and Vijay S Pande. Accelerat- ing molecular dynamic simulation on graphics processing units. Journal of Computational Chemistry, 30(Radeon 4870):864–872, 2009.

[40] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk- 5 multithreaded language. In In Proceedings of the SIGPLAN ’98 Conference on Program Language Design and Implementation, pages 212–223, 1998.

[41] Wilson W. L. Fung and Tor M. Aamodt. Thread block compaction for efficient SIMT control flow. In HPCA ’11, pages 25–36, 2011.

[42] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO-40, pages 407–420, 2007.

[43] V. Garcia, E. Debreuve, and M. Barlaud. Fast k nearest neighbor search using GPU. In Computer Vision on GPU, pages 1–6, 2008.

[44] M. Garland. Sparse matrix computations on manycore GPUs. In Proceedings of the Design Automation Conference (DAC), 2008.

[45] Jayanth Gummaraju, Laurent Morichetti, Michael Houston, Ben Sander, Benedict R. Gaster, and Bixia Zheng. Twin peaks: a software platform for heterogeneous computing on general- purpose and graphics processors. In PACT ’10: Proceedings of the 19th international confer- ence on Parallel architectures and compilation techniques, pages 205–216, New York, NY, USA, 2010. ACM.

[46] E. Gutierrez, O. Plata, and E. L. Zapata. A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors. In ICS00, pages 78–87. ACM Press, May 2000.

[47] Jesse D. Hall and John C. Hart. Abstract GPU acceleration of iterative clustering, 2004.

207 [48] M. Hall, S. Amarsinghe, B. Murphy, S. Liao, and M. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, (12), December 1996.

[49] H. Han and Chau-Wen Tseng. Improving compiler and runtime support for irregular re- ductions. In Proceedings of the 11th Workshop on Languages and Compilers for Parallel Computing, August 1998.

[50] H. Han and Chau-Wen Tseng. A comparison of locality transformations for irregular codes. In Proceedings of Fifth Workshop on Languages, Compilers, and Runtime Systems for Scal- able Computers, pages 31 – 36, May 2000.

[51] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kauf- mann Publishers, 2000.

[52] Reinhard v. Hanxleden. Handling irregular problems with Fortran D - a preliminary report. In Proceedings of the Fourth Workshop on Compilers for Parallel Computers, Delft, The Netherlands, December 1993. Also available as CRPC Technical Report CRPC-TR93339-S.

[53] Bingsheng He, Wenbin , Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT08: IEEE International Conference on Parallel Architecture and Compilation Techniques 2008, 2008.

[54] Paul Henning and Andrew B. White Jr. Trailblazing with roadrunner. Computing in Science and Engineering, 11, 2009.

[55] Thomas Henretty, Richard Veras, Franz Franchetti, Louis-Noel¨ Pouchet, J Ramanujam, and P Sadayappan. A stencil compiler for short-vector simd architectures. In ICS, pages 13–24, 2013.

[56] Tom Henretty, Kevin Stock, Louis-Noel¨ Pouchet, Franz Franchetti, J. Ramanujam, and P. Sa- dayappan. Data layout transformation for stencil computations on short-vector simd archi- tectures. In CC’11/ETAPS’11, pages 225–245, Berlin, Heidelberg, 2011. Springer-Verlag.

[57] Justin Holewinski, Louis-Noel¨ Pouchet, and P Sadayappan. High-performance code gen- eration for stencil computations on gpu architectures. In Proceedings of the international conference on Supercomputing, pages 311–320. ACM, 2012.

[58] Susan Flynn Hummel, Edith Schonberg, and Lawrence E. Flynn. Factoring: A method for scheduling parallel loops. Commun. ACM, 35(8):90–101, 1992.

[59] Xin Huo, Vignesh Ravi, Wenjing Ma, and Gagan Agrawal. An execution strategy and opti- mized runtime support for parallelizing irregular reductions on modern gpus. In Proceedings of the International Supercomputing Conference (ICS), June 2011.

[60] Xin Huo, Vignesh T. Ravi, and Gagan Agrawal. Porting irregular reductions on heteroge- neous cpu-gpu configurations. In HiPC’11, Dec. 2011.

208 [61] Yuan-Shin Hwang, Raja Das, Joel H. Saltz, Milan Hodoscek, and Bernard R. Brooks. Par- allelizing molecular dynamics programs for distributed memory machines. IEEE Computa- tional Science & Engineering, 2(2):18–29, Summer 1995. Also available as University of Maryland Technical Report CS-TR-3374 and UMIACS-TR-94-125.

[62] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988.

[63] Ruoming and Gagan Agrawal. A Middleware for Developing Parallel Data Mining Im- plementations. In In Proceedings of the first SIAM conference on Data Mining, 2001.

[64] Ruoming Jin and Gagan Agrawal. Performance Prediction for Random Write Reductions: A Case Study in Modeling Shared Memory Programs. In Proceedings of ACM SIGMETRICS, June 2002.

[65] Ruoming Jin and Gagan Agrawal. Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. In Proceedings of the second SIAM conference on Data Mining, 17:71–89, 2002.

[66] Maher Kaddoura, Chao-Wei , and Sanjay Ranka. Partitioning unstructured computational graphs for nonuniform and adaptive environments. IEEE Parallel & Distributed Technology, 3(3):63–69, Fall 1995.

[67] B.W. Kernighan and S. . An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2):291–307, February 1970.

[68] C. Kim, J. Chhugani, N. Satish, E. Sedlar, A.D. Nguyen, T. Kaldewey, V.W. Lee, S.A. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proceedings of the 2010 International Conference on Management of Data (SIGMOD 2010), pages 339–350. ACM, 2010.

[69] S. Kim and H. Han. Efficient SIMD Code Generation for Irregular Kernels. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2012), pages 55–64. ACM, 2012.

[70] W. Klotz. Graph Coloring Algorithms. Mathematik-Bericht. 2002.

[71] Donald E. Knuth. Artificial intelligence and mathematical theory of computation. chapter Textbook examples of recursion, pages 207–229. Academic Press Professional, Inc., San Diego, CA, USA, 1991.

[72] Donald E. Knuth. The art of computer programming, volume 1: (3rd ed.) fundamental algo- rithms. Addison Wesley Longman Publishing Co., Inc., 1997.

[73] C. Koelbel and P. Mehrotra. Compiling global name-space parallel loops for distributed execution. IEEE Transactions on Parallel and Distributed Systems, 2(4):440–451, October 1991.

209 [74] Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noel¨ Pouchet, and P Sa- dayappan. When polyhedral transformations meet simd code generation. In Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation, pages 127–138. ACM, 2013.

[75] Antonio Lain and Prithviraj Banerjee. Exploiting spatial regularity in irregular iterative ap- plications. In Proceedings of the Ninth International Parallel Processing Symposium, pages 820–826. IEEE Computer Society Press, April 1995.

[76] Markus Peloquin Lena Olson. Openclunk: a cilk-like framework for the gpgpu. 2010.

[77] Adam Levinthal and Thomas Porter. Chap - a SIMD graphics processor. In SIGGRAPH ’84, pages 77–82, 1984.

[78] Yuan Lin and David Padua. On the automatic parallelization of sparse and irregular Fortran programs. In Proceedings of the Workshop on Languages, Compilers, and Runtime Systems for Scalable Computers (LCR - 98), May 1998.

[79] Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. Merge: a programming model for heterogeneous multi-core systems. SIGPLAN Not., 43(3):287–296, 2008.

[80] Weiguo Liu, Bertil Schmidt, Gerrit Voss, and Wolfgang Muller-Wittig.¨ Molecular dynamics simulations on commodity gpus with cuda. In HiPC, pages 185–196, 2007.

[81] Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey. Efficient sparse matrix- vector multiplication on x86-based many-core processors. In Proceedings of the 27th inter- national ACM conference on supercomputing, pages 273–282. ACM, 2013.

[82] William Chun Yip Lo, Keith Redmond, Lothar Lilge, Jason Luu, Paul Chow, and Jonathan Rose. Hardware acceleration of a monte carlo simulation for photodynamic therapy treatment planning. Journal of biomedical optics, 14(1):014019–014019, 2009.

[83] R. A. Lorie and H. R. Strong. Method for conditional branch execution in SIMD vector processors. 1984.

[84] Bo Lu and John Mellor-Crummey. Compiler optimization of implicit reductions for dis- tributed memory multiprocessors. In Proceedings of the 12th International Parallel Process- ing Symposium (IPPS), April 1998.

[85] Mian Lu, , Huynh Phung Huynh, Zhongliang Ong, Yun Liang, Bingsheng He, Rick Siow Mong Goh, and Richard Huynh. Optimizing the mapreduce framework on intel xeon phi coprocessor. CoRR, abs/1309.0215, 2013.

[86] Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: exploiting parallelism on hetero- geneous multiprocessors with adaptive mapping. In MICRO ’09, pages 45–55, New York, NY, USA, 2009.

210 [87] Wenjing Ma and Gagan Agrawal. A Translation System for Enabling Data Mining Applica- tions on GPUs. In Proceedings of International Conference on Supercomputing (ICS), June 2009.

[88] Wenjing Ma and Gagan Agrawal. An Integer Programming Framework for Optimizing Shared Memory Use on GPUs. In Proceedings of Conference on High Performance Com- puting (HiPC), December 2010.

[89] John Mellor-Crummey, David Whalley, and Ken Kennedy. Improving memory hierarchy performance of irregular applications. In Proceedings of the International Conference on Supercomputing (ICS), June 1999.

[90] Jiayuan Meng and Kevin Skadron. A performance study for iterative stencil loops on gpus with ghost zone optimizations. International Journal of Parallel Programming, 39(1):115– 142, 2011.

[91] Jiayuan Meng, David Tarjan, and Kevin Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News, 38(3):235–246, June 2010.

[92] Nick Mitchell, Larry Carter, and Jeanne Ferrante. Localizing non-affine array references. In Proceedings of International Conference on Parallel Architectures and Compilation Tech- niques (PACT), October 1999.

[93] Maryam Moazeni, Alex Bui, and Majid Sarrafzadeh. A Memory Optimization Technique for Software-Managed Scratchpad Memory in GPUs. http://www.sasp- conference.org/index.html, Jul 2009.

[94] S.S. Mukherjee, S.D. Sharma, M.D. Hill, J.R. Larus, A. Rogers, and J. Saltz. Efficient support for irregular applications on distributed-memory machines. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), pages 68–79. ACM Press, July 1995. ACM SIGPLAN Notices, Vol. 30, No. 8.

[95] Veynu Narasiman, Michael Shebanow, Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving GPU performance via large warps and two-level warp schedul- ing. In MICRO, pages 308–317, 2011.

[96] Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 3.5-d blocking optimization for stencil computations on modern cpus and gpus. In SC, pages 1–13. IEEE, 2010.

[97] Dorit Nuzman, Ira Rosen, and Ayal Zaks. Auto-vectorization of interleaved data for simd. In PLDI, pages 132–143. ACM, 2006.

[98] NVIDIA. Cuda dynamic parallelism programming guide. August 2012.

211 [99] SJ Pennycook, CJ Hughes, M Smelyanskiy, and SA Jarvis. Exploring simd for molecular dynamics, using intel xeon processors and intel xeon phi coprocessors. In IPDPS, 2013.

[100] Matt Pharr and William R Mark. ispc: A spmd compiler for high-performance cpu program- ming. In Innovative Parallel Computing (InPar), 2012, pages 1–13. IEEE, 2012.

[101] Graham Sellers Pierre Boudier. Memory system on fusion apus, the benefits of zero copy. In AMD Fusion Developer Summit Sessions, June 2011.

[102] C. D. Polychronopoulos and D. J. Kuck. Guided self-scheduling: A practical scheduling scheme for parallel supercomputers. IEEE Trans. Comput., 36(12):1425–1439, 1987.

[103] R. Ponnusamy, J. Saltz, A. Choudhary, Y.-S. Hwang, and G. Fox. Runtime support and compilation methods for user-specified irregular data distributions. IEEE Transactions on Parallel and Distributed Systems, 6(8):815–831, August 1995.

[104] William M. Pottenger. The Role of Associativity and Commutativity in the Detection and Transformation of Loop-Level Parallelism. In Conference Proceedings of the 1998 Interna- tional Conference on Supercomputing (ICS), pages 188–195. ACM Press, July 1998.

[105] Timothy J. Purcell, Craig Donner, Mike Cammarano, Henrik Wann Jensen, and Pat Han- rahan. Photon Mapping on Programmable Graphics Hardware. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, pages 41–50. Eu- rographics Association, 2003.

[106] Timothy J. Purcell, Craig Donner, Mike Cammarano, Henrik Wann Jensen, and Pat Han- rahan. Photon mapping on programmable graphics hardware. In Proceedings of Graphics Hardware 2003, pages 41–50, 2003.

[107] Vignesh T. Ravi and Gagan Agrawal. Performance Issues in Parallelizing Data-Intensive Ap- plications on a Multi-core Cluster. In Proceedings of Conference on Clustering Computing and Grids (CCGRID), 2009.

[108] Vignesh T. Ravi and Gagan Agrawal. A dynamic scheduling framework for emerging het- erogeneous systems. In HiPC’11, Dec. 2011.

[109] Vignesh T. Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configu- rations. In ICS ’10, pages 137–146, Tsukuba, Ibaraki, Japan, 2010.

[110] Bin Ren, Gagan Agrawal, James R Larus, Todd Mytkowicz, Tomi Poutanen, and Wolfram Schulte. Simd parallelization of applications that traverse irregular data structures. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pages 1–10. IEEE, 2013.

[111] Minsoo Rhu and Mattan Erez. The dual-path execution model for efficient gpu control flow. In Proceedings of High Performance Computing Architecture (HPCA), 2013.

212 [112] Shane Ryoo, Christopher Rodrigues, Sam Stone, Sara Baghsorkhi, Sain-Zee Ueng, John Stratton, and Wen mei Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In Proceedings of the 2008 International Symposium on Code Generation and Optimization, April 2008, pages 195–204. ACM, April 2008.

[113] Bratin Saha, Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng , Mohan Rajagopalan, Jesse Fang, Peinan Zhang, Ronny Ronen, and Avi Mendelson. Programming model for a heterogeneous x86 platform. SIGPLAN Not., 44(6):431–440, 2009.

[114] Kirk Schloegel, George Karypis, and Vipin Kumar. Parallel static and dynamic multi- constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219–240, 2002.

[115] J. Sewall, J. Chhugani, C. Kim, N. Satish, and P. Dubey. PALM: Parallel Architecture- Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors. PVLDB, 4(11):795–806, 2011.

[116] -Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining, (First Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.

[117] David Tarditi, Sidd Puri, and Jose Oglesby. Accelerator: using data parallelism to program gpus for general-purpose uses. In John Paul Shen and Margaret Martonosi, editors, ASPLOS, pages 325–335. ACM, 2006.

[118] George Teodoro, Timothy D. R. Hartley, Umit¨ V. C¸atalyurek,¨ and Renato Ferreira. Run-time Optimizations for Replicated Dataflows on Heterogeneous Environments. In HPDC, pages 13–24, 2010.

[119] George Teodoro et al. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In CLUSTER ’09, pages 1–10, 2009.

[120] Xinmin Tian, Hideki Saito, Milind Girkar, Serguei Preis, Sergey Kozhukhov, Aleksei G. Cherkasov, Clark Nelson, Nikolay Panchenko, and Robert Geva. Compiling c/c++ simd ex- tensions for function and loop vectorizaion on multicore-simd processors. In IPDPS Work- shops, pages 2349–2358. IEEE Computer Society, 2012.

[121] Sundaresan Venkatasubramanian and Richard W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid cpu/gpu systems. In ICS ’09: Proceedings of the 23rd international conference on Supercomputing, pages 244–255, New York, NY, USA, 2009. ACM.

[122] Vibhav Vineet and P. J. Narayanan. Cuda cuts: Fast graph cuts on the gpu. Computer Vision and Pattern Recognition Workshop, 0:1–8, 2008.

[123] John Paul Walters, Vidyananth Balu, Vipin Chaudhary, David Kofke, and Andrew Schultz. Accelerating molecular dynamics simulations with gpus. In ISCA PDCCS, pages 44–49, 2008.

213 [124] Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick Y. , Guei-Yuan Lueh, and Hong Wang. Exochi: architecture and pro- gramming environment for a heterogeneous multi-core multithreaded system. In PLDI ’07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, pages 156–166. ACM Press, 2007.

[125] Samuel Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, and Leonid Oliker. Optimization of geometric multigrid for emerging multi- and manycore processors. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, 2012.

[126] Sven Woop, Jorg¨ Schmittler, and Philipp Slusallek. RPU: a programmable ray processing unit for realtime ray tracing. In SIGGRAPH ’05, pages 434–444, 2005.

[127] Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. Complexity anal- ysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. In Proceedings of the SIGPLAN symposium on Principles and practice of parallel programming, 2013.

[128] Janet Wu, Raja Das, Joel Saltz, Harry Berryman, and Seema Hiranandani. Distributed mem- ory compiler design for sparse problems. IEEE Transactions on Computers, 44(6):737–753, June 1995.

[129] Ke Yang, Bingsheng He, Qiong Luo, Pedro V. Sander, and Jiaoying Shi. Stack-based parallel recursion on graphics processors. SIGPLAN Not., 44(4):299–300, February 2009.

[130] Yi Yang, Ping , Mike Mantor, and Huiyang Zhou. CPU-assisted GPGPU on fused CPU-GPU architectures. In HPCA, pages 103–114, 2012.

[131] Abdou Youssef. Translation of serial recursive codes to parallel SIMD codes. In PACT ’95, pages 254–263, 1995.

[132] Hao Yu and Lawrence Rauchwerger. Adaptive reduction parallelization techniques. In Pro- ceedings of the 2000 International Conference on Supercomputing, pages 66–75. ACM Press, May 2000.

[133] E.Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-The-Fly Elimination of Dynamic Irregularities for GPU Computing. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, (ASPLOS 2011), pages 369–380. ACM, 2011.

214