Cross-Platform Heterogeneous Runtime Environment

Cross-Platform Heterogeneous Runtime Environment A Dissertation Presented by Enqiang Sun to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts April 2016 To my family. i Contents List of Figures iv List of Tables vi Acknowledgments vii Abstract of the Dissertation viii 1 Introduction 1 1.1 A Brief History of Heterogeneous Computing . 4 1.2 Heterogeneous Computing with OpenCL . 5 1.3 Task-level Parallelism across Platforms with Multiple Computing Devices . 5 1.4 Scope and Contribution of This Thesis . 7 1.5 Organization of This Thesis . 8 2 Background and Related Work 9 2.1 From Serial to Parallel . 9 2.2 Many-Core Architecture . 10 2.3 Programming Paradigms for Many Core Architecture . 14 2.3.1 Pthreads . 14 2.3.2 OpenMP . 15 2.3.3 MPI . 15 2.3.4 Hadoop MapReduce . 15 2.4 Computing with Graphic Processing Units . 16 2.4.1 The Emergence of Programmable Graphics Hardware . 16 2.4.2 General Purpose GPUs . 17 2.5 OpenCL . 19 2.5.1 An OpenCL Platform . 20 2.5.2 OpenCL Execution Model . 20 2.5.3 OpenCL Memory Model . 27 2.6 Heterogeneous Computing . 28 2.6.1 Discussion . 36 2.7 SURF in OpenCL . 36 2.8 Monte Carlo Extreme in OpenCL . 37 ii 3 Cross-platform Heterogeneous Runtime Environment 39 3.1 Limitations of the OpenCL Command-Queue Approach . 39 3.1.1 Working with Multiple Devices . 39 3.2 The Task Queuing Execution System . 40 3.2.1 Work Units . 42 3.2.2 Work Pools . 42 3.2.3 Common Runtime Layer . 44 3.2.4 Resource Manager . 44 3.2.5 Scheduler . 45 3.2.6 Task-Queuing API . 49 4 Experimental Results 50 4.1 Experimental Environment . 50 4.2 Static Workload Balancing . 51 4.2.1 Performance Opportunities on a Single GPU Device . 51 4.2.2 Heterogeneous Platform with Multiple GPU Devices . 52 4.2.3 Heterogeneous Platform with CPU and GPU(APU) Device . 56 4.3 Design Space Exploration for Flexible Workload Balancing . 59 4.3.1 Synthetic Workload Generator . 59 4.3.2 Dynamic Workload Balancing . 59 4.3.3 Workload Balancing with Irregular Work Units . 63 4.4 Cross-Platform Heterogeneous Execution of clSURF and MCXCL . 64 4.5 Productivity . 67 5 Summary and Conclusions 69 5.1 Portable Execution across Platforms . 69 5.2 Dynamic Workload Balancing . 69 5.3 APIs to Expose Both Task-level and Data-level Parallelism . 70 5.4 Future Research Directions . 70 5.4.1 Including Flexible Workload Balancing Schemes . 70 5.4.2 Running specific kernels on the best computing devices . 71 5.4.3 Prediction of data locality . 71 Bibliography 73 iii List of Figures 1.1 Multi-core CPU, GPU, and Heterogeneous System-on-Chip CPU and GPU. At present, designers are able to make decisions among diverse architecture choices: homogeneous multi-core with cores of various size of complexity or heterogeneous system-on-chip architectures. 3 2.1 Intel Processors Introduction Trends . 10 2.2 Multi-core Processors with Shared Memory . 11 2.3 Intel’s TeraFlops Architecture . 12 2.4 Intel’s Xeon Phi Architecture . 13 2.5 IBM’s Cell . 14 2.6 High Level Block Diagram of a GPU . 18 2.7 OpenCL Architecture . 19 2.8 An OpenCL Platform . 20 2.9 The OpenCL Execution Model . 21 2.10 OpenCL work-items mapping to GPU devices. 23 2.11 OpenCL work-items mapping to CPU devices. 24 2.12 The OpenCL Memory Hierarchy. 28 2.13 Qilin Software Architecture . 29 2.14 The OpenCL environment with the IBM OpenCL common runtime. 30 2.15 Maestro’s Optimization Flow . 34 2.16 Symphony Overview . 35 2.17 The Program Flow of clSURF. 37 2.18 Block Diagram of the Parallel Monte Carlo simulation for photon migration. 38 3.1 Distributing work units from work pools to multiple devices. 41 3.2 CPU and GPU execution . 43 3.3 An example execution of vector addition on multiple devices with different processing capabilities. 46 4.1 The performance of our work pool implementation on a single device – One Work Pool. 53 4.2 The performance of our work pool implementation on a single device – Two Work Pools. 54 4.3 Load balancing on dual devices – V9800P and HD6970. 55 iv 4.4 Load balancing on dual devices – V9800P and GTX 285. 56 4.5 Performance assuming different device fission configurations and load balancing schemes between CPU and Fused HD6550D GPU. 57 4.6 The Load Balancing on dual devices – HD6550D and CPU. 58 4.7 Performance of different workload balancing schemes on all 3 CPU and GPU devices, an A8-8350 CPU, a V7800 GPU and a HD6550D GPU, as compared to a V7800 GPU device alone. 62 4.8 Performance of different workload balancing schemes on 1 CPU and 2 GPU devices, an NVS 5400M GPU, a Core i5-3360M CPU and a Intel HD graphics 4000 GPU, as compared to the NVS 5400M GPU device alone. 63 4.9 Performance Comparison of clSURF implemented with various workload balancing schemes on the platform with V7800 and HD6550D GPUs. 65 4.10 Performance Comparison of MCXCL implemented with various workload balancing schemes . 66 4.11 Number of lines of the source code using our runtime API versus a baseline OpenCL implementation. 68 v List of Tables 3.1 Typical memory bandwidth between different processing units for reads. 47 3.2 Typical memory bandwidth between different processing units for writes. 47 3.3 The Classes and Methods . 47 4.1 Input sets emphasizing different phases of the SURF algorithm. 52 vi Acknowledgments First, I would like to thank my advisor, Prof. David Kaeli for his insightful and inspiring guidance during the course of my graduate study. I have always enjoyed talking with him on various research topics, and his computer architecture class is one of the best classes I have ever taken. The enlightening suggestions from my committee of Prof. Norman Rubin and Prof. Ningfang Mi have been a great help to this thesis. Norm was my mentor when I was doing a 6-month internship at AMD, and that’s where this thesis essentially started. I would also like to thank Dr. Xinping Zhu, who gave me valuable guidance for my early graduate study. My fellow NUCAR colleagues, Dana Schaa, Byunghyun Jang, Perhaad Mistry, etc, also helped me so much through technical discussions and feedback. If life is a train ride, I cherish every moment and every scene outside of the window we share together. My deepest appreciation goes to my family, as it is always where I can put myself together with their endless love. I would like to thank my mom and dad for their consistent support and motivation, and my brother for his advice and encouragement. And finally, but most importantly, I would like to thank my wife and mother of my two kids, Liwei, for her understanding, patience, and faith in me. I couldn’t have finished this thesis without her love. vii Abstract of the Dissertation Cross-Platform Heterogeneous Runtime Environment by Enqiang Sun Doctor of Philosophy in Computer Engineering Northeastern University, April 2016 Dr. David Kaeli, Adviser Heterogeneous platforms are becoming widely adopted thanks to the support from new programming languages and models. Among these languages/models, OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units on a system. OpenCL is one of the first standards that focuses on portability, allowing programs to be written once and run unmodified on multiple, heterogeneous devices, regardless of vendor. While OpenCL has been widely adopted, there still remains a lack of support for automatic workload balancing and data consistency when multiple devices are present in the system. To address this need, we have designed a cross-platform heterogeneous runtime environment which provides a high-level, unified, execution model that is coupled with an intelligent resource management facility. The main motivation for developing this runtime environment is to provide OpenCL programmers with a convenient programming paradigm to fully utilize all possible devices in a system and incorporate flexible workload balancing schemes without compromising the user’s ability to assign tasks according to the data affinity. Our work removes much of the cumbersome initialization of the platform, and now devices and related OpenCL objects are hidden under the hood. Equipped with this new runtime environment and associated programming interface, the programmer can focus on designing the application and worry less about customization to the target platform. Further, the programmer can now take advantage of multiple devices using a dynamic workload balancing algorithm to reap the benefits of task-level parallelism. To demonstrate the value of this cross-platform heterogeneous runtime environment, we have evaluated it running both micro benchmarks and popular OpenCL benchmark applications. With minimal overhead of managing data objects across devices, we show that we can achieve scalable performance and application speedup as we increase the number of computing devices, without any changes to the program source code. viii ix Chapter 1 Introduction Moore’s law describes technology advances that double transistor density on integrated circuits every 12 to 18 months [1]. However, with the size of transistors approaching the size of individual atoms, and as power density outpaces current cooling techniques, the end of Moore’s law has appeared on the horizon. This has encouraged the research community to.

Load more