The Distribution of Opencl Kernel Execution Across Multiple Devices

The Distribution of OpenCL Kernel Execution Across Multiple Devices by Steven Gurfinkel A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of the Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Steven Gurfinkel Abstract The Distribution of OpenCL Kernel Execution Across Multiple Devices Steven Gurfinkel Master of Applied Science Graduate Department of the Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto 2014 Many computer systems now include both CPUs and programmable GPUs. OpenCL, a new programming framework, can program individual CPUs or GPUs; however, distributing a problem across multiple devices is more difficult. This thesis contributes three OpenCL runtimes that automatically distribute a problem across multiple devices: DualCL and m2sOpenCL, which distribute tasks across a single system's CPU and GPU, and DistCL, which distributes tasks across a cluster's GPUs. DualCL and DistCL run on existing hardware, m2sOpenCL runs in simulation. On a system with a discrete GPU and a system with integrated CPU and GPU devices, running programs from the Rodinia benchmark suite, DualCL improves performance over a single device, when host memory is used. Running simi- lar benchmarks, m2sOpenCL shows that reducing the overheads present in current systems improves performance. DistCL accelerates unmodified compute intense OpenCL kernels, obtained from Rodinia, AMD samples and elsewhere, when distributing them across a cluster. ii Acknowledgements I must start by thanking my supervisors, professors Natalie Enright Jerger and Jason H. Anderson. Without their guidance, care, commitment, and knowledge this work would not have been possible. They always came through for me when I needed some last-minute help and for that I am particularly grateful. It was an honour to be mentored by such great researchers. I must also thank my fellow graduate student Tahir Diop. Our close collaboration was both fun and productive. Furthermore, I thank my committee, professors Tarek S. Abdelrahman, Paul Chow, and Wai Tung Ng for their valuable insights and feedback. Of course, I also thank my family and friends for their support throughout my degree. I thank both my parents for starting me on the path to science and engineering, Their academic achievements serve as an inspiration for my own. iii Contents 1 Introduction 1 1.1 Contributions . .1 1.2 Outline . .2 2 Background 3 2.1 OpenCL . .3 2.1.1 Device Model . .3 2.1.2 Memory Coherence and Consistency . .6 2.1.3 Host Model . .6 2.2 Other Heterogeneous Programming Models . .8 2.2.1 CUDA . .8 2.2.2 BOLT . .8 2.3 The AMD Fusion APU . .8 2.4 Multi2Sim . .8 2.5 Related Work . .9 2.5.1 Distribution with Custom Interfaces . .9 2.5.2 Distribution of OpenCL Kernels . 10 2.5.3 Making Remote Devices Appear Local . 10 2.5.4 Characterizing and Simulating Heterogeneous Execution . 11 3 Co-operative OpenCL Execution 12 3.1 Outline . 13 3.2 Dividing Kernel Execution . 13 3.3 DualCL . 14 3.3.1 Memory Allocation . 15 3.3.2 Partitioning Kernels . 15 3.4 m2sOpenCL . 15 3.5 Partitioning Algorithms . 17 3.5.1 Static Partitioning . 17 3.5.2 Block by Block (block) . 18 3.5.3 NUMA (numa) . 18 3.5.4 First-Past-The-Post (first) . 18 3.5.5 Same Latency (same) . 18 3.6 Experimental Study . 19 iv 3.6.1 Benchmarks . 20 3.7 Results and Discussion . 22 3.7.1 Memory Placement . 22 3.7.2 Partitioning Algorithms . 23 3.7.3 Individual Benchmarks . 26 3.7.4 Simulated Results . 34 3.8 Conclusions . 36 4 CPU Execution for Multi2Sim 38 4.1 Overall Design . 38 4.1.1 The OpenCL Device Model for CPUs . 39 4.2 Behaviour of a Work-Item . 39 4.2.1 Work-Item Stack Layout . 39 4.3 Fibres . 39 4.4 Scheduling and Allocating Work-Items . 41 4.4.1 Stack Allocation Optimization . 42 4.5 Whole-Kernel Execution and Scheduling . 43 4.6 Performance Validation . 43 4.7 Conclusion . 45 5 Distributed OpenCL Execution 46 5.1 Introduction . 46 5.2 DistCL . 47 5.2.1 Partitioning . 48 5.2.2 Dependencies . 48 5.2.3 Scheduling Work . 51 5.2.4 Transferring Buffers . 51 5.3 Experimental Setup . 52 5.3.1 Cluster . 53 5.4 Results and Discussion . 54 5.5 Conclusion . 58 6 Conclusion 60 6.1 Future Work . 61 Appendices 62 A The Accuracy of the Multi2Sim CPU Device 63 A.1 Microbenchmarks . 63 A.1.1 Register . 63 A.1.2 Dependent Add . 63 A.1.3 Independent Add . 64 A.1.4 Memory Throughput . 64 A.1.5 Memory Latency . 65 A.2 Issues and Best Configuration . 65 v B The DistCL and Metafunction Tool Manual 66 B.1 Introduction . 66 B.1.1 Intended Audience . 66 B.2 Building DistCL . 67 B.3 Linking DistCL to a Host Program . 67 B.3.1 Linking Metafunctions . 67 B.4 Running DistCL . 68 B.4.1 distcl.conf . 68 B.5 Writing Metafunctions . 69 B.5.1 Using require region ................................ 70 B.5.2 The Tool . 71 B.5.3 Building the Tool . 74 B.6 Enhancing DistCL . 75 B.6.1 Partition Strategies . 75 B.7 Limitations . ..

The Distribution of Opencl Kernel Execution Across Multiple Devices

Heterogeneous Computing for Advanced Driver Assistance Systems

Parallel Computer Architecture

AMD APP SDK V2.8.1

AMD APP SDK V2.9.1 Getting Started

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Thread-Level Parallelism – Part 1

AMD APP SDK V3.0 Beta

AMD APP SDK Developer Release Notes

Efficient Synchronization Mechanisms for Scalable GPU Architectures

Press Release

CIS 501 Computer Architecture This Unit: Shared Memory

Shared-Memory Multiprocessors Gates & Transistors