Metafork: a Compilation Framework for Concurrency Models Targeting Hardware Accelerators

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Scholarship@Western Western University Scholarship@Western Electronic Thesis and Dissertation Repository 3-24-2017 12:00 AM MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators Xiaohui Chen The University of Western Ontario Supervisor Marc Moreno Maza The University of Western Ontario Graduate Program in Computer Science A thesis submitted in partial fulfillment of the equirr ements for the degree in Doctor of Philosophy © Xiaohui Chen 2017 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Algebra Commons, Computer and Systems Architecture Commons, Programming Languages and Compilers Commons, and the Software Engineering Commons Recommended Citation Chen, Xiaohui, "MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators" (2017). Electronic Thesis and Dissertation Repository. 4429. https://ir.lib.uwo.ca/etd/4429 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract Parallel programming is gaining ground in various domains due to the tremendous com- putational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configura- tion is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time. Keywords: source-to-source compiler, pipelining, comprehensive parametric CUDA kernel generation, concurrency platforms, high-level parallel programming. ii Acknowledgments The work discussed in this dissertation would not have been possible without the constant encouragement and insight of many people. I must express my deepest appreciation and thanks to my supervisor Professor Marc Moreno Maza, for his enthusiasm and patience during the course of my PhD research. I also would like to thank my friends Ning Xie, Changbo Chen, Yuzhen Xie, Robert Moir and Colin Costello in University of Western Ontario. My appreciation also goes to my IBM colleagues in compiler group, Wang Chen, Abdoul-Kader Keita, Priya Unnikrishnan and Jeeva Paudel, for their discussions on developing compiler techniques. Many thanks to the members of my supervisory committee Professor John Barron and Pro- fessor Michael Bauer for their valuable feedbacks. Also my sincere thanks and appreciation go to the members of my examination committee Professor Robert Mercer, Professor Robert Webber, Professor David Jeffrey and Professor Jeremy Johnson for their comments and inspi- ration. Last but not the least, I am also greatly indebted to my family who deserves too many thanks to fit this page. iii Contents Abstract ii Acknowledgments iii List of Figures vii List of Tables xi List of Appendices xii 1 Introduction 1 1.1 Dissertation Outline . 2 1.2 Contributions . 3 1.3 Thesis Statement . 3 2 Background 4 2.1 Concurrency Platforms . 4 2.1.1 CilkPlus ................................. 4 2.1.2 OpenMP.................................. 5 2.1.3 GPU . 5 2.2 Performance Measurement . 7 2.2.1 Occupancy and ILP . 8 2.2.2 Work-Span Model . 9 2.2.3 Master Theorem . 9 2.3 Compilation Theory . 10 2.3.1 The State of the Art of Compilers . 10 2.3.2 Source-to-Source Compiler . 11 2.3.3 Automatic Parallelization . 12 3 A Metalanguage for Concurrency Platforms Based on the Fork-Join Model 13 3.1 Basic Principles and Execution Model . 14 3.2 Core Parallel Constructs . 14 3.3 Variable Attribute Rules . 18 3.4 Semantics of the Parallel Constructs in MetaFork . 20 3.5 Supported Parallel APIs of Both CilkPlus and OpenMP . 21 3.5.1 OpenMP.................................. 23 3.5.2 CilkPlus ................................. 26 iv 3.6 Translation . 27 3.6.1 Translation from CilkPlus to MetaFork . 28 3.6.2 Translation from MetaFork to CilkPlus . 29 3.6.3 Translation from OpenMP to MetaFork . 31 3.6.4 Translation from MetaFork to OpenMP . 39 3.7 Experimentation . 39 3.7.1 Experimentation Setup . 41 3.7.2 Correctness . 42 3.7.3 Comparative Implementation . 42 3.7.4 Interoperability . 43 3.7.5 Parallelism Overheads . 49 3.8 Summary . 50 4 Applying MetaFork to the Generation of Parametric CUDA Kernels 52 4.1 Optimizing CUDA Kernels Depending on Program Parameters . 53 4.2 Automatic Generation of Parametric CUDA Kernels . 55 4.3 Extending the MetaFork Language to Support Device Constructs . 57 4.4 The MetaFork Generator of Parametric CUDA Kernels . 61 4.5 Experimentation . 64 4.6 Summary . 67 5 MetaFork: A Metalanguage for Concurrency Platforms Targeting Pipelining 74 5.1 Execution Model of Pipelining . 74 5.2 Core Parallel Constructs . 75 5.3 Semantics of the Pipelining Constructs in MetaFork . 77 5.4 Summary . 78 6 MetaFork: The Compilation Framework 79 6.1 Goals . 79 6.2 MetaFork as a High-Level Parallel Programming Language . 81 6.3 Organization of the MetaFork Compilation Framework . 83 6.4 MetaFork Compiler . 83 6.4.1 Front-End of the MetaFork Compiler . 84 6.4.2 Analysis & Transformation of the MetaFork Compiler . 85 6.4.3 Back-End of the MetaFork Compiler . 87 6.5 User Defined Pragma Directives in MetaFork . 87 6.5.1 Registration of Pragma Directives in the MetaFork Preprocessor . 88 6.5.2 Parsing of Pragma Directives in the MetaFork Front-End . 89 6.5.3 Attaching Pragma Directives to the Clang AST . 90 6.6 Implementation . 90 6.6.1 Parsing MetaFork Constructs . 90 6.6.2 Parsing CilkPlus Constructs . 94 6.7 Summary . 97 7 Towards Comprehensive Parametric CUDA Kernel Generation 98 v 7.1 Comprehensive Optimization . 102 7.1.1 Hypotheses on the Input Code Fragment . 102 7.1.2 Hardware Resource Limits and Performance Measures . 103 7.1.3 Evaluation of Resource and Performance Counters . 104 7.1.4 Optimization Strategies . 105 7.1.5 Comprehensive Optimization . 105 7.1.6 Data-Structures . 106 7.1.7 The Algorithm . 107 7.2 Comprehensive Translation of an Annotated C Program into CUDA Kernels . 112 7.2.1 Input MetaFork Code Fragment . 112 7.2.2 Comprehensive Translation into Parametric CUDA Kernels . 113 7.3 Implementation Details . 114 7.4 Experimentation . 116 7.5 Conclusion . 122 8 Concluding Remarks 126 Bibliography 128 A Code Translation Examples 141 B Examples Generated by PPCG 153 C The Implementation for Generating Comprehensive MetaFork Programs 158 Curriculum Vitae 161 vi List of Figures 2.1 GPU hardware architecture . 6 2.2 Heterogeneous programming with GPU and CPU . 7 3.1 Example of a MetaFork program with a function spawn . 16 3.2 Example of a MetaFork program with a parallel region . 17 3.3 Example of a meta for loop . 18 3.4 Various variable attributes in a parallel region . 19 3.5 Example of shared and private variables with meta for . 19 3.6 Parallel fib code using a function spawn ....................... 20 3.7 Parallel fib code using a block spawn ........................ 20 3.8 OpenMP clauses supported in the current program translations . 23 3.9 A code snippet of an OpenMP sections example . 25 3.10 A code snippet showing how to exclude declarations which come from included header files . 28 3.11 A code snippet showing how to translate a parallel for loop from CilkPlus to MetaFork ..................................... 28 3.12 A code snippet showing how to insert a barrier from CilkPlus to MetaFork . 28 3.13 A code snippet showing how to handle data attribute of variables in the process of outlining . 32 3.14 A code snippet showing how to translate parallel for loop from MetaFork to CilkPlus ..................................... 32 3.15 A general form of an OpenMP construct . 32 3.16 Translation of the OpenMP sections construct . 34 3.17 An array type variable . 36 3.18 A code snippet translated from the codes in Figure 3.17 . 36 3.19 A scalar type variable . 36 3.20 A code snippet translated from the codes in Figure 3.19 . 36 3.21 A code snippet showing the variable m used as an accumulator . 36 3.22 A MetaFork code snippet (right), translated from the left OpenMP codes . 37 3.23 A code snippet showing how to avoid adding redundant barriers when translat- ing the codes from OpenMP to MetaFork .................... 38 3.24 Example of translation from MetaFork to OpenMP .

Load more