Download (2MB)
Total Page:16
File Type:pdf, Size:1020Kb
Urlea, Cristian (2021) Optimal program variant generation for hybrid manycore systems. PhD thesis. http://theses.gla.ac.uk/81928/ Copyright and moral rights for this work are retained by the author A copy can be downloaded for personal non-commercial research or study, without prior permission or charge This work cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given Enlighten: Theses https://theses.gla.ac.uk/ [email protected] Optimal program variant generation for hybrid manycore systems Cristian Urlea Submitted in fulfilment of the requirements for the Degree of Doctor of Philosophy School of Engineering College of Science and Engineering University of Glasgow January 2021 Abstract Field Programmable Gate Arrays (FPGAs) promise to deliver superior energy efficiency in het- erogeneous high performance computing (HPC), as compared to multicore CPUs and GPUs. The rate of adoption is however hampered by the relative difficulty of programming FPGAs. High-level synthesis tools such as Xilinx Vivado, Altera OpenCL or Intel’s HLS address a large part of the programmability issue by synthesizing a Hardware Description Languages (HDL) representation from a high-level specification of the application, given in programming lan- guages such as OpenCL C, typically used to program CPUs and GPUs. Although HLS solutions make programming easier, they fail to also lighten the burden of optimization. Application de- velopers must rely on expert knowledge to manually optimize their applications for each target device, meaning that traditional HLS solutions do not offer a solution to the issue of perfor- mance portability. This state of fact prompted the development of compiler frameworks such as TyTra that operate at an even higher level of abstraction, amenable to the use of Design Space Exploration (DSE). With DSE the initial program specification can be seen as the start- ing location in a search-space of correct-by-construction program transformations. In TyTra the search-space is generated from the transitive-closure of term-level transformations derived from type-level transformations. Compiler frameworks such as TyTra theoretically solve the is- sue of performance portability by providing a way to automatically generate alternative correct program variants. They however suffer from the very practical issue that the generated space is often too large to fully explore. As a consequence, the globally optimal solution may be overlooked. In this work we provide a novel solution to issue performance portability by deriving an efficient yet effective DSE strategy for the TyTra compiler framework. We make use of categor- ical data types to derive categorical semantics for the formal languages that describe the terms, types, cost-performance estimates and their transformations. From these we define a category of interpretations for TyTra applications, from which we derive a DSE strategy that finds the globally optimal transformation sequence in polynomial time. This is achieved by reducing the size of the generated search space. We formally state and prove a theorem for this claim and then show that the polynomial run-time for our DSE strategy has practically negligible coefficients leading to sub-second exploration times for realistic applications. i Contents Abstract i Acknowledgements viii Declaration ix 1 Introduction 1 1.1 Design Space Exploration . .3 1.2 Thesis statement . .5 1.3 Contributions and Publications . .6 2 Background 7 2.1 Performance portability . .8 2.2 Taxonomy of parallel computation . 10 2.2.1 Narrow and wide: data parallelism . 11 2.2.2 High and low: abstraction level . 15 2.2.3 Near and far: locality . 17 2.2.4 The middle way . 22 2.3 Formal methods . 23 2.3.1 Models of computation . 24 2.3.2 Imperative languages . 26 2.3.3 Functional languages . 28 2.3.4 Bridging models . 32 2.4 The TyTra Compiler Framework . 33 2.4.1 TyTra Compiler workflow . 33 2.4.2 TyTra Coordination Language . 38 2.4.3 TyTra Intermediate Representation . 45 2.4.4 TyTra Semantics . 46 2.5 Introduction to Category Theory . 50 ii CONTENTS iii 3 Related Work 58 3.1 Practical . 58 3.1.1 Imperative Languages . 59 3.1.2 Functional Languages . 67 3.1.3 Program and Behavioural Synthesis . 71 3.1.4 Compiler Optimization Techniques . 72 3.2 Theoretical . 73 3.2.1 Structured Parallelism . 73 3.2.2 Categorical Data Types and Techniques . 74 3.3 Graphical summary . 78 4 Categorical Semantics 80 4.1 Categorical Data Types . 81 4.1.1 Optimisation from recursion schemes . 85 4.1.2 Terms and Transformations . 87 4.1.3 Types and Type Constructors . 90 4.1.4 Cost-Performance Estimates . 95 4.2 DSE in Categorical Terms . 99 4.2.1 Specification . 99 4.2.2 Analysis . 106 4.2.3 Selection . 107 4.3 Optimal DSE in TyTra . 108 4.3.1 Naïve tactic . 111 4.3.2 Expert tactic . 116 4.4 TyTra Categorical Semantics . 123 4.4.1 Type Inference and checking . 124 4.4.2 Correct term transformation . 125 4.4.3 Cost-performance aware transformation . 126 4.4.4 Borrowing structure . 127 4.4.5 Fused DSE . 137 4.5 Efficient DSE Theorem . 140 4.5.1 Proof . 140 5 Experimental Evaluation, Conclusion & Future Work 143 5.1 Experimental validation . 144 5.2 Conclusion . 151 5.3 Future work . 152 List of Figures 1.1 Conceptual DSE stages: specification, analysis, selection. .3 2.1 Abstract three-dimensional space of programming languages and compilers. 10 2.2 CPU (narrow) and GPU (wide) architectures, representing the number of parallel compute units . 11 2.3 Pragma directives have a local effect, typically function, variable scope or basic- block level. 13 2.4 SIMD (left) vs scalar (right) addition. 14 2.5 FPGAs: A switched sea of lookup tables. 15 2.6 Harvard Architectures and logical distance. 18 2.7 Relations between parallel architectures and tools. 22 2.8 Graphical depiction of a finite state machine. 26 2.9 Composition of finite state machines. 27 2.10 Lambda abstraction and application. Execution Trace. 29 2.11 I and K combinators. 30 2.12 The Haskell Monad interface. 31 2.13 High-level view of the TyTra Compiler Workflow. 33 2.14 Detailed view of the TyTra Compiler Workflow. 34 2.15 Performance measured in clock-cycles. 36 2.16 Simplified presentation of the performance model. 37 2.17 TyTra CL term-level expression grammar. 39 2.18 TyTra CL expressions and actions . 39 2.19 Alternative graphical representation for TyTra CL expressions and actions. 39 2.20 TyTra CL Size Constructors. 41 2.21 The TyTra CL Atomic Type Constructor . 41 2.22 TyTra CL Structural Type Constructors. 42 2.23 Graphical depiction of a TyTra CL expression and types. 42 2.24 An abstract type transformation. 43 2.25 Nested type constructors can be simply seen as a single constructor. 43 2.26 The effect of Vector type Split and Merge operations on terms. 44 iv LIST OF FIGURES v 2.27 Type-level merge/split rules. Their composition yields the identity transformation. 44 2.28 TyTra IR, its sub-languages and the relationship to TyTra CL . 45 2.29 Abstract and full interpretation (sequential). 49 2.30 Abstract interpretation (average time). 49 2.31 The category C with three objects. 50 2.32 Composition or morphisms f and g in C..................... 51 2.33 The composition of morphisms must be associative. 51 2.34 The objects of a category are identified by their identity morphisms. 52 2.35 Functors: structure preserving maps between categories. 52 2.36 Triangle identity. 54 2.37 Pentagon identity. 54 2.38 Morphisms from Initial F-Algebra to all F-Algebras in that category. 56 3.1 OpenCL compilation and runtime. 59 3.2 OpenCL optimization workflow. 60 3.3 Fusion rules in Lift. 69 3.4 Cancellation rules in Lift. 69 3.5 The OpenCL Specific map, reduce and reorderStride rules in Lift. 69 3.6 Denotational semantics relating the various map implementations to an abstract map operation in the Lift compiler. 69 3.7 The build/fold rule corresponds to the Hylomorphism recursion scheme. 77 3.8 Overview of the relationship between practical and theoretical related work. 78 3.9 Hylomorphisms as a unifying construct for the object-language of Structured Arrows....................................... 79 3.10 Hylomorphisms as a unifying construct for the meta-language of the TyTra Compiler Framework. 79 4.1 ListF : 1 + a × List a describes an Endofunctor. 81 4.2 Category of Integers and List of Integers. ..