Optimising Purely Functional GPU Programs
Total Page:16
File Type:pdf, Size:1020Kb
Optimising Purely Functional GPU Programs Trevor L. McDonell a thesis in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computer Science and Engineering University of New South Wales 2015 COPYRIGHT STATEMENT ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.' Signed ……………………………………………........................... Date ……………………………………………........................... AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’ Signed ……………………………………………........................... Date ……………………………………………........................... ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed …………………………………………….............. Date …………………………………………….............. ii iii This thesis is submitted in fulfilment of the requirements for the degree of Doctor of Philosophy School of Computer Science and Engineering The University of New South Wales Australia iv Contents Optimising Purely Functional GPU Programs i Contents v Figures xi Listings xiii Tables xv Abstract xvii Acknowledgements xix 1 Introduction 1 2 Basics 5 2.1 Data parallelism .................................. 5 2.2 GPU computing .................................. 6 2.3 CUDA ........................................ 7 2.4 Data parallel algorithms in CUDA ........................ 8 2.4.1 Reduction .................................. 8 2.4.2 Scan ..................................... 10 2.4.3 Backpermute ................................ 12 2.4.4 Permute ................................... 13 2.4.5 Stencil .................................... 16 2.4.6 Segmented operations ........................... 19 2.5 Embedded domain-specific languages ....................... 20 2.5.1 Shallow embedding ............................. 20 2.5.2 Deep embedding .............................. 21 2.6 Discussion ...................................... 21 vi CONTENTS 3 Embedding array computations 23 3.1 The Accelerate EDSL ............................... 23 3.1.1 Computing a vector dot product ..................... 25 3.1.2 Arrays, shapes, and indices ........................ 25 3.1.3 Arrays on the host and device ....................... 26 3.1.4 Array computations versus scalar expressions .............. 27 3.1.5 Computing an n-body gravitational simulation ............. 28 3.1.6 Non-parametric array representation ................... 30 3.1.7 Richly typed terms ............................. 31 3.1.8 Tying the recursive knot .......................... 33 3.2 Manipulating embedded programs ........................ 34 3.2.1 Typed equality ............................... 34 3.2.2 Simultaneous substitution ......................... 36 3.2.3 Inlining ................................... 38 3.2.4 Function composition ........................... 38 3.2.5 Environment manipulation ........................ 39 3.3 Related work .................................... 40 3.4 Discussion ...................................... 41 4 Accelerated array code 43 4.1 Embedding GPU programs as skeletons ..................... 43 4.1.1 Array operations as skeletons ....................... 44 4.1.2 Algorithmic skeletons as template meta programs ............ 45 4.2 Instantiating skeletons ............................... 46 4.2.1 Producer-consumer fusion by template instantiation .......... 46 4.2.2 Instantiating skeletons with scalar code ................. 48 4.2.3 Representing tuples ............................ 50 4.2.4 Shared subexpressions ........................... 52 4.2.5 Eliminating dead code ........................... 53 4.2.6 Shapes and indices ............................. 55 4.2.7 Lambda abstractions ............................ 56 4.2.8 Array references in scalar code ...................... 57 4.2.9 Conclusion ................................. 58 4.3 Using foreign libraries ............................... 59 4.3.1 Importing foreign functions ........................ 59 4.3.2 Exporting Accelerate programs ...................... 60 4.4 Dynamic compilation & code memoisation .................... 60 4.4.1 External compilation ............................ 61 4.4.2 Caching compiled kernels ......................... 62 CONTENTS vii 4.4.3 Conclusion ................................. 65 4.5 Data transfer & garbage collection ........................ 65 4.5.1 Device-to-host transfers .......................... 65 4.5.2 Host-to-device transfers .......................... 66 4.5.3 Allocation & Deallocation ......................... 67 4.5.4 Memory manager implementation .................... 68 4.5.5 Conclusion ................................. 71 4.6 Executing embedded array programs ....................... 72 4.6.1 Execution state ............................... 73 4.6.2 Annotating array programs ........................ 73 4.6.3 Launch configuration & thread occupancy ................ 76 4.6.4 Kernel execution .............................. 76 4.6.5 Kernel execution, concurrently ...................... 77 4.6.6 Conclusion ................................. 81 4.7 Related work .................................... 81 4.7.1 Embedded languages ............................ 81 4.7.2 Parallel libraries .............................. 82 4.7.3 Other approaches .............................. 83 4.8 Discussion ...................................... 83 5 Optimising embedded array programs 85 5.1 Sharing recovery .................................. 85 5.1.1 Our approach to sharing recovery ..................... 87 5.2 Array fusion ..................................... 89 5.2.1 Considerations ............................... 90 5.2.2 The Main Idea ............................... 91 5.2.3 Representing Producers .......................... 94 5.2.4 Producer/Producer fusion ......................... 96 5.2.5 Producer/Consumer fusion ........................ 101 5.2.6 Exploiting all opportunities for fusion .................. 103 5.3 Simplification .................................... 108 5.3.1 Shrinking .................................. 108 5.3.2 Common subexpression elimination .................... 108 5.3.3 Constant folding .............................. 109 5.4 Related work .................................... 112 5.4.1 Shortcut fusion ............................... 112 5.4.2 Loop fusion in imperative languages ................... 113 5.4.3 Fusion in a parallel context ........................ 113 5.5 Discussion ...................................... 114 viii CONTENTS 5.5.1 Correctness ................................. 115 5.5.2 Confluence and termination ........................ 116 5.5.3 Error tests eliminated ........................... 116 5.5.4 Work duplication .............................. 117 5.5.5 Ordering and repeated evaluations .................... 117 6 Results 119 6.1 Runtime overheads ................................. 119 6.1.1 Program optimisation ........................... 120 6.1.2 Memory management & data transfer .................. 120 6.1.3 Code generation & compilation ...................... 121 6.2 Dot product ..................................... 121 6.2.1 Too many kernels .............................. 123 6.2.2 64-bit arithmetic .............................. 124 6.2.3 Non-neutral starting elements ....................... 125 6.2.4 Kernel specialisation ............................ 126 6.3 Black-Scholes option pricing ............................ 126 6.3.1 Too little sharing .............................. 127 6.4 Mandelbrot fractal ................................. 128 6.4.1 Fixed unrolling ............................... 129 6.4.2 Loop recovery ................................ 131 6.4.3 Explicit iteration .............................. 132 6.4.4 Unbalanced workload ........................... 133 6.4.5 Passing inputs as arrays .......................... 133 6.5 N-body gravitational simulation .........................