Improving the programmability of GPU architectures Citation for published version (APA): Nugteren, C. (2014). Improving the programmability of GPU architectures. Technische Universiteit Eindhoven. https://doi.org/10.6100/IR771987 DOI: 10.6100/IR771987 Document status and date: Published: 01/01/2014 Document Version: Publisher’s PDF, also known as Version of Record (includes final page, issue and volume numbers) Please check the document version of this publication: • A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers. Link to publication General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal. If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End User Agreement: www.tue.nl/taverne Take down policy If you believe that this document breaches copyright please contact us at: [email protected] providing details and we will investigate your claim. Download date: 03. Oct. 2021 Improving the Programmability of GPU Architectures PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Technische Universiteit Eindhoven, op gezag van de rector magnificus prof.dr.ir. C.J. van Duijn, voor een commissie aangewezen door het College voor Promoties, in het openbaar te verdedigen op woensdag 30 april 2014 om 16:00 uur door Cedric Nugteren geboren te Dordrecht Dit proefschrift is goedgekeurd door de promotoren en de samenstelling van de promotiecommissie is als volgt: voorzitter: prof.dr.ir. A.C.P.M. Backx 1e promotor: prof.dr. H. Corporaal 2e promotor: prof.dr.ir. H.E. Bal (Vrije Universiteit Amsterdam) leden: prof.dr. P.H.J. Kelly (Imperial College London) dr. A. Cohen (Ecole´ Polytechnique) dr.ir. A.L. Varbanescu (Universiteit van Amsterdam) prof.dr.ir. G. de Haan prof.dr. J.J. Lukkien PhD thesis Improving the Programmability of GPU Architectures Doctorate committee: prof.dr. H. Corporaal Eindhoven University of Technology, promotor prof.dr.ir. H.E. Bal Vrije Universiteit Amsterdam, promotor prof.dr.ir. A.C.P.M. Backx Eindhoven University of Technology, chairman prof.dr. P.H.J. Kelly Imperial College London dr. A. Cohen Ecole´ Polytechnique dr.ir. A.L. Varbanescu Universiteit van Amsterdam prof.dr.ir. G. de Haan Eindhoven University of Technology prof.dr. J.J. Lukkien Eindhoven University of Technology This work was supported by the Dutch government in their Point-One research program within the Morpheus project PNE101003 and carried out at the TU/e. Advanced School for Computing and Imaging This work was carried out in the ASCI graduate school. ASCI dissertation series number 295. This PhD-trajectory included a 4-month HiPEAC sponsored internship at ARM. © Cedric Nugteren 2014. All rights are reserved. Reproduction in whole or in part is prohibited without the written consent of the copyright owner. Printing: Printservice Technische Universiteit Eindhoven A catalogue record is available from the Eindhoven University of Technology Library. ISBN: 978-90-386-3599-6 Table of contents Preface vii 1 Introduction 1 1.1 Using GPUs as accelerators . 2 1.2 KeyaspectsoftheGPUarchitecture . 3 1.3 Problemstatement........................... 6 1.4 Contributions and thesis outline . 7 1.5 Context of this work . 8 2 Motivation and outlook 9 2.1 Currenttrends ............................. 9 2.1.1 The multi-core and many-core decade . 10 2.1.2 The memory wall . 12 2.1.3 Implications to programmability . 14 2.2 The prospect of dark silicon . 15 2.2.1 Dark and dim silicon . 15 2.2.2 Implications to computer architecture . 16 2.2.3 Implications to programmability . 18 2.3 Addressing programmability issues . 19 2.3.1 Programming languages and frameworks . 19 2.3.2 Architectural support for programmability . 21 2.3.3 Iterative compilation . 22 2.4 Example: An adaptive GPU architecture . 23 2.4.1 Parameter 1: The number of active threads . 24 2.4.2 Parameter 2: Compute-memory ratio . 25 2.4.3 Parameter 3: Core and warp sizing . 27 2.4.4 Discussion . 29 2.5 Summary ................................ 30 iii 3 Classifications of program code 31 3.1 A survey of algorithm classifications . 34 3.1.1 High abstraction-level classifications . 34 3.1.2 Algorithmic skeletons and related classifications . 35 3.1.3 Directive-based classifications . 36 3.1.4 Mathematical code representations . 36 3.1.5 Evaluation of existing classifications . 38 3.2 Algorithmic species . 40 3.2.1 Background: the polyhedral model . 42 3.2.2 Polyhedral model-based algorithmic species . 44 3.2.3 Automatic extraction of species . 51 3.2.4 Evaluation and discussion . 53 3.2.5 Conclusions . 58 3.3 Algorithmic species revisited . 59 3.3.1 Array reference characterisations . 59 3.3.2 Array reference-based algorithmic species . 62 3.3.3 Automatic extraction of species . 66 3.3.4 Evaluation and discussion . 67 3.3.5 Conclusions . 70 3.4 Finer-grainedspecies.......................... 70 3.4.1 Species+: a finer-grained classification . 71 3.4.2 Evaluation and discussion . 73 3.4.3 Conclusions . 76 4 Compilation using algorithmic skeletons 77 4.1 A survey of source-to-source compilers . 78 4.1.1 Directives using hiCUDA . 79 4.1.2 Algorithmic skeletons through SkePU . 80 4.1.3 OpenACC directives with PGI Accelerator . 81 4.1.4 Automatic compilation with Par4All and PPCG . 82 4.1.5 Evaluation and discussion . 82 4.2 A skeleton-based source-to-source compiler . 83 4.2.1 Example skeletons . 85 4.2.2 Compiler optimisations . 87 4.3 Optimising host-accelerator data transfers . 87 4.4 Kernelfusion .............................. 89 4.4.1 Legality of fusion . 90 4.4.2 Performance considerations . 92 4.5 Experimentalresults .......................... 93 4.5.1 Evaluating compiler optimisations . 94 4.5.2 Comparison of multiple targets . 97 4.5.3 Comparison against the state-of-the-art . 99 4.6 Discussion . 101 iv 5 Towards a programmable GPU architecture 105 5.1 A detailed GPU cache model . 106 5.1.1 Related work . 108 5.1.2 Background: reuse distance theory . 109 5.1.3 Parallel execution model . 110 5.1.4 Memory latencies . 111 5.1.5 Cache associativity . 113 5.1.6 Miss-status holding-registers . 114 5.1.7 Warp divergence . 115 5.1.8 Implementation of the model . 116 5.1.9 Micro-benchmarks . 118 5.1.10 Verification of the model . 121 5.1.11 Example use: evaluating cache parameters . 125 5.1.12 Summary and future work . 125 5.2 A case for locality-aware thread scheduling . 126 5.2.1 Related work . 127 5.2.2 Experimental setup . 127 5.2.3 The potential of thread scheduling . 129 5.2.4 Detailed case studies . 133 5.2.5 Summary and future work . 136 6 Conclusions and future work 137 6.1 Conclusions . 137 6.2 Future work . 139 Bibliography 141 Summary 155 Acknowledgements 157 About the author 159 v vi Preface In front of you lies the pinnacle of a PhD-student’s hard work: the thesis. How- ever, hidden from the reader’s eyes is the path that has led to this result. How was the subject chosen? What was left out? What were the difficult parts? Which part of the process could be improved? Prospective and current PhD-students are encouraged to read this preface, as well as any others interested in the process that led to this thesis. In retrospect, the subject of this work was chosen in 2008 by browsing through the first CUDA documents for my master thesis work. Because CUDA and accel- erating scientific workloads on GPUs was new at the time, mapping application X on GPU Y was still considered a scientific contribution. I therefore spent the first months of my PhD accelerating a histogram computation on a GPU, which has led to a publication at the GPGPU workshop. The results of this work encour- aged me to investigate whether the GPU architecture could be improved. But how to do any experiments without a simulator or model at hand? The lack of a simulator forced me to search for a different research question, eventually lead- ing to the question “how can a compiler improve the programmability of GPU architectures?”. My advisors pointed me to the work of Wouter Caarls, describing skeleton- based compilation for embedded systems. Without a compiler-background and without substantial literature research (don’t try this at home), I started the development of my own skeleton-based compiler, focussing on GPUs instead. Al- though this would eventually lead to bones (the compiler presented in chapter 4 of this thesis), a more thorough background study would have saved time and work (including my ‘deprecated’ SAMOS and GPGPU papers).
