Inter-Workgroup Barrier Synchronisation on Graphics Processing Units

Imperial College London Department of Computing Inter-workgroup Barrier Synchronisation on Graphics Processing Units Tyler Rey Sorensen May 2018 Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Computing of Imperial College London and the Diploma of Imperial College London Declaration This thesis and the work it presents are my own except where otherwise acknowledged. Tyler Rey Sorensen The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work. 2 Abstract GPUs are parallel devices that are able to run thousands of independent threads concur- rently. Traditional GPU programs are data-parallel, requiring little to no communication, i.e. synchronisation, between threads. However, classical concurrency in the context of CPUs often exploits synchronisation idioms that are not supported on GPUs. By studying such idioms on GPUs, with an aim to facilitate them in a portable way, a wider and more generic space of GPU applications can be made possible. While the breadth of this thesis extends to many aspects of GPU systems, the common thread throughout is the global barrier: an execution barrier that synchronises all threads executing a GPU application. The idea of such a barrier might seem straightforward, however this investigation reveals many challenges and insights. In particular, this thesis includes the following studies: • Execution models: while a general global barrier can deadlock due to starvation on GPUs, it is shown that the scheduling guarantees of current GPUs can be used to dynamically create an execution environment that allows for a safe and portable global barrier across a subset of the GPU threads. • Application optimisations: a set GPU optimisations are examined that are tailored for graph applications, including one optimisation enabled by the global barrier. It is shown that these optimisations can provided substantial performance improvements, e.g. the barrier optimisation achieves over a 10× speedup on AMD and Intel GPUs. The performance portability of these optimisations is investigated, as their utility varies across input, application, and architecture. • Multitasking: because many GPUs do not support preemption, long-running GPU compute tasks (e.g. applications that use the global barrier) may block other GPU functions, including graphics. A simple cooperative multitasking scheme is proposed that allows graphics tasks to meet their deadlines with reasonable overheads. 3 Acknowledgements First and foremost I thank my adviser, Ally Donaldson. These four years have shaped me not only as a researcher, but also as a person. Research is difficult mentally, emotionally and socially. I could not have learned from a better mentor than Ally in all of these aspects. I am thankful that Ally trusted me and gave me the freedom to work on the ideas that eventually turned into this thesis. My favourite emails to write to Ally would start with \I have an idea...". In a few weeks, Ally would patiently help turn my rough thoughts into an actionable plan. He always pushed to do my best possible work, while also allowing me to develop my own creative voice. I will always look up to his scientific integrity, driven by genuine curiosity and an aim to better the wider scientific community. Although the programme was extremely difficult, working with Ally has been one of the most fulfilling experiences of my life. I thank all my wonderful collaborators and mentors who have taught me so much about science and have been excellent examples of how to navigate this journey. In particular, I thank John Wickerson, Hugues Evrard, Nathan Chong, Mark Batty, and Sreepathi Pai. I thank my internship mentors who gave me valuable industrial experience: Vinod Grover, Todd Mytkowicz, and Madan Musuvathi. I thank Margaret Martonosi for allowing me to join her group; our short time working together has already been full of amazing opportunities and support. I thank Ganesh Gopalakrishnan for patiently introducing me to research while I was an undergraduate at University of Utah. He is a role model not only as a scientist, but as a mentor and a human. He is a moral lighthouse that I constantly look to. I am especially grateful to him and Zvonimir Rakamarićfor allowing me to spend a few months recharging at University of Utah during my PhD. I thank the new friends I made during my my time at Imperial College. In particular, I thank Kareem Khazem for being a constant and reliable source of support, advice, and fun. I'm sure East Acton is happy to be rid of us! I thank Carsten Fuhs for making Friday pub night academic enough for me to justify attending. I thank Heidy Khlaaf for paving a way for me. I thank Marija Selakovic (and Jovan) for immediately making me 4 feel like family. I thank everyone else that I was able to interact with in the Department of Computing and especially the Multicore and Software Reliability groups. I thank Richard Bornat, James Brotherston and David Pym. They selflessly helped me during an extremely difficult time and helped me find my footing at Imperial College. This thesis would not have been finished without their support during that time. I thank my parents, Ken and Valerie Sorensen, for the principles they instilled in me and for all their support and patience. I wasn't the most cooperative teenager growing up, but the one thing I knew they wouldn't tolerate was poor performance in school. Since the first time I had homework in grade school, my parents would ensure that it was completed to the best of my ability. I still carry these expectations with me, along with the confidence that has grown out of them. This journey has been full of rash decisions on my end, but their support gave me the courage to dive in head first. Finally, I will always look up to them for their integrity and work ethic. I thank the rest of my family: my brothers, Jason and Adam, and my sister, Annalynn. It wasn't easy being so far away for so long. Although I couldn't be there for all the birthdays and camping trips, I love you guys. I thank my oldest and dearest friends, with whom I grew up with and still am lucky enough to see occasionally. Something about small town central Utah creates some of the kindest and most genuine people I have ever met. No matter how much time or distance separates us, we can always pick up like it was yesterday. In particular, I thank: Scott and Mardi McDonald for making their spare bedroom my home away from home; Dallin Brown for always being up for an adventure; Clay and Stephanie Nielsen for holding down the hometown; and Jordan Peterson for hanging out in the deep end. I thank Tony Field and Albert Cohen for providing an engaging final examination. Their feedback and discussions provided a wider context for me to see this work in. For the same reasons I thank all the anonymous reviewers that provided feedback for my submissions. I know reviewing can feel like a thankless task, but it has always improved my work. I thank all the universities that have supported me and gave me a place to call home: Snow College, University of Utah, Imperial College London, and Princeton University. We can always do better. I'm going to keep trying if you guys keep trying. Let's keep rockin' and rollin', man. [And97] 5 Contents 1 Introduction 14 1.1 A short story . 15 1.1.1 Classic concurrency on GPUs? . 15 1.1.2 A GPU global barrier . 15 1.1.3 Using the GPU barrier . 17 1.1.4 Barriers freezing graphics . 18 1.1.5 Looking forward . 18 1.2 Contributions . 19 1.3 Publications . 21 1.4 Formal acknowledgements . 22 2 Background 24 2.1 A brief history . 24 2.2 GPU hardware . 26 2.3 GPU programming: OpenCL . 28 2.3.1 OpenCL feature classes . 33 2.3.2 OpenCL 2.0 memory consistency model . 35 2.4 GPUs used in this work . 35 2.5 Summary . 37 3 A Portable Global Barrier 38 3.1 Personal context . 38 3.2 Motivation . 38 3.2.1 Chapter contributions . 41 3.3 Occupancy-bound execution model . 43 3.4 Occupancy discovery . 45 3.4.1 Implementation . 46 3.4.2 Properties and correctness . 49 3.5 Inter-workgroup barrier . 50 3.5.1 The XF barrier . 51 6 3.5.2 OpenCL 2.0 memory model primer . 52 3.5.3 Implementation . 54 3.5.4 Barrier specification . 54 3.5.5 Correctness of the inter-workgroup barrier . 55 3.6 Empirical evaluation . 58 3.6.1 Recall of discovery protocol . 58 3.6.2 Comparison with the multi-kernel approach . 64 3.6.3 Portability vs. specialisation . 66 3.6.4 OpenCL on CPUs . 71 3.7 Related work . 72 3.8 Summary . 73 4 Functional and Performance Portability for GPU Graph Applications 76 4.1 Personal context . 76 4.2 Motivation . 77 4.2.1 Chapter contributions . 79 4.3 Generalising optimisations to OpenCL . 80 4.3.1 Cooperative conversion . 80 4.3.2 Nested parallelism . 82 4.3.3 Iteration outlining . 83 4.3.4 Workgroup size . 84 4.3.5 The optimisation space . 85 4.4 Experimental methodology .

Inter-Workgroup Barrier Synchronisation on Graphics Processing Units

Lecture 7 CUDA

ATI Radeon™ HD 4870 Computation Highlights

AMD Accelerated Parallel Processing Opencl Programming Guide

AMD Opencl User Guide.)

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

Near Data Processing: Are We There Yet?

Integrated Framework for Heterogeneous Embedded Platforms Using Opencl Author: Kulin Seth Department: Electrical and Computer Engineering

Virtual GPU Software User Guide Is Organized As Follows: ‣ This Chapter Introduces the Capabilities and Features of NVIDIA Vgpu Software

Opencl Programming

Lecture 7 Today's Content Trends

Compute & Memory Optimizations for High-Quality Speech Recognition on Low-End GPU Processors