Improving Resource Utilization in Heterogeneous CPU-GPU Systems
Total Page:16
File Type:pdf, Size:1020Kb
Improving Resource Utilization in Heterogeneous CPU-GPU Systems A Dissertation Presented to the Faculty of the School of Engineering and Applied Science University of Virginia In Partial Fulfillment of the requirements for the Degree Doctor of Philosophy (Computer Engineering) by Michael Boyer May 2013 c 2013 Michael Boyer Abstract Graphics processing units (GPUs) have attracted enormous interest over the past decade due to substantial increases in both performance and programmability. Programmers can potentially leverage GPUs for substantial performance gains, but at the cost of significant software engineering effort. In practice, most GPU applications do not effectively utilize all of the available resources in a system: they either fail to use use a resource at all or use a resource to less than its full potential. This underutilization can hurt both performance and energy efficiency. In this dissertation, we address the underutilization of resources in heterogeneous CPU-GPU systems in three different contexts. First, we address the underutilization of a single GPU by reducing CPU-GPU interaction to improve performance. We use as a case study a computationally-intensive video-tracking application from systems biology. Because of the high cost of CPU-GPU coordination, our initial, straightforward attempts to accelerate this application failed to effectively utilize the GPU. By leveraging some non-obvious optimization strategies, we significantly decreased the amount of CPU-GPU interaction and improved the performance of the GPU implementation by 26x relative to the best CPU implementation. Based on the lessons we learned, we present general guidelines for optimizing GPU applications as well as recommendations for system-level changes that would simplify the development of high-performance GPU applications. Next, we address underutilization at the system level by using load balancing to improve performance. We propose a dynamic scheduling algorithm that automatically and efficiently i Abstract ii divides the execution of a data-parallel kernel across multiple, possibly heterogeneous GPUs. We show that our scheduler can nearly match the performance of an unrealistic static scheduler when device performance is fixed, and can provide better performance when device performance varies. Finally, we address underutilization within a GPU by using frequency scaling to improve energy efficiency. We propose a novel algorithm for predicting the energy-optimal GPU clock frequencies for an arbitrary kernel. Using power measurements from real systems, we demonstrate that our algorithm improves significantly on the state of the art across multiple generations of GPUs. We also propose and evaluate techniques for decreasing the CPU's energy consumption during GPU computation. Many of the techniques presented in this dissertation can be used to improve the per- formance and energy efficiency of GPU applications with no programmer effort or software modifications required. As the diversity of available hardware systems continues to increase, automatic techniques such as these will become critical for software to fully realize the benefits of future hardware improvements. Approval Sheet This dissertation is submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Engineering) Michael Boyer Michael Boyer This dissertation has been read and approved by the Examining Committee: Kevin Skadron Kevin Skadron, Advisor Sudhanva Gurumurthi Sudhanva Gurumurthi, Committee Chair Jack Davidson Jack Davidson Mircean Stan Mircean Stan Westley Weimer Westley Weimer Accepted for the School of Engineering and Applied Science: James H. Aylor James H. Aylor, Dean, School of Engineering and Applied Science May 2013 iii Acknowledgments The work presented in this dissertation, as well as other work completed during my graduate career, would have not have been possible without the support of many different people. First and foremost, I thank my advisor, Kevin Skadron, who has played an important role in nearly everything I have accomplished as a graduate student. Advisors have a tough job: they must walk a fine line between being too overbearing and being too laid back, with the right mix often depending heavily on the particular student. Kevin has always managed to find the right balance between the two extremes, consistently helping me stay confident and motivated but also challenging me to keep improving. Other than my advisor, Nuwan Jayasena of AMD played the largest role in helping shape much of the work presented here. I also thank the members of my Ph.D. committee: Professors Jack Davidson, Sudhanva Gurumurthi, Mircea Stan, and Wes Weimer. Their guidance helped steer my work in a more interesting and useful direction. Wes deserves special recognition as a co-author of my first peer-reviewed publication, which was based on work that I completed as part of his Programming Languages course and which remains my most-cited (first-author) publication. Professor Andrew Grimshaw provided me with a unique opportunity to co-teach a course on parallel programming; this experience was rewarding and enlightening, and remains one of the highlights of my time in graduate school. The members of the LAVA Lab have supported me in innumerable ways over the years. David Tarjan took me under his wing early in my career and was a wonderful collaborator on a number of different projects. Shuai Che, Jeremy Sheaffer, and Runjie Zhang provided feedback iv Acknowledgments v on numerous matters both technical and otherwise. I am also indebted to other past and current members of the Lava Lab: Greg Faust, Marisabel Guevara, Tibor Horvath, Jiayuan Meng, Brett Meyer, Donnie Newell, Karthik Sankaranarayanan, Prateeksha Satyamoorthy, Lukasz Szafaryn, Jack Wadden, Ke Wang, and Liang Wang. I am especially thankful for the support of my fellow graduate students Chris Gregg, Jiawei Huang, Sudeep Ghosh, and Nishant George throughout my time at the University of Virginia. Saurav Basu helped me understand the underlying algorithms and data structures used in the leukocyte tracking work. Jonathan Dorn provided valuable insight on the frequency scaling work. This dissertation was typeset using a LATEX template created by Joel Coffman. My parents encouraged my intellectual curiosity from a young age, and are in many ways responsible for my early interest in computers and my subsequent choice to study computer engineering as both an undergraduate and graduate student. Their support throughout the years has been unwavering despite the physical distance that has separated us for the last decade. Any homesickness I may have felt has been tempered slightly by my in-laws, Jay, Cathy, Alanna, and Brett, who have warmly welcomed me into their family here in Virginia. Last but certainly not least, I thank my wife Jessica for her love and support. I am grateful for her constant encouragement, her occasional advice, and, perhaps most importantly, her patience. Seven years was a long time to wait; now it is time to see what adventures are in store for us in the next chapter of our life. This work was supported in part by NSF grants IIS-0612049 and CNS-0916908, SRC grants 1607.001, 1972.001, and 2233.001, a GRC AMD/Mahboob Khan Ph.D. fellowship, an NVIDIA research grant, and equipment donated by AMD and NVIDIA. Contents Contents vi List of Tables...................................... viii List of Figures...................................... ix 1 Introduction1 1.1 Leukocyte Tracking................................3 1.2 Heterogeneous Load Balancing..........................4 1.3 GPU Frequency Scaling.............................5 1.4 Organization...................................7 2 Background8 2.1 GPU Hardware..................................8 2.2 GPU Software................................... 11 2.2.1 GPU Programming Model........................ 12 2.3 Energy Efficiency................................. 14 3 Case Study: Leukocyte Tracking 18 3.1 Leukocyte Detection and Tracking........................ 20 3.2 Accelerating the Detection Stage........................ 22 3.2.1 Comparison to OpenMP......................... 23 3.2.2 Na¨ıve CUDA Implementation...................... 24 3.2.3 CUDA Optimizations........................... 24 3.2.4 Single- vs. Double-Precision Performance................ 26 3.3 Accelerating the Tracking Stage......................... 27 3.3.1 Comparison to OpenMP......................... 29 3.3.2 Na¨ıve CUDA Implementation...................... 29 3.3.3 CUDA Optimizations........................... 29 3.4 Discussion..................................... 35 3.4.1 Lessons for Software Developers.................... 36 3.4.2 Lessons for System Architects...................... 41 3.5 Related Work................................... 44 3.6 Conclusions.................................... 44 3.7 Postscript..................................... 45 3.8 Publications and Impact............................. 48 vi Contents vii 4 Heterogeneous Load Balancing 49 4.1 Motivation..................................... 50 4.1.1 Need for Heterogeneous Load Balancing................ 51 4.1.2 Need for Dynamic Load Balancing................... 52 4.2 Related Work................................... 54 4.3 Dynamic Load Balancing............................. 57 4.3.1 Optimal Partition............................. 57 4.3.2 Scheduling Algorithm........................... 59 4.3.3 Scheduler Implementation........................ 60 4.3.4 Example Schedule............................. 61 4.4 Experimental