Improving GPU Performance Through Instruction Redistribution and Diversification
Total Page:16
File Type:pdf, Size:1020Kb
Improving GPU Performance through Instruction Redistribution and Diversification A Dissertation Presented by Xiang Gong to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts August 2018 ProQuest Number:10933866 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. ProQuest 10933866 Published by ProQuest LLC ( 2018). Copyright of the Dissertation is held by the Author. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code Microform Edition © ProQuest LLC. ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 To my family for their unconditional love and support. i Contents List of Figures vi List of Tables viii List of Acronyms ix Acknowledgments xi Abstract of the Dissertation xii 1 Introduction 1 1.1 Toward Peak Performance on GPUs . .2 1.1.1 Challenge 1: A general lack of appropriate tools to perform systematic hardware/software studies on GPUs . .3 1.1.2 Challenge 2: alleviating memory contention in GPUs . .3 1.1.3 Challenge 3: the insufficient instruction diversity in GPUs . .5 1.2 Contributions of This Thesis . .6 1.3 Organization of Thesis . .7 2 Background 8 2.1 Overview of Parallelism . .8 2.1.1 The Rise of Parallelism . .8 2.1.2 The Evolution of Hardware and Software . .9 2.2 GPU architecture . 10 2.2.1 Compute Unit . 11 2.2.2 Memory Hierarchy . 12 2.3 GPU programming model . 13 2.3.1 OpenCL Platform Model . 14 2.3.2 OpenCL Memory Model . 14 2.3.3 OpenCL execution model . 16 2.3.4 OpenCL Programming Model . 17 2.3.5 OpenCL compilation model . 19 2.4 Multi2Sim . 19 2.4.1 Disassembler . 20 ii 2.4.2 Functional Simulator . 20 2.4.3 Architectural Simulator . 21 2.4.4 Multi2Sim OpenCL Runtime and Driver Stack . 21 2.5 Summary . 22 3 Related Work 23 3.1 Tools for GPU Research . 23 3.1.1 GPU Benchmarks Suites . 23 3.1.2 GPU Compilers . 24 3.1.3 GPU Simulator . 25 3.2 Optimization of Memory Contention in GPUs . 27 3.2.1 Wavefront Scheduling . 27 3.2.2 Thread Block/Work-group Scheduler . 27 3.3 Optimization of Resource Utilization on GPUs . 28 3.3.1 Software-based CKE . 28 3.3.2 Hardware-based CKE . 29 3.4 Summary . 29 4 M2C and M2V 30 4.1 Overview . 30 4.2 M2C: An open source GPU compiler for Multi2Sim . 31 4.2.1 The Front-end . 32 4.2.2 The Back-end . 35 4.2.3 The Finalizer . 40 4.2.4 The Test Suite . 41 4.3 M2V: A visualization infrastructure for Multi2Sim . 42 4.3.1 The data serialization layer . 43 4.3.2 The back-end . 44 4.3.3 The front-end . 45 4.3.4 Visualization Examples . 45 4.4 Summary . 48 5 Twin Kernels 49 5.1 Motivation . 50 5.1.1 Instruction Scheduling and Performance . 51 5.2 Implementation . 53 5.2.1 Memory Consistency . 54 5.2.2 The Twin Kernel Compiler . 55 5.2.3 The Twin Kernel Binary . 56 5.2.4 Twin Kernel Assignment . 57 5.2.5 The Twin Kernel Execution Engine . 59 5.3 Evaluation . 61 5.3.1 Methodology . 61 5.3.2 Results . 62 5.4 Summary . 66 iii 6 Rebalanced Kernel 67 6.1 Motivation . 68 6.2 Rebalanced Kernel . 70 6.2.1 Addressing Spatial Imbalance with Kernel Fusion . 70 6.2.2 Addressing Temporal Imbalance with Twin Kernels . 82 6.2.3 Rebalanced Kernel . 82 6.3 Evaluation . 83 6.3.1 Methodology . 83 6.3.2 Performance Improvement: Stage 1 . 84 6.3.3 Performance Improvement: Stage 2 . 86 6.3.4 Further Discussion . 87 6.3.5 Predictive Modeling . 89 6.4 Summary of ReBalanced Kernels . 96 7 Conclusion 97 7.1 Conclusion . 97 7.2 Future Work . 98 Bibliography 99 iv List of Figures 2.1 Overview of a GPU. 10 2.2 Diagram of an AMD GCN Compute Unit. 11 2.3 Diagram of the GCN GPU memory system. 12 2.4 Diagram of OpenCL Platform Model. 14 2.5 Diagram of the OpenCL Memory Model. 15 2.6 Diagram of the OpenCL Kernel Execution Model. 16 2.7 OpenCL Compilation Models . 19 2.8 Multi2Sim software modules [36] . 20 2.9 Multi2Sim Runtime and Driver Software Stack [36] . 22 4.1 M2C architecture diagram . 31 4.2 The Stages of the M2C front-end. 32 4.3 Abstract Syntax Tree of the Vector Addition kernel. 34 4.4 The transition of VecAdd Kernel in Muti2C Front-end. 35 4.5 Stages of the M2C Back-end pipeline. 36 4.6 Vector Addition: translating LLVM IR to SelectionDAG. 37 4.7 Vector Addition: The DAG before and after instruction selection. 38 4.8 M2V architecture . 43 4.9 Breakdown of Stalls in Compute Unit . 46 4.10 Breakdown of Instructions Count and Cycle . 47 4.11 Utilization of Execution Unit . 47 4.12 Length of Different Type of Instructions . 47 4.13 Instruction Activity Comparison . 48 5.1 Visualization of instruction of Matrix Multiplication kernel . 50 5.2 Execution timeline for five systems. 51 5.3 Workflow for the Twin Kernel Compiler. 55 5.4 Comparison of a Twin Kernel Binary and two regular kernel binaries. 57 5.5 Assign Twin Kernels to wavefronts by setting the initial PC. 58 5.6 Performance of 5 instruction schedulers in the SIMT model, relative to the average performance of 10 instruction schedulers. 59 5.7 Best performance in SIMT, relative to the baseline . 62 5.8 Speedup of TKMT, relative to the baseline . 63 v.