Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on Gpus

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs A Dissertation Presented by Yash Ukidave to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Northeastern University Boston, Massachusetts December 2015 c Copyright 2015 by Yash Ukidave All Rights Reserved i Contents List of Figures vi List of Tables ix List of Acronyms x Acknowledgments xi Abstract xii 1 Introduction 1 1.1 TypesofConcurrencyonGPUs . ... 2 1.2 Benefits of Multi-level concurrency . ........ 3 1.3 Challenges in Implementing Multi-Level Concurrency . ............. 5 1.3.1 Runtime Level Challenges . 5 1.3.2 Architectural Challenges . .... 5 1.3.3 Scheduling Challenges for Multi-Tasked GPUs . ........ 6 1.4 Contributions of the Thesis . ...... 7 1.5 Organization of the Thesis . ..... 9 2 Background 10 2.1 GPUarchitecture ................................. 10 2.1.1 AMD Graphics Core Next (GCN) Architecture . ..... 11 2.1.2 NVIDIA Kepler Architecture . 12 2.1.3 Evolution of GPU Architecture . 14 2.2 TheOpenCLProgrammingModel . 16 2.2.1 PlatformModel................................ 16 2.2.2 MemoryModel................................ 17 2.2.3 RuntimeandDriverModel. 18 2.3 Multi2Sim: Cycle-level GPU Simulator . ........ 20 2.3.1 Disassembler ................................. 20 2.3.2 Functional Simulation / Emulation . ...... 21 2.3.3 Architectural Simulator . 22 2.3.4 OpenCL Runtime and Driver Stack for Multi2Sim . ...... 23 ii 2.4 Concurrent Kernel Execution on GPUs . ....... 24 2.4.1 Support for Concurrent Kernel Execution on GPUs . ........ 26 2.5 Concurrent Context Execution on GPUs . ....... 27 2.5.1 Context Concurrency for Cloud Engines and Data Centers ......... 27 2.5.2 Key Challenges for Multi-Context Execution . ........ 28 3 Related Work 30 3.1 GPUModelingandSimulation . 30 3.1.1 Functional simulation of GPUs for compute and graphics applications . 31 3.1.2 Architectural Simulation of GPUs . ..... 31 3.1.3 PowerandAreaEstimationonGPUs . 33 3.2 Design of Runtime Environments for GPUs . ....... 33 3.3 Resource Management for Concurrent Execution . .......... 34 3.3.1 Spatial Partitioning of Compute Resources . ........ 34 3.3.2 QoS-aware Shared Resource Management . ..... 35 3.3.3 Workload Scheduling on GPUs . 36 3.3.4 GPU virtualization for Data Center and Cloud Engines . .......... 37 3.3.5 Scheduling on Cloud Servers and Datacenter . ....... 38 3.3.6 Address Space and Memory Management on GPUs . 38 4 Adaptive Spatial Partitioning 40 4.1 Mapping Multiple Command Queues to Sub-Devices . ......... 41 4.2 Design of the Adaptive Spatial Partitioning Mechanism . .............. 43 4.2.1 Compute Unit Reassignment Procedure . ..... 45 4.2.2 Case Study for Adaptive Partitioning . ...... 46 4.3 Workgroup Scheduling Mechanisms and Partitioning Policies............ 47 4.4 Evaluation Methodology . 49 4.4.1 Platform for Evaluation . 49 4.4.2 EvaluatedBenchmarks . 49 4.5 EvaluationResults ............................... 51 4.5.1 Performance Enhancements provided by Multiple Command Queue Mapping 51 4.5.2 Effective Utilization of Cache Memory . ...... 53 4.5.3 Effects of Workgroup Scheduling Mechanisms and Partitioning Policies . 54 4.5.4 Timeline Describing Load-Balancing Mechanisms for the Adaptive Partitioning 55 4.6 Summary ....................................... 58 5 Multi-Context Execution 59 5.1 Need for Multi-Context Execution . ....... 60 5.2 Architectural and Runtime Enhancements for Multi-Context Execution . 61 5.2.1 Extensions in OpenCL Runtime . 61 5.2.2 Address Space Isolation . 63 5.3 Resource Management for Efficient Multi-Context Execution............ 64 5.3.1 Transparent Memory Management using CML . 64 5.3.2 Remapping Compute-Resources between Contexts . ........ 66 5.4 Firmware Extensions on Command-Processor (CP) for Multi-Context Execution . 68 iii 5.5 Evaluation Methodology . 69 5.5.1 Platform for Evaluation . 69 5.5.2 Applications for Evaluation . 69 5.6 EvaluationResults ............................... 70 5.6.1 Speedup and Utilization with Multiple Contexts . .......... 70 5.6.2 Evaluation of Global Memory Efficiency . ..... 74 5.6.3 TMMevaluation ............................... 75 5.6.4 Evaluation of Compute-Unit Remapping Mechanism . ........ 79 5.6.5 L2 cache Efficiency for Multi-Context Execution . ......... 79 5.6.6 Firmware characterization and CP selection . ......... 81 5.6.7 Overhead in Multi-Context Execution . ...... 82 5.7 Discussion...................................... 84 5.7.1 Static L2 Cache Partitioning for Multiple Contexts . ........... 84 5.8 Summary ....................................... 84 6 Application-Aware Resource Management for GPUs 86 6.1 Need and Opportunity for QoS . 87 6.2 Virtuoso DefinedQoS................................. 89 6.2.1 QoSpolicies ................................. 89 6.2.2 QoSMetrics ................................. 90 6.3 Virtuoso Architecture ................................. 90 6.3.1 QoS Interface: Settings Register (QSR) . ....... 90 6.3.2 QoS Assignment: Performance Analyzer . ..... 92 6.3.3 QoS Enforcement: Resource Manager . 93 6.4 Evaluation Methodology . 95 6.4.1 Platform for Evaluation . 95 6.4.2 Applications for Evaluation . 96 6.4.3 Choosing the QoS Reallocation Interval . ....... 97 6.5 VirtuosoEvaluation..... ...... ..... ...... ...... .. 98 6.5.1 Evaluation of QoS Policies . 98 6.5.2 Impact of Compute Resource Allocation . ...... 99 6.5.3 Effects of QoS-aware Memory Management . 100 6.5.4 Case Study: Overall Impact of Virtuoso ................... 102 6.5.5 Evaluation of Memory Oversubscription . ....... 103 6.5.6 Sensitivity Analysis of High-Priority Constraints . ............. 104 6.5.7 Measuring QoS Success rate . 105 6.6 Discussion...................................... 105 6.7 Summary ....................................... 106 7 Machine Learning Based Scheduling for GPU Cloud Servers 108 7.1 Scheduling Challenges for GPU Cloud Servers and Datacenters . 109 7.2 GPU Remoting for Cloud Servers and Datacenters . ......... 111 7.3 Collaborative Filtering for Application Performance Prediction . 112 7.4 MysticArchitecture..... ...... ..... ...... ...... .. 114 7.4.1 The Causes of Interference (CoI) . 114 iv 7.4.2 Stage-1: Initializer and Profile Generator . ......... 114 7.4.3 Stage-2: Collaborative Filtering (CF) based Prediction ........... 116 7.4.4 Stage-3: Interference Aware Scheduler . ....... 117 7.4.5 Validation................................... 118 7.5 Evaluation Methodology . 118 7.5.1 ServerSetup ................................. 118 7.5.2 Workloads for Evaluation . 119 7.5.3 Metrics .................................... 120 7.6 EvaluationResults ............................... 121 7.6.1 Scheduling Performance . 121 7.6.2 System Performance Using Mystic . 123 7.6.3 GPUUtilization ............................... 124 7.6.4 Quality of Launch Sequence Selection . 125 7.6.5 Scheduling Decision Quality . 126 7.6.6 Scheduling Overheads Using Mystic .................... 126 7.7 Summary ....................................... 127 8 Conclusion 128 8.1 FutureWork...................................... 129 Bibliography 131 v List of Figures 2.1 BasicarchitectureofaGPU. ..... 11 2.2 AMD GCN Compute Unit block diagram. 12 2.3 NVIDIA Advanced Advanced Stream Multiprocessor (SMX) . ........... 13 2.4 OpenCL Host Device Architecture . ...... 17 2.5 TheOpenCLmemorymodel.. 18 2.6 The runtime and driver model of OpenCL . ...... 19 2.7 Multi2Sim’s three stage architectural simulation . ............... 21 2.8 PipelinemodelofGCNSIMDunit . 22 2.9 Multi2Sim Runtime and Driver Stack for OpenCL . ......... 24 2.10 ComputeStagesofCLSurf . 25 2.11 Kernel scheduling for Fermi GPU . ...... 26 2.12 Multiple hardware queues in GPUs . ....... 27 2.13 Active time of data buffers in GPU applications . ............ 28 4.1 High-level model describing multiple command queue mapping .......... 41 4.2 Flowchart for adaptive partitioning handler . ............ 44 4.3 Procedure to initiate compute unit reallocation between kernels . 46 4.4 Case study to demonstrate our adaptive partitioning mechanism. 46 4.5 Workgroup scheduling mechanisms . ...... 47 4.6 Performance speedup using multiple command queue mapping........... 52 4.7 Memory hierarchy of the modeled AMD HD 7970 GPU device. ........ 52 4.8 L2 cache hit rate when using multiple command queue mapping .......... 53 4.9 Effect of different partitioning policy and workgroup scheduling mechanisms . 54 4.10 Timeline analysis of compute-unit reassignment . .............. 56 4.11 Overhead for CU transfer . ..... 57 5.1 Resource utilization and execution time of Rodinia kernels ............. 60 5.2 Position of CML and context dispatch in the runtime . ........... 62 5.3 Address space isolation using CID . ....... 63 5.4 Combined host and CP interaction for TMM. ....... 65 5.5 TMM operation, managing three concurrent contexts. ............. 66 5.6 Speedup and resource utilization improvement using multi-context execution . 72 5.7 Compute units utilization using multi-context execution............... 73 5.8 Increase in execution time for multi-context execution. ............... 74 vi 5.9 Global memory bandwidth utilization using multi-context execution. 74 5.10 Timeline of global memory usage for contexts from different mixes. 75 5.11 Evaluation of different global memory management mechanisms . 77 5.12 Performance of pinned memory updates over transfer stalling............ 77 5.13 Timeline analysis of the compute unit

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on Gpus

Small Form Factor 3D Graphics for Your Pc

Exploring Weak Scalability for FEM Calculations on a GPU-Enhanced Cluster

Hardware Developments V E-CAM Deliverable 7.9 Deliverable Type: Report Delivered in July, 2020

Conga-TR4 User's Guide

Monte Carlo Evaluation of Financial Options Using a GPU a Thesis

Deliverable 7.7 Deliverable Type: Report Delivered in June, 2019

ATI Radeon™ HD 2000 Series Technology Overview

Deep Dive: Asynchronous Compute

Contributions of Hybrid Architectures to Depth Imaging: a CPU, APU and GPU Comparative Study

AMD APP SDK V2.8.1 Developer Release Notes

Time Complexity Parallel Local Binary Pattern Feature Extractor on a Graphical Processing Unit

Warehouse-Scale Video Acceleration: Co-Design and Deployment in the Wild