Runtime Energy Management of Concurrent Applications for Multi-Core Platforms

UNIVERSITY OF SOUTHAMPTON FACULTY OF ENGINEERING AND PHYSICAL SCIENCES Electronics and Computer Science Runtime Energy Management of Concurrent Applications for Multi-core Platforms by Karunakar Reddy Basireddy Thesis submitted for the degree of Doctor of Philosophy April 2019 UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING AND PHYSICAL SCIENCES Electronics and Computer Science Doctor of Philosophy RUNTIME ENERGY MANAGEMENT OF CONCURRENT APPLICATIONS FOR MULTI-CORE PLATFORMS by Karunakar Reddy Basireddy Multi-core platforms are employing a greater number of heterogeneous cores and resource configurations to achieve energy-efficiency and high performance. These platforms often execute applications with different performance constraints concurrently, which con- tend for resources simultaneously, thereby generating varying workload and resources demands over time. There is a little reported work on runtime energy management of concurrent execution, focusing mostly on homogeneous multi-cores and limited application scenarios. This thesis considers both homogeneous and heterogeneous multi-cores and broadens application scenarios. The following contributions are made in this thesis. Firstly, this thesis presents online Dynamic Voltage and Frequency Scaling (DVFS) techniques for concurrent execution of single-threaded and multi-threaded applications on homogeneous multi-cores. This includes an experimental analysis and deriving metrics for efficient online workload classification. The DVFS level is proactively set through predicted workload, measured through Memory Reads Per Instruction. The analysis also considers thread synchronisation overheads, and underlying memory and DVFS architectures. Average energy savings of up to 60% are observed when evaluated on three different hardware platforms (Odroid-XU3, Intel Xeon E5-2630, and Xeon Phi 7620P). Next, an energy efficient static mapping and DVFS approach is proposed for heterogeneous multi-core CPUs. This approach simultaneously exploits different types of cores for each application in a concurrent execution scenario. It first selects performance meeting mapping (no. of cores and type) for each application having minimum energy consumption using offline results. Then online DVFS is applied to adapt to workload and performance variations. Compared to recent techniques, the proposed approach has an average of 33% lower energy consumption when validated on the Odroid-XU3. To eliminate dependency on the offline application profiling and to adapt to dynamic application arrival/completion, an adaptive mapping approach coupled with DVFS is presented. This is achieved through an accurate performance model, and an energy efficient resource selection technique and a resource manager. Experimental evaluation on the Odroid-XU3 shows an improvement of up to 28% in energy efficiency and 7.9% better prediction accuracy by performance models. Contents Abbreviations xv Nomenclature xxi Declaration of Authorship xxiii Acknowledgements xxv 1 Introduction1 1.1 Runtime Energy Management.........................3 1.2 Research Justification.............................3 1.3 Research Questions...............................5 1.4 Research Contributions............................5 1.5 Software Contributions.............................7 1.6 Publications...................................7 1.7 Thesis Organization..............................8 2 Literature Review of Runtime Energy Management 11 2.1 DVFS and Thread-to-Core Mapping..................... 12 2.1.1 DVFS.................................. 13 2.1.2 Thread-to-Core Mapping....................... 15 2.2 Multi-Processor System-on-Chips....................... 17 2.2.1 Number of Processing Cores...................... 17 2.2.2 Heterogeneity in Processing Cores.................. 18 2.2.3 Types of Multi-core Architecture................... 19 2.2.3.1 Homogeneous Multi-core Architectures.......... 20 2.2.3.2 Heterogeneous Multi-core Architectures.......... 21 2.3 Approaches to Runtime Energy Management................ 24 2.3.1 Linux Power Governors........................ 24 2.3.2 RTM of Homogeneous Multi-cores.................. 26 2.3.3 RTM of Heterogeneous Multi-cores.................. 30 2.4 Experimental Platforms, Applications and Tools.............. 35 2.4.1 Experimental Platforms........................ 35 2.4.1.1 Odroid-XU3......................... 36 2.4.1.2 Intel Xeon E5-2630V2.................... 37 2.4.1.3 Intel Xeon Phi 7120P.................... 38 2.4.2 Applications.............................. 39 2.4.3 Tools for Monitoring and Analysis of Performance Events..... 41 v vi CONTENTS 2.5 Discussion.................................... 42 3 Runtime Management of DVFS for Homogeneous Multi-cores 45 3.1 Workload Classification............................ 45 3.1.1 Methodology.............................. 46 3.1.2 Results and Discussion......................... 49 3.1.2.1 Execution Time Based Workload Classification...... 49 3.1.2.2 PMC Data Based Workload Classification......... 51 3.2 Runtime Management of DVFS for Concurrent Execution of Single-threaded Applications................................... 55 3.2.1 Runtime Management of DVFS.................... 57 3.2.1.1 Workload Selection and Prediction............. 58 3.2.1.2 Workload Selection..................... 59 3.2.1.3 Workload Prediction..................... 59 3.2.1.4 Workload Classification and Frequency Selection..... 61 3.2.1.5 Adaptive Sampling...................... 62 3.2.1.6 Performance Evaluation and Compensation........ 62 3.2.2 Experimental Results......................... 63 3.2.2.1 Workload Prediction and Adaptive Sampling....... 63 3.2.2.2 Energy Savings........................ 64 3.2.2.3 Application Performance.................. 66 3.2.2.4 Runtime Overheads..................... 66 3.2.3 Summary................................ 67 3.3 Runtime Management of DVFS for Concurrent Execution of Multi-Threaded Applications................................... 68 3.3.1 Runtime DVFS Approach....................... 69 3.3.1.1 PMC data collection..................... 70 3.3.1.2 Computing MAPM, utilization and NUMA latency... 70 3.3.1.3 Workload prediction..................... 71 3.3.1.4 Identification of V-f setting................. 72 3.3.2 Experimental Results......................... 73 3.3.2.1 Evaluation on 24-core Xeon E5-2630............ 75 3.3.2.2 Evaluation on 61-core Xeon Phi.............. 78 3.3.2.3 Runtime Overheads..................... 79 3.4 Discussion.................................... 80 4 Static Mapping and DVFS for Heterogeneous Multi-cores 83 4.1 Problem Formulation.............................. 87 4.2 Static Thread-to-Core Mapping and DVFS Approach............ 88 4.2.1 Thread-to-Core Mapping....................... 89 4.2.1.1 Offline Analysis....................... 89 4.2.1.2 Runtime Mapping Selection................. 90 4.2.2 DVFS Approach............................ 92 4.2.2.1 Workload Selection and Prediction............. 92 4.2.2.2 Workload Classification and Frequency Selection..... 95 4.2.2.3 Performance Observation and Compensation....... 95 4.3 Experimental Validation............................ 96 CONTENTS vii 4.3.1 Energy Savings and Performance Comparison............ 98 4.3.1.1 Energy Savings........................ 98 4.3.1.2 Breakdown of Energy Savings............... 102 4.3.1.3 Performance......................... 103 4.3.2 Workload Prediction.......................... 105 4.3.3 Overheads of the Proposed Approach................ 105 4.3.3.1 Runtime Overhead...................... 105 4.3.3.2 Offline Analysis Overhead.................. 106 4.4 Discussion.................................... 106 5 Adaptive Mapping and DVFS for Heterogeneous Multi-cores 109 5.1 Problem Formulation.............................. 113 5.2 Adaptive Thread-to-Core Mapping and DVFS................ 113 5.2.1 Online Identification of Energy-efficient Mapping......... 115 5.2.1.1 Runtime Data Collector................... 115 5.2.2 Performance Predictor......................... 116 5.2.3 Resource Combination Enumerator.................. 119 5.2.4 Resource Selector............................ 120 5.2.5 Resource Manager/Runtime Adaptation............... 121 5.2.5.1 Resource Allocator/Reallocator.............. 121 5.2.5.2 Performance Monitor.................... 123 5.2.5.3 DVFS Governor....................... 124 5.3 Experimental Results.............................. 125 5.3.1 Experimental Setup and Implementation............... 125 5.3.2 Evaluation of Performance Predictor................. 127 5.3.3 Comparison of Energy Consumption................. 129 5.3.4 Performance.............................. 130 5.3.5 Runtime Overheads.......................... 131 5.4 Discussion.................................... 132 6 Conclusions and Future Work 133 6.1 Conclusions................................... 133 6.2 Future Work.................................. 136 Appendix A Application Offline Analysis Results 139 Appendix B Experimental Data Used in Classification Bins 145 Appendix C Sample Code for Data Collection and Processing 149 References 161 List of Figures 1.1 Historical trends in CPU performance, reprinted from [1].........1 1.2 Hot spot power density scaling with feature size. Reprinted from [2]...2 2.1 Illustration of static and dynamic current through an inverter gate.... 12 2.2 Relation between maximum frequency and dynamic power versus supply voltage. Reprinted from [3]........................... 14 2.3 An illustration of the DVFS technique. Reprinted from [4]......... 14 2.4 Illustration of thread-to-core mapping

Load more