Regression Modelling of Power Consumption for Heterogeneous Processors
Total Page:16
File Type:pdf, Size:1020Kb
Regression Modelling of Power Consumption for Heterogeneous Processors by Tahir Diop A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Departement of Electrical and Computer Engineering University of Toronto c Copyright 2013 by Tahir Diop Abstract Regression Modelling of Power Consumption for Heterogeneous Processors Tahir Diop Master of Applied Science Graduate Department of Departement of Electrical and Computer Engineering University of Toronto 2013 This thesis is composed of two parts, that relate to both parallel and heterogeneous processing. The first describes DistCL, a distributed OpenCL framework that allows a cluster of GPUs to be programmed like a single device. It uses programmer-supplied meta-functions that associate work-items to memory. DistCL achieves speedups of up to 29× using 32 peers. By comparing DistCL to SnuCL, we determine that the compute-to-transfer ratio of a benchmark is the best predictor of its performance scaling when distributed. The second is a statistical power model for the AMD Fusion heterogeneous processor. We present a systematic methodology to create a representative set of compute micro-benchmarks using data collected from real hardware. The power model is created with data from both micro-benchmarks and application benchmarks. The model showed an average predictive error of 6.9% on heterogeneous workloads. The Multi2Sim heterogeneous simulator was modified to support configurable power modelling. ii Dedication To my wife and best friend Petra. iii Contents 1 Introduction 1 1.1 Contributions...........................................2 1.2 Organization...........................................3 2 Background 4 2.1 GPU Architecture........................................4 2.1.1 AMD Evergreen.....................................5 2.1.2 Nvidia Fermi.......................................6 2.2 Fusion APU............................................8 2.2.1 CPU............................................8 2.2.2 GPU............................................9 2.3 Programing Models........................................9 2.3.1 OpenCL..........................................9 2.3.2 CUDA........................................... 12 2.4 Simulators............................................. 12 2.4.1 Multi2Sim......................................... 13 2.4.2 GPGPUSim........................................ 14 2.5 SciNet............................................... 15 3 Distributing OpenCL kernels 16 3.1 Background............................................ 17 3.2 DistCL............................................... 18 3.2.1 Partitioning........................................ 19 3.2.2 Dependencies....................................... 20 3.2.3 Scheduling Work..................................... 22 3.2.4 Transferring Buffers................................... 22 3.3 Experimental Setup....................................... 23 3.3.1 Linear Compute and Memory.............................. 24 3.3.2 Compute-Intensive.................................... 25 3.3.3 Inter-Node Communication............................... 25 3.3.4 Cluster.......................................... 27 3.3.5 SnuCL........................................... 28 3.4 Results and Discussion...................................... 29 iv 3.5 Performance Comparison with SnuCL............................. 33 3.6 Conclusion............................................ 36 4 Selecting Representative Benchmarks for Power Evaluation 38 4.1 Power Measurements....................................... 38 4.2 Micro-benchmark Selection................................... 43 4.2.1 Memory Benchmarks................................... 44 4.2.2 Compute Benchmarks.................................. 46 4.3 Conclusion............................................ 53 5 Power Modelling 54 5.1 Background............................................ 56 5.2 Selecting Benchmarks...................................... 58 5.2.1 Micro-Benchmarks.................................... 58 5.2.2 Application Benchmarks................................. 58 5.3 Measuring............................................. 60 5.3.1 Hardware Performance Counters............................ 60 5.3.2 Multi2Sim Simulation.................................. 61 5.4 Modelling............................................. 66 5.5 Conclusion............................................ 76 6 Power Multi2Sim 77 6.1 Epochs............................................... 77 6.2 Using Power Modelling...................................... 78 6.2.1 Configuration....................................... 78 6.2.2 Runtime Usage...................................... 79 6.2.3 Reports.......................................... 79 6.3 Validation............................................. 80 6.4 Conclusion............................................ 80 7 Conclusion and Future Work 82 Bibliography 84 Appendices 92 A Clustering Details 93 B Multi2Sim CPU Configuration Details 95 v List of Tables 2.1 AMD A6-3650 Specification...................................8 3.1 Benchmark Description..................................... 24 3.2 Cluster Specifications...................................... 27 3.3 Measured Cluster Performance................................. 27 3.4 Execution Time Spent Managing Dependencies........................ 32 3.5 Execution Time Spent Managing Dependencies........................ 34 3.6 Benchmark Performance Characteristics............................ 34 4.1 Data Acquisition Unit Specifications.............................. 39 4.2 ACS711 Current Sensor Specifications............................. 39 4.3 AMD Fusion Cache Specifation................................. 46 4.4 Possible Factor Values for Benchmarks............................. 47 4.5 Operation Groupings....................................... 51 4.6 Sensitivity Scores for the CPU................................. 53 4.7 Sensitivity Scores for the GPU................................. 53 5.1 Applications Benchmarks Used................................. 59 5.2 Instruction Categories...................................... 62 5.3 CPU Configuration Summary.................................. 63 5.4 Memory Latency Comparison.................................. 66 5.5 Memory Configuration...................................... 67 5.6 GPU Model Coefficients..................................... 71 5.7 CPU Model Coefficients..................................... 71 5.8 GPU Model Coefficients..................................... 73 5.9 CPU Model Coefficients..................................... 75 5.10 APU Model Coefficients..................................... 75 A.1 Most Common Property per Cluster for the CPU....................... 93 B.1 CPU Configuration Details................................... 95 vi List of Figures 2.1 AMD Evergreen based streaming processor...........................5 2.2 AMD Evergreen based SIMD core................................6 2.3 Nvidia Fermi base CUDA core..................................7 3.1 Vector's 1-dimensional NDRange is partitioned into 4 subranges............... 19 3.2 The read meta-function is called for buffer a in subrange 1 of vector............. 22 3.3 Speedup of distributed benchmarks using DistCL....................... 30 3.4 Breakdown of runtime....................................... 31 3.5 HotSpot with various pyramid heights.............................. 33 3.6 DistCL and SnuCL speedups................................... 35 3.7 DistCL and SnuCL compared relative to compute-to-transfer ratio.............. 36 4.1 Idle power measurements done using the DI-145........................ 40 4.2 Idle power measurements done using the DI-149........................ 40 4.3 MSI A75MA-G55 motherboard schematic [63]......................... 41 4.4 Schematic of the measuring setup................................ 42 4.5 A Picture of the measuring setup in action........................... 43 4.6 An example of a stack used to store the order of recent memory accesses.......... 45 4.7 Energy consumption of ALU benchmarks on the CPU..................... 49 4.8 Energy consumption of ALU benchmarks on the GPU..................... 50 4.9 Frequency of cluster sizes from the CPU results........................ 52 4.10 Frequency of property being the most common in a cluster.................. 52 4.11 Percentage of benchmarks in a cluster that share the most common property........ 52 5.1 Steps involved in the modelling process............................. 55 5.2 Comparison of the literal and best memory configurations................... 65 5.3 The Regression process...................................... 69 5.4 Fitting error of the training benchmarks for the CPU models................. 70 5.5 Fitting error of the training benchmarks for the GPU models................. 70 5.6 Fitting error of the validation benchmarks for the CPU models................ 72 5.7 Fitting error of the validation benchmarks for the GPU models................ 72 5.8 Linear regression of workloads at various frequencies...................... 73 5.9 Predicted and true values for the total energy of the Rodinia benchmarks.......... 75 6.1 Measured power consumption of back propagation on real hardware............. 80 vii 6.2 Simulate power consumption of back propagation using Multi2Sim.............. 80 viii List of Abbreviations ABI | Application Binary Interface AGU | Address Generating Unit ALU | Arithmetic Logic Unit APU | Accelerated Processing Unit AMD | Advanced Micro Devices API | Application