Automatic Analysis and Exploitation of Dynamism for Energy-Efficient

Lehrstuhl für Rechnerarchitektur & Parallele Systeme Fakultät für Informatik Technische Universität München Automatic Analysis and Exploitation of Dynamism for Energy-Efficient HPC Applications Madhura Kumaraswamy Lehrstuhl für Rechnerarchitektur & Parallele Systeme Fakultät für Informatik Technische Universität München Automatic Analysis and Exploitation of Dynamism for Energy-Efficient HPC Applications Madhura Kumaraswamy Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzende: Prof. Gudrun J. Klinker, Ph.D. Prüfende der Dissertation: 1. Prof. Dr. Hans Michael Gerndt 2. Prof. Dr. Per Gunnar Kjeldsberg, Norwegian University of Science and Technology Die Dissertation wurde am 03.08.2020 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 13.09.2020 angenommen. Abstract The increase in the energy consumption as a result of growing computational performance is a major chal- lenge to reach Exascale levels of computing due to the cooling costs of High Performance Computing (HPC) systems. Unfortunately, developers focus on improving the performance of existing algorithms, and neglect potential improvements to energy-efficiency due to the lack of platform knowledge. To overcome these challenges, we propose a dynamic autotuning approach constituting design-time and runtime stages by combining methods from the embedded systems and HPC domains to optimize the energy consumption. The approach targets applications that exhibit changing characteristics, known as application dynamism between individual iterations of the time loop as well as within a single time loop. We leverage the inter-loop and intra-loop dynamism using two tuning plugins, which use a search strategy to create a search space of different configurations of tuning knobs. To analyze the inter-loop dynamism, we propose features to cluster loops with similar characteristics using the DBSCAN and spectral clustering methods. To prevent the search space from exploding and to save tuning time and cost, we optimize the search space using a probability distribution based on a Gaussian kernel to test a large number of configurations in a single application run. Our proposed approach stores the best configuration for each instance of a code region based on its unique computational characteristics in a tuning model. In the runtime tuning stage, we use the tuning model to dynamically switch the system configuration for different code regions. To predict the behaviour of unseen loops during production runs, we implemented predictors based on a second-order Markov chain, one-bit and two-bit dynamic branch prediction schemes. While the proposed method supports an automatic approach, we also enable the specification of domain knowledge to improve the tuning results. We evaluated our methodology using three real-world applications and one proxy application, and achieved up to 20% improvement in both energy-efficiency and performance. iii Acknowledgments Years of research have culminated in this dissertation, which would not have been possible without the support and guidance of the following people. First of all, I would like to express my deepest gratitude to my advisor Prof. Dr. Michael Gerndt for giving me this wonderful and fruitful opportunity to pursue research in a stimulating and dynamic research envi- ronment. I am thankful for his immense patience, wisdom, support, insightful guidance, and constructive criticism that equipped me with the confidence to independently develop research ideas. I am also grateful to my second advisor Prof. Per Gunnar Kjeldsberg for his valuable and expert feedback during my visit to the Norwegian University of Science and Technology (NTNU). The work presented in this dissertation is an extension of the core concepts developed in the READEX project. I would like to thank all the researchers who worked on the READEX project and directly or indirectly helped me throughout my journey. I am also indebted to the Centre for Information Services and High Performance Computing (ZIH) at the Technische Universität Dresden for providing me the resources and enabling me to pursue my work. I gratefully acknowledge the HPC Support team for their patience with my constant stream of emails. I would also like to thank Prof. Dr. Martin Schulz, the head of the Chair of Computer Architecture and Parallel Systems (CAPS) at the Technische Universität München (TUM) for creating a great working en- vironment, and encouraging collaboration between other PhD students. I am grateful to Andreas Wilhelm, Anamika Chowdhury, Robert Mijakovic,´ Vladimir Podolskiy, and Dai Yang for the fruitful discussions and fun moments at the Chair. I also thank TUM for the financial support for my research during the last years of my PhD. I express my appreciation and thanks to all my friends for their encouragement that kept me going through tough times. Last but not least, I am incredibly grateful to my family for their moral support and love, and for inspiring me to pursue my dreams. Special thanks go to my wonderful sister and friend who gave me the much required moments of comic relief. Thank you all. v Contents List of Figures ix List of Tables xi List of Listings xii 1 Introduction 1 1.1 Motivation and Problem Specification . 1 1.2 Solution: A Tools-Aided Approach to Optimize the Energy-Efficiency . 3 1.3 Contributions . 5 1.4 Structure of the Dissertation . 6 2 Related Work 8 2.1 Hardware Mechanisms for Power Management . 9 2.1.1 Clock Gating . 9 2.1.2 Power Gating . 10 2.1.3 Dynamic Duty Cycle Modulation (DDCM) . 10 2.1.4 Dynamic Voltage Frequency Scaling (DVFS) . 10 2.1.5 Uncore Frequency Scaling (UFS) . 11 2.1.6 Power-Aware Hardware . 11 2.2 Software Methods for Power Management . 12 2.2.1 Power-Aware Job Scheduling . 13 2.2.2 Power Management for Heterogeneous Systems . 16 2.2.3 DVFS . 16 2.2.4 UFS . 24 2.2.5 Clock Modulation . 25 2.2.6 Classification of Software Techniques . 26 2.3 Summary . 28 3 Background 29 3.1 Concepts and Formalism . 29 3.1.1 System Structure . 29 3.1.2 System Tuning . 30 3.1.3 Tuning Potential and Dynamism . 31 3.2 Tools . 32 3.2.1 Score-P . 32 3.2.2 Periscope Tuning Framework . 32 3.2.3 PAPI . 33 3.3 Tuning Parameters . 34 vii Contents 3.3.1 Hardware Tuning Parameters . 34 3.3.2 System Software Tuning Parameters . 35 3.3.3 Application Tuning Parameters . 35 3.4 Summary . 36 4 Tuning Methodology 37 4.1 Overview of the Tuning Workflow . 37 4.2 Architecture of the Tuning Framework . 41 4.3 Summary . 42 5 Concepts of Inter-Phase Tuning 43 5.1 Exploiting the Characteristic Behavior of Phases . 44 5.1.1 Clustering . 45 5.2 Targeted Search . 53 5.3 Dynamic Prediction of Phase Behavior . 60 5.3.1 Markov Chain Predictor . 62 5.3.2 Predictors Based on Dynamic Branch Predictors . 68 5.4 Summary . 71 6 Integration into the Tuning Framework 73 6.1 Pre-analysis . 74 6.1.1 Automatic Reduction of Instrumentation Overhead . 74 6.1.2 Domain Knowledge Specification . 75 6.1.3 Detection of Significant Regions . 76 6.1.4 Analyze Tuning Potential . 76 6.2 Design-Time Analysis . 81 6.2.1 Tuning Intra-Phase Dynamism . 84 6.2.2 Tuning Inter-Phase Dynamism . 89 6.2.3 Generating the Application Tuning Model . 95 6.3 Runtime Application Tuning . 96 6.3.1 Calibration . 98 6.4 Summary . 99 7 Evaluation 101 7.1 Platform Description . 101 7.2 Benchmark Specification . 102 7.3 Exploitation of Inter-Phase Dynamism . 102 7.3.1 128.GAPgeofem . 104 7.3.2 sam(oa)2 ........................................110 7.3.3 INDEED . 116 7.3.4 miniMD . 121 8 Summary and Outlook 128 8.1 Future Work . 130 viii List of Figures 1.1 Variation of the execution time across the phases of INDEED. 2 3.1 Components of PTF and their interaction with Score-P. 33 4.1 A high-level workflow of the autotuning methodology consisting of the pre-analysis steps, DTA and RAT. 39 4.2 The architecture of the READEX methodology, depicting the interaction between PTF, Score-P and RRL. 41 5.1 Automatic detection of eps for DBSCAN. 47 5.2 Clusters detected for a toy example consisting of 150 points in the [0,1] X-Y plane using spectral clustering. 52 5.3 Affinity matrix for the toy example computed using the Gaussian similarity function. 52 5.4 Eigenvalues computed for the graph Laplacian matrix for the toy example. 53 5.5 A zoomed-in view of the top 20 eigenvalues for the toy example. 53 5.6 The overall weights for the attractors and repellers determined using Equation 5.9. 57 5.7 The final overall weights for the attractors and repellers obtained after translating the values of fr to the positive Y-axis. 58 5.8 The discrete probability mass function obtained.

Load more