Fault Tolerant Optimizations for High Performance Computing Systems

TECHNISCHE UNIVERSITÄT MÜNCHEN FAKULTÄT FÜR INFORMATIK Fault Tolerant Optimizations for High Performance Computing Systems Dai Yang Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Universität München zur Erlangung des akademischen Grades eines Doktor der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzender: Prof. Dr. Martin Bichler Prüfende der Dissertation: 1. Prof. Dr. Dr. h.c. (NAS RA) Arndt Bode 2. Prof. Dr. Dieter Kranzlmüller Ludwig-Maximilians-Universität München Die Dissertation wurde am 16.09.2019 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 18.12.2019 angenommen. Acknowledgments This extensive work could only be accomplished thanks to the successful teamwork of present and past colleagues at TUM, KIT, RWTH, and Uni Mainz. I can only express my wholehearted thanks to all my colleagues, professors, students, and friends. First, I would like to thank my advisor, Prof. Dr. Dr. h.c. Arndt Bode, for his inspiring advice and help throughout these years. In addition, I would like to thank Dr. Carsten Trinitis, who was always extremely helpful whenever I needed him. I would like to thank Prof. Dr. Dieter Kranzlmüller for his role as secondary advisor. Finally, I want to thank Prof. Dr. Martin Schulz, who has only recently joined the chair, for his support and help in improving my work, especially with the MPI-related topics. I also want to thank all of my other colleagues at the Chair of Computer Architecture and Parallel Systems at TUM and LRZ, especially Josef Weidendorfer, Tilman Küstner and Amir Raoofy, who contributed many papers and helped me a lot. I will never forget the technical discussions with Josef and Amir, and all the encouraging words from Til which guided me through thick and thin. I would like to thank Beate Hinterwimmer and Silke Albrecht for their best support in the chair that anyone could imagine. I also want to thank my other colleagues for their help: Andreas Wilhelm, Alexis Engelke, Marcel Meyer, Jürgen Obermeier, and ² 英?. Furthermore, I want to thank several colleagues at the Chair of Astronautics at TUM: Sebastian Rückerl, Nicolas Appel, Martin Langer, and Florian Schummer for all the fun they sparked. In addition, I would like to thank the Federal Ministry of Education and Research of the Federal Republic of Germany for providing the grant for the project ENVELOPE under the grant title 01|H16010D. I gratefully acknowledge the Gauss Centre for Supercomputing e.V. for funding this project by providing computing time on the GCS Supercomputer SuperMUC and Linux Cluster at the Leibniz Supercomputing Centre. Finally, I want to especially thank my family and friends for providing me all kinds of support. They helped me to make crucial decisions. Just to mention some names: Nina Harders, Clemens Jonischkeit, Michael Schreier, Simon Roßkopf, Leonhard Wank, Lukas von Sturmberg, Marcel Stuht, and !¿. I also want to thank all the Bachelor and Master students, IDP Project students, and the guided research students for your contributions to my research. Dai Yang September 10, 2019 Abstract On the road to exascale computing, a clear trend toward a greater number of nodes and increasing heterogeneity in the architecture of High Performance Computing (HPC) systems can be observed. Classic resilience approaches which are mainly reactive cause significant overhead on large scale parallel applications. In this dissertation, we present a comprehensive survey on the state-of-the-practice failure prediction methods for HPC systems. We further introduce the concept of data migration as a promising way of achieving proactive fault tolerance in HPC systems. We present a lightweight application library – called LAIK – to assist application program- mers in making their applications fault-tolerant. Moreover, we propose an extension – called MPI sessions and MPI process sets – to the state-of-the-art programming model for HPC applications – the Message Passing Interface (MPI) – in order to benefit from failure prediction. Our evaluation shows that there is no significant additional overhead generated by using both LAIK and MPI sessions for the example applications LULESH and MLEM. iii Zusammenfassung Auf dem Weg zum Exascale Computing ist ein deutlicher Trend bezüglich hoher Paral- lelität und hoher Heterogenität in den Rechnerarchitekturen der Höchstleistungsrech- nensysteme (HLRS) zu beobachten. Klassische Ansätze zur Behandlung von Fehler- toleranz, die hauptsächlich reaktiv sind, verursachen erheblichen Mehraufwand bei großen parallelen Anwendungen. In dieser Dissertation wird ein umfassender Überblick über den Stand der Technik zur Fehlervorhersage für HLRS präsentiert. Darüber hinaus stellen wir das Konzept der Datenmigration als vielversprechenden Weg zur proaktiven Fehlertoleranz in HLRS vor. Wir führen eine leichtgewichtige Anwendungsbibliothek – LAIK – ein, die den Anwendungsprogrammierer dabei unterstützt, seine Anwendungen fehlertolerant zu machen. Außerdem schlagen wir eine Erweiterung – genannt MPI-Sessions und MPI- Prozesssets – für die Standardkommunikationsbibliothek für HPC-Anwendungen – das Message Passing Interface (MPI) – vor, um von der Fehlervorhersage zu profitieren. Unsere Auswertungen zeigen keinen signifikanten Mehraufwand, welcher durch LAIK oder MPI sessions für die Beispielanwendungen LULESH und MLEM eingeführt wurde. iv Contents Acknowledgments ii Abstract iii Zusammenfassung iv 1. Introduction 1 1.1. Technical Background of Computer Architecture . .2 1.1.1. Types of Parallelism . .2 1.1.2. Amdahl’s Law . .5 1.1.3. Gustafson’s Law . .7 1.1.4. Heterogeneous Computing . .7 1.1.5. Other Factors in Processor Design . .8 1.2. Modern HPC System Architectures . .9 1.2.1. TOP500 and the High Performance LINPACK (HPL) Benchmark9 1.2.2. Parallelism and Heterogeneity in Modern HPC Systems . 10 1.3. Motivation . 13 1.4. Contribution . 14 1.5. Structure of This Dissertation . 15 2. Terminology and Technical Background 16 2.1. Terminology on Fault Tolerance . 16 2.1.1. Fault, Error, Failure . 16 2.1.2. Fault tolerance . 17 2.2. Terminology on Machine Learning and Failure Prediction . 18 2.3. Terminology on Parallel Computer Architecture . 20 2.3.1. Flynn’s Taxonomy of Computer Architectures . 20 2.3.2. Memory Architectures . 22 2.3.3. Scalability . 23 2.4. Terminology in Parallel Programming . 24 2.4.1. Message Passing Interface . 24 2.4.2. OpenMP . 25 v Contents 3. Failure Prediction: A State-of-the-practice Survey 26 3.1. Methodology and Scope . 26 3.2. Survey on Failure Modes in High Performance Computing (HPC) Systems 27 3.2.1. Failure Modes . 28 3.2.2. On Root Causes Analysis . 29 3.3. Survey of Failure Prediction Methods . 33 3.3.1. Probability and Correlation . 37 3.3.2. Rule-based Methods . 40 3.3.3. Mathematical/Analytical Methods . 41 3.3.4. Decision Trees/Forests . 42 3.3.5. Regression . 46 3.3.6. Classification . 47 3.3.7. Bayesian Networks and Markov Models . 49 3.3.8. Neural Networks . 50 3.3.9. Meta-Learning . 51 3.4. Insights and Discussions on Failure Predictions in High Performance Computing Systems . 52 3.4.1. Effectiveness of Failure Prediction System . 53 3.4.2. Availability of Datasets, Reproducibility of Research . 54 3.4.3. Metrics and the Effect of False Positive Rate . 54 3.4.4. Failure Prediction and Fault-mitigation Techniques . 55 4. Fault Tolerance Strategies 56 4.1. System Architecture of Batch Job Processing System . 56 4.2. Fault-mitigation Mechanisms . 59 4.2.1. Overview of Fault Tolerance Techniques . 61 4.2.2. Application-integrated vs. Application-transparent Techniques . 62 4.2.3. Checkpoint and Restart . 62 4.2.4. Migration . 64 4.2.5. Algorithm-based Fault Tolerance . 69 4.2.6. Summary of Fault Tolerance Techniques . 69 5. Data Migration 71 5.1. Basic Action Sequence for Data Migration . 73 5.2. Data Organization of Parallel Applications . 76 5.3. Data Consistency and Synchronization of Processes . 82 5.4. Summary: The Concept of Data Migration . 85 vi Contents 6. LAIK: An Application-integrated Index-space Based Abstraction Library 88 6.1. The LAIK Library . 88 6.1.1. Basic Concept of LAIK . 90 6.1.2. Architecture of LAIK . 92 6.1.3. Overview of LAIK APIs . 95 6.1.4. User API: The Process Group API Layer . 98 6.1.5. User API: The Index Space API Layer . 100 6.1.6. User API: The Data Container API Layer . 103 6.1.7. Callback APIs . 108 6.1.8. The External Interface . 109 6.1.9. The Communication Backend Driver Interface . 110 6.1.10. Utilities . 111 6.1.11. Limitations and Assumptions in Our Prototype Implementation 112 6.2. Basic Example of a LAIK Program . 112 6.2.1. Extended Example of Automatic Data Migration with LAIK . 114 6.3. Evaluation of the LAIK Library with Real-world Applications . 114 6.3.1. Application Example 1: Image Reconstruction with the Maximum- Likelihood Expectation-Maximization (MLEM) Algorithm . 114 6.3.2. MPI Parallelization . 117 6.3.3. Evaluation of Application Example 1: MLEM . 119 6.3.4. Application Example 2: The Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) Benchmark . 126 6.3.5. Evaluation of Application Example 2: LULESH . 132 6.4. Discussion on Effectiveness of Data Migration with LAIK . 139 6.4.1. Advantages of LAIK . 139 6.4.2. Disadvantages and Limitations of LAIK . 140 6.4.3. Lessons Learned . 140 7. Extending MPI for Data Migration: MPI Sessions and MPI Process Sets 142 7.1. MPI Sessions . 144 7.1.1. MPI Sessions and Fault Tolerance . 147 7.1.2. MPI Sessions and Data Migration . 147 7.2. Extension Proposal for MPI Sessions: MPI Process Sets . 148 7.2.1. Components in Our MPI Sessions / MPI Process Sets Design . 150 7.2.2. Semantics of MPI Process Sets . 152 7.2.3. Storage of Process Set Information . 154 7.2.4. Change Management of the MPI Process Set . 155 7.2.5. Implementation of Our MPI Process Set Module .

Load more