Resilience Design Patterns a Structured Approach to Resilience at Extreme Scale - Version 1.0
Total Page:16
File Type:pdf, Size:1020Kb
ORNL/TM-2016/687 Resilience Design Patterns A Structured Approach to Resilience at Extreme Scale - version 1.0 Saurabh Hukerikar Christian Engelmann Approved for public release. October 2016 Distribution is unlimited. DOCUMENT AVAILABILITY Reports produced after January 1, 1996, are generally available free via US Department of Energy (DOE) SciTech Connect. Website: http://www.osti.gov/scitech/ Reports produced before January 1, 1996, may be purchased by members of the public from the following source: National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Telephone: 703-605-6000 (1-800-553-6847) TDD: 703-487-4639 Fax: 703-605-6900 E-mail: [email protected] Website: http://classic.ntis.gov/ Reports are available to DOE employees, DOE contractors, Energy Technology Data Ex- change representatives, and International Nuclear Information System representatives from the following source: Office of Scientific and Technical Information PO Box 62 Oak Ridge, TN 37831 Telephone: 865-576-8401 Fax: 865-576-5728 E-mail: [email protected] Website: http://www.osti.gov/contact.html This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal lia- bility or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or rep- resents that its use would not infringe privately owned rights. Refer- ence herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not nec- essarily constitute or imply its endorsement, recommendation, or fa- voring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. ORNL/TM-2016/687 Computer Science and Mathematics Division Resilience Design Patterns A Structured Approach to Resilience at Extreme Scale version 1.0 Saurabh Hukerikar Christian Engelmann Date Published: October 2016 Prepared by OAK RIDGE NATIONAL LABORATORY Oak Ridge, TN 37831-6283 managed by UT-Battelle, LLC for the US DEPARTMENT OF ENERGY under contract DE-AC05-00OR22725 CONTENTS LIST OF FIGURES . iv LIST OF TABLES . v ACKNOWLEDGMENTS . vii ABSTRACT . viii 1. Introduction . 1 2. Resilience Terminology and Concepts . 4 2.1 Reliability . 4 2.2 Availability . 4 2.3 Systems . 4 2.4 Faults . 5 2.5 Errors . 5 2.6 Failures . 6 2.7 The Relationship between Faults, Errors and Failures . 7 2.8 Resilience Capabilities . 7 2.8.1 Detection . 7 2.8.2 Containment . 8 2.8.3 Masking . 8 2.9 Resilience Metrics . 8 2.9.1 Reliability Metrics . 8 2.9.2 Availability Metrics . 9 2.9.3 Error and Failure Detection Metrics . 10 2.9.4 Mean Time to Failure . 11 3. The Resilience Challenge for Extreme-Scale HPC Systems . 12 4. Survey of HPC Resilience Techniques . 14 4.1 Hardware-based Techniques . 14 4.2 Software-implemented Techniques . 14 4.2.1 Operating System & Runtime-based Solutions . 14 4.2.2 Message Passing Library-based Solutions . 15 4.2.3 Compiler-based Solutions . 15 4.2.4 Programming Model Techniques . 16 4.2.5 Algorithm-Based Fault Tolerance . 17 4.3 Cooperative Hardware/Software Approaches . 18 5. Design Patterns for Resilience . 19 5.1 Introduction to Design Patterns . 19 5.2 Design Patterns for HPC Resilience Solutions . 19 5.3 Anatomy of a Resilience Design Pattern . 20 6. Classification of Resilience Design Patterns . 21 7. The Resilience Pattern Catalog . 23 7.1 Describing Design Patterns . 23 7.2 Strategy Patterns . 25 7.2.1 Fault Treatment Pattern . 25 7.2.2 Recovery Pattern . 26 7.2.3 Compensation Pattern . 28 7.3 Architectural Patterns . 29 7.3.1 Fault Diagnosis Pattern . 29 7.3.2 Reconfiguration Pattern . 30 3 7.3.3 Checkpoint Recovery Pattern . 32 7.3.4 State Diversity Pattern . 33 7.3.5 Design Diversity Pattern . 34 7.4 Structural Patterns . 35 7.4.1 Monitoring Pattern . 35 7.4.2 Prediction Pattern . 37 7.4.3 Restructure Pattern . 38 7.4.4 Rejuvenation Pattern . 39 7.4.5 Reinitialization Pattern . 40 7.4.6 Roll-back Pattern . 41 7.4.7 Roll-forward Pattern . 42 7.4.8 N-modular Redundancy Pattern . 44 7.4.9 N-version Design Pattern . 45 7.4.10 Recovery Block Pattern . 46 7.5 State Patterns . 47 7.5.1 Persistent State Pattern . 47 7.5.2 Dynamic State Pattern . 48 7.5.3 Environment State Pattern . 49 7.5.4 Stateless Pattern . 50 8. Building Resilience Solutions using Resilience Design Patterns . 52 8.1 Features of Resilience Solutions . 52 8.2 Design Spaces . 52 9. Case Study: Checkpoint and Rollback . 54 10. Case Study: Proactive Process Migration . 56 11. Case Study: Cross-Layer Hardware/Software Hybrid Solution . 58 12. Summary . 60 iii LIST OF FIGURES 1 Relationship between fault, error and failure . 7 2 Anatomy of a Resilient Design Pattern . 20 3 Classification of resilience design patterns . 21 4 Elements of a resilience solution for HPC systems and applications . 52 5 Design Spaces for construction of resilience solutions using patterns . 53 6 Resilience Solution Case Study: Checkpoint & Restart using BLCR . 54 7 Resilience Solution Case Study: Process Migration . 56 8 Resilience Solution Case Study: Cross-Layer Design using ECC with ABFT . 58 iv LIST OF TABLES 1 Availability measured by the “nines” . 10 v ACKNOWLEDGMENTS This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Early Career Research Program under contract number DE-AC05-00OR22725. vii ABSTRACT Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Pro- jections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Practical limits on power consumption in HPC systems will require future systems to embrace innovative architectures, increasing the levels of hardware and software complexities. The resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. These techniques must seek to improve resilience at reasonable overheads to power consumption and performance. While the HPC community has developed various solutions, application-level as.