Fault-Tolerant Operating System for Many-Core Processors

Fault-Tolerant Operating System for Many-core Processors Amit Fuchs Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Fault-Tolerant Operating System for Many-core Processors Research Thesis In partial fulfillment of the requirements for the degree of Master of Science in Computer Science Amit Fuchs Submitted to the Senate of the Technion — Israel Institute of Technology Tevet 5778 Haifa December 2017 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 The research thesis was carried out under the supervision of prof. Avi Mendelson, in the Faculty of Computer Science. Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Contents List of Figures xi List of Listings xiii Abstract1 Glossary3 Abreviations5 1 Introduction7 1.1 Achievements.......................... 10 1.2 Related Work.......................... 12 1.2.1 Distributed Operating Systems ............ 12 1.2.2 Parallelization Techniques............... 13 1.2.2.1 Hardware Parallelism ............ 13 1.2.2.2 Native Multi-threading Libraries . 14 1.2.2.3 Message Passing............... 14 1.2.2.4 Transactional Memory............ 14 2 Architecture Model 17 2.1 Clustered Architecture..................... 17 2.2 Distributed Memory ...................... 19 2.2.1 Local Memories..................... 20 2.2.2 Globally Shared Memory................ 20 2.2.2.1 Scalable Consistency Model......... 21 2.2.2.2 No Hardware-based Coherency . 21 2.2.2.3 Distance Dependent Latency (NUMA) . 22 2.2.2.4 Address Translation ............. 22 2.3 Processor Faults......................... 23 2.3.1 Threats Terminology.................. 23 2.3.2 Sources for Processor Faults.............. 24 2.3.3 Implications of Transistor Scaling........... 25 2.3.4 Treated Faults...................... 26 2.3.4.1 Permanent Faults (a.k.a. Hard-faults) . 26 vii Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 2.3.4.2 Transient Faults (a.k.a. Soft-faults) . 27 3 Operating System Goals 29 3.1 Fault-Tolerance Goals...................... 29 3.2 Performance Goals....................... 30 3.3 Additional Considerations................... 32 4 Design Principles 35 4.1 Application Programming Model ............... 35 4.1.1 Dataflow Flavored Execution ............. 36 4.1.2 Implicit Synchronization................ 37 4.1.3 Explicit Parallelization................. 38 4.1.4 Building Programs................... 38 4.1.5 Parallelization Example ................ 39 4.2 Partitioned Memory Layout .................. 40 4.2.1 Local Memory Role................... 41 4.2.2 Shared Memory Organization............. 41 4.2.2.1 Partitioned Global Address Space . 42 4.2.2.2 Low-Level Memory Management . 43 4.3 Fault-Tolerance and Resiliency................. 43 4.3.1 Tolerating Hard Faults................. 44 4.3.1.1 Detecting Permanent Faults......... 44 4.3.1.2 Recovery Process Overview......... 44 4.3.1.3 Fault Detection and Recovery Scenario . 45 4.3.2 Tolerating Transient Faults............... 47 4.3.3 Low-Cost Fine-Grained Checkpoints ......... 49 4.4 Additional Principles...................... 49 4.4.1 Software-only approach................ 49 4.4.2 Everything is Wait-Free................. 50 4.4.3 Software-Managed Consistency............ 50 4.4.4 No Atomic Conditionals ................ 51 4.4.5 Completely Decentralized............... 51 4.4.6 Easily Fit Existing Systems............... 52 5 Functional Overview 53 5.1 Inter-node Communication .................. 53 5.1.1 Block Transport Layer ................. 53 5.1.1.1 Behavior on Faults.............. 54 5.1.1.2 Poor Man’s Transactions........... 54 5.1.1.3 Extending Transactions ........... 55 5.1.1.4 Transport Backplane............. 55 5.1.1.5 Channel Operation.............. 55 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 5.1.2 Message-Passing Layer................. 56 5.1.2.1 TaskLoad ................... 56 5.1.2.2 TaskWrite . 57 5.1.2.3 Heartbeat . 57 5.1.3 (Safe) Messages Processing .............. 57 5.1.3.1 Processing Task Inputs............ 58 5.1.3.2 Processing Scheduling Messages . 58 5.2 Task Execution Procedure ................... 60 5.2.1 Life of a Task ...................... 60 5.2.2 Deferring Side-Effects ................. 62 5.2.2.1 Deferring Memory Write Operations . 62 5.2.2.2 Deferring Scheduling Requests . 62 5.2.2.3 Deferring Results Publication . 64 5.2.3 Task Finalization.................... 65 5.3 Fault-Tolerant Methods..................... 66 5.3.1 Recoverable State.................... 67 5.3.1.1 Verifying Shared State............ 68 5.3.2 Idempotent Side-effects ................ 68 5.3.3 Fault-Tolerant Inputs State Tracking.......... 69 5.3.4 Distributed Watchdog ................. 71 5.3.5 Double Execution.................... 72 5.4 Task Placement......................... 73 5.4.1 Load Balancing..................... 74 5.4.1.1 Placement Considerations.......... 75 5.4.1.2 Task Placement................ 75 5.4.2 Scheduling Junctions.................. 76 5.4.3 Heartbeats Collection ................. 77 5.5 Implementation Details..................... 79 5.5.1 Unique Frame Identifiers................ 79 5.5.2 Task Descriptor..................... 80 5.5.2.1 Structure Validation............. 81 5.5.3 Thread Binaries..................... 82 5.5.4 Service Threads..................... 84 5.5.5 Dataflow Threads.................... 84 6 Evaluation 87 6.1 Full-system Simulation..................... 87 6.1.1 Simulator Operation.................. 88 6.1.2 Simulator Modifications................ 89 6.2 Investigation Methodology................... 90 6.3 Goals of the Experiments.................... 92 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 6.4 Experiments and Results.................... 94 6.4.1 Application Timeline.................. 95 6.4.2 Recovery from Faults.................. 98 6.4.3 Speedup Through Parallelization........... 99 6.4.4 Overhead of Parallelization and Reliability . 102 6.4.4.1 Effects of Parallel Workload . 102 6.4.4.2 Impact of Cores Partitioning . 103 6.4.5 Impact of Parallel Workload . 105 7 Hands-on Experiment 109 7.1 Goal of the Experiment..................... 109 7.2 Instructions to Start ...................... 109 7.3 Expected Output ........................ 112 8 Concluding Remarks 117 9 Appendix 119 9.1 Experiment Output of 512 Cores . 119 References 123 Hebrew Abstract 136 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 List of Figures 1.1 On-the-fly recovery from a hard fault on a 1024-core processor 11 2.1 Heterogeneous cluster architecture for a many-core processor 18 2.2 Top-level organization of the many-core processor . 19 4.1 High-level roles given to the architectural components . 41 4.2 Node memory layout....................... 42 4.3 Hard-fault recovery process ................... 47 4.4 On-the-fly hard-fault recovery - 128 cores ........... 48 5.1 Shared memory layout...................... 54 5.2 TaskWrite message structure .................. 57 5.3 Structure of a task identifier (UFI) ............... 80 5.4 Structure of a TaskDescriptor . 81 5.5 Binary layout of a dataflow thread . 83 6.1 Full-system many-core simulation................ 89 6.2 Throughput and parallel workload during run-time . 96 6.3 Workload versus task duration.................. 97 6.4 Workload versus nodes - 4 cores per node ........... 97 6.5 Workload versus cores per node - 4 nodes ........... 98 6.6 Workload versus cores per node - 8 nodes ........... 98 6.7 Workload versus nodes - long tasks duration.......... 98 6.8 Increased hard-fault detection delay .............. 99 6.9 Excessively long hard-fault detection delay . 100 6.10 Speedup versus the number of cores . 101 6.11 OS dominated cycles on different workloads . 103 6.12 OS dominated cycles on different configurations . 104 6.13 Run-time with varied tasks duration and cores partitioning . 105 6.14 Varied application workloads .................. 106 6.15 Impact of cores partitioning on workload . 107 7.1 Simulation screenshot ...................... 112 7.2 Recovery status screenshot.................... 114 7.3 Double Execution in action.................... 115 xi Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Listings 4.1 The famous recursive definition for Fibonacci series..... 39 4.2 Explicit dataflow version of the Fibonacci series....... 40 7.1 Example output trace of the dataflow engine......... 112 7.2 Hard-fault detection and recovery............... 114 9.1 Node trace output on a 512-core experiment......... 119 xiii Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Technion - Computer Science Department - M.Sc. Thesis MSC-2018-28 - 2018 Abstract Creating operating systems for many-core processors is a critical challenge. Contemporary multi-core systems inhibit scaling by relying on hardware- based cache-coherency and atomic primitives to guarantee consistency and synchronization when accessing shared memory. Moreover, projections on transistor scaling trends predict that hardware fault rates will increase by orders of magnitude and the microarchitecture alone could not provide adequate robustness in the exascale era. Resilience must be considered at all levels; operating systems cannot continue

Load more