Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell

Reliable Massively Parallel Symbolic Computing: Fault Tolerance for a Distributed Haskell by Robert Stewart Submitted in conformity with the requirements for the degree of Doctor of Philosophy Heriot Watt University School of Mathematical and Computer Sciences Submitted November, 2013 The copyright in this thesis is owned by the author. Any quotation from this thesis or use of any of the information contained in it must acknowledge this thesis as the source of the quotation or information. Abstract As the number of cores in manycore systems grows exponentially, the number of failures is also predicted to grow exponentially. Hence massively parallel computations must be able to tolerate faults. Moreover new approaches to language design and system architecture are needed to address the resilience of massively parallel heterogeneous architectures. Symbolic computation has underpinned key advances in Mathematics and Computer Sci- ence, for example in number theory, cryptography, and coding theory. Computer algebra software systems facilitate symbolic mathematics. Developing these at scale has its own distinctive set of challenges, as symbolic algorithms tend to employ complex irregular data and control structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel High Performance Computing platforms. A key element of SymGridParII is a domain specific language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for scalable distributed-memory parallelism, and employs work stealing to load balance dynamically generated irregular task sizes. To investigate providing scalable fault tolerant symbolic computation we design, implement and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles faults, using task replication as a key recovery strategy. The scheduler supports load balancing with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel skeletons that encapsulate common parallel programming patterns. The user is oblivious to many failures, they are instead handled by the scheduler. An operational semantics describes small-step reductions on states. A simple abstract machine for scheduling transitions and task evaluation is presented. It defines the semantics of supervised futures, and the transition rules for recovering tasks in the presence of failure. The transition rules are demonstrated with a fault-free execution, and three executions that recover from faults. The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN model checker is used to exhaustively search the intersection of states in this automaton to validate a key resiliency property of the protocol. It asserts that an initially empty supervised future on the supervisor node will eventually be full in the presence of all possible combinations of failures. The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey mechanism has been developed for stress testing resiliency with random failure combinations. All unit tests pass in the presence of random failure, terminating with the expected results. 2 Dedication To Mum and Dad. 3 Acknowledgements Foremost, I would like to express my deepest thanks to my two supervisors, Professor Phil Trinder and Dr Patrick Maier. Their patience, encouragement, and immense knowledge were key motivations throughout my PhD. They carry out their research with an objective and principled approach to computer science. They persuasively conveyed an interest in my work, and I am grateful for my inclusion in their HPC-GAP project. Phil has been my supervisor and guiding beacon through four years of computer science MEng and PhD research. I am truly thankful for his steadfast integrity, and selfless dedication to both my personal and academic development. I cannot think of a better supervisor to have. Patrick is a mentor and friend, from whom I have learnt the vital skill of disciplined critical thinking. His forensic scrutiny of my technical writing has been invaluable. He has always found the time to propose consistently excellent improvements. I owe a great debt of gratitude to Phil and Patrick. I would like to thank Professor Greg Michaelson for offering thorough and excellent feedback on an earlier version of this thesis. In addition, a thank you to Dr Gudmund Grov. Gudmund gave feedback on Chapter 4 of this thesis, and suggested generality improvements to my model checking abstraction of HdpH-RS. A special mention for Dr Edsko de Vries of Well Typed, for our insightful and detailed discussions about network transport design. Furthermore, Edsko engineered the network abstraction layer on which the fault detecting component of HdpH-RS is built. I thank the computing officers at Heriot-Watt University and the Edinburgh Parallel Computing Centre for their support and hardware access for the performance evaluation of HdpH-RS. 4 Contents 1 Introduction 10 1.1 Context . 10 1.2 Contributions . 11 1.3 Authorship & Collaboration . 13 1.3.1 Authorship . 13 1.3.2 Collaboration . 14 2 Related Work 16 2.1 Dependability of Distributed Systems . 16 2.1.1 Distributed Systems Terminology . 17 2.1.2 Dependable Systems . 17 2.2 Fault Tolerance . 18 2.2.1 Fault Tolerance Terminology . 18 2.2.2 Failure Rates . 20 2.2.3 Fault Tolerance Mechanisms . 22 2.2.4 Software Based Fault Tolerance . 25 2.3 Classifications of Fault Tolerance Implementations . 27 2.3.1 Fault Tolerance for DOTS Middleware . 27 2.3.2 MapReduce . 28 2.3.3 Distributed Datastores . 28 2.3.4 Fault Tolerant Networking Protocols . 29 2.3.5 Fault Tolerant MPI . 30 2.3.6 Erlang . 32 2.3.7 Process Supervision in Erlang OTP . 33 2.4 CloudHaskell . 34 2.4.1 Fault Tolerance in CloudHaskell . 34 2.4.2 CloudHaskell 2.0 . 35 2.5 SymGridParII . 36 5 2.6 HdpH . 36 2.6.1 HdpH Language Design . 36 2.6.2 HdpH Primitives . 37 2.6.3 Programming Example with HdpH . 37 2.6.4 HdpH Implementation . 38 2.7 Fault Tolerance Potential for HdpH . 38 3 Designing a Fault Tolerant Programming Language for Distributed Memory Scheduling 41 3.1 Supervised Workpools Prototype . 42 3.2 Introducing Work Stealing Scheduling . 43 3.3 Reliable Scheduling for Fault Tolerance . 45 3.3.1 HdpH-RS Terminology . 45 3.3.2 HdpH-RS Programming Primitives . 47 3.4 Operational Semantics . 48 3.4.1 Semantics of the Host Language . 49 3.4.2 HdpH-RS Core Syntax . 49 3.4.3 Small Step Operational Semantics . 50 3.4.4 Execution of Transition Rules . 58 3.5 Designing a Fault Tolerant Scheduler . 62 3.5.1 Work Stealing Protocol . 62 3.5.2 Task Locality . 64 3.5.3 Duplicate Sparks . 68 3.5.4 Fault Tolerant Scheduling Algorithm . 71 3.5.5 Fault Recovery Examples . 75 3.6 Summary . 80 4 The Validation of Reliable Distributed Scheduling for HdpH-RS 81 4.1 Modeling Asynchronous Environments . 82 4.1.1 Asynchronous Message Passing . 82 4.1.2 Asynchronous Work Stealing . 83 4.2 Promela Model of Fault Tolerant Scheduling . 84 4.2.1 Introduction to Promela . 84 4.2.2 Key Reliable Scheduling Properties . 85 4.2.3 HdpH-RS Abstraction . 86 4.2.4 Out-of-Scope Characteristics . 88 4.3 Scheduling Model . 88 6 4.3.1 Channels & Nodes . 88 4.3.2 Node Failure . 91 4.3.3 Node State . 92 4.3.4 Spark Location Tracking . 94 4.3.5 Message Handling . 95 4.4 Verifying Scheduling Properties . 103 4.4.1 Linear Temporal Logic & Propositional Symbols . 103 4.4.2 Verification Options & Model Checking Platform . 104 4.5 Model Checking Results . 105 4.5.1 Counter Property . 106 4.5.2 Desirable Properties . 106 4.6 Identifying Scheduling Bugs . 107 5 Implementing a Fault Tolerant Programming Language and Reliable Scheduler 109 5.1 HdpH-RS Architecture . 109 5.1.1 Implementing Futures . 111 5.1.2 Guard Posts . 116 5.2 HdpH-RS Primitives . 117 5.3 Recovering Supervised Sparks and Threads . 118 5.4 HdpH-RS Node State . 120 5.4.1 Communication State . 120 5.4.2 Sparkpool State . 121 5.4.3 Threadpool State . 122 5.5 Fault Detecting Communications Layer . 123 5.5.1 Distributed Virtual Machine . 123 5.5.2 Message Passing API . 123 5.5.3 RTS Messages . 124 5.5.4 Detecting Node Failure . 125 5.6 Comparison with Other Fault Tolerant Language Implementations . 129 5.6.1 Erlang . 129 5.6.2 Hadoop . 130 5.6.3 GdH Fault Tolerance Design . 131 5.6.4 Fault Tolerant MPI Implementations . 132 6 Fault Tolerant Programming & Reliable Scheduling Evaluation 133 6.1 Fault Tolerant Programming with HdpH-RS . 134 7 6.1.1 Programming With HdpH-RS Fault Tolerance Primitives . 134 6.1.2 Fault Tolerant Parallel Skeletons . 134 6.1.3 Programming With Fault Tolerant Skeletons . 137 6.2 Launching Distributed Programs . 138 6.3 Measurements Platform . 141 6.3.1 Benchmarks . 141 6.3.2 Measurement Methodologies . 142 6.3.3 Hardware Platforms . ..

Load more