Resilience in High-Level Parallel Programming Languages

Resilience in High-Level Parallel Programming Languages Sara S. Hamouda A thesis submitted for the degree of Doctor of Philosophy at The Australian National University June 2019 c Sara S. Hamouda 2019 Except where otherwise indicated, this thesis is my own original work. Sara S. Hamouda 13 October 2018 To my parents, Hoda and Salem. Acknowledgements I grant all my gratitude to Almighty Allah, for blessing me with this wonderful learning journey and giving me the strength to complete it. Crossing the finish line would not have been possible without the great support and assistance of a group of wonderful people. First of all, my supervisors, Josh Milthorpe, Peter Strazdins and Steve Blackburn, who offered their ultimate guidance and support to remove every obstacle in the way and enable me to achieve my goals. I wish to express my deepest appreciation to my primary supervisor, Josh Milthorpe, who dedicated many hours for discussing my work, refining my ideas, connecting me to collaborators, and challenging me with difficult questions sometimes to help me learn more. Thank you, Josh, for all the favors you showed me during these years. Without your encouragement and generous support, this thesis would not have been possible. I am very grateful to the X10 team; David Grove, Olivier Tardieu, Vijay Saraswat, Benjamin Herta, Louis Mandel, and Arun Iyengar, for giving me the opportunity to collaborate with them in the Resilient X10 project, and for the invaluable experience I gained from this collaboration. I am also very grateful to the computer systems group in the research school of computer science at the ANU for the valuable feedback and recommendations given during the PhD monitoring sessions. I would like to thank J. Eliot B. Moss for valuable discussions about transactional memory systems during his visit to the ANU, and Peter Strazdins for valuable discussions that led to the design of the resilience taxonomy described in this thesis. I am grateful to the Australian National University for providing me the HDR merit and stipend scholarship. This research was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. Finally, I would to thank my family and friends for the continuous encouragement and the happy moments they favored me with. My nephews, Nour Eldeen, Omar, and Eyad, are now too young to read these words, but I hope one day they will open my thesis and read this line to know how much I love them and how much chatting with them on a regular basis was my main source of happiness during my remote study years. To my dear family; Hoda, Salem, Mai, Ahmed and Esraa, thanks for supporting me every step of the way. vii Abstract The consistent trends of increasing core counts and decreasing mean-time-to-failure in supercomputers make supporting task parallelism and resilience a necessity in HPC programming models. Given the complexity of managing multi-threaded distributed execution in the presence of failures, there is a critical need for task-parallel ab- stractions that simplify writing efficient, modular, and understandable fault-tolerant applications. MPI User-Level Failure Mitigation (MPI-ULFM) is an emerging fault-tolerant specification of MPI. It supports failure detection by returning special error codes and provides new interfaces for failure mitigation. Unfortunately, the unstructured form of failure reporting provided by MPI-ULFM hinders the composability and the clarity of the fault-tolerant programs. The low-level programming model of MPI and the simplistic failure reporting mechanism adopted by MPI-ULFM make MPI-ULFM more suitable as a low-level communication layer for resilient high-level languages, rather than a direct programming model for application development. The asynchronous partitioned global address space model is a high-level programming model designed to improve the productivity of developing large-scale applications. It represents a computation as a global control flow of nested parallel tasks that use global data partitioned among processes. Recent advances in the AP- GAS model supported control flow recovery by adding failure awareness to the nested parallelism model — async-finish — and by providing structured failure reporting through exceptions. Unfortunately, the current implementation of the resilient async- finish model results in a high performance overhead that can restrict the scalability of applications. Moreover, the lack of data resilience support limits the productivity of the model as it shifts the challenges of handling data availability and atomicity under failure to the programmer. In this thesis, we demonstrate that resilient APGAS languages can achieve scalable performance under failure by exploiting fault tolerance features in emerging communication libraries such as MPI-ULFM. We propose multi-resolution resilience, in which high-level resilient constructs are composed from efficient lower-level resilient constructs, as an approach for bridging the gap between the efficiency of user-level fault tolerance and the productivity of system-level fault tolerance. To address the limited resilience efficiency of the async-finish model, we propose ‘optimistic finish’ — a message-optimal resilient termination detection protocol for the finish construct. To improve programmer productivity, we augment the APGAS model with resilient ix x data stores that can simplify preserving critical application data in the presence of failure. In addition, we propose the ‘transactional finish’ construct as a productive mechanism for handling atomic updates on resilient data. Finally, we demonstrate the multi-resolution resilience approach by designing high-level resilient application frameworks based on the async-finish model. We implemented the above enhancements in the X10 language, an embodiment of the APGAS model, and performed empirical evaluation for the performance of resilient X10 using micro-benchmarks and a suite of transactional and non-transactional resilient applications. Concepts of the APGAS model are realized in multiple programming languages, which can benefit from the conceptual and technical contributions of this thesis. The presented empirical evaluation results will aid future comparisons with other resilient programming models. Contents Acknowledgements vii Abstract ix 1 Introduction1 1.1 Background and Motivation..........................1 1.2 Multi-Resolution Resilience..........................4 1.3 Problem Statement...............................5 1.4 Research Questions...............................6 1.5 Thesis Statement................................6 1.6 Contributions..................................6 1.7 Thesis Outline..................................7 2 Background and Related Work9 2.1 Resilience Definition..............................9 2.2 Taxonomy of Resilient Programming Models................ 10 2.2.1 Adaptability............................... 10 2.2.1.1 Resource Allocation..................... 10 2.2.1.2 Resource Mapping...................... 11 2.2.2 Fault Tolerance............................. 12 2.2.2.1 Fault Type.......................... 12 2.2.2.2 Fault Level.......................... 13 2.2.2.3 Recovery Level........................ 13 2.2.2.4 Fault Detection........................ 14 2.2.2.5 Fault Tolerance Technique................. 15 2.2.3 Performance............................... 17 2.2.3.1 Performance Recovery................... 17 2.3 Resilience Support in Distributed Programming Models......... 19 2.3.1 Message-Passing Model........................ 19 2.3.1.1 MPICH-V........................... 20 2.3.1.2 FMI.............................. 20 2.3.1.3 rMPI.............................. 21 2.3.1.4 RedMPI............................ 21 2.3.1.5 FT-MPI............................ 21 xi xii Contents 2.3.1.6 MPI-ULFM.......................... 22 2.3.1.7 FA-MPI............................ 22 2.3.2 The Partitioned Global Address Space Model........... 23 2.3.2.1 GASNet............................ 23 2.3.2.2 UPC.............................. 23 2.3.2.3 Fortran Coarrays....................... 24 2.3.2.4 GASPI............................. 24 2.3.2.5 OpenSHMEM........................ 25 2.3.2.6 Global Arrays........................ 25 2.3.2.7 Global View Resilience................... 25 2.3.3 The Asynchronous Partitioned Global Address Space Model.. 25 2.3.3.1 Chapel............................ 26 2.3.3.2 X10.............................. 27 2.3.4 The Actor Model............................ 28 2.3.4.1 Charm++........................... 28 2.3.4.2 Erlang............................. 29 2.3.4.3 Akka.............................. 29 2.3.4.4 Orleans............................ 29 2.3.5 The Dataflow Programming Model................. 29 2.3.5.1 OCR.............................. 30 2.3.5.2 Legion............................. 30 2.3.5.3 NABBIT and PaRSEC.................... 31 2.3.5.4 Spark............................. 31 2.3.6 Review Conclusions.......................... 31 2.4 The X10 Programming Model......................... 34 2.4.1 Task Parallelism............................. 34 2.4.2 The Happens-Before Constraint................... 34 2.4.3 Global Data............................... 35 2.4.4 Resilient X10.............................. 36 2.4.5 Elastic X10................................ 37 2.4.5.1 The PlaceManager API................... 37 2.4.5.2 Place Virtualization..................... 39 2.5 Summary..................................... 39 3 Improving Resilient

Load more