MPI Library Combination That Adds Resource-Elasticity Support for HPC Applications Is Proposed in This Work

TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Fakultat¨ fur¨ Informatik Resource-Elasticity Support for Distributed Memory HPC Applications Isa´ıas Alberto Compres´ Urena˜ Vollstandiger¨ Abdruck von der Fakultat¨ fur¨ Informatik der Technischen Universitat¨ Munchen¨ zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Prof. Bernd Brugge,¨ Ph.D. Prufende¨ der Dissertation: 1. Prof. Dr. Hans Michael Gerndt 2. Prof. Dr. Michael Georg Bader Die Dissertation wurde am 23.06.2017 bei der Technischen Universitat¨ Munchen¨ eingereicht und durch die Fakultat¨ fur¨ Informatik am 12.07.2017 angenommen. TECHNICAL UNIVERSITY OF MUNICH Dissertation Resource-Elasticity Support for Distributed Memory HPC Applications Author: Isa´ıas Alberto Compres´ Urena˜ First examiner: Prof. Dr. Hans Michael Gerndt Second examiner: Prof. Dr. Michael Georg Bader The dissertation was submitted to the Technical University of Munich on 23.06.2017, and was approved by the Faculty of Informatics on 12.07.2017. I hereby declare that this thesis is entirely the result of my own work except where other- wise indicated. I have only used the resources given in the list of references. Garching, 5.5.2017 Isa´ıas Alberto Compres´ Urena˜ Acknowledgments First, I want to thank to Prof. Gerndt. It was because of a recommendation of his that I originally had the opportunity to engage in message passing research at a reputable research institution. He later gave me the opportunity to pursue this doctorate, with an ex- panded scope that includes resource management and scheduling. In addition, the quality of this work has largely improved thanks to his diligent supervision and advice. I would also like to thank the people in my academic environment. To all my colleagues that provided me with new ideas to consider, I am forever grateful. To the staff of the Technical University of Munich, for providing a great environment for work and research. To the Leibniz Supercomputing Center, for granting me access to the supercomputing resources needed for this type of research. Finally, to the Invasive Computing Transregional Collaborative Research Center for providing the theoretical background and necessary funding for this work. I would also like to take this opportunity to thank all my friends and relatives, in no particular order, who have directly or indirectly positively influenced my life. I would like to express my gratitude to Manuel and Gloria Cocco, who helped me during moments of adversity. I am thankful to my mother Yvette Urena,˜ whose lifelong interest in my well being has no parallels. I also want to thank my uncle Miguel Ramon´ Urena˜ for his constant advice and support. Finally, I want to express gratitude to my aunt Miguelina Urena,˜ who has helped me in many ways over the years. vii Abstract Computer simulations are alternatives to the scientific method in domains where physi- cal experiments are unfeasible or impossible. When the amount of memory and processing speed required is large, simulations are executed in distributed memory High Performance Computing (HPC) systems. These systems are usually shared among its users. A resource manager with a batch scheduler is used to fairly and efficiently share the resources of these systems among its users. Current large HPC systems have thousands of compute nodes connected over a high-performance network. Users submit batch job descriptions where the number of resources required by their simulations are specified. Batch job descriptions are queued and scheduled based on priorities and submission times. The parallel efficiency of a simulation depends on the number of resources allocated to it. It is challenging for users to specify allocation sizes that produce adequate parallel efficiencies. A resource allocation can be too small and the parallel efficiency of the application may be adequate, but its performance may not be scaled to its maximum po- tential. A resource allocation can be too large and therefore the parallel efficiency of the application may be degraded due to synchronization overheads. Unfortunately, in current systems these resource allocations cannot be adapted once the applications of a job start. A resource manager and MPI library combination that adds resource-elasticity support for HPC applications is proposed in this work. The resource manager is extended with operations to adapt the resources of running applications in jobs; in addition, new scheduling techniques are added to it. The MPI library has been extended with operations that enable resource adaptations as changes in the number of processes in world communicators. The goal is to optimize system-wide efficiency metrics through adjustments to the resource allocations of running applications. Resource allocations are adjusted continuously based on performance feedback from running applications. ix x Contents Acknowledgements vii Abstract ix 1 Introduction 1 2 Motivation 7 2.1 Adaptive Mesh Refinement (AMR) Methods . .8 2.1.1 Challenges of AMR Methods in Distributed Memory Systems . .9 2.2 Applications with Multiple Computational Phases . 11 2.2.1 Phases with Different Scalability Properties . 12 2.2.2 Network-, Memory- and Compute-Bound Phases . 12 2.2.3 Phases with Different Input Dependent Network and Compute Scal- ing Proportionalities . 15 2.2.4 Efficient Ranges for Application Phase Scalability . 16 2.3 System-Wide Parallel Efficiency . 16 2.3.1 Suboptimal Network Performance due to Fixed Initial Allocations . 16 2.3.2 Idle Resources due to Inflexible Resource Requirements in Jobs . 18 2.3.3 Energy and Power Optimizations . 18 3 Invasive Computing 19 3.1 Invasive Computing Research Groups . 19 3.1.1 Group A Projects . 19 3.1.2 Group B Projects . 20 3.1.3 Group C Projects . 21 3.1.4 Group D Projects . 22 3.1.5 Group Z Projects . 23 4 Related Work 25 4.1 Programming Languages and Interfaces without Elastic Execution Support 25 4.1.1 Parallel Shared Memory Systems . 25 4.1.2 Distributed Memory Systems . 27 4.1.3 Cloud and Grid Computing . 28 4.2 Elastic Programming Languages and Interfaces for HPC . 29 4.2.1 Charm++ and Adaptive MPI . 30 4.2.2 The X10 Programming Language . 31 4.2.3 Parallel Virtual Machine (PVM) . 32 4.2.4 Other Related Works . 32 xi Contents 5 The Message Passing Interface (MPI) 33 5.1 MPI Features Overview . 33 5.1.1 Data Types . 34 5.1.2 Groups and Communicators . 34 5.1.3 Point-to-Point Communication . 35 5.1.4 One-Sided Communication . 35 5.1.5 Collective Communication . 36 5.1.6 Parallel IO . 37 5.1.7 Virtual Topologies . 37 5.2 Dynamic Processes Support and its Limitations . 38 5.3 MPICH: High-Performance Portable MPI . 40 5.3.1 Software Architecture . 40 5.3.2 MPI Layer . 41 5.3.3 Device Layer . 41 5.3.4 Channel Layer . 41 6 Elastic MPI Library 43 6.1 MPI Extension Operations . 43 6.1.1 MPI Initialization in Adaptive Mode . 45 6.1.2 Probing Adaptation Data . 45 6.1.3 Beginning an Adaptation Window . 47 6.1.4 Committing an Adaptation Window . 48 6.2 MPI Extension Implementation . 48 6.2.1 MPI INIT ADAPT . 48 6.2.2 MPI PROBE ADAPT . 49 6.2.3 MPI COMM ADAPT BEGIN . 50 6.2.4 MPI COMM ADAPT COMMIT . 52 7 Elastic-Phase Oriented Programming (EPOP) 53 7.1 Motivation for a Resource-Elastic Programming Model . 53 7.1.1 Identification of Serial and Parallel Phases in the Source Code . 53 7.1.2 Process Entry and Data Redistribution Locations . 54 7.2 The EPOP Programming Model . 56 7.2.1 Initialization, Rigid and Elastic-Phases (EPs) . 56 7.2.2 EPOP Programs and Branches . 56 7.2.3 Application Data . 57 7.3 Current Implementation . 57 7.3.1 Driver Program . 57 7.3.2 Program Element . 59 7.3.3 Program Structure . 61 7.4 Additional Benefits of the EPOP Model and Driver Programs . 61 8 Resource Management in High Performance Computing 63 8.1 Resource Management in Shared Memory Systems . 63 8.2 Resource Management in Distributed Memory Systems . 64 8.2.1 Additional Requirements for the Scheduling of Elastic Jobs . 65 8.3 Simple Linux Utility for Resource Management (SLURM) . 65 8.3.1 Controller Daemon (SLURMCTLD).................... 67 xii Contents 8.3.2 Node Daemon (SLURMD).......................... 67 9 Elastic Resource Manager 69 9.1 Overview of the Integration with the Elastic MPI Library . 69 9.1.1 Rank to Process Mapping Strategy . 71 9.1.2 Support for Arbitrary Node Identification Orders . 71 9.2 Elastic Batch and Runtime Scheduler . 73 9.3 Node Daemons . 74 9.4 Launcher for Elastic Jobs . 76 10 Monitoring and Scheduling Infrastructure 77 10.1 Theoretical Background on Multiprocessor Scheduling . 77 10.1.1 Problem Statement . 77 10.1.2 Computational Complexity . 78 10.1.3 Resource-Static Scheduling in Distributed Memory HPC Systems . 79 10.1.4 Modified Scheduling Problem for Resource-Elastic Execution . 81 10.2 Performance Monitoring Infrastructure . 82 10.2.1 Process-Local Pattern Detection and Performance Measurements . 82 10.2.2 Node-Local Reductions and Performance Data Updates . 85 10.2.3 Distributed Reductions and Performance Models . 87 10.2.4 EPOP Integration . 88 10.3 Elastic Schedulers . 88 10.3.1 Elastic Runtime Scheduler (ERS) . 89 10.3.2 Performance Model and Resource Range Vector (RRV) . 92 10.3.3 Elastic Backfilling . 93 11 Evaluation Setup 97 11.1 Elastic Resource Manager Nesting in SuperMUC . 97 11.1.1 Phase 1 and Phase 2 Nodes . 98 11.1.2 MPI Library and Compilers Setup . 98 11.2 Testing and Measurement Binaries . 98 12 Elastic MPI Performance 99 12.1 MPI INIT ADAPT . 99 12.2 MPI PROBE ADAPT . 99 12.3 MPI COMM ADAPT BEGIN .

Load more