Extending the C++ Asynchronous Programming Model with the HPX Runtime System for Distributed Memory Computing

Extending the C++ Asynchronous Programming Model with the HPX Runtime System for Distributed Memory Computing Erweiterung des asynchronen C++ Programmiermodels mithilfe des HPX Laufzeitsystems für verteiltes Rechnen Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades Doktor-Ingenieur vorgelegt von Thomas Heller aus Neuendettelsau Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg Tag der mündlichen Prüfung: 28.02.2019 Vorsitzender des Promotionsorgans: Prof. Dr.-Ing. Reinhard Lerch Gutachter/in: Prof. Dr.-Ing. Dietmar Fey Prof. Dr. Thomas Fahringer To Steffi, Felix and Hanna Acknowledgement This thesis was written at the Chair for Computer Science 3 (Computer Architecture) of the Friedrich-Alexander-University Erlangen-Nuremberg. I would like to thank all persons who were involved in creating this work in one way or another. Special thanks goes to the Prof.Dr.-Ing. Dietmar Fey who took the role of this doctoral thesis supervi- sor. I would like to thank him for his support, trust and all the helpful discussions that led to the success of this thesis. Additionally, I would like to thank Prof.Dr. Thomas Fahringer for his support and accepting the role as reviewer. In addition, I would like to thank all students that contributed to the project in the various forms of either bachelor, master theses as well as being part of the team otherwise. I would like to thank my colleagues for all the fruitful discussions that helped to further develop the ideas presented in this thesis. I would like to thank Dr. Alice Koeniges for providing me with access to the NERSC Supercomputers and the RRZE for providing access to the Meggie Cluster. Furthermore, I would like to thank the STE||AR-Group, especially Dr. Hartmut Kaiser. Without the help and support of the group, this thesis wouldn’t have been possible. The group helped to develop a stable product, which is the foundation of this very thesis. Hartmut is and was an excellent mentor who drove my research in various ways and helped develop my academic career. Last but not least, I would like to thank my family, especially my wife and children, for their all-embracing support, without which this thesis couldn’t have been accomplished in the first place. Abstract This thesis presents a fully Asynchronous Many Task (AMT) runtime system extending the C++ programming language. Defining the distributed, asynchronous C++ Pro- gramming Model based on the C++ programming language is in focus. Besides, presenting performance portable Application Programming Interfaces (APIs) for shared and distributed memory computing as well as accelerators. With the rise of multi and many-core architectures, the C++ Language got amended with support for concurrency and parallelism. This work derives the methodology for massive parallelism from this industry standard and extending it with fine-grained user-level threads as well as distributed computing allowing large-scale Supercomput- ers to employ the same syntax and semantics for remote and local operations. By leveraging the nature of asynchronous task-based message passing using a one-sided Remote Procedure Call (RPC) mechanism, the overarching principle of work follows data man- ifests. By leveraging the asynchronous, task-based nature of the future as a handle for asyn- chronously computed results, the term Futurization is coined, presenting a technique based on Continuation Passing Style (CPS) programming. This technique allows deal- ing with millions of concurrently running asynchronous tasks. By attaching continuations, dynamic dependency graphs are formed naturally from the regular control flow of the code. The effect is to parallelize through the runtime system by executing multiple continuations in parallel. In other words, the future based synchronization is express- ing fine-grained constraints. Furthermore, Futurization blends in naturally with other well-known Techniques, such as data parallelism. Those other paradigms can be built using Futurization. The technique as mentioned earlier provides the necessary foundation to address the needs for modern scientific applications targeting High Performance Computing (HPC) platforms. However, addressing the challenge of handling more and more complicated architectures, like different memory access latencies and accelerators is essential. This thesis attempts to solve this challenge by providing necessary means to define computa- tional and memory targets by reusing already defined, or upcoming, concepts for C++. Consequently, providing means to link them together to intensify the principle of work follows data. The feasibility of this approach will be demonstrated with a set of low-level micro- benchmarks to show that the provided abstractions come with minimal overhead. Pro- viding a 2D Stencil example that attests the programmability of Futurization, as well as the performance benefits, serves as the second benchmark. Lastly, showing there- sults of futurizing the astrophysics application OctoTiger, a 3D octree Adaptive Mesh Refinement (AMR) based binary star simulation, running at extreme scales concludes the experimental section. Kurzübersicht Diese Arbeit stellt ein vollständig “Asynchronous Many Task (AMT)“ Laufzeitsystem vor. Der Fokus liegt dabei auf der Definition der benötigten Konzepte auf der Basis der C++ Programmiersprache. Darüberhinaus werden portable APIs für das Rechnen auf verteilten System und Beschleuniger-Hardware eingeführt. Mit dem Aufkommen von Multi- und Many-Core-Architekturen wurde die C++ Pro- grammiersprache mit Unterstützung für Nebenläufigkeit und Parallelität erweitert. Die- se Arbeit leitet die Methodik für massive Parallelität von diesem Industriestandard ab und ergänzt es mit fein granularen User- Level-Threads sowie verteiltem Rechnen. Dies ermöglicht das Benutzen großer Supercomputer mit derselben Syntax und Semantik für entfernte und lokale Operationen. Durch Verwendung des asynchronen task-basierten Nachrichtenaustausches durch einseitige Remote Procedure Call (RPC), ergibt sich das all umfassende Prinzip des ”work follows data”, d.h. die Arbeit wird dort ausgeführt, wo die Daten liegen. Der Begriff Futurization wird als Basis der “Continuation Passing Style (CPS)“ Program- mierung geprägt. Dies erreicht man anhand der future basierten Handle zum Aus- druck von asynchronen, Task-basierten Ergebnissen. Diese Technik erlaubt es, Millio- nen von nebenläufigen, asynchronen Tasks handzuhaben. Durch Einhängen von Con- tinuations werden dynamische Abhängigkeitsgraphen erzeugt, die als Nebenprodukt des regulären Kontrollflusses, leicht zu bestimmen sind. Dies hat den Effekt, dass meh- rere dieser Continuations parallel durch das Laufzeitsystem abgearbeitet werden kön- nen. Durch diese future basierte Synchronisierung ist man in der Lage, fein granulare Bedingungen für die korrekte Ausführung zu bestimmen. Darüber hinaus, ermöglicht Futurization die Implementierung anderer Programmierparadigmen wie Data Paralle- lismus. Diese Technik bietet die notwendige Grundlage, um den Bedarf an modernen wissen- schaftlichen Anwendungen für High Performance Computing (HPC) Plattformen ge- recht zu werden. Allerdings werden die Herausforderungen, um immer kompliziertere Architekturen effizient zu programmieren, immer schwieriger. Darunter fallen unter- schiedliche Speicherzugriffslatenzen und Hardware-Beschleunigern. Diese Arbeit ver- folgt das Ziel diese Aufgabe zu lösen, indem sie die notwendigen Mittel durch die Wie- derverwendung bereits definierter oder zukünftiger Konzepte aus dem C++ Standard bereitstellt. Die Ergebnisse dieses Ansatzes werden anhand der Evaluation mehrerer Benchmarks dargestellt. Zuerst wird eine Messung mit diversen Micro-Benchmarks durchgeführt, um zu zeigen, dass der Overhead der bereitgestellten Abstraktionen minimal ist. Sowohl die Programmierbarkeit als auch die Leistungsfähigkeit wird anhand einer 2D Stencil Anwendung demonstriert. Die Arbeit wird abgeschlossen durch die Anwendung Octo- Tiger, eine 3D octree basierte “Adaptive Mesh Refinement (AMR)“ Astro Physik Simu- lation. Diese wird anhand von Futurization auf einen der größten aktuellen Supercom- puter portiert. Contents 1 Introduction 1 2 Related Work 5 3 Parallelism and Concurrency in the C++ Programming Language 9 3.1 Low-Level Abstractions ............................. 10 3.1.1 Memory Model ............................. 11 3.1.2 Concurrency Support ......................... 14 3.1.3 Task-Parallelism Support ....................... 19 3.2 Higher Level Parallelism ............................ 23 3.2.1 Concepts of Parallelism ........................ 23 3.2.2 Parallel Algorithms ........................... 26 3.2.3 Fork-Join Based Parallelism ...................... 30 3.3 Evolution ..................................... 31 3.3.1 Executors ................................ 31 3.3.2 Support for heterogeneous architectures and Distributed Com- puting .................................. 34 3.3.3 Futurization ............................... 35 3.3.4 Coroutines and Parallelism ...................... 40 4 The HPX Parallel Runtime System 43 4.1 Local Thread Management ........................... 44 4.2 Active Global Address Space ......................... 47 4.2.1 Processes in Active Global Address Space (AGAS) – Localities . 48 4.2.2 C++ Objects in AGAS – Components . 48 4.2.3 Global Reference Counting ...................... 52 4.2.4 Resolving Globally unique Identifier (GID)s to

Load more