Extending the C++ Asynchronous Programming Model with the HPX for Computing

Erweiterung des asynchronen C++ Programmiermodels mithilfe des HPX Laufzeitsystems für verteiltes Rechnen

Der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg zur Erlangung des Doktorgrades

Doktor-Ingenieur

vorgelegt von Thomas Heller aus Neuendettelsau

Als Dissertation genehmigt von der Technischen Fakultät der Friedrich-Alexander-Universität Erlangen-Nürnberg

Tag der mündlichen Prüfung: 28.02.2019

Vorsitzender des Promotionsorgans: Prof. Dr.-Ing. Reinhard Lerch Gutachter/in: Prof. Dr.-Ing. Dietmar Fey Prof. Dr. Thomas Fahringer

To Steffi, Felix and Hanna

Acknowledgement

This thesis was written at the Chair for 3 (Computer Architecture) of the Friedrich-Alexander-University Erlangen-Nuremberg. I would like to thank all persons who were involved in creating this work in one way or another. Special thanks goes to the Prof.Dr.-Ing. Dietmar Fey who took the role of this doctoral thesis supervi- sor. I would like to thank him for his support, trust and all the helpful discussions that led to the success of this thesis. Additionally, I would like to thank Prof.Dr. Thomas Fahringer for his support and accepting the role as reviewer. In addition, I would like to thank all students that contributed to the project in the vari- ous forms of either bachelor, master theses as well as being part of the team otherwise. I would like to thank my colleagues for all the fruitful discussions that helped to further develop the ideas presented in this thesis. I would like to thank Dr. Alice Koeniges for providing me with access to the NERSC and the RRZE for providing access to the Meggie Cluster. Furthermore, I would like to thank the STE||AR-Group, especially Dr. Hartmut Kaiser. Without the help and support of the group, this thesis wouldn’t have been possible. The group helped to develop a stable product, which is the foundation of this very thesis. Hartmut is and was an excellent mentor who drove my research in various ways and helped develop my academic career.

Last but not least, I would like to thank my family, especially my wife and children, for their all-embracing support, without which this thesis couldn’t have been accomplished in the first place.

Abstract

This thesis presents a fully Asynchronous Many Task (AMT) runtime system extend- ing the C++ programming language. Defining the distributed, asynchronous C++ Pro- gramming Model based on the C++ programming language is in focus. Besides, pre- senting performance portable Application Programming Interfaces () for shared and distributed memory computing as well as accelerators.

With the rise of multi and many-core architectures, the C++ Language got amended with support for concurrency and parallelism. This work derives the methodology for massive parallelism from this industry standard and extending it with fine-grained user-level threads as well as allowing large-scale Supercomput- ers to employ the same syntax and semantics for remote and local operations. By lever- aging the nature of asynchronous task-based message passing using a one-sided (RPC) mechanism, the overarching principle of work follows data man- ifests.

By leveraging the asynchronous, task-based nature of the future as a handle for asyn- chronously computed results, the term Futurization is coined, presenting a technique based on Continuation Passing Style (CPS) programming. This technique allows deal- ing with millions of concurrently running asynchronous tasks. By attaching continua- tions, dynamic dependency graphs are formed naturally from the regular of the code. The effect is to parallelize through the runtime system by executing multiple continuations in parallel. In other words, the future based synchronization is express- ing fine-grained constraints. Furthermore, Futurization blends in naturally with other well-known Techniques, such as . Those other paradigms can be built using Futurization.

The technique as mentioned earlier provides the necessary foundation to address the needs for modern scientific applications targeting High Performance Computing (HPC) platforms. However, addressing the challenge of handling more and more complicated architectures, like different memory access latencies and accelerators is essential. This thesis attempts to solve this challenge by providing necessary means to define computa- tional and memory targets by reusing already defined, or upcoming, concepts for C++. Consequently, providing means to link them together to intensify the principle of work follows data.

The feasibility of this approach will be demonstrated with a set of low-level micro- benchmarks to show that the provided abstractions come with minimal overhead. Pro- viding a 2D Stencil example that attests the programmability of Futurization, as well as the performance benefits, serves as the second benchmark. Lastly, showing there- sults of futurizing the astrophysics application OctoTiger, a 3D octree Adaptive Mesh Refinement (AMR) based binary star simulation, running at extreme scales concludes the experimental section. Kurzübersicht

Diese Arbeit stellt ein vollständig “Asynchronous Many Task (AMT)“ Laufzeitsystem vor. Der Fokus liegt dabei auf der Definition der benötigten Konzepte auf der Basis der C++ Programmiersprache. Darüberhinaus werden portable APIs für das Rechnen auf verteilten System und Beschleuniger-Hardware eingeführt.

Mit dem Aufkommen von Multi- und Many-Core-Architekturen wurde die C++ Pro- grammiersprache mit Unterstützung für Nebenläufigkeit und Parallelität erweitert. Die- se Arbeit leitet die Methodik für massive Parallelität von diesem Industriestandard ab und ergänzt es mit fein granularen User- Level-Threads sowie verteiltem Rechnen. Dies ermöglicht das Benutzen großer mit derselben Syntax und Semantik für entfernte und lokale Operationen. Durch Verwendung des asynchronen task-basierten Nachrichtenaustausches durch einseitige Remote Procedure Call (RPC), ergibt sich das all umfassende Prinzip des ”work follows data”, d.h. die Arbeit wird dort ausgeführt, wo die Daten liegen.

Der Begriff Futurization wird als Basis der “Continuation Passing Style (CPS)“ Program- mierung geprägt. Dies erreicht man anhand der future basierten Handle zum Aus- druck von asynchronen, Task-basierten Ergebnissen. Diese Technik erlaubt es, Millio- nen von nebenläufigen, asynchronen Tasks handzuhaben. Durch Einhängen von Con- tinuations werden dynamische Abhängigkeitsgraphen erzeugt, die als Nebenprodukt des regulären Kontrollflusses, leicht zu bestimmen sind. Dies hat den Effekt, dass meh- rere dieser Continuations parallel durch das Laufzeitsystem abgearbeitet werden kön- nen. Durch diese future basierte Synchronisierung ist man in der Lage, fein granulare Bedingungen für die korrekte Ausführung zu bestimmen. Darüber hinaus, ermöglicht Futurization die Implementierung anderer Programmierparadigmen wie Data Paralle- lismus.

Diese Technik bietet die notwendige Grundlage, um den Bedarf an modernen wissen- schaftlichen Anwendungen für High Performance Computing (HPC) Plattformen ge- recht zu werden. Allerdings werden die Herausforderungen, um immer kompliziertere Architekturen effizient zu programmieren, immer schwieriger. Darunter fallen unter- schiedliche Speicherzugriffslatenzen und Hardware-Beschleunigern. Diese Arbeit ver- folgt das Ziel diese Aufgabe zu lösen, indem sie die notwendigen Mittel durch die Wie- derverwendung bereits definierter oder zukünftiger Konzepte aus dem C++ Standard bereitstellt.

Die Ergebnisse dieses Ansatzes werden anhand der Evaluation mehrerer Benchmarks dargestellt. Zuerst wird eine Messung mit diversen Micro-Benchmarks durchgeführt, um zu zeigen, dass der Overhead der bereitgestellten Abstraktionen minimal ist. Sowohl die Programmierbarkeit als auch die Leistungsfähigkeit wird anhand einer 2D Stencil Anwendung demonstriert. Die Arbeit wird abgeschlossen durch die Anwendung Octo- Tiger, eine 3D octree basierte “Adaptive Mesh Refinement (AMR)“ Astro Physik Simu- lation. Diese wird anhand von Futurization auf einen der größten aktuellen Supercom- puter portiert. Contents

1 Introduction 1

2 Related Work 5

3 Parallelism and Concurrency in the C++ Programming Language 9 3.1 Low-Level Abstractions ...... 10 3.1.1 Memory Model ...... 11 3.1.2 Concurrency Support ...... 14 3.1.3 Task-Parallelism Support ...... 19 3.2 Higher Level Parallelism ...... 23 3.2.1 Concepts of Parallelism ...... 23 3.2.2 Parallel Algorithms ...... 26 3.2.3 Fork-Join Based Parallelism ...... 30 3.3 Evolution ...... 31 3.3.1 Executors ...... 31 3.3.2 Support for heterogeneous architectures and Distributed Com- puting ...... 34 3.3.3 Futurization ...... 35 3.3.4 and Parallelism ...... 40

4 The HPX Parallel Runtime System 43 4.1 Local Management ...... 44 4.2 Active Global Address Space ...... 47 4.2.1 Processes in Active Global Address Space (AGAS) – Localities . . 48 4.2.2 C++ Objects in AGAS – Components ...... 48 4.2.3 Global Reference Counting ...... 52 4.2.4 Resolving Globally unique Identifier (GID)s to Local Addresses . 54 4.3 Active Messaging ...... 55 4.3.1 Parcels ...... 57 4.3.2 Serialization ...... 61 4.3.3 Network Transport ...... 67 4.4 Asynchronous, unified API for remote and distributed computing . . . . 70 4.4.1 Asynchronous Programming Interface ...... 70 4.4.2 Equivalence between Local and Remote Operations ...... 71

i 4.4.3 Natural extension to the C++ Standard ...... 73 4.5 Performance Counters ...... 73

5 Abstractions for High Performance Parallel Programming 75 5.1 Co-Locating Data and Work ...... 75 5.2 Targets in Common Computer Architectures ...... 77 5.2.1 Allocation Targets ...... 77 5.2.2 Execution Targets ...... 78 5.2.3 Non (NUMA) Architectures ...... 78 5.2.4 Graphic Processing Unit (GPU) offloading ...... 80 5.2.5 High Bandwidth Memory ...... 83 5.2.6 Target Aware Container ...... 85 5.3 Supporting Abstractions ...... 85 5.3.1 Synchronization of Concurrent Work ...... 85 5.3.2 Global Collectives ...... 86 5.3.3 Point to Point Communcations ...... 86

6 Evaluation 89 6.1 Benchmark Setup ...... 89 6.2 Low Level Benchmarks ...... 90 6.2.1 HPX Thread Overhead ...... 92 6.2.2 HPX Communication Overhead ...... 96 6.2.3 STREAM Benchmark ...... 100 6.3 Benchmark Applications ...... 102 6.3.1 Two-dimensional Stencil Application ...... 103 6.3.2 OctoTiger ...... 114

7 Conclusion 127

A Appendix 131 A.1 Atomic Operations ...... 131 A.2 Asynchronous Providers ...... 134 A.3 Task Block ...... 137 A.4 Parallel Algorithms ...... 138

Glossary 144

Bibliography 145

ii List of Definitions

3.1 Thread of Execution ...... 11 3.2 Data Race ...... 11 3.3 Sequential Consistency ...... 11 3.4 Execution Agent ...... 11 3.5 BasicLockable ...... 15 3.6 Lockable ...... 15 3.7 TimedLockable ...... 15 3.8 Asynchronous Return Object ...... 19 3.9 Asynchronous Provider ...... 19 3.10 Callable ...... 20

5.11 Target of Execution ...... 77

List of Figures

3.1 Types of Parallelism ...... 23 3.2 Execution Policies ...... 27 3.3 Fork-Join Model ...... 30

4.1 HPX Runtime System Components ...... 44 4.2 HPX Thread Context ...... 45 4.3 Schematic of a distributed HPX Application ...... 49 4.4 Schematic of a distributed HPX Application using Components . . . . . 50 4.5 Graphical Representation of a GID ...... 53 4.6 Interaction between Parcelhandler and AGAS ...... 55

5.1 Presenting differences in Memory Latencies ...... 76 5.2 NUMA Architecture Layout ...... 79 5.3 CUDA Architecture ...... 81 5.4 3D Stacked Memory ...... 83 5.5 Knights Landing layout ...... 84 5.6 Global collective operation ...... 87

6.1 Overheads of creating tasks on a single processing unit ...... 91

iii 6.2 HPX Overhead ...... 92 6.3 Future overhead ...... 93 6.4 Task overhead ...... 94 6.5 Scheduling different task granularities ...... 95 6.6 Serialization overhead ...... 96 6.7 Action throughput ...... 97 6.8 Component creation ...... 98 6.9 Component action overhead ...... 99 6.10 send/recv ...... 99 6.11 Broadcast ...... 100 6.12 STREAM TRIAD results (ARM64) ...... 101 6.13 STREAM TRIAD results (X86-64) ...... 102 6.14 STREAM TRIAD results (KNL) ...... 103 6.15 The PDE result ...... 104 6.16 The 2D Grid ...... 105 6.17 Grid element access ...... 105 6.18 5 point stencil ...... 106 6.19 Single node stencil performance ...... 110 6.20 Distributed weak scaling of the stencil ...... 113 6.21 Visualization of OctoTiger showing accretion stream ...... 115 6.22 OctoTiger Ghost Zone Exchange ...... 118 6.23 Concurrency Analysis of OctoTiger ...... 120 6.24 Single Node of OctoTiger using 7 LoR ...... 121 6.25 Subgrids per seconds on different CPU counts and LoR ...... 123 6.26 Relative Speedup of OctoTiger using 13 LoR ...... 124 6.27 Relative Speedup of OctoTiger using 14 LoR ...... 125

iv List of Listings

3.1 C++ Memory Order Definition ...... 12 3.2 C++ Memory Fence API ...... 13 3.3 Definition of std::mutex ...... 14 3.4 Definition of std::lock_guard ...... 16 3.5 Definition of variadic std::(try_) ...... 17 3.6 Definition of std::condition_variable ...... 18 3.7 The std::async definition ...... 21 3.8 Sequential execution policy type and global Object ...... 25 3.9 Parallel execution policy type and global Object ...... 26 3.10 Parallel unsequenced execution policy type and global Object ...... 27 3.11 And-composition of futures ...... 36 3.12 Or composition of futures ...... 36 3.13 A naïve, recursive implementation of the fibonacci sequence ...... 36 3.14 Asynchronous, recursive fibonacci sequence implementation ...... 37 3.15 Futurized fibonacci sequence implementation ...... 37 3.16 Futurized if-statement ...... 38 3.17 if-statement with an asynchronous condition ...... 38 3.18 Futurized if-statement with an asynchronous condition ...... 39 3.19 Pseudo code of a while-loop ...... 39 3.20 Futurized while-loop ...... 40 3.21 Futurized for-loop ...... 41 3.22 Fibonacci example using TS ...... 41

4.1 AGAS Localities queries ...... 49 4.2 Component Boilerplate ...... 51 4.3 Creating Objects in AGAS ...... 51 4.4 Registering GIDs with a symbolic name ...... 52 4.5 The HPX action mechanism ...... 58 4.6 Polymorphic actions for serialization ...... 59 4.7 Giving the action a name ...... 60 4.8 HPX continuations for transporting results ...... 61

v 4.9 Definition of a Parcel ...... 62 4.10 HPX Serialization API (1/2) ...... 63 4.11 HPX Serialization API (2/2) ...... 64 4.12 Marking types as bit-wise serializable ...... 67 4.13 Generic Parcel handling protocol ...... 69 4.14 HPX Performance Counter Syntax ...... 74

5.1 Generic function offloading with CUDA ...... 82 5.2 The interface for ::lcos::channel ...... 88

6.1 The HPX STREAM TRIAD benchmark ...... 101 6.2 2D Stencil pseudo algorithm ...... 104 6.3 Stencil line update ...... 106 6.4 Stencil line iterator ...... 107 6.5 Stencil row iterator ...... 108 6.6 Stencil parallelized, ...... 110 6.7 Stencil communicator, first row update ...... 112 6.8 Natural recursive tree traversal ...... 116 6.9 Futurized recursive tree traversal ...... 117

A.1 Basic atomic interface ...... 132 A.2 atomic interface for integral numeric types ...... 133 A.3 atomic interface additions for pointers ...... 134 A.4 promise interface ...... 134 A.5 packaged_task interface ...... 135 A.6 future interface ...... 136 A.7 shared_future interface ...... 137 A.8 Fork Join Support in C++ ...... 137

List of Tables

3.1 C++ critical sections ...... 16 3.2 The iterator categories and their impact on parallelization ...... 28 3.3 Basic concepts for program execution ...... 32 3.4 Executor Categories ...... 33

4.1 Local and Remote Semantic and Syntactic Equivalence ...... 72

6.1 Used Hardware Systems ...... 90

vi 6.2 Used system software ...... 91

A.2 Non-modifying sequence operations ...... 138 A.4 Set operations on sorted ranges ...... 138 A.6 Modifying sequence operations ...... 139 A.8 Partitioning operations ...... 140 A.10 Sorting operations ...... 140 A.12 Binary search operations ...... 140 A.14 Heap operations ...... 141 A.16 Minimum/maximum operations ...... 141 A.18 Numeric Operations ...... 142 A.20 Operations on uninitialized memory ...... 142

vii

Introduction 1

The free lunch is over [97]. This sentence is hinting towards the end of Moore’s Law where we have to improve the architecture of modern computers instead of relying on ever improving technological processes that are supposed to increase the performance of our calculations. One of these architectural changes that altered the mainstream of computing is the up-rise of multi and many-core architectures or the invention of Beo- wulf clusters and their perfection seen in today’s fastest Supercomputers. Leading to the challenges of developing effective and efficient parallel programming techniques to allow the usage of all parts of a computing system.

Especially when looking at the challenges imposed for exascale computing, the same conclusions can be drawn. With the end of Moore’s Law, the massive increase in on- node parallelism is a result of the necessity of keeping the power consumption in bal- ance [86] while increasing performance. Promising architectures occurring today, such as GPUs, FPGAs and other many-core systems like the , come in the form of accelerators, that is, serving a particular purpose, implying the need for heterogeneous systems. All in all, it can be anticipated that concurrency in future systems will be in- creased by order of magnitudes. At the same time, the available memory technology is not able to scale at a similar rate [2]. That imposes an imbalance between compute capa- bilities and the complexity of acquiring the needed data that in turn requires complex memory architectures.

At the same time, important classes of parallel applications are emerging as scaling im- paired. These are problems that require substantial execution time, often exceeding several months, but that are unable to make effective use of more than a few thousand processors [3, 98, 59, 110, 111] in strong scaling scenarios.

In the current landscape of scientific computing, the conventional thinking is that the currently prevalent programming model of MPI+X is suitable enough to tackle those challenges with minor adaptations [106]. That is, MPI is the preferred tool for inter-

1 2 Chapter 1. Introduction node communication, and X is a placeholder to represent intra-node parallelization, for example, OpenMP or CUDA. Other inter-node communication paradigms, such as PGAS, emerged. Those focus on one-sided messages together with explicit and often global synchronization. The promise of keeping the current programming patterns and known tools sounds appealing, the disjoint approach results in the usage and require- ment to maintain various interfaces and the question of interoperability is yet unclear [24].

Although the existing solutions do not, per se, dictate a specific programming paradigm, a particular model of programming is usually preferred and was indeed the initial moti- vation for the creation of various programming interfaces. For MPI, and to the same ex- tend PGAS models, this paradigm is centered around Bulk Synchronous Programming (BSP) [62] and being refined by adding neighborhood-based synchronization through algorithmic optimizations, for example halo exchanges. Often, algorithms developed in this paradigm use a lock-step fashion implementation, that is distinct phases of commu- nication and computation, that are by definition not overlapping or complex to achieve [6]. As a result, performance is gained by both, increasing the processing power and the network interconnect speed. The speed of light and other technological limitations inherently limit network interconnects to increase parallel data lanes endlessly. Thus, the communication step cannot be sped up indefinitely. On the other hand, increasing the performance of the computational part was done by increasing the processor’s fre- quency; this soon became a bottleneck due to thermal issues and leak currents. As such, developing newer generations of processors with various forms of parallelism ranging from Single Instruction Multiple Data (SIMD) based vectorization to multi and many- core architectures was paramount. As the development of new architectures can always be seen as a co-design . In the sense, that existing application have to acceler- ated to be able to sell new generations of processors. In the context of this thesis, the applications are to perform scientific simulations using linear algebra which are ableto expose specific forms of parallelism. Most numerical linear algebra algorithms canbe expressed in terms of data parallel constructs that are found in the programming inter- faces of OpenMP [19] and CUDA [71]. Those data parallel constructs are usually based on the fork/join model, that is, at the start of the parallel region, the computation fans out to all processors and being implicitly joined once the computation is complete. As such, it is a perfect fit for the BSP model. Apparently, this model is limited in , since the start and end of a parallel region will always entail a serial region. By looking at Amdahl’s Law [36], the scaling limitations become apparent, since the inherent serial portion of a given application (paramount through communication) will dominate the runtime as parallelism increases. This implies that only weak scaling of computational problems is possible, that is not always applicable.

The solution to those impeding factors of limited scalability can only be achieved by per- forming a paradigm change in the parallel programming model. The move from BSP towards fine-grained constrained based synchronization to limit the effect of global bar- 3 riers. This paradigm shift is enabled by the emerging class of AMT runtime systems, which carry properties that alleviate the limitations mentioned above. It is therefore not a coincidence that the C++ Programming Language, starting with the standard up- dated in 2011, gained support for concurrency by defining a memory model for a multi- threaded world as well as laying the foundation towards enabling task-based parallelism by adapting the future [5, 26] concept. In 2017, based on those foundational layers, sup- port for parallel algorithms became part of the standard. Coincidentally, covering the need for data parallel algorithms.

The HPX parallel runtime system unifies an AMT tailored for HPC usage combined with strict adherence to the C++ standard. It therefore represents a combination of well- known ideas (such as dataflow [22, 23], futures [5, 26], and CPS) with new and unique overarching concepts. The combination of these ideas and their strict application forms overarching design principles that make this model unique.

HPX consists of the following main building blocks:

• A C++ Standards-conforming API enabling wait-free asynchronous execution.

• futures, channels, and other synchronization primitives.

• An Active Global Address Space (AGAS) that supports load balancing through object migration.

• An active messaging network layer that ships functions to data.

• A work-stealing lightweight task scheduler that enables finer-grained paralleliza- tion and synchronization.

• A versatile in-situ performance observation, measurement, analysis, and adapta- tion framework APEX.

HPX’s features and programming model allow application developers to naturally use fundamental , such as overlapping communication and computation, decentralizing of control flow, oversubscribing execution resources and sending work to data instead of data to work. Using Futurization, developers can express complex dataflow execution trees that generate millions of HPX tasks that by definition areexe- cuted in the proper order.

In this thesis, I will use the HPX runtime system to demonstrate an alternative to the prevalent programming models in the HPC community by first deriving its properties from the definitions in the C++ ISO Standard and highlighting how those can beused to overcome the limitations provided by MPI and OpenMP with a maintainable and 4 Chapter 1. Introduction efficient API. In summary, my contributions are:

• An extension of C++ for efficient handling of massive parallel system including distributed memory and accelerators (Chapter 4).

• The HPX programming model of applying Futurization, by reusing asynchronous handles as the universal, fine-grained concept to turn concurrency into parallelism (Chapter 4).

• An evaluation of the developed concepts using a wide range of micro benchmarks as well as larger numerical simulations (Chapter 6).

• Understanding and optimizing bottlenecks occurring through to the novel parallel programming paradigm (Chapter 6).

The remainder of this thesis is structured as follows. Section 2 provides a classification of the HPX runtime system and comparison between other existing solutions. Section 3 will introduce the concepts of the C++ Standard to give a basis for further discussions and also insight on the upcoming, very likely to be included, features being the subject of discussion throughout the remainder of this thesis. Section 4 discusses the different layers of HPX (Thread management, AGAS, Active Messaging and the unified inter- face) to fully derive the HPX programming model from the C++ language to provide a natural extension of the already existing capabilities of the language. After having the fundamentals in place, Section 5 adds a refinement, higher level layer discussion to address various topics concerning data and work co-location, synchronization con- cepts and global collectives. The evaluation and discussion about the feasibility of the approach are in Section 6 including a discussion of the involved overheads as well as mitigation strategies. Finally, a study in developing a toy 2D heat equation solver serves to evaluate the approach in terms of the application of the concepts as mentioned above and performance portability. This section concludes with the analysis and extreme scale results obtained with a production-grade application, OctoTiger. Section 7 will conclude the thesis. Related Work 2

Traditional high-level programming models for HPC center around the idea of intro- ducing parallel regions to accelerate specific portions of an application. This was ex- cercised most prominently by OpenMP [76] which popularized parallel programming by introducing pragmas for parallelizing loops for shared memory systems. For a long time, this served as the de-facto programming paradigm and influenced modern C++ based solutions like Intel TBB [40], Microsoft PPL [64] and Kokkos [11]. This is in con- trast to distributed memory parallel programs where Message Passing Interface (MPI) [63] represents the prevalent solution. While it is not a parallel programming paradigm per se, and more a low level message passing , in combination with other par- allel programming solutions it is often directly related to the BSP [108] programming model. With an ever increasing number of on-node concurrency, this model is getting to its limits. As such, latest research in the field of parallel programming models points to the efficient utilization of intra-node parallelism with the ability to exploit the under- lying communication infrastructure using fine grain task based approaches to deal with concurrency [112, 103, 39, 4, 21].

In recent years, programming models targeting lightweight tasks are more and more commonplace. HPX is no exception to this trend. Task based parallel programming models can be placed into one of several categories [104, 53]:

• Library solutions- examples of those are Intel TBB [40], StarPU [92], Argobots [89], Qthreads [103] and many others

• Language extensions- examples here are Intel Plus [42], or OpenMP 3.0 [76]

• Experimental programming languages- most notable examples are Chapel [12], Intel ISPC [41], or X10 [13]

While all the current solutions expose parallelism to the user in different ways, they

5 6 Chapter 2. Related Work all seem to converge on a continuation style programming model. Some of them use a futures-based programming model, while others use dataflow processing of depending on tasks with an implicit or explicit Directed Acyclic Graph (DAG) representation of the control flow. While the majority of the task based programming models focus ondeal- ing with node level parallelism, HPX presents a solution for homogeneous execution of remote and local operations.

When looking at library solutions for task-based programming, we are mainly looking at C/C++. While other languages such as Java and Haskell provide similar libraries, they play a secondary role in the field of High Performance Computing. Fortran, for exam- ple, has only one widely available solution, OpenMP 3.0. However, it is possible to call Fortran kernels from C/C++ or to create Fortran language bindings for the use library. Examples of pure C solutions are StarPU and Qthreads. They both provide interfaces for starting and waiting on lightweight tasks as well as creating task dependencies. While Qthreads provides suspendable user level threads, StarPU is built upon Codelets that, by definition, run to completion without suspension. Each of these strategies, Codelets and User-Level-Threads, have their advantages. HPX is a C++ library and tries to stay fully inside the C++ memory model and its execution scheme. For this reason, it follows the same route as Qthreads to allow for more flexibility and for easier support for syn- chronization mechanisms. For C++, one of the existing solutions is Intel TBB, which like StarPU, works in a codelet-style execution of tasks. What these libraries have in com- mon is that they provide a high performance solution to task-based parallelism with all requirements one could think about for dealing with intra-node level parallelism. What they clearly lack is a uniform API and a solution for dealing with distributed . This is one of the key advantages HPX has over these solutions: A program- ming API that is based on the C++ standard, with support for constructs from C++11 to C++17 as well as the upcoming C++20 standard [99, 100, 101] furthermore augmenting and extending the defined concepts and interfaces to support remote operations.

Hand in hand with the library-based solutions come the pragma-based language exten- sions for C, C++ and Fortran: They provide an efficient and effective way for program- mers to express intra-node parallelism. While OpenMP has supported tasks since V3.0, it lacks support for continuation-based programming and task dependencies, focusing instead on fork-join parallelism. Task dependencies were introduced in OpenMP 4.0. This hole is filled by OmPSs [102], that serves as an experiment in integrating inter-task dependencies using a pragma-based approach. One advantage of pragma-based solu- tions over libraries is their support for accelerators, that provides programmers with an approachable to access accelerators without the need to rely on external libraries. This effort was spearheaded by OpenACC [74] and is now part of the OpenMP 4.0 specifica- tion. Libraries for accelerators, as of now, have to fall back to language extensions like CUDA [18], C++AMP [8, 31], SYCL [9] or OpenCL [75].

In addition to language extensions, an ongoing effort to develop new programming lan- 7 guages is emerging, aiming at better support for parallel programming. Some parallel programming languages like OpenCL or Intel Cilk Plus [42] are focusing on node level parallelism. While OpenCL focuses on abstracting hardware differences for all kinds of parallelism, Intel Cilk Plus supports a fork-join style of parallelism. In addition, there are programming languages which explicitly support distributed computing, like UPC [107] or Fortress [77] but lack support for intra-node level parallelism. Current research, however, is developing support for both, inter and intra-node level parallelism based on a global partitioned address space (PGAS [80]). The most prominent examples are Chapel [12] and X10 [13] which represent the PGAS languages. HCMPI [14] shows sim- ilarities with the HPX programming model by offering interfaces for asynchronous dis- tributed computing, either based on distributed data driven futures or explicit message passing in an MPI [63] compatible manner. The main difference between HPX and the above solutions is that HPX invents no new syntax or semantics. Instead, HPX imple- ments the syntax and semantics as defined by C++11, providing it with a homogeneous API that relies on a widely accepted programming interface.

Many applications must overcome the scaling limitations imposed by current program- ming practices by embracing an entirely new way of coordinating parallel execution. Fortunately, this does not mean that we must abandon all of our legacy code. HPX can use MPI as a highly efficient portable communication platform and at the same time serve as a back-end for OpenMP, OpenCL, or even Chapel while maintaining or even improving execution times. This opens a migration path for legacy codes to a new programming model which will allow old and new code to coexist in the same applica- tion.

Parallelism and Concurrency in the C++ Programming Language 3

The underlying foundation of this thesis is the C++ Programming Language. This sec- tion will give a brief introduction to the needed definitions and concepts found in the ISO C++ Programming Language and presented proposals that are considered as fun- damental to advance C++ further into the domain of HPC programming. Parallelization and concurrency are therefor in the focus and require definitions to provide foundation and a line of reasoning for parallel runtime systems.

The C++ Programming Language has been among the Top 10 Programming Languages [28, 10] since its creation. It has marked the beginning of the popularity of Object Ori- ented Programming (OOP) [85, 16]. OOP provides the capabilities to offer rich user interfaces and compelling abstractions. Due to its C heritage, efficiency did not have to suffer, and as such C++ presents a high-performance programming language solution. Furthermore, C++ is a multi-paradigm language combining various aspects of different programming paradigms like functional and generic programming [17].

The history of C++ goes back to the year 1990 when the C++ Annotated Reference Man- ual (C++-ARM) [25] was published. C++ was designed from the beginning to support efficient development for large-scale systems and has evolved from “C with Classes” to a full ISO Programming Standard in 1998 [43]. With the help of the template mecha- nism, this programming language was able to deliver high-level abstractions without significant loss in performance for its rich standard library offering Object-Oriented Data structures and generic algorithms. These algorithms were not only usable with the predefined data structures but also with user-defined ones. This design is aprime example of re-usability and in fact is one of the driving factors for C++’s popularity [84]. Templates are often called a language within the language and allow the programmer to implement complex calculations with types in a functional manner. This technique is called Template Meta Programming (TMP) and is one of the critical components for implementing efficient abstractions. The next version of the C++ Language in2003[44]

9 10 Chapter 3. Parallelism and Concurrency in the C++ Programming Language was only a minor revision of the language and its standard library containing mostly bug fixes.

The rise of multi-core processors in the early 2000s faced programmers with new chal- lenges: Most modern programming languages lacked support for concurrency and par- allelism. Due to the end of ever scaling single core performance, the free lunch was over [97]. The C++ language specifications so far lacked a memory model and library con- structs to support concurrent programming. Without this, software developers want- ing to exploit the performance possibilities of modern multi-core processors had to rely either on other language extensions like OpenMP or low-level spe- cific APIs like POSIX [7]. It was only in 2011 when the new standard was published [45]. This new standard came with new features within the core language (perfect for- warding, lambdas and ranged based for loops). In addition, it introduces a theoretical memory model (developed in conjunction with the C11 ISO Standard) together with support for low-level concurrency to deal with multiple threads of execution as well as critical sections in addition to other synchronization primitives and atomic operations. These lay the foundation for a modern language supporting the needs of the ever more emerging multi and many core architectures.

The next iteration of the standard in 2014 [46] brought nothing new to the language concerning concurrency or parallelism. However, the latest version of the standard, published in 2017 [47] is adding parallel algorithms as well as other relevant features such as generic lambdas and more support for compile-time programs to further sup- port TMP. The parallel algorithms are an essential step towards high-level abstractions for parallel programming and offer a good starting point for parallel skeletons.

The remainder of this chapter will introduce the low-level abstractions needed for effi- cient parallel programming (see Section 3.1). Those are essential and form the needed foundation by providing the memory and execution model. Eventually, higher-level abstractions, such as parallel algorithms, will be presented in Section 3.2. Section 3.3 will provide insight on the evolution of the C++ Language with respect to parallel pro- gramming and will introduce further concepts up for standardization, such as executors and the evolution of asynchronous programming using CPS that are going to be used and refined in Chapter 4 and Chapter 5.

3.1. Low-Level Abstractions

Whenever dealing with parallelism or concurrency, the need for low-level abstractions arise. Those abstractions are necessary to express parallelism and to apply solutions for problems arising due to concurrency, such as data races (see Definition 2). As the basis 3.1. Low-Level Abstractions 11 for further discussion in this chapter we use the following definitions:

Definition 3.1 (Thread of Execution) A thread of execution is a single flow of control executing a function and every subsequently invoked function. [45, 46, 47]

Definition 3.2 (Data Race) A data race is the effect of (at least) two operations that may be interleaved with (at least) two concurrently running threads of executions. That is, those operations are not atomic [45, 46, 47].

Definition 3.3 (Sequential Consistency) A data race free program is considered as sequen- tial consistent.

Definition 3.4 (Execution Agent) An execution agent is an entity such as a thread that may perform work in parallel with other execution agents. [45, 46, 47]

The thread of execution (see Definition 1) forms the basis of every program, it represents the control flow, that is the logic of the application. The logic of the application might spawn new execution agents (see Definition 4) and as such has to take care of eliminat- ing data races (see Definition 2). An operation in Definition 2 is a C++ expression that reads or modifies a given memory operation. Therefor, a data race are two conflict- ing expressions that read or write to the same memory concurrently without being an explicit atomic operation.

Additionally, a memory model is needed that is well-defined in a context with multi- ple execution agents. The C++ model will be described in Section 3.1.1. Based on this memory model, other tools can be defined to synchronize between different threads of execution (see Section 3.1.2). Section 3.1.3 will discuss the spawning of execution agents.

3.1.1. Memory Model

To execute a program, regardless if it is in a multi- or single-threaded environment, one needs to define a memory model to access and mutate the state of a program correctly. It should be noted that the memory model in itself is assuming a data-race free pro- gram. The purpose of the memory model is to define the ordering of various memory accesses and therefore allow for reasoning about the execution sequence in which the instructions are executed. This is needed to establish Happens Before or Happens After re- lationships that determine results of operations. These relationships are defined in the following sections. In particular, those memory orderings are represented in the atomic operations introduced in the upcoming sections. This will form the basis to implement correct multi-threaded applications. 12 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

Memory Ordering and Consistency

The preconditions to define atomic memory operations on different types is todefine rules that describe the mechanisms of when and how the side effects of those opera- tions are visible to other threads of execution. Listing 3.1 presents the enumeration of the available memory orderings. Those orderings denote either acquire or release se- mantics. The exception is memory_order_relaxed coming without any guarantees con- cerning the visibility of other writes or loads and can be used as an optimization. The strongest guarantee is given by memory_order_seq_cst in that it is defined as if the program was executed sequentially and no other operation is allowed to be reordered before or after. All other side effects, be it stores or writes need to be visible inthis thread of execution.

enum class memory_order { memory_order_relaxed, // Relaxed, no guarantees memory_order_consume, // Soon to be deprecated memory_order_acquire, // Acquire, synchronizes with loads memory_order_release, // Release, synchronizes with stores memory_order_acq_rel, // Acquire/Release memory_order_seq_cst // Sequential Consistency };

Listing 3.1: Memory order as defined by the C++ Standard. memory_order_relaxed is the weakest ordering as it is not associated with any guarantees while memory_order_seq_cst being the strongest, offering sequential consistent acquire and release semantics.

An acquire operation issued on a given memory region, has the effect that no other op- eration may be reordered before and all previous writes are visible to the current thread of execution. That is, all other previously started loads are visible to that store operation. The orderings that adhere to acquire semantics are memory_order_acquire, memory_- order_acq_rel and memory_order_seq_cst. Only a store operation can operate with acquire semantics.

In contrast to acquire, ordering with a release memory order is only valid for load opera- tions. That is, all other previously started stores are visible to that load operation, and no operations can be reordered before this store. The orderings that fulfill those semantics are: memory_order_release, memory_order_acq_rel, memory_order_seq_cst and memory_order_consume1.

1This memory order will be removed from the standard as there is no real benefit over memory_order_- acq_rel, in fact, all implementations implement it in terms of this. 3.1. Low-Level Abstractions 13

These orderings denote sufficient requirements on atomic operations to reason about Happens Before or Happens After. Even though the actual order of execution is undefined and non-deterministic, one can compose the required operations with the particular memory ordering allowing for correctness in the algorithm for concurrently executing threads of execution.

Memory Fences

A fence operation is a synchronization primitive that can have either acquire semantics (acquire fence), release semantics (release fence) or both. A release fence synchronizes with all other acquire atomic operations that happen before this fence. Similarly, the acquire fence is doing the same with the respective release operations. Listing 3.2 is showing the API functions as defined in the C++ Standard. extern ”C” void atomic_thread_fence(memory_order order) noexcept; extern ”C” void atomic_signal_fence(memory_order order) noexcept;

Listing 3.2: The API to issue memory fences. The order argument is denoting the se- mantics of the fence operation. It can either have no effect or have either acquire or release semantics.

Atomic Operations

Without atomic operations, it is impossible to enforce visibility of data to other, concur- rently running threads. It is important to note that regular operations on built-in types do not have defined behavior about visibility to other threads of executions. Memory regions altered in a thread of execution can only be made visible through defined syn- chronization points and explicit atomic operations, which are specified using the de- fined memory orderings (see Listing 3.1).

As such, all atomic operations use a special type. Listing A.1 shows the definition of the interface. Together with the regular constructors, additional member functions are defined for loading storing and atomically exchanging values. By default, themem- ory order is set to memory_order_seq_cst as it provides the strongest guarantees. An atomic is not copyable; however, non-atomic values are assignable and extracted implicitly with sequential consistent acquire-release semantics. The compare-exchange operations perform a bit-wise comparison with the desired value, only if this equality holds, the value is updated; otherwise, the operation is equivalent to an atomic load.

For integral types (such as int or long), additional member functions to perform numer- 14 Chapter 3. Parallelism and Concurrency in the C++ Programming Language ical operations, for example fetch_add, or logical, for example fetch_or, exchange operations have been added which atomically perform the respective operation using the currently stored value (see Listing A.2). Those operations are defined using the two’s complement binary representation of integer values with a wrap around on overflow. Another specialization for atomic types is for pointer types (see Listing A.3). These atomic pointers allow arithmetic operations on the pointer value following the same semantics as the integral operations.

Synchronization Points

Memory orderings are essential for atomic operations, solely using them for an entire program is overkill regarding semantics and efficiency. Not all operations need tobe atomic concerning each other, and the already existing algorithms and data structures should still be usable in the context of a multi-threaded program. For that purpose, it is crucial to have synchronization points. As such, each atomic operation that is not relaxed is a synchronization point. Since all mutex operations use atomics in some form or another, they represent synchronization points.

struct mutex { constexpr mutex() noexcept; ~mutex(); mutex(const& mutex) = delete; mutex& operator=(const& mutex ) = delete; void lock(); bool try_lock(); void unlock(); using native_handle_type = implementation-defined; native_handle_type native_handle(); };

Listing 3.3: The C++ definition of a mutex implementing the Lockable concept.

3.1.2. Concurrency Support

After having discussed the memory model and the respective atomic operations, the next step is to introduce additional support for marking critical regions that shall be protected from concurrent access. Naturally, those are internally implemented using atomic instructions and other low-level primitives provided by the operating system. 3.1. Low-Level Abstractions 15

Three concepts defined in the standard are BasicLockable, Lockable and TimedLock- able.

Definition 3.5 (BasicLockable) A type BasicLock is BasicLockable when it fulfills the following requirements (m is of type BasicLockable): Requirements Effects m.lock() Blocks until a lock can be acquired for the current execution agent. m.unlock() Releases a lock held by the current execution agent. The lock needs to be acquired by the same execution agent a priori.

Definition 3.6 (Lockable) A type Lock is Lockable if it meets the requirements of BasicLock- able (see Definition 5) and the following requirements hold (m is of type Lockable): Requirements Effects m.try_lock() Attempts to acquire a lock for the current execution agent without blocking. It returns true if the lock was acquired, false otherwise.

Definition 3.7 (TimedLockable) A type Lock is TimedLockable if it meets the requirements of Lockable (see Definition 6) and the following requirements hold (m is of type TimedLockable): Requirements Effects m.try_lock_for(rel_time) Attempts to acquire a lock for the current exe- cution agent. It blocks until either the lock has been acquired or the relative timeout specified by rel_time has been exceeded. It returns true if the lock was acquired, false otherwise. m.try_lock_until(abs_time) Attempts to acquire a lock for the current execution agent. It blocks until either the lock has been ac- quired or the timeout specified by abs_time has been exceeded. It returns true if the lock was ac- quired, false otherwise.

BasicLockable solely requires a member function to lock and unlock the object to mark the critical region (see Definition 5). In addition to that, Lockable (see Definition 6) ex- tends this definition by adding the possibility to try to lock. That is, locking of the critical section does not need to be successful. Such a speculative locking is useful whenever a critical section might contend, that is while waiting for the lock to become available. Further extensions are in the TimedLockable concept (see Definition 7), where the try to lock operation is supplemented with time duration to try to lock for a specific amount of time.

Listing 3.3 shows an example interface of a mutex that satisfies the definition of Lock- able. Likewise the C++ Standard defines timed_mutex to implement the TimedLock- able concept. A mutex is required to offer exclusive ownership semantics, that means that once a call to lock() returned, no other thread of execution can return from a call to 16 Chapter 3. Parallelism and Concurrency in the C++ Programming Language lock(), that is, lock blocks until the mutual exclusion properties are fulfilled. Not part of the definition, however, is the type of blocking; an implementation using spinlock techniques (such as a busy-loop waiting on an atomic flag to be set with an optional exponential back off) instead of operating system APIs is entirely possible.

Name Concept Notes lock_guard BasicLockable This class is not copyable or movable. It is used to control ownership of lockable objects within a scope. unique_lock TimedLockable This class is used to control ownership within a scope. Ownership can be transferred via move op- erations shared_lock TimedLockable This class is used to control shared ownership within a scope. Shared ownership can be trans- ferred via move operations

Table 3.1.: The C++ Resource Acquisition is Initialization (RAII) classes to manage crit- ical sections.

template class lock_guard { BasicLockable& lock_; public: lock_guard(BasicLockable& lock) : lock_(lock) { lock_.lock(); } ~lock_guard() { lock_.unlock(); } lock_guard(lock_guard const&) = delete; lock_guard& operator=(lock_guard const&) = delete; };

Listing 3.4: The lock_guard is an Resource Acquisition is Initialization (RAII) class that is usable to programmatically represent a critical section using automatic, exception safe resource management for a mutex.

Defining the mutual exclusion types alone might seem sufficient at first sight.Con- sidering that a mutex represents a resource, it makes sense to offer RAII classes that can acquire the resource, and release it once the scope is left. The RAII idiom is based on the principle to acquire a resource in the constructor of an object while releasing it 3.1. Low-Level Abstractions 17 in the destructor. Adopting this principle has the effect that unlocking a mutex is be- ing performed automatically. Listing 3.4 is presenting an example implementation of a lock_guard. This class locks the passed reference (which is a BasicLockable ob- ject) upon construction and unlocks it in the destructor. RAII is a common pattern in modern C++ and useful to express intent on a scope level without sacrificing exception safety or other possible hazards from early returns. Table 3.1 shows an overview of the different locks that can be used from within the C++ standards library. One additional constructor overload that is not part of Listing 3.4, is used to modify the behavior of the constructor, by taking global constants with distinct types using the following sym- bols:

• try_to_lock: Calls try_lock instead of lock. This might lead to the lock not being acquired. An additional conditional check is needed here.

• adopt_lock: Does not call lock in the constructor, assumes that the calling thread owns the mutex.

• defer_lock: Does not call lock in the constructor, assumes that the calling thread does not own the mutex.

When dealing with more than one lock that protects different memory regions, that need to be locked simultaneously, it is easy to create if those locks are not acquired and released in the same sequence. In general contexts, this is impossible to guarantee. For that reason, the C++ standard defines two variadic functions (see Listing 3.5) to ensure the -free acquisition of locks. In combination with the RAII- classes described in Table 3.1 it is possible to create deadlock free programs with any amount of mutecis to protect critical regions properly. template void lock(Lockable&...); template int try_lock(Lockable&...);

Listing 3.5: The generic, variadic locking functions perform a deadlock-avoiding algo- rithm to acquire the passed locks in sequence. std::lock returns once all locks are acquired. std::try_lock returns -1 if all locks can be acquired, or an index to the lock which could not be acquired.

With the availability of mechanisms to create critical sections to avoid data races, the only ingredient missing providing sufficient support for multi-threaded programming is a mechanism for a synchronization primitive that can be used to block a thread until notified by some other that a condition is met or until a specified system time hasbeen reached. The C++ standard library definition for condition_variable can be found in Listing 3.6. Functions are provided to notify one or all waiting threads as well as to wait 18 Chapter 3. Parallelism and Concurrency in the C++ Programming Language indefinitely until a condition is met and timed versions.

This definition is very similar to the one defined in thePOSIX[7] standard and there- fore doesn’t come with many surprises. One observation to make is that condition- _variable::wait takes a specific lock type as the first argument. This might be unrea- sonable when using other lock types as presented previously. For that reason, the C++ standard also defines the condition_variable_any class that takes a generic Basic- Lockable lock as an argument to the wait functions.

struct condition_variable { condition_variable(); ~condition_variable(); condition_variable(const condition_variable&) = delete; condition_variable& operator=(const condition_variable&) = ,→ delete; void notify_one() noexcept; void notify_all() noexcept; void wait(unique_lock& lock); template void wait(unique_lock& lock, Predicate pred); template cv_status wait_until(unique_lock& lock, const ,→ chrono::time_point& abs_time); template bool wait_until(unique_lock& lock, const ,→ chrono::time_point& abs_time, Predicate ,→ pred); template cv_status wait_for(unique_lock& lock, const ,→ chrono::duration& rel_time); template bool wait_for(unique_lock& lock, const ,→ chrono::duration& rel_time, Predicate pred); using native_handle_type = implementation-defined; native_handle_type native_handle(); };

Listing 3.6: The condition_variable forms the foundation to provide functionality of notifying different threads of execution of a condition being met. This can be used, for example, in Producer-Consumer scenarios 3.1. Low-Level Abstractions 19

3.1.3. Task-Parallelism Support

All concurrency facilities described in Section 3.1.2 provide the necessary basis for im- plementing concurrently operating libraries and applications. Using those low-level facilities gets cumbersome and error-prone very quickly. One of the reasons for this is the missing support and concept for transportation of results. As such, the natural consequence is to add an abstraction to support task-based parallelism and allow for seamless transport of return values of functions. As the underlying concept, the C++ Standard leverages Futures and Promises [5, 26].

The classical std::thread2 interface, that is starting the execution of a function in a new thread of execution, with additional members that allow waiting until it has been finished, will not be discussed here. The ability to start new threads, to join, or todetach from the completion of the thread is completely covered with the interface used for task- parallelism and as such is considered obsolete within the context of this thesis. The significant advantage of the approach presented here is the seamless transport ofthe outcome of the executed task as well as improved synchronization and composability when using futures.

The Shared State

Coming back to Task-Parallelism support in the C++ Standard, we have two important concepts: the producer and the consumer. They communicate over a common, reference counted3, but an otherwise unspecified shared state. The shared state is the centerpiece of all the subsequently introduced concepts, classes, and functions. The consumer is defined as Asynchronous Return Object (see Definition 8) and the producer is defined as Asynchronous Provider (see Definition 9).

Definition 3.8 (Asynchronous Return Object) An asynchronous return object is an object that reads results from a shared state. A waiting function of an asynchronous return object is one that potentially blocks to wait for the shared state to be made ready. If a waiting function can return before the state is made ready because of a timeout, then it is a timed waiting function. Otherwise, it is a non-timed waiting function.

Definition 3.9 (Asynchronous Provider) An asynchronous provider is an object that provides a result to a shared state. The result of a shared state is set by respective functions on the asynchronous provider. The means of setting the result of a shared state is specified in the description of those classes and functions that create such a state object.

2http://en.cppreference.com/w/cpp/thread/thread 3In order to maintain how many return objects or providers reference the shared state to avoid dangling pointers. 20 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

The shared state is reference counted in a thread-safe manner. Both, the provider and return object hold a reference to the shared state. There shall only ever be one asyn- chronous provider, but there might be multiple asynchronous return objects. When- ever either the return object or the provider releases its shared state, the reference count is decremented by one. Once the reference count is zero, the shared state will be de- stroyed. A shared state is ready whenever the provider either sets an exception or the value. The retrieval and setting of the shared state’s value or exception are appropriately synchronized with each other using the mechanisms described in Section 3.1.2.

Asynchronous Provider

At first, we want to start with the asynchronous providers as they form the entry point for writing task-based applications. The most prominent provider is the promise (see Listing A.4). The promise itself does not create any new threads or tasks. It can be considered as the archetypical asynchronous provider. It provides all the functions for setting either a value or an exception. However, for retrieving the value, one must refer to an asynchronous return object. In the case of C++, the most common asynchronous return object is std::future (see Listing A.6). An important observation to make here is that there only ever should be one owner of an asynchronous provider, even though, ownership might be transferred via move operations such as move assignment or construction.

The next asynchronous provider in the list is packaged_task (see Listing A.5). In con- trast to promise a packaged_task does not allow to set the value or exception of the shared state directly. The packaged_task instead is a wrapper around a callable with return type and a variadic argument list Args.... That is, upon construction, a callable object (see Definition 10) is passed to packaged_task. The value set in the shared state is then the return value produced by the passed callable. If an exception is thrown dur- ing the execution of the passed function, it will be intercepted and stored in the shared state. To retrieve that value (or exception), users have to call get_future to retrieve an asynchronous return object in the form of a future. This future will be set ready, once the tasks completed and the return value will be stored in the shared state.

Definition 3.10 (Callable) A Callable is a type that is invokable; an object f of type Callable can be called with the passed parameter pack Args: Expression Requirements INVOKE(f, args...) This expression is well-formed

Where INVOKE is well-formed when f is either a pointer to a member function or member vari- able, or if f is either a regular function, a lambda or an object with operator() overloaded. The passed parameters must be convertible to the function’s arguments. The return value of the function must be convertible to R as well, or is discarded if R is void. 3.1. Low-Level Abstractions 21

The packaged_task, even though the name might suggest it, doesn’t execute the passed function in a new thread of execution, and therefore isn’t required to execute it concur- rently. However, it can be used to build asynchronous providers to achieve exactly that. Listing 3.7 defines the signatures for the various std::async overloads as provided by the C++ Standard. This allows to have functions executed asynchronously, which might be implemented using packaged_task, but for the sake of providing only the semantics, the implementation details are left out of the definition. template future(decay_t...)>> async(F&& f, Args&&... args); template future(decay_t...)>> async(launch policy, F&& f, Args&&... args);

Listing 3.7: The function async is used to asynchronously spawn a given task. The sec- ond overload determines how and when the task is going to be launched

To create asynchronously (and possibly concurrently) running tasks, async was defined. In its current form, two overloads are provided. The first has the intent to start atask executing the callable f with the passed parameters args. The second overload has an additional first argument, the launch policy. It is defined as: enum class launch : unspecified { async = unspecified, deferred = unspecified, implementation-defined };

In the case of the first overload, when no launch policy is specified, the implementation might choose between launch::async and launch::deferred. launch::async will invoke the passed function and execute it as if it was executed in a new thread. This will almost always have the effect that a new thread of execution is created. In the caseof launch::deferred, the implementation should defer the invocation of the function to a point in time when no more concurrency can be effectively exploited. In both cases, the arguments and the passed function are decay-copied, that is stored internally until the function is finally executed. The returned future will be set to contain the result of the function invocation, which is either the returned value or an exception that has been thrown during function invocation. 22 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

Asynchronous Return Object

After having defined the basic set of asynchronous providers that allow us tospawn asynchronous tasks, this part will deal with defining the asynchronous return objects. The C++ Standard is defining future (see Listing A.6) and shared_future (see Listing A.7). The commonality of the two is the ability to retrieve the value or exception stored in the shared state, but not setting it. The difference lies in ownership semantics: future only ever has a single owner (transferable through move semantics) while shared_future represents shared ownership, such that there are multiple refer- ences to the same shared state. One of the effects of the unique ownership semantics of future is that whenever the shared state’s value is retrieved the future is invali- dated. Similarly, when calling share(), the semantics change, and therefore require the original future to be invalidated to avoid corner cases with asynchronous return objects pointing to the same shared state but representing different semantics.

The result of the asynchronous provider can be retrieved by calling get(). This function will either return the stored value or rethrow the exception that might have occurred. A call to get() might block until the value has been stored in the shared state by the provider. In a scenario where a simple wait on the result of the shared state is suffi- cient, without actually retrieving the result, future also defines the wait() function which does not lead to an invalidation of the future object. Besides, timed wait functions are defined to restrict waiting for a specific duration or until a certain time point. Those timed wait functions block until either the shared state is ready or when the requested time has run out. The result is returned via the future_status enum: enum class future_status { ready, // the waited on future is ready timeout, // the time period is up, the future isn't ready yet deferred // the shared state contains a deferred function };

The difference between future and shared_future lies, as mentioned previ- ously, in the ownership semantics. This results in a difference in the semantics of the get() function. While future::get() returns the shared state result by value, shared_future::get() returns the result by const reference. Also, the retrieval of the shared state result does not lead to invalidation, as such, the shared state can be reused multiple times, and of course, there can be multiple references to the same shared state from multiple shared_future return objects. In most cases, asynchronous providers return a future, when required, this can be either implicitly converted to shared_future, or explicitly, by calling future::share(). 3.2. Higher Level Parallelism 23

3.2. Higher Level Parallelism

The low-level abstractions introduced in Section 3.1 form the basis and allow formulat- ing further abstractions by using the regular composition facilities provided by the C++ programming language. The goal should be to provide high level, generic and reusable abstractions. Those abstractions should be able to cover the needs for individual applica- tion domains without limiting the possibilities to create further abstractions; allowing to efficiently and effectively apply parallelization techniques without the need to dealwith low-level concurrency details that are often error-prone and hard to use. The different parallelization strategies are all derived by relying on a small set of types of parallelism that are illustrated in Figure 3.1. The current C++ Standard starts with extending the standard library specification with parallel algorithms (see Section 3.2.2).

Figure 3.1.: Showing the different forms of parallelism discussed in this thesis andhow they relate to each other in the context of a task-based system. Everything is built on top of the lower layers, without restricting the use of any layer underneath to allow for full flexibility

3.2.1. Concepts of Parallelism

When defining higher level parallelism features, the various means of expressing par- allelism and their mapping to specific concepts have to be laid out. On this basis, we will derive the first subset of generically applicable higher-level policies that allowus not only to express parallelism but to allow for potentially user-defined customization points. Those customization points can be used to alter the execution behavior. 24 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

Types of Parallelism

First, this section describes the various types of parallelism. Starting from a low-level perspective, the hardware defines the available forms of parallelism:

• Bit-Level Parallelism

• Instruction-Level Parallelism (ILP)

• Thread-Level Parallelism

Bit-Level parallelism is defined by the different width of the various data types. Thatis, an instruction is usually performing a given operation on all bits of the underlying data type in parallel instead of computing every single bit sequentially. Those are usually the integral, native data types in a programming language, representing numbers of 8, 16, 32 and 64 bits. From a programmer’s point of view, this level of parallelism is om- nipresent in a modern Instruction Set Architecture (ISA) and well understood. Further- more, modern CPU architectures are equipped with several forms of ILP; ranging from superscalar pipelines over SIMD to out of order Symmetric Multi Threading (SMT).

SIMD introduces a form of parallelism that has to be handled explicitly by a program- mer. To exploit the full potential of modern architectures, it is crucial to give program- mers ways to use those SIMD instructions. Modern compilers made great advance- ments in auto-vectorization [105, 73]. However, cases exist for which a compiler still needs to be manually instructed to generate the most efficient SIMD code. As such, the toolkit to support high-level parallelism needs to include a way to effectively and ef- ficiently use the SIMD capabilities of modern architectures in a performance portable manner [55, 56]. Defining those APIs is outside the scope of this thesis and will notbe discussed further. With SMT we are entering the realm of Thread-Level Parallelism, which is closely related to hiding stalls in the CPUs . Furthermore, SMT needs to be explicitly programmed in the same way as one would program a full multi-core CPU. The low level concurrency support was discussed in Section 3.1. However, the goal of this section is to lay the foundation and discuss the already existing mechanisms for high-level parallelism, that of course, mainly focuses on multithreading and hiding the complexity of low-level thread management to offer a performance portable way of expressing parallelism. Based on that, new parallel programming paradigms have emerged over the last decades:

• Fork-Join Parallelism

• Data Parallelism

3.2. Higher Level Parallelism 25

To form High-Level APIs the combination of the above-described paradigms with the low-level characteristic defined by the computer architecture needs to be applied. Fig- ure 3.1 gives an overview of the different layers of abstractions that will be introduced in the subsequent sections. These layers of abstractions allow for maximum flexibility in describing specific properties defined by applications and algorithms to allow for maximal exploitation of the underlying hardware. At the very top, we have the parallel paradigms, which are either expressed by meaningful and generic parallel algorithms of other, paradigm specific building blocks, like task_block. These can be customized using execution policies and parameters. All higher-level abstractions are then built on top of the underlying, low-level features for concurrency as described in Section 3.1.2.

Execution Policies

Building upon the low-level building blocks the fundamental ingredients to build high- er-level abstractions for parallelism come as policies. Those policies describe the be- havior of the algorithm regarding allowed parallelism on the one hand, and additional, optional parameters to control non-functional properties such as chunk size on the other hand. Whether sequential or parallel execution, as expressed by the requested execution policy, is mostly determined by the nature of the user provided functions the algorithms use, also referred to as access functions. As such, the order in which he access functions might be executed determine the execution policy to be used to guarantee data-race freedom. namespace std { namespace execution { // A unique, implementation defined type to disambiguate // overloading for sequential execution class sequenced_policy { // unspecified };

// A global instance of the sequential_policy type that can // be passed to the parallel algorithms constexpr sequenced_policy seq { /*unspecified*/ }; }}

Listing 3.8: Sequential execution policy type and global Object

Sequential Execution Policy In the case were the access functions need to be exe- cuted sequentially, that means that no parallelism is allowed, the sequential execution policy guarantees that the will invoke the access functions in the call- 26 Chapter 3. Parallelism and Concurrency in the C++ Programming Language ing thread in an undetermined order. The types and objects to use can be seen in Listing 3.8.

Parallel Execution Policy To select an algorithm overload where the access functions can be concurrently invoked, the parallel execution policy can be specified (see List- ing 3.9). When this policy is selected, the access functions are now guaranteed to be executed indeterminately sequenced concerning one thread and might, therefore, run concurrently. That is, fixed sized partitions of the algorithm are executed in parallel. Within the partitions, the user access is happening sequentially.

namespace std { namespace execution { // A unique, implementation defined type to disambiguate // overloading for sequential execution class parallel_policy { // unspecified };

// A global instance of the sequential_policy type that can be // be passed to the parallel algorithms constexpr parallel_policy par { /*unspecified*/ }; }}

Listing 3.9: Parallel execution policy type and global Object

Unsequenced Parallel Execution Policy In the cases where the user access func- tions do not require any order at all, the unsequenced parallel execution policy can be applied (see Listing 3.10). The unsequenced part is referring to the execution of the user access functions within one thread. The unsequenced behavior is essential to be able to vectorize the user level access functions, as due to vectorization, the sequential con- sistency of the executed code cannot be strictly guaranteed, and it is the user’s job to provide the suitable user level access functions correctly.

The differences of the three primary execution policies are summarized by giving anex- ample order for processing 12 elements and the possible effect of the specified execution policies (see Figure 3.2).

3.2.2. Parallel Algorithms

The already existing C++ Standard library has proven to provide excellent abstractions and generic APIs to apply various algorithms on a range of elements. Those algorithms 3.2. Higher Level Parallelism 27

namespace std { namespace execution { // A unique, implementation defined type to disambiguate // overloading for sequential execution class parallel_unsequenced_policy { // unspecified };

// A global instance of the sequential_policy type that can // be passed to the parallel algorithms constexpr parallel_unsequenced_policy par_unseq { ,→ /*unspecified*/ }; }}

Listing 3.10: Parallel unsequenced execution policy type and global Object

sequenced execution parallel execution parallel unsequenced execution

0 0 4 8 0 1 2 3 4 5 1 1 5 9 6 7 8 9 10 11 2 2 6 10 3 3 7 11 4 5 6 7 8 9 10 11

Figure 3.2.: The effect of the different Execution Policies. Sequenced execution isexe- cuting all elements in the occurring order without parallelism; parallel exe- cution keeps the original sequence of the elements while executing chunks in parallel; parallel unsequenced is executing the elements in parallel while not necessarily keeping the original order. 28 Chapter 3. Parallelism and Concurrency in the C++ Programming Language range from everyday tasks such as iterating over the elements, mutate the elements in a sequence or sorting. The generality comes from decoupling the actual storage from accessing the elements by providing iterators into containers. An Iterator4 is distin- guished by specific categories to identify complexity guarantees of given operations as well as the ability to advance or access elements randomly.

Name Concept InputIterator Allows to dereference and advance by one (increment) element in the range. After an increment, any copy to the previously incremented iterator is no longer valid. When dereferencing, the value is only readable. OutputIterator Allows to dereference and increment the iterator. As with InputIterator, once deferenced, other copies be- come invalid. The result of the dereference operation is write only. ForwardIterator This category specifies iterators that can be incremented and unlike InputIterator and OutputIterator, the it- erators stay valid when being dereferenced. BidirectionalIterator A refinement of ForwardIterator that in addition also allows the decrement of an iterator, that is, going back one element. RandomAccessIterator Refinement of BidirectionalIterator, allows O(1) access to the elements in the described range.

Table 3.2.: The iterator categories and their impact on parallelization

By adhering to those principles, the algorithms that are feasible to be parallelized have been easily identified, and the specification was a straightforward extension. Theal- ready existing sequential versions have been amended with overloads taking the previ- ously discussed execution policies (see Section 3.2.1). The iterator categories in addition also determine the effective possibility of parallelization as described in Table 3.2. Apart from the InputIterator and OutputIterator all other iterator categories can be used to write parallel algorithms. However, RandomAccessIterator is suited most due to ef- ficient generation of working sets. The complete list of parallel algorithms can befound in Appendix A.4 including non-modifying sequence operations (see Table A.2), modify- ing sequence operations (see Table A.6), partitioning operations (see Table A.8), sorting operations (see Table A.10), binary search operations (see Table A.12), set operations (see Table A.4), heap operations (see Table A.14), minimum/maximum operations (see Table A.16), numeric operations (see Table A.18) and operations on uninitialized mem- ory (see Table A.20).

4An Iterator is a a generic and concept to access elements pointing into a range of elements. This range can be either infinite or bound to the elements of a given container. Iterators can be used totraverse containers. 3.2. Higher Level Parallelism 29

As an example of the expressiveness of parallel algorithms, a naïve implementation of the for_each algorithm is presented here. The algorithm has the following signatures with two overloads: template < typename InputIt, typename UnaryFunction> void for_each( InputIt begin, InputIt end, UnaryFunction f); template < typename Policy, typename InputIt, typename UnaryFunction> void for_each( Policy, InputIt begin, InputIt end, UnaryFunction f);

The first is the classic version of the algorithm, while the second one has an additional Policy argument. This policy specifies the required execution policy. InputIt de- notes the type of the iterator for an exclusive range [begin, end). That is, for each element from begin to end, the UnaryFunction f will be called. It is important to note that the iterator category for the first overload and sequential execution policy do not require any restrictions, the category of the parallelized version needs to be at least ForwardIterator (see also Table 3.2).

This algorithm can now be naïvely parallelized by dividing the range into mostly equal- sized chunks (assuming that the number of elements is evenly divisable by the number of threads), distribute those chunks to different threads and join those threads after- ward. Section 5 will dive more into possibilities on how to further improve this algo- rithm to adapt to different architectural features to allow for true performance portabil- ity. The naïve version can be formulated as follows: template void for_each(execution::parallel_policy, InputIt begin, InputIt end, UnaryFunction f) { std::size_t length = std::distance(begin, end); std::size_t num_threads = std::hardware_concurrency(); std::size_t chunk = length / num_threads;

std::vector threads; // Distribute the work to our concurrently running threads. for(auto it = begin, it != end; std::advance(it, chunk)) { auto chunk_end = it; std::advance(chunk_end, chunk); 30 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

threads.emplace_back( [it, chunk_end, f]() { std::for_each(it, chunk_end, f); }); }

// Wait on everyone to finish for(auto& thread: threads) { thread.join(); } }

It is important to note, while this implementation achieves the goal of parallelizing the algorithm, the critical functionality such as control over where our work is run, fine- tuning of chunk size, constraints regarding the used OS-Threads as well as the implicit barrier following the parallel region.

3.2.3. Fork-Join Based Parallelism

Figure 3.3.: Fork-Join Model: One Master forks several tasks to run in parallel on dif- ferent workers. Once each task is completed, the parallel region is joined onto the master thread and serial execution continues instead of executing everything sequentially.5

Another essential fundamental building block for parallel applications is based on the concept of Fork-Join (see Figure 3.3). The importance of this parallel programming de- 3.3. Evolution 31 sign pattern comes within its simplicity and ability to be applied gradually to an ap- plication. In a sense, the parallel algorithms discussed in the previous section can be seen as representatives of the fork/join family of parallel constructs as well. The topic of Fork-Join Based Parallelism will not be part of C++17 but of the Parallelism TS V2 [70].

Instead of building a fine-grained, constrained based flow of the parallel algorithm, the fork-join pattern is demanding that each task that has been forked in a given region, must be joined at the end of said region to continue to compute. The C++ Parallelism V2 TS, therefore, specifies the task_block, defining low level APIs to perform fork-join based task decomposition.

3.3. Evolution

After having discussed the ingredients of the C++ Language, as well as the C++ Stan- dard Library concerning concurrency and parallelism, this Section discusses the exten- sions that will be needed for allowing the C++ ecosystem to tackle the requirements imposed by High-Performance Computing as well as the Embedded Computing land- scape.

The challenges in the near future are dictated by energy aware as well as performance portable programming of maintainable software and evolve towards distributed com- puting. This section will continue to highlight the evolution of the C++ Standard by introducing the important concepts. At the very heart of this is the concept of execu- tors, forming the basic foundation of abstraction to control the where and when tasks are being executed (see Section 3.3.1). Section 3.3.2 will discuss the related challenges and solutions for heterogeneous computing, and shares common problems with distributed computing. This Section will close by discussing the solution to CPS as introduced by applying Futurization (see Section 3.3.3) that easily turns an application into a hard to maintain “callback hell” (see Section 3.3.4) to provide maintainable code.

3.3.1. Executors

In Section 3.2.1 the notion of Execution Policies was introduced. Execution Policies are important to allow for an API that can select between parallel and sequential algorithms, depending on the user’s needs. As recent development in the C++ Standards Committee [68, 48, 78] suggests, this isn’t enough. Apart from being able to specify the type of parallelism, the ever more important ability to select either different subsets of a given

5https://commons.wikimedia.org/wiki/File:Fork_join.svg 32 Chapter 3. Parallelism and Concurrency in the C++ Programming Language set of Processing Units (for example NUMA Domains), accelerators or even remote sites become more and more necessary to cope with recent developments in heterogeneous computer architectures.

Concept Effects Instruction Stream Code to be run in a form appropriate for the target ex- ecution architecture. Execution Architecture Denotes the target architecture for an instruction stream. Possible target execution architectures include accelerators such as GPUs or RPC. The execution ar- chitecture may impose architecture-specific constraints and provides architecture-specific facilities for an in- struction stream. Execution Resource An instance of an execution architecture that is capable of running an instruction stream targeting that archi- tecture. Execution Agent An instruction stream is run by an execution agent on an execution resource. An execution agent may be lightweight in that its existence is only observable while the instruction stream is running. As such a lightweight execution agent may come into existence when the instruction stream starts running and cease to exist when the instruction stream ends. Execution Context The mapping of execution agents to execution re- sources. Execution Function An execution function targets a specific execution ar- chitecture and is representing an instruction stream ca- pable of running on the respective execution Context. Executor Provides execution functions for running instruction streams on a particular, observable execution resource. A particular executor targets a particular execution ar- chitecture.

Table 3.3.: Concepts needed to define the execution of a given high level C++ program on any given Computer/Software Architecture.

Table 3.3 defines the basic concepts needed in order to generalize the notion ofan Exec- utor. In principle, an Executor is the extensible high-level abstraction to define where to run a specific function. To provide those functionalities, we need a notion of context, as well as the associated resources. An Executor might be a user-defined type, which needs to be categorized into specified groups (see Table 3.4).

Executors defined using those Concepts can then be used to dispatch tasks to it.The concrete syntax and extension points are not completed yet. Nevertheless, experiments 3.3. Evolution 33

Category Description BaseExecutor Forming the basic requirements for an executor. It is capable of returning the Execution Context. OneWayExecutor Providing an execute function that operates in fire&forget semantics. TwoWayExecutor Capability of submitting a execution function to the under- lying context returning a future holding the result. ThenExecutor In addition to submit a passed function to the execution con- text, this executor has the ability to defer execution until a predicate future has finished execution. BulkOneWayExecutor Same as the non-Bulk version with the addition of being able to submit the respecting operation in bulk. BulkTwoWayExecutor Same as the non-Bulk version with the addition of being able to submit the respecting operation in bulk. BulkThenExecutor Same as the non-Bulk version with the addition of being able to submit the respecting operation in bulk.

Table 3.4.: Executor Categories define the different capabilities for an executor in order to derive the respective semantics of the associated operations. building upon the concepts can be conducted, and field experience can be obtained. As an example usage (as defined in [78]) we can look at the following Listing:

// execute an async on an executor: auto future = std::async(my_executor, task1);

// execute a parallel for_each on an executor: std::for_each(std::execution::par.on(my_executor), data.begin(), ,→ data.end(), task2);

// make require(), prefer(), and properties available using namespace std::experimental::execution;

// execute a non-blocking, fire-and-forget task on an executor: require(my_executor, oneway, never_blocking).execute(task1);

// execute a non-blocking, two-way task on an executor. prefer // to execute as a continuation: auto future2 = prefer(require(my_executor, twoway), ,→ continuation).twoway_execute(task2);

// when future is ready, execute a possibly-blocking // billion-agent task 34 Chapter 3. Parallelism and Concurrency in the C++ Programming Language auto bulk_exec = require(my_executor, possibly_blocking, bulk, then); auto future3 = bulk_exec.bulk_then_execute(task3, 1<<30, future2, ,→ result_factory, shared_factory);

3.3.2. Support for heterogeneous architectures and Distributed Computing

One of the most significant features missing in the C++ Programming Language, and especially the standard library, is support for heterogeneous architectures and distribut- ed computing. Those two, on first sight, seem to have little in common, they share certain aspects that make them sufficiently similar from a high-level perspective tofind a standard and unified high-level abstraction API. Those similarities are:

• Addressing of remote objects

• Data Movement (implicit and explicit)

• Remote Procedure Calls

The term heterogeneous computing is implying that a given computing system is not only consisting of a single computer architecture with a set of given properties but mul- tiple ones. These different architectures mean that we need a programming interface to express offloading to a given remote computing device, and to move data.Most commonly, heterogeneous architectures are to be found when integrating application- specific accelerators (for example DSPs), general purpose Graphical Processing Units (GPUs) or FPGAs.

For accelerators, we can observe various technological developments to support low- level programming of such devices. Those technologies are CUDA [18], OpenCL [75], SYCL [9], OpenMP [76] and OpenACC [74] with different features and goals, that isto solely program accelerators. Whereas the latter two are based on compiler directives (pragmas), and therefore unsuitable as a basis for generic, high-level abstractions in the style of the C++ Standard Library, the remaining three allow for a good foundation to experiment with different API designs and trade-offs to generate programmable and performance portable solutions to heterogeneous programming.

On the other hand, support for distributed computing lacks behind without a clear so- lution in sight. The only effort undertaken by the C++ standardization committee is in the form of a Networking TS [69], which merely focuses on plain message passing; roughly translating to the ability to move data from one device to another in the context of heterogeneous computing using accelerators. Completely missing, however, is the 3.3. Evolution 35 support for remote procedure calls. As the marshaling of data between host and (accel- erator) devices is currently done implicitly by the presented solutions (see Section 4.3.2), a well-defined data layout or standardized introspection support will help to drive this development further.

3.3.3. Futurization

Requesting support for heterogeneous architectures and distributed computing require techniques that can mitigate the unavoidable latencies introduced by explicit commu- nication among different computing entities. It is important to provide facilities that enable, and encourage the usage of asynchronous operations while allowing to overlap those with other work. This technique is usually referred to as overlapping computation and communication. The approach presented here is called Futurization and makes use of the so-called “Continuation Passing Style (CPS)” by utilizing asynchronous function calls that return futures.

Continuation Passing Style (CPS)

The origins of CPS programming stems from the realm of functional languages [93, 58]. Passing functions as arguments, therefore, are used to continue the caller defined algo- rithm. In other words, a generic extension to recursion. Another important application was found in need to express side effects, for example, I/O operations, in a strict func- tional environment [79]. In the context of asynchronous operations, CPS allows for the elegant formulation of data and control flow dependencies [94]. Modern C++ leverages this technique by allowing to attach continuations to a future, the handle representing the asynchronously computed result (see Section 3.3.3). These observations are the basic principles behind the idea of futurization, now expressed in a multi-paradigm program- ming language.

Composability of Futures

The std::experimental::future::then (same as the shared_future variant) func- tion allows for sequential composition of future return values, more complex control flow in real applications require support to compose on more than one future value. The composition of futures using continuations is called Futurization. That is, an im- plicit dependency graph is constructed by composing the asynchronous return values another calculation depends on into a single future. Therefore, applying Futurization 36 Chapter 3. Parallelism and Concurrency in the C++ Programming Language creates a DAG of tasks that the runtime system can schedule once all input futures have become ready.

template auto when_all(Futures &&... futures);

Listing 3.11: And-composition of futures: when_all returns a future which is marked ready if, and only if, all input futures have become ready

The support for arbitrary complex control flows can then be implemented by providing facilities to AND (see Listing 3.11) and OR (see Listing 3.12) compose futures.

template struct when_any_result { Sequence sequence; std::size_t index; };

template when_any_result when_any(Futures &&... futures);

Listing 3.12: Or-composition of futures: when_any returns a future which is marked ready if one of the input futures have become ready

Control Flow Transformation

As such, the basic idea of Futurization is to apply simple rules to a given control flow that allow turning serial code into an asynchronous version that potentially runs in parallel using CPS programming techniques.

int fib(int n) { if (n<2) return n; int left = fib(n-2); int right = fib(n-1); return left + right; }

Listing 3.13: A naïve, recursive implementation of the fibonacci sequence 3.3. Evolution 37

Functions Calls and Recursion The first and obvious way of futurization is that of a function call: Instead of calling the function directly, we can dispatch it with a call to async moreover, use the returned future as a dependency for the next calculation. One use case for this is recursive algorithms. As a demonstration, we chose a naïve imple- mentation of the Fibonacci sequence, using the recursive mathematical formulation (see Listing 3.13). int fib(int n) { if (n<2) return n; future left = async(fib, n-2); int right = fib(n-1); return left.get() + right; }

Listing 3.14: Turning recursion into an asynchronous task of computations on the exam- ple of a naïve implementation of the fibonacci sequence

By applying the rule for function calls, we futurize the first recursive call to obtain a future for the result (see Listing 3.14). It is obvious to see that the left and right part of the calculation can be executed in parallel. The distinct disadvantage is that the function needs to block until the left branch has finished the calculation. future fib(int n) { if (n<2) return make_ready_future(n); future left = async(fib, n-2); future right = fib(n-1); return when_all(left, right).then( [](auto l, auto r) { return l.get() + r.get(); }; }

Listing 3.15: Turning recursion into an asynchronous task of computations on the exam- ple of a naïve implementation of the fibonacci sequence without blocking

To overcome the described shortcomings, we use continuations to mitigate the explicit blocking and suspension of the thread of execution. Attaching a continuation is to be preferred since suspended tasks use system resources like stacks that need to be stored. However, it is, of course, a small trade-off, since we need to store the continuation as well, that is usually significantly smaller than the complete stack of a task. The result of the complete transformation can be seen in Listing 3.15. The operation that needs the asynchronous results is the addition, that is executed regarding the continuation once both branches of the computation have been completed, this is closely adhering to CPS. 38 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

Conditional Branches Apart from function calls, the other necessary control struc- ture is conditional branches, expressed as an if-statement. At first sight, extracting par- allelism from an if-statement in isolation seems to be hard. The benefit at this moment comes in multiple ways. One way is to either asynchronously launch the ThenBlock or ElseBlock (see Listing 3.16). The advantage we gain now is to use the resulting future for further futurization, or perform other work after the conditional that does not di- rectly depend on the result.

// if-statement: // if (Condition) // ThenBlock // else // ElseBlock

future res; if (Condition) res = async(ThenBlock); else res = async(ElseBlock);

Listing 3.16: A futurized version of an if-statement. Instead of invoking the ThenBlock or ElseBlock sequentially those execute asynchronously.

In contrast to Listing 3.16, the condition itself might be represented by an asynchronous result. Listing 3.17 is directly waiting on the future to execute the respective blocks based on the result of the future. However, to completely futurize it, we use the technique demonstrated in Listing 3.18.

if (ConditionFuture.get()) ThenBlock else ElseBlock

Listing 3.17: In this example, the Condition is represented by a future. Listing 3.18 will show the complete futurized version by attaching a continuation.

The decision on which branch to take is now deferred to whenever the asynchronous operations result becomes available by attaching a continuation to the relevant future. It is important to note, that the variation presented in Listing 3.16 and Listing 3.18 can be combined to represent a fully futurized version of an if-statement and the resulting future can be returned from the continuation. 3.3. Evolution 39 future = ConditionFuture.then( [](ConditionReadyFuture) { if (ConditionReadyFuture.get()) ThenBlock else ElseBlock } );

Listing 3.18: A futurized version of Listing 3.17. Instead of waiting explicitly on ConditionFuture, a continuation is attached that is executed whenever it becomes ready and the regular if-statement is then executed.

Loops As the last necessary control structure, loops can be futurized as well. List- ing 3.19 shows a canonical form of a while-loop. The futurization of such a loop is performed by combining the transformation rules as described with the if-statement and recognizing that such loops can be expressed recursively. As such, we dynamically construct the asynchronous control flow graph by recursively calling the DoBlock. while(Condition) { DoBlock }

Listing 3.19: Code snippet showing the basic syntax of a while-loop. DoBlock is exe- cuted iteratively as long as Condition is true.

The futurized while-loop is shown in Listing 3.20 where the transformation towards the asynchronous recursion can be observed.

A for-loop is transformed in the very same way as the while-loop. As a precondition, it is important to note, that every for-loop is trivially convertible to a while-loop. After those basic transformation rules have been applied, the resulting futurized loop (see Listing 3.21) is trivial. As with the other, inherently dependent control structures, the gain of this transformation is the ability to return a future from a function containing it, allowing the caller to either explicitly wait, or perform other work in parallel.

Futurization and Parallel Algorithms One critical aspect of futurization lies in the ability to compose asynchronously, and possibly concurrently executed threads of con- trol into an implicit DAG that represents the data flow of the application in question. It is important to note, that futurization is most effectively applied when used in the 40 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

std::function)> WhileContinuation; WhileContinuation = [&](future CondititionFuture) { if(ConditionReadyFuture.get()) { DoBlock WhileContinuation(async(Condition)); } }

future while_loop = async(Condition).then(WhileContinuation);

Listing 3.20: The building blocks for futurizing the while-loop as shown in Listing 3.19. The execution of DoBlock is deferred and attached as a continuation re- turning the next evaluation of Condition asynchronously. This in turns can then be used to execute the next iteration. whole program. As such it is of utter importance to have higher level abstraction return futures as well. The high-level abstractions presented in Section 3.2.2 are the parallel algorithms. In order to embed the parallel algorithms into the context of futurization, it is essential to allow for them to return futures as well. As an extension to the already existing execution policies, an additional task policy is under consideration to allow for effective futurization of high-level abstractions.

3.3.4. Coroutines and Parallelism

Futurization is a promising concept when it comes to expressing asynchronous data dependencies. However, explicit CPS will become cumbersome and convoluted. One solution to that problem is the mechanism proposed in the technical specification for C++ extensions for coroutines [91].

Listing 3.22 shows the great simplification that comes out of the definitions that the coroutines TS provides. As such it provides the definitions for implementing corou- tines in standard C++. As for example in python, the specification introduces three new keywords:

• co_yield: a value from a function, returning control to the caller. The caller might return control back to the callee after consuming the yielded value.

• co_return: The function returns control back to the caller. The execution of the function has been completed; control cannot be returned. 3.3. Evolution 41

// for-loop: // for (Init; Condition; LoopStatement) // { // DoBlock; // } std::function)> ForContinuation; ForContinuation = [&](future CondititionFuture) { if(ConditionReadyFuture.get()) { DoBlock LoopStatement; return async(Condition).then(ForContinuation) } } Init; future for_loop = async(Condition).then(ForContinuation);

Listing 3.21: The building blocks for futurizing the for-loop. A canonical transforma- tion of the for-loop to a while-loop (see Listing 3.19) is performed and the continuations are applied accordingly (see Listing 3.20).

future fib(int n) { if (n<2) co_return n; future left = async(fib, n-1); future right = async(fib, n-2); co_return co_await left + co_await right; }

Listing 3.22: The futurized version of the fibonacci sequence implemented with co_await and co_return showing the simplification of the implicit con- tinuation provided from the compiler using the semantics as defined in the Coroutines TS. 42 Chapter 3. Parallelism and Concurrency in the C++ Programming Language

• co_await: The first two new keywords allow the specification of the coroutine be- havior from within the coroutines perspective. However, calling a coroutine might now be an asynchronous operation, the result might or might not be available im- mediately, and might return a future value. co_await allows for a transformation of the asynchronous result.

It is important to note that the mechanism to transport a value out of a coroutine is expressed regarding a future type. The proposed solution is a cooperation of compiler- based transformations and library defined constructs.

That is, co_yield and co_return are defined in terms of co_await. For a co_await expression to be valid, the right-hand side expression has to either support a operator co_await expression or directly support the member functions await_ready, await_- suspend and await_resume.

By having the already ratified APIs of the C++ standard and the additional concepts discussed in this section at our disposal, the goal of the remaining chapters is to dis- cuss the implementation of a library not only supporting the current C++ standard, but also allowing for efficient Futurization. Also, giving explicit control over all avail- able computational resources by providing a unified, in terms of syntax and semantics programming environment using executors for shared memory, accelerators and dis- tributed computing targeting HPC. The HPX Parallel Runtime System 4

The C++ Language and its standard library (see Section 3) form the basis groundwork for modern C++ APIs. As such, the HPX Parallel Runtime System is an implementation of the described concepts with additional extensions for distributed and heterogeneous computing. The runtime system is built on lightweight user-level tasks and is embrac- ing asynchronous programming to interface with the various runtime system services as well as when providing higher level APIs and Design Patterns for parallel program- ming. The governing principles behind HPX have been laid by the ParalleX execution model [27] that has been the theoretical foundation of HPX, and essentially defines the following principles:

Starvation

Latency

Overheads

Waiting for Contention

These principles are combined, and it is postulated to be the inherent factors of paral- lel programs limiting scalability. The ParalleX model, therefore, embraces the usage of lightweight threads, fine-grained synchronization using data flow techniques andcom- bining this in a dynamic global address space (AGAS).

By having the execution model and a programming standard that serve as a primary formal specification, these sections outline the fundamental principles on which the runtime system is built upon and the defining components that are providing the core mechanisms. The execution model drove the internal runtime system implementation, and the C++ Standard is acting as the User-Facing frontend.

43 44 Chapter 4. The HPX Parallel Runtime System

Thread Manager Performance Counter

AGAS Service Resolve Remote Actions Parcel Handler

Figure 4.1.: Overview of the HPX Runtime System components.

Figure 4.1 shows a schematic of the different components interacting with each other. The remainder of this section will discuss the basic concept behind the component and implementation details of the specific component. In the core is the thread manager that serves the purpose to provide the underlying mechanisms for local thread management (see Section 4.1). The second component is AGAS that will be discussed in Section 4.2. The support for asynchronous programming in a distributed programming environ- ment is implemented in the parcel handler (see Section 4.3) component providing the one-sided active messaging layer. This messaging layer is the foundation to create an ef- ficient representation of AGAS to provide an abstraction over globally addressable C++ objects that support RPC. The implication of the combination of these concepts will be highlighted in Section 4.4 that presents the fully unified API for local and remote op- erations that form a natural extension to the C++ programming language. To be able to support runtime adaptivity, HPX exposes certain performance counters that can be queried globally and are described in Section 4.5.

4.1. Local Thread Management

As the starting point for the discussion of a parallel runtime system, the shared memory management of threads needs to be discussed. The thread management is implemented based on the assumption that the Operating System (OS) provided thread and synchro- 4.1. Local Thread Management 45 nization primitives come with an unwanted overhead and that a high-performance run- time system needs to be in charge of how the scheduling of its tasks needs to work. As such, HPX implements lightweight user-level tasks that operate in an N to M mapping. Where N is the number of tasks and M is the number of cores to run the tasks on. The user level tasks need to be suspendable to support a useful asynchronous API (see Sec- tion 4.4). One implication of this property is that tasks can be migrated from one OS thread to another that allows balancing load issues easily. The overheads and there- fore performance of a parallel application is furthermore influenced by the overall task scheduling policy.

The essence of lightweight user-level tasks is that each task has its own stack. As such it is operating as if it was its own thread of execution in the sense of Definition 1. The HPX-Threads offer an interface to yield, suspend and be resumed to offer the necessary capabilities to implement all higher level APIs. An entirely conforming implementation of std::thread has been built.

0xABCDEF0000020070: Address of function to execute 0xABCDEF0000020068: Address of trampoline function 0xABCDEF0000020060: Dummy 0xABCDEF0000020058: Parameter for Trampoline 0xABCDEF0000020050: Dummy return address 0xABCDEF0000020048: return address 0xABCDEF0000020040: %rbp 0xABCDEF0000020038: %rbx 0xABCDEF0000020030: %rsi 0xABCDEF0000020028: %rdi 0xABCDEF0000020020: %r12 0xABCDEF0000020018: %r13 0xABCDEF0000020010: %r14 0xABCDEF0000020008: %r15 0xABCDEF0000020000: End of Stack

0xABCDEF0000000000: Start of Stack Figure 4.2.: A HPX-Thread Context containing room for stack space, that is configurable for different sizes as well as a set of registers to save the state of the currently executing task if it needs to get suspended. This figure shows an example for the X86 64-bit architecture and the Operating System. 46 Chapter 4. The HPX Parallel Runtime System

To accomplish user-level context switching, HPX relies on handwritten assembly in- structions to switch between tasks. As such, each task needs to have a previously allo- cated memory region for stack usage, a memory region to save registers (see Figure 4.2) and an architecture and operating system specific routine to perform the switch. The effect of this design decision is that scheduling in itself is non-preemptive. Asaconse- quence, running tasks have to run cooperatively with others and either suspend now and then or the set of tasks to be executed is fine-grained enough, that is, their runtime is relatively short to allow for maximal throughput and fair scheduling.

Other parallel runtime systems usually choose to not allow for their tasks to suspend in a lightweight fashion. The consequence is the advantage of not having to maintain stacks and register contexts as well as invoking a task consists merely of a regular func- tion call. However, the disadvantage is that as soon as synchronization among different tasks is required, the core executing the tasks needs to rely on coarse grain, operating system based, concurrency primitives. The most significant consequence here is that this specific core is not able to continue progress with other tasks that might bepend- ing. This lack of forward progress guarantee might lead to specific deadlock situations. Furthermore, the ability to suspend and resume tasks in a lightweight fashion is the basis to implement fine-grained, constraint-based synchronization.

Finding the optimal task scheduling policy for a given application or algorithm is often a non-trivial process. For that reason, the predefined policies are interchangeable at runtime.

global: The global scheduling policy maintains one global queue for the tasks to schedule. This policy represents a naïve approach to implement a thread stealing, multi-core scheduling queue.

static: The static scheduling policy maintains one queue per core and disabling task stealing altogether. Making it suitable for applications with heterogeneous workloads.

thread local: The thread local policy is the default policies as it seems to work decently for all classes of applications. This policy maintains one queue per core. Whenever a core runs out of work, tasks will be stolen from neighboring queues.

All policies can be configured to either operate ina First In, First Out (FIFO) or Last In, First Out (LIFO) mode to determine the order in which ready tasks get executed. In addition, new scheduling policies can be provided as well. 4.2. Active Global Address Space 47

4.2. Active Global Address Space

After having the lightweight user-level tasks available as described in the previous sec- tion, the next necessary module within HPX is AGAS. Its primary purpose is, to serve as the foundation to support distributed computing and is related to other global address space attempts which are represented by the Partitioned Global Address Space (PGAS) family with the purpose of hiding explicit message passing.

In PGAS, the global address space is exposed indeed as if it is partitioned into various segments of memory, where each segment resides in a specific, local address space and requires a software layer, to translate access from a global view of the data to a local presentation of dereferenceable addresses. This property is directly exposed to the user, who, in general, works on symmetric memory segments [109].

AGAS is different in that it exposes globally addressable data. That is, instead ofdirectly exposing memory segments, it exposes addressable objects with a GID and therefor is a representative of global object spaces [1, 54] (see Section 4.2.2). This naturally aligns with the desire to have an asynchronous RPC mechanism that composes nicely with globally addressable objects (see Section 4.3.1).

Based on this design decision, we can derive three properties that prove to be valuable for correctness and ease of programming:

• Global, automatic, garbage collection

• Transparent location of objects

• Syntactic equivalence of local and remote operations

By having each object living in AGAS being represented by a GID, the location of this object does not need to be bound to a particular locality and leads to uniformity and independence whether an object is located remotely or locally. In addition, this in- dependence allows for transparent migration among different process local address spaces with AGAS being responsible for proper address resolution (see Section 4.2.4). Of course, when dealing with GIDs that refer to an object, with those GIDs not being bound to reside in a particular location, it is important to define proper lifetime semantics. For improved correctness, Section 4.2.3 discusses the global reference counting scheme im- plemented within HPX to improve correctness and avoid lifetime issues. Last but not least, it is also possible to register any symbolic name to a particular GID (see Section 4.2.2) avoiding the need to publish them via global collective operations.

The AGAS layer itself consists of the four following described different subcomponents. 48 Chapter 4. The HPX Parallel Runtime System

The Locality Namespace is responsible for resolving a GID representing a given locality to the endpoints by which the locality can be reached (see Section 4.2.1). The Primary Namespace is responsible for resolving all other GIDs (representing an object in the global address space) to local addresses; this will be discussed in Section 4.2.2. This part will also cover the Component Namespace that is responsible for managing which component, or object, can be created at which particular locality.

Other than registering or resolving a name to a GID, the user does not directly interface with AGAS. Once an attempt to invoke a RPC on a locality or component, the runtime will query AGAS automatically and potentially handle control over to the parcel layer to manage the communication (see Section 4.3). The GIDs are managed and guaranteed to be unique by AGAS maintaing a globally coherent counter.

4.2.1. Processes in AGAS – Localities

When running a distributed HPX application, the different processes that are part of the application are called localities. Figure 4.3 presents a brief overview of a distributed HPX application where each locality represents one process that communicates with the help of the parcelport to span the global address space. By that, each locality can be addressed with a GID to invoke free-standing (possibly remote) function calls. By having GIDs assigned to those processes, AGAS can dynamically manage connecting and disconnecting localities to allow to reduce or increase the footprint of a distributed application.

Like all other globally addressable objects, localities are represented as an instance of a hpx::id_type. Listing 4.1 is presenting the basic set of functions to query all existing localities currently connected. This API is related to the MPI functionality of querying the current rank as well as the number of overall ranks within the current communica- tor. Due to the dynamic nature of HPX, the choice of using a GID representation leads to far more flexibility to resize the application’s footprint, which is to add orremove processes.

4.2.2. C++ Objects in AGAS – Components

Another cornerstone in the functionality of AGAS is the ability to put any C++ object inside the global address space, making it remotely addressable. Figure 4.4 is giving a brief, high-level overview schematic of a potential HPX application with multiple local- ities having different objects allocated in the global address space and multiple GIDs pointing to it. It is important to note that a single object can be referenced from mul- tiple localities. That is that we have multiple instances of a hpx::id_type containing 4.2. Active Global Address Space 49

Locality 0 Locality 1 Locality i Locality N-1

Global Address Space

Memory Memory Memory Memory

Parcelport

AGAS CPUs CPUs CPUs CPUs GPUs GPUs GPUs GPUs

Figure 4.3.: Schematic of a distributed HPX application representing different localities with the various HPX modules. Each separate process is called a locality.

namespace hpx { class id_type; // An Opaque representation of a GID

// Returns the GID of the current locality hpx::id_type find_here();

// Returns a vector containing all localities connected at // the time of the call std::vector find_all_localities();

// Returns a vector containing all remote localities // connected at the time of the call std::vector find_all_remote_localities(); }

Listing 4.1: The API responsible to query different, connected localities. With hpx::id_type being the fundamental type to represent any globally ad- dressable objects, the remaining functions serve the purpose to query the current locality GID, all connected, or all remotely connected localities. 50 Chapter 4. The HPX Parallel Runtime System the same GID, each of those instances hold the object alive. Upon migration, these ref- erences remain valid. The GIDs can point to a component that is local or remote.

Creating Objects in AGAS

Through the restrictions coming from the C++ programming language, such as not be- ing able to introspect arbitrary types dynamically, several manual registration boiler- plate is needed to register objects. This registration enables possible remote interactions, including the creation of an object as well as remotely calling member functions on this object.

Locality 0 Locality 1 Locality i Locality N-1

Parcelport

AGAS CPUs CPUs CPUs CPUs GPUs GPUs GPUs GPUs

Figure 4.4.: Schematic of a distributed HPX application with multiple localities and var- ious components allocated in the global address with different GIDs (gray) pointing to those objects (blue) in the global address space.

Listing 4.3 is presenting the outline of C++ classes and macros that need to be used. In HPX, each object that is supposed to get allocated inside of AGAS needs to derive from hpx::components::component_base and a factory function for remote creation based on a unique name needs to be instantiated. The base class does not impose any runtime overheads; the sole purpose is to provide management facilities to support the runtime. For that to disappear, significant changes to the C++ Standard has to be made to support generic reflection of types.

The allocation functions in Listing 4.3 can be used to either create a single component on a given target locality where and to bulk-allocate count components on the specified 4.2. Active Global Address Space 51 namespace hpx { template class component_base;

template class component; }

HPX_REGISTER_COMPONENT_FACTORY(component<...>);

Listing 4.2: Necessary wrappers for enabling C++ classes to be components. All compo- nents must derive from hpx::component_base and then register a factory using the hpx::component template. locality. Those functions give us a basic set of functionality that we can use to further specify various other things like distribution policies for distributed data structures. template hpx::future new_(hpx::id_type where, Ts&&...ts); template hpx::future> new_(hpx::id_type where, ,→ std::size_t count, Ts&&...ts);

Listing 4.3: To allocate objects in AGAS, an API resembling the syntax of new has been created. It takes the component to be created as a template argument, and the location as well as constructor arguments. The second overload is sup- porting bulk creation of objects.

Symbolic Namespace

As previously discussed, a GID allows being bound to an arbitrary, user-defined string literal, and the ability to resolve said names. This feature is one cornerstone to build protocols for distributed data structures and other related algorithms like scatter and gather functionalities or distributed partitioned arrays.

The exposed API allows for registration/resolution of single GIDs as well as register- ing/resolving multiple objects sharing the same base name, using different indices to identify a specific object. This functionality is useful to build up loosely or tightly cou- pled MPI communicator like structures without the need for global communication or synchronization. 52 Chapter 4. The HPX Parallel Runtime System

namespace hpx { namespace agas { // register a GID with a name hpx::future register_name(std::string const& name, ,→ naming::id_type const& id); // Resolve a GID based on a name hpx::future resolve_name(std::string const& name); // register a GID using a base name and sequence number hpx::future register_with_basename(std::string const& ,→ base_name, hpx::id_type const& id, std::size_t sequence_nr); // resolve a GID using a base name and sequence number hpx::future find_from_basename(std::string const& ,→ base_name, std::size_t sequence_nr); // resolve all GIDs using a base name hpx::future> find_from_basename(std::string ,→ const& base_name, std::size_t count); }}

Listing 4.4: Functions related to register, resolve and unregister arbitrary names with a hpx::id_type. The functions return futures that either mark the comple- tion of the registration, or return the resolved GID.

4.2.3. Global Reference Counting

One important aspect when dealing with C++ applications is the proper management of lifetime of resources. In modern C++, the solutions for lifetime management are ei- ther contained in objects with value semantics, that means that an object is the unique owner of resources, and lifetime is controlled by the special member functions such as constructors, destructors and assignment operators. For dynamically allocated objects, the tools of choice for C++ programmers are smart pointers. Smart pointers come in different flavors, for example, std::unique_ptr is representing unique ownership se- mantics, that is that the pointer contained is released when the corresponding object goes out of scope while copying is prohibited and moving such an object will trans- fer ownership. For shared ownership semantics, C++ offers std::shared_ptr which employs a reference counting scheme such that the last pointer going out of scope will release the allocated resource.

Objects in AGAS are dynamically allocated, and GIDs resemble a certain equivalence to a void pointer. The resolution of a virtual address, as exposed to a programmer in most modern operating systems, is being handled with a combination of operating system and hardware support. HPX, however, is solely implemented in software to resolve a GID to a specific locality local virtual address. Figure 4.5 is sketching the general layout being in use to fulfill this purpose. The algorithm for resolving a GID to a local virtual 4.2. Active Global Address Space 53

MSB LSB prefix rc identifier 0 32 92 127 Figure 4.5.: A GID is a 128-byte number. The different bits are used for various purposes such as the initial locality where it was allocated, the credit and a locally unique ID to resolve the GID to the local virtual address. The prefix de- termines the target locality, rc stands for reference count and contains the credit counter; the remainder acts as the identifier for lookup of the local virtual address. address is discussed in Section 4.2.4.

The need for sophisticated lifetime management arises from the fact that remote opera- tions might still be in flight while references to the component went out of scope orthe scope is depending on a remote operation to be completed. The effect of this is the need for a system to support shared ownership. Allowing multiple references, that is, imply- ing shared ownership, distributed among various localities, the scheme implemented in HPX is global reference counting. This method of garbage collection not only follows the direction of the C++ Standard but is in line with state of the art distributed garbage collection [82, 65]. The decision on using reference counting as the means for garbage collection is taken to provide a deterministic way of object lifetime. The disadvantage of being acyclic did not weigh over the needed global synchronization necessary to per- form mark-and-sweep style algorithms.

Global reference counting in HPX is based on global credit-based reference counting [66] in combination with local reference counting. The lifecycle of a component starts when being constructed inside of the primary namespace of AGAS. Once constructed, the credit of the GID is being filled, and the global reference inside the AGAS management 32 structures is set to 2 since the credit represents the log2 of its contribution to the global reference count. This GID, filled with credits, is then passed to the caller who requested the allocation. The corresponding hpx::id_type is locally reference counted and has shared ownership semantics. Once all references to this GID go out of scope, the credit being stored inside the GID is being returned to AGAS, and a decrement of the global reference count is performed. Once the global reference count reaches 0, AGAS can destroy the object.

Once a hpx::id_type is being passed as an argument or parameter to a RPC that needs to be sent to a remote locality, the contained credit is being split. This happens in the serialization process as described in Section 4.3.2. The actual split operation divides the credit by two. If the result of this splitting leads to a credit count of 0, we need to increment the global reference count to assign new credits. During serialization we logically create two separate references to the same object, as such, the global count 54 Chapter 4. The HPX Parallel Runtime System needs to be increased by (232 − 1) ∗ 2 = 233 − 2 to fill the two created GIDs. Only after this reference increment has been completed, the split operation is completed.

This credit-based technique combined with local reference counting results in minimiz- ing the required network round trips needed to ensure the proper lifetime of all objects in AGAS. That is, an actual reference increment occurs once after the credit has been exhausted, and the local reference counting avoids network round trips altogether. The only round trip needed is when an instance of hpx::id_type drops to a local reference count of 0, where the remaining global credit has to be given back to AGAS. For the cost mitigation of potential remote operations, the requests to decrement the credit are being cached and sent over the wire in batch mode on a periodic basis or when a given threshold is reached.

Resolving GIDs with the help of a symbolic name (see Section 4.2.2) also contributes to the global reference count. When resolving a name, the returned GID’s credit is filled up in the same way as if the object had been newly allocated.

4.2.4. Resolving GIDs to Local Addresses

An integral part of the operation of AGAS is the resolving of GIDs to a pair of localities and the actual local virtual address. This address resolution enables transparent han- dling of objects in the global address space by eliminating the need for users to know where actual objects are located. By GIDs being unique in the system, transparent mi- gration is not only feasible but possible.

The basic translation process from a GID to a Local Memory address is described in Figure 4.6. The process is being started whenever a function on a global object is being invoked. At first, we determine if the GID is local or managed on the same locality as the call. If it is on the same locality, the address is immediately resolved, and the action can be scheduled locally. The prefix encoded in the GID (see Figure 4.5) is the Service Locality. That is, in the case of an object being migrated to a different locality, it is not necessarily residing there. That means the local resolution might lead to the result that the Component is located on a different locality. In that case, a parcel (see Section 4.3) will be sent to the destination. The same happens if the service locality is remote. The routing process defers the execution of the action to the locality where the component is determined to reside. At the same time, AGAS needs to be queried to resolve the prefix to the actual endpoint address of the destination to send the decoded parcel over the wire.

For mitigating the costs of asking AGAS to resolve GIDs, caching is trivially imple- mented using a Least Recently Used (LRU) replacement strategy. The only time a cache entry needs to be explicitly invalidated is when a component is being migrated. The 4.3. Active Messaging 55

Parcelhandler AGAS sending

encode parcel resolve GID to destination send to destination

passing message to network

receiving

decode parcel resolve GID to address schedule action

passing task to Thread Manager

Figure 4.6.: Interaction between AGAS and the Parcelhandler. To send a parcel, AGAS is asked to resolve the destination first. On the receiving end, AGAS needs to be queried to resolve the destination GID to the actual virtual address.

GID resolving algorithm is invariant if a component has not been found on a previously resolved locality, for example in the presence of an outdated cache entry, and directly routes the request to the service instance. This technique avoids global barriers when migrating single components at the possible cost of a false-positive AGAS cache hit, providing the foundation for a resilient global address space.

Even though, a local cache for optimization purposes exists, AGAS in itself does not provide any coherency guarantees to avoid the need for complex protocols. Instead, the focus is on fine-grained synchronization among different objects to ensure granularity, in the same way as the C++ memory model. There is no ordering among different in- vocations on global objects. Coherency has to be ensured with local synchronization primitives by the designer of the Global Object.

4.3. Active Messaging

Since HPX as a runtime system also supports distributed computing, it is inevitable to also require mechanisms for exchanging messages over a network interconnect. Pre- dominantly, the preferred mechanism used in HPC is through message passing. Mes- sage passing in itself can be categorized as follows: 56 Chapter 4. The HPX Parallel Runtime System

• Two-sided Communication

• One-Sided Communication

The current HPC landscape is dominated by two programming paradigms. The first is represented by MPI. MPI offers a wide variety of different communication primitives for passing messages among processes. It is best known for its two-sided communication primitives where the sending side issues a receive operation that contains the buffers, size of the buffers, a tag and the destination. The destination has to post the receiving part of this operation proactively. Those operations are extended by asynchronous ver- sions. Global communications follow the same principle by requiring all participating processes to issue the particular MPI call. With the newest incarnation of MPI, MPI- 3, the message passing standard, also, received support for one-sided communication. One-Sided communication is characterized by requiring only one active process to send or receive a message. This usage mode has been influenced by Remote Direct Memory Access (RDMA) facilities in modern HPC network fabrics (.g. InfiniBand). The bene- fit here is that each process can retrieve or send data independently. Nevertheless, to ensure consistency among the communicating partners, other means of synchroniza- tion have to be implemented, such as epochs. The other prominent representative lies within the family of PGAS programming models that extend programming languages (for example UPC [15] and Fortran [72]). PGAS implements message passing through distributed over all processes (sometimes also known as symmetric mem- ory). Message passing in those models is using one-sided communications with explicit invocations of global barriers to ensure synchronization.

In the HPX programming model, none of those previously discussed forms of commu- nication seems to be appropriate. By having the versatile form of invoking lightweight threads locally, it seems natural that this should be extended in such a way to allow for remote invocations of threads as well. This form of communication is often referred to as RPC or Active Messages. Whereas an Active Message usually relates to a message that do not only contain data, but also have a specifc action attached to it, that is executed once the data has been received on the remote end. An Active Message in HPX is called a Parcel. The parcel is the underlying concept to implement RPC calls to remote local- ities in a C++ fashion. That means that we require an arbitrary, yet type-safe, number of arguments, as well as the possibility to return values from remote procedure invo- cations. This operation is one-sided, as the RPC destination does not need to actively poll on incoming messages, following the semantics of a regular C++ member function call. 4.3. Active Messaging 57

4.3.1. Parcels

The essence of a parcel is to define an abstraction that is suitable to encapsulate all infor- mation necessary to capture all data that is necessary to create and run a task remotely. A task is a C++ function (the action), that can be a free-standing function or a member function, with a set of user-defined arguments (the payload) and a single return value. The task needs to have a destination, e.g. a different locality, where it is executed as well as a method of returning the result to the caller that is, in the context of parcels, referred to as continuations.

Actions

The obvious and naïve way to identify functions is by dereferencing it to take its ad- dress. As simple as this solution sounds, it comes with a variety of problems. The most significant issue is that a function pointer is only valid for the currently running process. The static information used during the compilation process is usually an offset into the binary that has been produced by the dynamic linker, and the actual address is com- puted when the program is started. By solely relying on the functions address, runtime mechanisms would be required to guarantee type safety. That is, tasks are required to take a signature in the form of void (void * arg0, ..., void *argN).

To remedy this situation, HPX introduces a mechanism to convert a regular C++ func- tion pointer into a uniquely and portable class that is used to invoke remote procedure calls. The function pointer type along with the statically known function address will be passed to the class template basic_action (see Listing 4.5) which encapsulates the necessary functionality such as the tuple type to capture the arguments, the result type and the target type (that can be a component or locality). By having this type, we ob- tained a regular callable which can be used in the same way as a regular C++ function with objects registered with AGAS or other localities.

When the target is local, we are able to directly call the action encoded function by using the local AGAS address. This is the first step to get to the syntactic and semantic exten- sion towards remote procedure calls and regular C++ function invocations. However, we did not solve the issue of transferring arbitrary functions to remote localities. This is- sue is handled by using the ability to serialize polymorphic base types (see Section 4.3.2). As such, the class transfer_action is being derived from the polymorphic base class transfer_action_base (see Listing 4.6). This class hierarchy allows for type- erasure of arbitrary C++ functions with, possibly, user-defined, arbitrary classes ina type-safe manner. As such, a parcel is loaded from a binary data and scheduled with- out further user interaction. To recreate the concrete transfer_action, we identify actions by a unique name, that defaults to the compiler dependent Runtime 58 Chapter 4. The HPX Parallel Runtime System

template struct basic_action;

// Basic action for free functions template struct basic_action { typename R result_type; typename void target_type; typename tuple arguments_type; };

// Basic action for member functions template struct basic_action { typename R result_type; typename Class target_type; typename tuple arguments_type; }; // Shortcut for action creation #define HPX_ACTION(Function) \ basic_action

template void schedule_action(typename Action::target_type *target_address, ,→ typename Action::arguments_type arguments) { // register function invocation with thread manager }

// The C++ Function to turn into an action void foo(int i, double d); using foo_action = HPX_ACTION(foo);

Listing 4.5: Turning a regular C++ function into a type safe, globally unique C++ type that can be used to call any C++ function remotely in a type safe manner. 4.3. Active Messaging 59

struct transfer_action_base { virtual ~transfer_action_base() {}

virtual void schedule_action(hpx::id_type continuation, void* ,→ target_address) = 0; }; template struct transfer_action { typedef typename Action::target_type target_type;

void schedule_action(hpx::id_type continuation, void* ,→ target_address) { if (continuation) { schedule_continuation_action(Action{}, continuation, reinterpret_cast(target_address), std::move(arguments_)); } else { schedule_action(Action{}, reinterpret_cast(target_address), std::move(arguments_)); } }

typename Action::arguments_type arguments_; };

Listing 4.6: Mechanisms to polymorphically transfer actions. The continuations are used for transporting the result of the action. 60 Chapter 4. The HPX Parallel Runtime System

Type Information (RTTI) (see Listing 4.7). This mechanism works in settings where all HPX processes run a binary produced by the same compiler, however, as soon as mul- tiple compilers are involved, a portable solution must be provided, which requires the user to specify the name manually.

template struct action_name { const char* call() { return typeid(Action).name(); } }; template <> struct action_name { const char* call() { return ”foo_action”; } };

Listing 4.7: C++ Trait class to identify an action with a unique name

Continuations

The previous discussion about actions did not cover the possible transport of a return value. For that purpose, we use an object within AGAS, which allows setting the result of an action remotely, that is the return value or an exception. The continuation object is type erased by hpx::id_type. The action to be executed, however, can correctly resolve the actual type needed to set the corresponding return value, since it knows its return type.

Furthermore, continuations, as a generic concept, are used to formulate chains of ac- tions, or regular functions to be executed before returning the result to the initiating caller. Thereby offering the generic abstraction that the ParalleX execution model de- scribes as Lightweight Control Objects (LCOs)[52]. 4.3. Active Messaging 61 template struct continuation { // Upon invocation, the result is either set to a corresponding // shared state or used to trigger a following function or // action invocation. void set_result(Result&& result); // declare the action for remote invocation using set_result_action = HPX_ACTION(continuation::set_result); // Upon invocation, the shared state is put into an exceptional // storing the received exception and the exceptional result is // propagated appropriately. void set_exception(std::exception_ptr); // declare the action for remote invocation using set_exception_action = ,→ HPX_ACTION(continuation::set_exception); };

Listing 4.8: Interface for a continuation. This interface describes the basic concept for allowing remote completion notifications and continuation of an action in- vocation and does not reflect an actual implementation.

Parcel structure

By having defined the necessary data structures to transfer a function, the target and the function arguments in a type-safe manner to remote processes, the underlying par- cel structure is defined as in Listing 4.9. By storing the target action as a unique pointer to the polymorphic base class transfer_action_base we can use the built-in C++ vir- tual dispatch mechanism to delegate the scheduling of the concrete transfer_action, that also stores the necessary arguments to invoke the action. Before dispatching to the derived transfer action, we ask the AGAS service to resolve the opaque target GID to the actual local virtual address necessary to invoke the action on the correct object instance or locality.

4.3.2. Serialization

After covering the Parcel and Action transfer mechanism, the detail that has been left open is the actual marshaling, or serialization, of the payload. The purest form of seri- alizing data is to copy the content of the payload bit by bit. This method shows to be impractical for generic C++ types, which might be composed of more than just regular 62 Chapter 4. The HPX Parallel Runtime System built-in types. As soon as the class contains pointers or if the type is not trivially copy- able, that means, every class deriving from a pure virtual base class, a class that defined one of the copy and assignment functions, or a class consisting of members with one of those properties.

struct parcel { void schedule_action() { action_->schedule_action(continuation, ,→ agas::resolve(target_)); } hpx::id_type target_; hpx::id_type continuation_; std::unique_ptr action_; };

Listing 4.9: Parcel Structure: The actions target, the continuation and the transfer action holding the actual action type as well as the arguments.

As such, a protocol to define the structure of a given type has to be provided forthe network layer to send and receive arbitrary data structures over the wire transparently. The MPI Standard provides an API that describes the underlying data types resulting in a structural description, this description is either generated on the fly using MPI_Pack and MPI_Unpack or by creating a custom MPI_Datatype to give programmers the full flexibility to send and receive data. One downside is that it is not possible to scatter different pointers to memory regions without an additional copy[88]. Also, this ap- proach is tailored towards C data structures such as arrays and is not well suited for more complex data structures like linked lists or associative maps.

Other approaches go in the direction of requiring a separate description of the data structures that are being compiled into language-specific serialization and deserializa- tion functions (For example protobuf [95] or Interface Definition Language (IDL) [90]). One significant advantage of this approach is that it allows a programming language independent way for marshaling data, with the disadvantage of requiring a separate tool for the compilation process. Also, most of those tools require an intrusive adaption of the user defined types. The advantage, however, is that the definition of thetype always stays consistent with the serialization code.

The approach we wanted to follow was first developed within the .Serializa- tion1 library. Other libraries following this technique are cereal 2 and YaS 3. The un-

1www.boost.org/doc/libs/release/libs/serialization/ 2https://github.com/USCiLab/cereal 3https://github.com/niXman/yas 4.3. Active Messaging 63 derlying idea is to allow the programmer of a given class explicit control and syntax of what to serialize. It is based on operator overloading of two special archive types that hold a buffer or stream to store the serialized data and is responsible for dispatching the serialization mechanism to the intrusive or non-intrusive version (see Listing 4.10 and 4.11). The serialization process in itself is recursive. Each member that needs to be serialized needs to be specified explicitly. The advantage of this approach is that these- rialization code is written in C++ and leverages all necessary programming techniques. The generic, user-facing interface allows effective application of the serialization process without obstructing the algorithms needing special code for packing and unpacking. Also, allowing for optimizations in the implementation of the archives.

// Intrusive serialization, A needs to be default constructible. struct A { int a; // Load data (de-serialize) void serialize(input_archive& ar) { ar & a; // Alternative syntax: // ar >> a; } // Safe data (serialize) void serialize(output_archive& ar) const { ar & a; // Alternative syntax: // ar << a; } };

Listing 4.10: Basic Interface for intrusive serialization, allows to access private members of a class.

The HPX serialization framework provides support for serializing all built-in types as well as all of C++ Standard library collection and utility types. This list is extended by the HPX vocabulary types such as hpx::future and hpx::id_type with proper support for global reference counting as outlined in Section 4.2.3, which is the main motivation of having a separate serialization layer instead of relying on existing library solutions.

Archives

The serialization archives are the abstraction that implements the logic for serializing objects. By that, the archives hold information about the endianess of the input and ad- 64 Chapter 4. The HPX Parallel Runtime System

// Non-intrusive serialization, B needs to be default constructible. struct B { A a; }; // Load data (de-serialize) void serialize(input_archive& ar, B& b) { ar & b.a; // The serialization process is recursive. // Alternative syntax: // ar >> b.a; } // Safe data (serialize) void serialize(output_archive& ar, B const& b) { ar & b.a; // The serialization process is recursive. // Alternative syntax: // ar << b.a; }

Listing 4.11: Basic Interface for non-intrusive serialization, adapting existing classes. equately select either the intrusive or non-intrusive version of an object to be serialized. The archives have built-in support for integral and floating point types. The archives can hold a type-erased container which is responsible for bit-wise writing or reading those integral types. The need for supporting arbitrary containers comes from the fact that different storage might be needed. Also, a container with the purpose of counting the number of bytes needed, accounting for futures that are not ready yet, or other asyn- chronous tasks necessary to perform the serialization can be written. This is an integral part of the serialization facility. In order to accomplish that, the serialization mechanism is invoked, but instead of writing the bytes directly, we only maintain the size of the ob- jects to be saved. After that serialization pass has been completed, we have the exact size to allocate a necessary buffer for the actual serialization process. That way, we can avoid reallocation of memory entirely when dealing with dynamically sized objects.

Pointers

When dealing with C++ object hierarchies, pointers are often encountered. The class of pointers is broken down into two categories:

1. Containers which use pointers to point to the actual storage of their contained 4.3. Active Messaging 65

storage. Pointers are used to handle the dynamic storage requirements of the con- tainers.

2. Pointers that are used to reference other objects.

For the first case, the serialization framework does not impose any particular require- ments since the memory is entirely managed within the container itself and the con- tainer offers clear ownership and copy semantics. Those containers are serialized by first saving the number of elements and afterward the stored elements.

The second case, however, requires particular attention. Whenever a con- tains a member that is a pointer, it is essential to know the ownership semantics of said pointer. Which means it is crucial for the class to know when the pointed to object needs to be deallocated and destructed. As a consequence, the serialization process is unable to marshal pointers that are non-owning. Potentially leading to memory leaks upon deserialization due to not being able to decide when to delete the newly allocated mem- ory. For raw pointers, we cannot derive any ownership semantics. As a consequence, by default, the serialization of raw pointers will lead to a compile time error. However, to not put unnecessary restrains to expert programmers (a.k.a.: “I know what I am doing, I promise!”), and allowing for best possible flexibility, the hpx::serialization::raw_- pointer has been added which marks the serialization of one individual raw pointer as safe. Upon deserialization, a new object is being allocated, and ownership is implicitly handed to the de-serialized object.

The ownership problem is not unique to the problem of serialization, and in general, the C++ community tries to avoid manual memory management, which is one of the biggest sources of errors in programming, by offering special types, called smart point- ers. Those smart pointers are one way of garbage collection and provide automatic memory management by employing Resource Acquisition is Initialization (RAII). Ei- ther ensuring unique ownership or shared ownership. std::unique_ptr provides unique ownership semantics and is deleting an object in its destructor if the pointer is not null. std::shared_ptr is providing shared ownership semantics through ref- erence counting. That is, whenever the number of references to the same object drops to zero (a reference counter is decremented in the destructor), the object gets released. This semantics make it possible to ensure no memory leaks during the (de-) serialization process.

When dealing with shared ownership semantics, pointed to objects should only be se- rialized once when being archived. For that, each archive maintains a set to track the pointers to serialize to manage multiple pointed to objects correctly. This further un- derlines the difficulty of using raw pointers directly. The use of smart pointer typesor collections of objects is therefor not only highly recommended for regular applications but becomes more and more ubiquitous for distributed applications. 66 Chapter 4. The HPX Parallel Runtime System

Polymorphism

Another important tool for designing C++ libraries, in general, is the use of polymor- phism to model object hierarchies and implement generic interfaces with a bounded subset of implementations. One example was discussed in the handling of actions for remote procedure calls (see Section 4.3.1). Since the ability to dispatch a virtual function call to the dynamic type is handled through either references or pointers to a base type, the support for serializing polymorphic objects is built into the pointer serialization fa- cilities.

For serializing a pointer to a base type, the virtual dispatch capabilities of C++ are used. Together with the actual content of the object hierarchy, the name of the most derived type is being saved to create the most derived object through a factory function, that is looked up by the name. For that mechanism to work, derived types have to be registered in a global lookup table.

Zero Copy Mechanism

One important optimization for handling data is the ability to avoid copies altogether to not waste unnecessary CPU cycles. The mechanism implemented inside the archive is to either directly copy the data bit-wise or, after a certain threshold, put a distinctive mark into the stream of data that the data is to be sent directly from a user-provided memory region and can be issued directly to the underlying network hardware. Modern HPC Network Interface Cards (NICs) can perform RDMA operations that can be exploited by that mechanism. A threshold here is necessary since RDMA operations usually incur an overhead in registering memory regions for remote access as well as the necessary exchange of the RDMA address data. Also, an additional, small overhead has to be paid in the (de-) serialization protocol implementation.

Listing 4.12 shows the type trait necessary to flag a given type to be bit-wise serializ- able. Collections with contiguous memory, for example std::vector, are enabled to perform the discussed zero copy optimization for their whole range of allocated data if the element type itself is bitwise serializable.

Support for Futures

Another critical aspect for the HPX runtime system regarding support for serializing arbitrary arguments and return types occurring in actions is the need to be able to deal with asynchronous results, that is hpx:future. To determine the result of the future, 4.3. Active Messaging 67 template struct is_bitwise_serializable : std::false_type {};

// Helper Macro to mark a given type as bit-wise serializable #define HPX_IS_BITWISE_SERIALIZABLE(T) \ template <> \ struct is_bitwise_serializable : std::true_type \ {}; \ /**/

Listing 4.12: HPX type trait to mark a class to be bit-wise serializable, that is allow to optimize the serialization process to directly memcpy the data. the actual serialization process has to be delayed until the result is ready. This process is combined with the size calculation since the actual size might also depend on the dynamic nature of the deferred result.

The mechanism is implemented as a asynchronous fixed-point iteration that traverses the entire object hierarchy using the serialization interface until all deferred results are completed, and the passed parameters are ready to be copied to an underlying buffer that is served to the network. This algorithm is necessary to support the mechanisms of global reference counting (see Section 4.2.3) since it might require an asynchronous operation to the AGAS service instance managing the global object in question.

4.3.3. Network Transport

The topics discussed in this section so far have been the preliminaries to the actual re- ceiving and sending of messages. As already hinted, the messaging scheme of HPX is breaking with the prevailing paradigms in HPC of point-to-point and one-sided com- munication. Instead, the focus is on high-performance asynchronous RPCs. The pri- mary motivation behind this design decision is that this mode of communication mod- els the interaction inside of C++ in a much more accurate way and is complemented with the AGAS layer.

The actual network transport inside of HPX is hidden behind the action mechanism. This means, that if a message is transferred over the network it is determined by the actual target of an action (see Section 4.3.1) that is resolved by AGAS. Whenever the des- tination, as determined by AGAS, is remote, the action goes through the parcelhandler that is serializing the data and sending it over a specific parcel port to the destination 68 Chapter 4. The HPX Parallel Runtime System locality. Since the actual network transport is depending on the available hardware and infrastructure, this handling is generalized, the actual implementation of the on-wire protocol has been factored out to be provided by a plugin mechanism. Currently, two types of network transport are fully implemented, TCP and MPI. Furthermore, other transports that rely on other lower-level libraries such as Infiniband verbs, GNI are possible. Nevertheless, to reduce maintenance overheads, relying on an additional middleware such as libfabric to handle those low latency, high-bandwidth interconnects is more sustainable [30, 83], and is the third option.

Generic Parcel Handling

First, let us discuss the generic parcel handling implementation. Listing 4.13 shows a simplified version of the sending procedure. Just the last callto send(dest, p) does re- quire a network transport implementation and the remaining part, can be implemented generically to enable optimizations on an algorithmic level but not omitting network transport specific optimizations. The algorithm, in essence, is a summary of thecon- cepts as mentioned earlier to invoke an action on a remote locality: Wait until deferred results have been completed; Serialize the data; Pass it onto the network transport to send. What is omitted here, is the receiving end. A specific network transport imple- mentation needs to have a socket (or similar) available to receive messages in the back- ground and process the data available on the NIC. The decoding of the byte stream into a parcel can then be formulated again in a network agnostic fashion.

One aspect of this algorithm is to wait on deferred results. The naïve algorithm shows that Listing 4.13 either needs to block until the processed asynchronous results are ready or poll. Both variants do not seem appealing and might result in improper resource utilization or even unnecessarily blocked results. Instead, this loop is being futurized, and the fixed-point iteration is formulated as a sequence of continuations with the actual parcel serialization and sending marking the completion of the fixed-point iteration.

Once the size of the to be serialized parcel has been determined (that is, the fixed-point iteration has been converged), a buffer is allocated. In the version outlined in Listing 4.13, this is done by plain std::vector. To support network transport specific opti- mizations, such as RDMA-based one sided putting or getting into a registered memory region, the actual type for the buffer is provided by the network transport implementa- tion.

When looking closer to how a send operation needs to be implemented, it becomes clear that something like a connection has to be established between the sending and the re- ceiving party. In the case of a TCP based communication, the sender needs to connect to a listening socket. This operation is adding latency to the actual sending procedure and should be done only once. For that reason, the parcel handling process is implemented 4.3. Active Messaging 69

void put_parcel(locality dest, parcel p) { parcel_awaiter pa; output_archive await_archive(pa); // Do the fixed-point iteration to await all deferred // results. do { // Run the pre-serialization step. output_archive & p;

// Stop when all futures have been resolved. } while (pa.await_futures());

// create a buffer to hold the serialized content std::vector buffer(pa.size()); output_archive buffer_archive(buffer);

// Now serialize the parcel content into our archive buffer_archive & p;

// Send the parcel in a network transport dependent way. send(dest, p); }

Listing 4.13: Generic Parcel handling protocol for sending, in pseudo code, simplified. The assumption is that AGAS already determined a remote location, and a created parcel needs to be sent. 70 Chapter 4. The HPX Parallel Runtime System using a caching mechanisms to reuse already established channels to send parcels, for reducing the overheads of creating connections or queue pairs. The problem that arises with that is that we can potentially create an infinite amount of opened point to point connections exhausting available resources. The countermeasure is to only allow a given number of total connections to a specific locality as well as a maximum for the number of overall connections. Since NICs are, in relation, an order of magnitude slower to handle the network requests than the Central Processing Unit (CPU) is capable of producing, one, also, needs to defer outgoing messages when a certain threshold has been reached. The effect of caching connections, and only allow a fixed size of connections, isthat parcels are now being coalesced together, that will lead to better resource utilization in terms of bandwidth.

The techniques discussed in this section allow for efficient, asynchronous processing of outgoing messages without disrupting the intent of the actual algorithm, and as such is the basis for the unified asynchronous interface.

4.4. Asynchronous, unified API for remote and distributed computing

The preceding sections discusses the foundation on which HPX is being established. This part is to conclude those architectural design decisions by laying out the substantial higher level consequence that follows from the lower level feature sets.

Combined with the foundation of the C++ Standard in combination with the underlying principles, the fully asynchronous API with the ability to host billions of concurrently running tasks is derived. In combination with AGAS functionalities, this leads to the full equivalence between local and remote operation and the transparent handling of such, that helps productivity and is necessary for applications that need to adapt dynamically at runtime. Therefore, a natural extension of the shared memory C++ memory model and standard library is the goal.

4.4.1. Asynchronous Programming Interface

With the foundation of having a lightweight task management system, that resembles the general notion of threads (see Section 4.1) correctly, dealing with millions of con- currently active threads becomes not only possible but feasible. This observation is mainly responsible for encountering starvation of hardware resources. Furthermore, by reusing the concepts of futures as defined in Section 3.1.3, the management among 4.4. Asynchronous, unified API for remote and distributed computing 71 different threads of execution becomes manageable due to improved handling ofcom- puted results of a given task, which allows for fine-grained synchronization avoiding coarse grain barriers. These possibilities make the notion of dealing with the concept of threads directly in application code obsolete. This becomes more and more apparent when integrating the idea of executors (see Section 3.3.1) that decouple resource man- agement from parallel algorithms of any kind entirely. Due to the generic thread prop- erties, such a task-based system is suitable as the foundation of a general parallel run- time system and therefore fit to express all kinds of concurrency and parallelism. The presented mechanism acts as the underlying foundation of all manifested forms of par- allelism in Section 3.2 reusing the necessary definitions regarding memory model and sequential consistency as defined in Section 3.1. By having this theoretical foundation, the user-facing APIs in the HPX parallel runtime system are designed to be fully asyn- chronous and provide implementations for all further development of the C++ standard as discussed in Section 3.3.

With the background of further extending the footprint of the runtime system towards distributed computing, asynchronous APIs have the essential property of seamless la- tency hiding due to their very nature. In combination with the futurization techniques discussed in Section 3.3.3, a mechanism to hide all factors of SLOW has been devised, that is not limited to the distributed use case, but is also valid on shared memory archi- tectures.

4.4.2. Equivalence between Local and Remote Operations

The usability of the overall system can be determined by looking at differences among operations dealing with various forms of intra-node parallelism (including accelerator support) towards inter-node parallelism.

In HPC, the dominant solutions are either pure MPI based (excluding accelerators) or rely on other paradigms to perform intra-node parallelism such as OpenMP, Intel TBB, Kokkos, or CUDA for accelerator support. Besides, different models, such as PGAS, arose, that operate in a similar way than MPI. Often being referred as MPI+X.

The MPI+X model has significant difficulties when it comes to usability since thepro- grammer is in need to switch between two programming models, that don’t expose good interoperability. The exception to this is given when substituting X with MPI, which unifies the programming models but lacks support for accelerators.

The HPX programming model, however, attempts to unify intra- and inter-node paral- lelism under one single paradigm. This unification is made possible through the em- phasize on asynchronous APIs as well as through the functionality offered by AGAS. 72 Chapter 4. The HPX Parallel Runtime System

Also, thanks to recent developments in programming accelerators, the integration of of- fload based programming to accelerators become possible. The accelerator integration will be discussed in detail in Section 5.2.4.

The complete unification of local and remote operations becomes possible by combining the basic APIs as defined in the C++ Standard (e.g. std::async) with objects located in the global address space (see Section 4.2 with the one sided active messaging (see Section 4.3.1).

Functions HPX actions

Synchronous f(vs...); a()(gid, vs...); (returns R)

Asynchronous hpx::async(f, vs...); hpx::async(a(), gid, vs...); (returns hpx::future)

Fire & Forget hpx::apply(f, vs...); hpx::apply(a(), gid, vs...); (returns void)

Table 4.1.: Equivalence of regular function calls to RPC invocations based on different ways to invoke a function synchronously and asynchronously in C++ and the HPX extensions

Table 4.1 is giving an overview on the various ways to invoke regular C++ functions and HPX actions. It is important to note that HPX actions are regular C++ callables (see Definition 10) and by definition, have no different semantic than regular C++ functions. The only point standing out is the gid parameter, that is of type hpx::id_type (see Section 4.2). At first sight, this seems like a deviation from the regular C++ function call syntax. However, upon closer inspection, this difference does not hold. A GID can be seen as a pointer to an object (either representing a locality or an HPX component in the global address space). The syntax for calling a C++ member function with the signature R (C::)(T...) where R is the return type, C the type of the class and T... an arbitrary number of arguments. The C++ standard defines INVOKE for member functions as fol- lows: INVOKE(member_function, c, ts...)4 where member_function is a pointer to a member function with the previously mentioned signature, c a pointer to an object of type C and ts... the corresponding arguments.

As such, we provide an entire semantic and syntactic equivalence between local and remote function calls. This has the effect, that not only local resource management (for example number of cores), as is the case for shared memory parallelism, but also dis- tributed resource management is completely decoupled. By combining that with ex-

4https://en.cppreference.com/w/cpp/utility/functional/invoke 4.5. Performance Counters 73 ecutors, we get a usable way to manage resources that are extensible to offloading to accelerators (see Section 5.2.4).

4.4.3. Natural extension to the C++ Standard

To conclude this section, we postulate that the HPX programming model provides a natural extension to the C++ Standard towards distributed computing and enables het- erogeneous programming.

By looking at the semantic and syntactic equivalence of local and remote operations as well as the semantics of the Objects in the global address space the extension becomes obvious. Those objects are entirely in line with the C++ programming model in the way that the observable behavior is unsynchronized (see Section 3.1.1). That means that there might be multiple concurrent accesses to objects (through RPC calls). Those accesses are not synchronized but need to be synchronized using the synchronization mechanisms described in Section 3.1.2. Atomic objects in the global address space can be implemented by providing specific hooks such that each RPC call happens inase- quenced, but unspecified order. This mechanism cannot be realized lock-free inthe generic case.

4.5. Performance Counters

One crucial phase of achieving performance portability is the means to be able to profile some aspects of a program intrinsically to allow for the possibility of runtime adaptiv- ity which enables auto tuning capabilities. The previous sections discussed the over- all architecture and fundamental principles of the HPX runtime system. A highly dy- namic runtime has been described, with a big emphasis on having dynamic, often non- deterministic, properties affecting performance. Such behavior is desirable to serve the use case of load balancing and dynamically mitigating contention. To further assist in the decision making process, an HPX intrinsic performance counter framework has been devised, that can encompass HPX internal performance metrics as well as being extensible to supply application-specific parameters or platform dependent hardware performance counters. Those performance counters are accessible through the means of AGAS, by having each counter registered with a specific symbolic name, making them available throughout the system.

All performance counters are accessible using symbolic names. Listing 4.14 lists the basic syntax patterns that those symbolic names follow. That is, the counters are orga- nized following a scheme that is based on directories with the ability to reference single 74 Chapter 4. The HPX Parallel Runtime System

/objectname{counter}/countername@parameters /objectname{locality#index/thread#index}/countername@parameters /objectname{locality#index/total}/countername@parameters /objectname{locality#index/*}/countername@parameters /objectname{*}/countername@parameters

Listing 4.14: The performance counter names follow the given pattern. Parameters are optional. objectname is the given performance counter for a specific mod- ule, and the locality or thread can be either indexed directly, or provided with a wildcard components or threads. Also, cumulative counters, as well as wildcards, are provided that get expanded to form a set of counters. The available counters range from HPX intrinsics to hardware specific counters. The list of counters is a moving target andis therfore omitted in this thesis.

One example is to query the total count of created threads on locality 0, that shall be ex- pressed as /threads{locality#0/total}/count/cumulative. An example for per- formance counters taking an argument is the statistics module: /statistics{/threads{locality#0/total}/count/cumulative}/average@500 The instance we want to query is another counter, taking the average over samples of every 500 milliseconds. The performance are merely presented for the sake of complete- ness and are not further discussed in this thesis. Abstractions for High Performance Parallel Programming 5

The previous sections provide the foundation for the HPX programming model and highlight the profound embedding into the C++ programming language by deriving the essential programming model (see Chapter 4) based on the definitions found in the international C++ standard (see Chapter 3).

This section will build upon those principles to further provide abstractions to improve parallel programming productivity and performance portability. By reiterating the im- portance of data and work co-location, especially in the presence of a global address space, and highlighting the importance of the concept “work follows data”. Section 5.1 is serving as the foundation for the abstractions covered in the subsequent sections. The concept of Targets is being introduced to define the link between allocating data and where work is executed (Section 5.2). Accompanying data structures will be dis- cussed in Section 5.2.6 that are 100% standards conformant yet support the developed concepts to maximize performance. Section 5.3 will give an overview of other utilities useful for synchronization among concurrently running threads of execution and lay out the support for global collectives as well as providing a tool for efficient neighbor- hood communication using channels.

5.1. Co-Locating Data and Work

Figure 5.1 is presenting the disparity of accesses to different memory subsystems. Those different access times come from different technological advances in the respective ar- eas. As the frequency of processing units could be steadily increased, main memory could not keep up [51]. This development led to the invention of caches, and as manu- facturing technologies advanced, to high density and high bandwidth stacked memory

75 76 Chapter 5. Abstractions for High Performance Parallel Programming technologies. Through scalability issues of on-chip interconnects we are likely to see an increase of complexity in the memory hierarchy as the number of cores increases. It is important to note, that this figure is only showing the situation of a single core. The premise of any efficient algorithm should then be to minimize data movements such that the task at hand can complete as quickly as possible, that in turn means to keep data accesses local as much as possible.

CPU Registers

Caches

High Bandwidth Bandwidth Memory

Latency Main Memory

High Bandwidth Network

File System

Figure 5.1.: Latency and Bandwidth for access to different kinds of memory. This figure is not based on a real architecture but is for demonstration purpose for the importance of localized execution

By observing the memory needed for different aspects in a program, the memory space needed to execute an algorithm, the work (usually the binary of the program), is often orders of magnitude smaller than the data that the program uses. As such, the HPX programming model follows the principle of work following data. That is, work should be executed as close to the data as possible to minimize the amount of necessary data movements.

In the context of a RPC to a possibly remote object, co-location of data and work is achieved by the very principle that the work follows data. By its very nature, a RPC is executed precisely at its destination. In this specific case, the GID based address trans- lation performed in AGAS is ensuring that only the arguments passed to a function call 5.2. Targets in Common Computer Architectures 77 are being sent to the remote locality, while the actual work is exactly executed where the data is located. This property is particularly useful cases, where transparent mi- gration of an object in AGAS is employed to implement a dynamically load balancing application.

In general, a sweet spot when also considering the arguments and return values of gen- eralized functions (that is functions that can be executed local or remote) exists. To uti- lize the “work follows data” principle, the volume in size of those needs to be sensible for the algorithm to perform work. Exceptions to this principle exist. One example is the ghost zone exchange required for specific applications; usually returning the ghost zone data from the invoked function.

Definition 5.11 (Target of Execution) A target of execution is a specific location in the computing system that can be used to execute a thread and/or allocate memory, with target de- pending resources attached (see Table 3.3).

By adhering to the importance of co-locating data and work, the HPX programming model defines the Target of Execution (see Definition 11). This concept is closing the gap in task-based, dynamic runtimes between the execution of work and allocation of data that has previously only been considered in separation. The combination is possible by leveraging the concepts defined in the evolution of the C++ Language, namely executors (see Section 3.3.1). That is, the targets are the link between the execution context and the memory location.

5.2. Targets in Common Computer Architectures

After having an opaque definition of a Target of Execution (see Definition 11), this sec- tion is discussing the implementation and concepts of relevant architectures found in modern HPC systems: NUMA Architectures (Section 5.2.3), High Bandwidth Memory (Section 5.2.5) and Accelerators (Section 5.2.4).

To have an underlying common ground, we will first introduce generic utilities to work with targets that are, the allocation of data (Section 5.2.1) and execution of work (Section 5.2.2) that define the connections to co-locate data and work given specific targets.

5.2.1. Allocation Targets

For allocation of memory, the C++ standard defines the Allocator concept. This con- cept is used to define strategies for access and addressing of objects, allocation and 78 Chapter 5. Abstractions for High Performance Parallel Programming deallocation of memory as well as construction and destruction of objects. As such, it is a perfect fit to be used in conjunction with our targets and perfectly blends into the landscape of the C++ programming language. As the foundation of our extension, the std::allocator_traits template is used1. It is a generic mecha- nism to access the underlying implementation of a given allocator without the loss of generality. An important feature of std::allocator_traits is that it makes use of std::pointer_traits2 which widens the scope of pointers usable with allocators to further support Allocators returning pointers to objects that are not necessarily lo- cated in the same virtual address space. By having these utilities at hand, we merely need to slightly extend the information of the discussed traits amending them with an additional target_type member type and a static target_type target() member function to give access to the underly- ing target. Besides, a mechanism to allocate and construct objects in bulk is useful to mitigate overheads and allow for further optimizations with given specialized targets. However, target aware allocators remain a drop-in replacement for standard allocators and are usable with data structures not designed with targets in mind. Nevertheless, ex- ploiting the presence of targets opens the door to optimizations as discussed in Section 5.2.6.

5.2.2. Execution Targets

As with the Allocator concept, we can reuse existing infrastructure to account for the existence of targets and to be able to link targets with allocation and execution of work. The natural choice is the use of the Executor concept (see Section 3.3.1) that is already sufficiently defined to be used directly with targets. It is important tonotethat Targets of Execution might not only refer to a specific execution contexts, but also to places in a system solely related to a particular kind of memory. Executors defined with targets in mind always have specific computational resources attached to them that are close to the memory they are operating on to avoid potential inefficiencies (such as increased latency) when accessing data.

5.2.3. NUMA Architectures

When looking at the design of modern CPUs and especially multi CPU systems, the use of NUMA based designs is becoming more and more prevalent. Traditionally, NUMA existed for multi CPU systems, that naturally had non-uniform access to memory. This architectural change comes from the fact, that each CPU is equipped with a dedicated

1http://en.cppreference.com/w/cpp/memory/allocator_traits 2http://en.cppreference.com/w/cpp/memory/pointer_traits 5.2. Targets in Common Computer Architectures 79 memory controller but still providing transparent access to all attached physical mem- ory through a common inter-CPU bus (see Figure 5.2). In recent years, the core count of general purpose CPUs increased. This increase brought up scalability challenges within a single core due to a commonly shared ring bus that led to having multiple NUMA do- mains even on a single core for the most recent incarnations of chips from Intel starting with Haswell server chips.

Figure 5.2.: A schematic for a NUMA Architecture. The key property of such a system is that latencies are determined by which processing element is accessing memory. This example shows a single Haswell EP scheme exposing two NUMA domains.3

The challenges that uprise with such systems are to reduce access latencies and the abil- ity to utilize all memory controllers present in a given system efficiently. For achieving high performance in the presence of NUMA, it is paramount to place memory in the proximity of where it is being accessed by the given tasks. The placement of memory in current OSs in the presence of NUMA is determined by the so-called “first-touch” prin- ciple or via explicit mapping of pages through OS specific APIs. The essence is thata set of CPUs is associated with a specific memory controller. This resembles the NUMA Target. A NUMA aware allocator is then able to place memory in a given domain determined

2https://www.heise.de/newsticker/meldung/Intels-Xeon-E5-18-CPU-Kerne-und-AVX2-fuer- Server-2357378.html 80 Chapter 5. Abstractions for High Performance Parallel Programming by the associated CPU set. The placement is achieved either by enforcing a first-touch access inside of the defined CPU set or by explicitly allocating the pages to the memory controller attached to the given set if supported by the underlying OS. Since the access to memory in separate NUMA domains is inside the same virtual address space, a special block allocator has been devised that takes a list of targets resembling NUMA domains. This eases NUMA aware programming and places the to be allocated memory in uni- form partitions across the given NUMA domains. Other partitioning schemes can be implemented in the same manner.

NUMA aware executors use a given CPU-set4 as the constraint to place new tasks to be executed. They use the underlying schedulers as defined in Section 4.1, that perform work stealing within the given NUMA domain. In resemblance to the block allocator, a block executor can be defined, that is aware of the data distribution and therefore automatically co-locates data and work based on the same underlying targets.

5.2.4. GPU offloading

Another relevant development in the landscape of HPC architectures is the presence of accelerators in the form of GPUs. In scientific computing, a GPU is often used as an accelerator for number crunching, such as numerical linear algebra routines. Since the actual usage of GPUs, rendering images largely requires dense linear algebra as well, and the development within the computer graphic community asked for programmable processing units (shaders), the scientific computing community discovered the useful- ness of those chips. The dual usage results in a symbiotic effect that consumer products with a large margin of profit can be reused by the scientific community to accelerate simulations. In the HPC community, the prevalently used GPUs are manufactured by NVIDIA and programmed with the CUDA toolkit. As such, the remainder of this sec- tion will focus on NVIDIA based architectures and discuss the implementation with the help of CUDA. The architecture of a GPU is specialized towards streaming applica- tions that expose a high operational intensity. Figure 5.3 shows the architecture of the newest generation of NVIDIA based processors. The different, lightweight Streaming Multiprocessor (SM) are bundled to operate in a SIMD fashion, and each one of those bundles can operate independently. Those microprocessors operate on a large register set and share a specific amount of cache. As a result, the CUDA programming model is centered around providing computational, number crunching, features and can not efficiently process all required C++ standard features, for example, exceptions orfine- grained control over scheduling tasks. Those limitations often require tedious, architec- tural specific fine-tuning. However, it is to be anticipated, that accelerator processing units and general purpose processing units will converge more and more as technology advances. One example for this trend is the Intel Knights Landing processor as well

4A CPU-set can be understood as the set of numbers relying to physical cores to uniquely address them. 5.2. Targets in Common Computer Architectures 81 as the ongoing efforts to support more and more use cases for general purpose GPU programming, as seen with the announcement of the Volta architecture5.

Figure 5.3.: The architecture of a NVIDIA Volta. Each GPU is equipped with up to 84 SMs connectedd via a common L2 cache.6

To support CUDA based offloading of special purpose algorithms that are deemed to be fit for GPU based architectures, we need to rely on the features provided within the Software Development Kit (SDK) provided by NVIDIA. In that regard, the CUDA pro- gramming model is using a single source approach. That is, functions that are supposed to be executed on the GPU are marked with the special attribute __device__ moreover, the GPU program entry points are marked with a special __kernel__ attribute. Luck- ily, CUDA supports the C++14 language entirely, and one of its strength, that is based on the single source property is to interoperate with host and GPU based code seam- lessly.

Listing 5.1 is showing the basic principle on how we can support GPU offloading of generic functions. The entry point (execute_function) is implemented as a template taking a special Callable object, the Closure, that captures the initially passed argu- ments and the function to call. In a real application, extra care needs to be taken to support use cases in which the Callable and Closure need to be transferred separately

5http://nvidianews.nvidia.com/news/nvidia-launches-revolutionary-volta-gpu- platform-fueling-next-era-of-ai-and-high-performance-computing 6https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture- whitepaper.pdf 82 Chapter 5. Abstractions for High Performance Parallel Programming

template __global__ void execute_function(Closure closure) { closure(); }

Listing 5.1: GPU offloading of generic functions with CUDA.

to the GPUs memory. The user-defined device functions can then be executed onthe device specified. Without the loss of generality and for the sake of argument, this is neglected in further discussions, but present in the real implementation. By using tem- plates, we instruct the compiler to generate the code for the GPU device, and we can wrap this into a CUDA specific executor, that exposes the very same interface as areg- ular TwoWayBulkExecutor (see Table 3.3). Due to the GPU programming model how- ever, to get the best performance, it is advisable to restrict oneself to the bulk operations exposed.

When talking about allocating data for GPUs, the concept of unified memory appears. With unified memory, the virtual address space of a process is extended with thephys- ical memory available on GPUs, and the driver is responsible for mapping the pages accordingly and migrate data back and forth in addition to ensure mutually exclusive access. While this brings significant advantage towards programmability, it also adds uncertainty to performance considerations since memory management becomes im- plicit and perfect colocation of work and data becomes hard. The similarities to NUMA aware programming are apparent. In both cases, the exact location of the data needs to be known to maximize memory operation throughput and minimize latencies. As a consequence, the allocators for CUDA based programming, implement explicit allo- cation on the GPU devices to maximize flexibility. By using std::pointer_traits to implement a fancy pointer7 that makes the different physical memory locations ex- plicit, yet translate to regular pointers in device code, the integration into regular C++ is achieved. As such, a CUDA target is representing one GPU device. It is made accessible from the host code, fulfilling every property required by Definition 11. It is important to note that the GPU operations implemented that way take advantage of asynchronous CUDA streams. Those streams allow the creation of a CUDA specific shared state implementa- tion and therefore allow the integration of any CUDA based operations into the context of futurization. Allowing for integration into the existing facilities and allows for com- position of the different layers of parallelism in a unified manner.

7A fancy pointer is a C++ object that behaves like a pointer, but contains more information of logic for, for example, dereferencing the pointer value. 5.2. Targets in Common Computer Architectures 83

5.2.5. High Bandwidth Memory

In recent years, semiconductor manufacturing processes have made significant advance- ments. Especially the introduction of three-dimensional designs into the manufacturing process led to emerging technologies usually called high bandwidth memory, that al- lows for dense, three-dimensional stacking of memory cells. Those cells can be placed on top, or very near to the computational cores (see Figure 5.4).

Figure 5.4.: 3D stacked integrated circuits on the example of high bandwidth memory that is located directly on top of the processing unit8.

This technique has been deployed to the newest generation of GPU devices from AMD and NVIDIA for accelerating main memory access times and bandwidth. This section, however, is focusing on highlighting the benefits of Targets of Execution when dealing with heterogeneous memory. That is, unlike NUMA, such memory can be addressed explicitly and exhibits different performance characteristics. One example that incor- porates high bandwidth memory as part of its memory hierarchy is the Intel Knights Landing architecture (see Figure 5.5). In the case of Knights Landing, the HBM mem- ory can either be used as part of the cache hierarchy, in which case it is used as a direct mapped cache shared amongst all cores or explicitly to assist memory-bound applica- tions. Since we are interested in explicit memory placement, this is the mode this section targets.

As seen from the block diagram shown in Figure 5.5, the Knights Landing divides the core into four quadrants, where each of the quadrants has two dedicated channels to the HBM, it is called MCDRAM in Intel’s case. The main aspect to note here is that, similar

8http://fudzilla.com/media/k2/items/cache/0b7a50d58ee4ee631879a2c11f39c082_XL.jpg 9http://www.nersc.gov/users/computational-systems/cori/configuration/knl-processor- modes/ 84 Chapter 5. Abstractions for High Performance Parallel Programming

Figure 5.5.: Knights Landing layout showing the different tiles in the 2D on-chip mesh network. This shows the partitioning of the cores into 4 Sub-NUMA Do- mains9. 5.3. Supporting Abstractions 85 to a regular NUMA system, we have a set of cores that are close to the memory, even though, the memory can only be directly allocated to the special memory region. Nev- ertheless, the important distinction is that the CPU set for things like MCDRAM need to be decided based on the underlying architecture, and not, necessarily, the NUMA domains have to be exposed by the operating system.

5.2.6. Target Aware Container

The concept of a target, as defined in Section 5.2 is useful to present the general notion of different places residing in a system. They can act as a building block for higher level concepts and have to be usable with data structures that form an abstraction to represent well-defined semantics. The semantics range from specific domain specific knowledge, for example, a particle in an N-Body simulation to composable data structures such as containers. For the sake of demonstrating the capabilities as exposed by the defined concepts, this section will focus on the discussion of a dynamically allocated , that can be seen as a generic container, usable in a variety of other, more spe- cialized data structures. The C++ Standard defines std::vector10 for that purpose. This container can be parameterized with an Allocator making it immediately usable without the target-aware allocators defined in the previous section. Unfortunately, ithas no support for the bulk construction facilities usable with target_allocator_traits. This bulk construction is an important optimization for the NUMA as well as the CUDA target. As such, a drop-in replacement for std::vector has been implemented which comes with those optimizations in mind.

5.3. Supporting Abstractions

Data and execution locality is the starting point for HPC applications. As soon as more complex use cases need to be implemented, the need for other abstractions, such as syn- chronization primitives, global collectives and point to point communications arises.

5.3.1. Synchronization of Concurrent Work

The C++ Standard defines a set of low-level primitives to perform synchronization among concurrently running threads of execution (see Section 3.1). Current implementations of those primitives (expect for lock-free atomic operations) are using OS specific APIs

10http://en.cppreference.com/w/cpp/container/vector 86 Chapter 5. Abstractions for High Performance Parallel Programming to perform their duties. By performing blocking calls on the OS level, we are permit- ting other user-level tasks to run on the same thread. For mitigation of this problem, all those primitives have been ported to perform those blocking calls with the underly- ing threading model of HPX (see Section 4.1) to perform cooperative scheduling among other user-level tasks. Those primitives are usable in a shared memory context only, while it has a certain appeal to implement things like mutual exclusion or condition variables in a distributed memory context, the HPX programming model favors local- ized synchronization, combined with globally addressable objects which are suitable as building blocks for those, if needed.

5.3.2. Global Collectives

Apart from the local synchronization primitives, applications often require global ex- change of data. Those requirements come either as algorithmic properties, for example, the need to have the result of a scalar product available on all threads of execution, or by setting globally available configuration values.

Figure 5.6 presents the basic idea to implement global communication patterns. That is, instead of sending N − 1 messages from a single locality, a fan-out is being performed to send the messages in a tree-based manner. Other programming models that depend on explicit message passing (like MPI), can benefit from potential hardware support, this is not generically applicable to HPX. Since the target, GIDs do not necessarily have to represent localities, but arbitrary objects, we can not assume a fixed set of localities, as such, the creation of a communicator, which itself requires global communication, is not realistic. However, by increasing the fan-out, an HPX implementation can benefit from spawning the children in parallel.

5.3.3. Point to Point Communcations

As important as the global communication primitives, is the ability to perform point to point communication. As the messaging paradigm can be considered as one-sided (see Section 4.3.1), having the ability to perform point to point synchronization and commu- nication is important to efficiently support applications that do not strictly require a BSP style mode of operation. One approach, is, of course, to synchronize on futures. How- ever, this only provides synchronization mechanisms for the consumer, leaving out the producer synchronization. In MPI based applications, the mechanism of choice is to make use of the pair of send and receive functions. That is, each send must be matched by a receive. This pairing naturally synchronizes between the producer (initiating the send operation) and the consumer, matching the send with a corresponding receive op- eration. For supporting this pattern in the HPX programming model, a special type in 5.3. Supporting Abstractions 87

Localities 0 1 2 3 4 5 6 Collective

Figure 5.6.: Global collective operation, basic principle based on tree spanning algo- rithm involving all places. The arrows denote a RPC call to the next locality, each level spans its children in parallel, the execution of the collective is rep- resented as a grey box. 88 Chapter 5. Abstractions for High Performance Parallel Programming the global address space has to be created. A common pattern for asynchronous pro- gramming models is the concept of a channel [81].

template struct channel { explicit channel(hpx::id_type where); explicit channel(std::string name);

hpx::future get(std::size_t generation) const;

void set(T t, std::size_t generation); };

Listing 5.2: The interface for hpx::lcos::channel: A value can be either asyn- chronously retreived via a given generation or set. This allows for efficient point to point communication. The set operation is non blocking. The re- turned future of the get operation can be used to futuruize the control flow.

Listing 5.2 demonstrates the synopsis of an unbounded channel to be used within HPX. Since it is representing a global object, one can either connect to it using a previously reg- istered symbolic name or via an hpx::id_type which has been distributed otherwise. The default implementation uses an unbounded queue, where push and pull requests can arrive out of order, with an associated sequence number. The pull request returns a future that is being set ready whenever the respective push completed, providing the full semantics of a send/recv operation. The sender can either use fire and forget semantics or synchronize on the completion of the send operation asynchronously or synchronously. Evaluation 6

After the distributed, task based parallel programming based on the HPX runtime sys- tem has been discussed and introduced in the previous sections, this section will eval- uate the concepts. In this part of thesis, the various components of the system will be quantified on various platforms. First of all, Section 6.1 will introduce the software and versions which were used to conduct the experiments as well as the used hardware. Section 6.2 will discuss various low level experiments to be able to determine overheads as well as performing an analysis on the amortization of those overheads as well as a comparison with other, comparable solutions.

The evaluation will be concluded by presenting two example applications, where the first, a two dimensional heat stencil, serves as an educational example. Itwillputa focus on programmability and showing the effects of the features discussed, such as futurization (see Section 3.3.3) and the unified programming model supporting het- erogeneous architectures. Applying those concepts enables techniques such as latency hiding and achieves performance portability (see Section 6.3.1). The second application, OctoTiger[60, 50, 49], is a multi-physic simulation for binary star systems, that has been tuned and pushed to scale as part of the research for this thesis (see Section 6.3.2).

6.1. Benchmark Setup

The setup for evaluating the performance is made up of 6 different systems representing ARM64, x86-64 and KNL CPU architectures to demonstrate performance portability on CPU based systems. To not only measure single node performance, some benchmarks were in addition evaluated for up to 16 nodes for ARM64, 256 nodes for x86-64 and 8192 nodes for KNL. For scaling the number of nodes, the Meggie cluster1 and the Cori

1https://www.anleitungen.rrze.fau.de/hpc/meggie-cluster/

89 90 Chapter 6. Evaluation

Supercomputer2 has been used. The ARM64 hardware used was a custom build small scale cluster using 16 NanoPC-T3 Plus Board3. The full hardware specifications are summarized in Table 6.1.

ARM64: CPU Samsung S5P6818 Cortex A53 Frequency 1.4 GHz Cores 8 Main Memory 2GB DDR3 RAM Nodes 16 Network Interconnect Ethernet X86-64: CPU 2x Intel Xeon E5-2630 v4 Frequency 2.2 GHz Cores 2x 10 Main Memory 64GB DDR4 RAM Nodes 728 Network Interconnect Intel OmniPath KNL: CPU Intel Xeon Phi 7250 (Knights Landing) Frequency 1.4 GHz Cores 68 Main Memory 96GB DDR4 RAM, 16GB MCDRAM Network Interconnect Cray Dragonfly

Table 6.1.: Used hardware systems: For this thesis various different architectures to eval- uate the performance of the presented programming model were used. Rang- ing from embedded systems to large scale supercomputers

Table 6.2 summarizes the used software stack. In order to have results not depending too much on the compiler, GCC has been used with the latest version available on the corresponding system. Details about the respective benchmarks are discussed in the remainder of this chapter in the respective sections.

6.2. Low Level Benchmarks

In order to determine and evaluate the entire system, this Section will introduce micro benchmarks to assess the performance of the overall system. First, the threading sys- tem overheads will be determined, including the performance characteristics of futures.

2http://www.nersc.gov/users/computational-systems/cori/ 3http://wiki.friendlyarm.com/wiki/index.php/NanoPC-T3_Plus#Hardware_Spec 6.2. Low Level Benchmarks 91

Name GCC HPX MPI Boost Jemalloc Hwloc ARM64 8.1.0 fdc279a OpenMPI 1.10.2 1.67 5.1.0 1.11.10 X86-64 8.1.0 6e19dce IntelMPI 2018 1.64 5.0.1 1.11.7 KNL 7.3.0 6e19dce Cray MPICH 7.7.0 1.65.1 5.0.1 1.11.9

Table 6.2.: Used system software: The used software includes the latest version of GCC available on the platform as well as the latest version of HPX, Boost and Je- malloc as available at the point of writing

The second part will deal with AGAS and determine the behavior of the complete stack involved. The last part will determine the efficiency of the parallel algorithm imple- mentation using the STREAM Benchmark [61]. The STREAM Benchmark is part of the main HPX repository and the remaining Benchmarks can be found at GitHub4. The evaluation uses commit 2943fc2.

Task Creation Overhead

ARM64 10000 X86-64 KNL 8000

6000 Cycles 4000

2000

0 HPX Coroutines HPX Threads OpenMP

Figure 6.1.: Overheads of creating tasks on a single processing unit. This graph shows a comparison between spawning a HPX thread for each function call and spawning a Operating System thread on various platforms, normalized to the cost of calling a function on the respective platform.

4https://github.com/sithhell/hpxbenchmarks 92 Chapter 6. Evaluation

6.2.1. HPX Thread Overhead

One of the basic building blocks of the HPX programming paradigm are the user level tasks as described in Section 4.1. The claim is, that those enable massive parallelism in today’s system. While there is no doubt, that future Operating Systems are able to provide similar facilities, the focus here is to demonstrate the effectiveness of the ability to enable as fine grained parallelism as possible. Grubel [29] discussed the overheads of HPX tasks extensively. In this thesis, I will reiterate the founding of her work to reflect the progress of HPX as well as include other hardware. The experiments conducted in this section will be all low level synthetic benchmarks which are focusing on determin- ing the overheads of the threading system.

HPX Scheduling Overhead

ARM64 X86-64 KNL

4.2% 3.6% 3.4% 4.4% 5.3% 4.6%

91.4% 91.2% 91.9%

Stack Creation Context Switch Scheduler Overhead

Figure 6.2.: Scheduling overhead when launching a single HPX Thread without con- tention. The scheduler queues maintenance dominates the runtime cost on all platforms.

Figure 6.1 shows the costs of spawning a HPX thread in relation to a regular function call and compares this to the cost of doing the equivalent in OpenMP. Due to the com- piler assisted implementation of OpenMP, it is significantly faster than the creation of HPX threads. The HPX coroutine is merely creating a stack and performing two context switches. As such, it makes sense that the full HPX Threads, which gets scheduled via the scheduler are two orders of magnitude slower. The overall results are consistent over the different evaluated platforms. It is, however, interesting to note, that the KNL platform is showing the worst performance. This benchmark shows the relative costs of spawning HPX-Threads in the context of no contention in the underlying scheduler. It 6.2. Low Level Benchmarks 93 can be seen that the benefits of having lightweight tasks is beneficial and allows, with minimal overhead, to create more work than hardware resources are available to en- sure a steady utilization due to work stealing algorithms in the scheduler. The costs of creating Operating Systems is 500 times slower than the HPX Thread equivalent on all platforms. Figure 6.2 is presenting the relative overhead of the different involved com- ponents. It can be clearly seen that with 9̃1% on all platforms, the scheduling policies contributes the most to the actual runtime overheads.

Future timings

25000 ARM64 X86-64 20000 KNL

15000 Cycles 10000

5000

0 hpx::make_ready_future hpx::promise std::promise

Figure 6.3.: Costs of creating futures on different platforms. In comparison to the stan- dard C++ implementation, the costs of creating HPX futures is 4 orders of magnitudes faster.

Work stealing and general scheduling creates overhead due to the fact that queues of outstanding tasks need to be managed and shared between the concurrently running scheduling algorithms. In order to asses this scheduling overhead, the same benchmark is repeated, now with a varying number of compute resources. In Figure 6.5 the results of this test on X86-64 are presented. The quality of the results on the other two platforms were similar. The results indicate, that the user level task scheduling is feasible and provides good scalability without sacrificing performance. The merit is, that unlike for regular functions, the work executed as part of the task needs to be long enough. To merit the overhead, this should be in the range of 100ms.

Furthermore, it is interesting how the usage of user-level tasks compares to other, sim- ilar solutions, for example OpenMP, which also provides task-based interfaces without the added functionality of each task being suspendable. The conclusion here is that suspendable user level tasks don’t induce a performance hit, while offering superior se- 94 Chapter 6. Evaluation mantics, for example the ability to synchronize concurrent access without blocking the underlying computational resource and without oversubscribing the available cores. Nevertheless, the results indicate, that HPX is not yet able to leverage very fine-grained tasks when comparing to the traditional OpenMP parallel for loop. The comparison with the OpenMP task pragma fares better and shows that the HPX approach is in the same ball-park. Yet again however, it is important to note, that a naive approach with std::async is, in general creating too much overhead to be considered for fine grained tasks.

Async timings

200000 ARM64 X86-64 KNL 150000

100000 Cycles

50000

0 hpx::async std::async

Figure 6.4.: Costs of creating futures on different platforms. In comparison to the stan- dard C++ implementation, the costs of creating HPX tasks is 4 orders of magnitudes faster.

The above described benchmarks merely handle the case of scheduling tasks. What comes short so far, is the actual transport of results of asynchronously executed tasks. In order to asses that, we first look at the achievable throughput by sequentially creating futures which remain in a ready state and then compare that with the actual costs for hpx::async and std::async. Figure 6.3 shows that creating the actual synchroniza- tion primitives is relatively cheap in comparison to the Operating System primitives based implementation of the C++ Standard Library. Figure 6.4 compares hpx::async with std::async on the three platforms. That is, in addition to spawning a task, the result is being transported to the caller via a future. The comparison between the C++ Standard implementation and HPX are similar to Figure 6.3 and show that Operating System based synchronization of tasks is not well suited to support fine grained tasking systems. 6.2. Low Level Benchmarks 95

Scheduling Speedup for different task granularities (X86-64)

hpx::async OpenMP for 20

15

10 Speedup

5

0 4 8 12 16 20 4 8 12 16 20

OpenMP task std::async 20 ideal Speedup 100.0 ms 15 10.0 ms 1.0 ms 10 0.1 ms 0.0 ms Speedup

5

0 4 8 12 16 20 4 8 12 16 20 # Cores # Cores

Figure 6.5.: The overhead of a runtime system determines the granularity which amor- tizes those. These graphs show the different approaches and the scalability with different task granularities. 96 Chapter 6. Evaluation

6.2.2. HPX Communication Overhead

In Section 6.2.1 the focus is on the overheads related to local thread creation, this Section will put a focus on the RPC facilities and the closely related AGAS which is needed to perform the RPC transparently without explicit knowledge of the actual locality.

As the first subsystem to evaluate, we choose the serialization mechanism as layed out in Section 4.3.2. Serialization can be considered to be the main source of overheads, since it is required to be performed a priori when sending and again on the receiving end to further process the received parcel. As such, we compare the proposed solution against other pre-existing solutions to determine the viability of the approach. In addi- tion, the various optimization opportunities implemented will be evaluated, which are the essence to enable further exploitation of the underlying network fabrics available in today’s supercomputers.

HPX Serialization throughput (X86-64)

memcpy 20 plain serialization parcel serialization 15

10

Bandwidth [GB/s] 5

0

1 2 4 8

16 32 64

128 256 512

1024 2048 4096 8192

16384 32768 65536

131072 262144 524288

1048576 Processed bytes

Figure 6.6.: The serialization overhead in comparison to a plain memcpy operation is significant for small messages smaller than 2 KByte. For larger messages, the overhead is armortized.

Figure 6.6 is representing the results of serializing a vector of doubles and a plain mem- cpy operation. It can be seen that the overhead of the serialization mechanism is sig- nificant for messages smaller than 4 KByte. This can be accounted to the additional bookkeeping that needs to be performed in a generalized fashion, wheras the mem- cpy variant, merely copies the data from the source to the destination. Considering the advanced mechanisms, such as the usage of zero copy optimization together with 6.2. Low Level Benchmarks 97 the intrinsic support for futures and global reference counting, the HPX solution can be considered to be a feasible solution especially when considering regular payloads, for example as needed by halo exchanges, which easily surpass the 2 KByte mark (an equivalent of 250 double values).

HPX action throughput (X86-64)

101 hpx async direct, window size = 1 hpx async direct, window size = 128 −1 10 hpx async, window size = 1 hpx async, window size = 128

Bandwidth [GB/s] 10−3 mpi, window size = 1 mpi, window size = 128

1 2 4 8

16 32 64

128 256 512

1024 2048 4096 8192

16384 32768 65536

131072 262144 524288

1048576 Processed bytes

Figure 6.7.: The achievable throughput of messages sent via HPX is an important perfor- mance characteristic. It can be seen, that an MPI based send/recv solution is faster than a single action call. However, once there are multiple actions in flight simultanousely, they can achieve a higher throughput starting at8 KByte of payload.

Furthermore, once the serialization overhead have been determined, the asynchronous, one sided underlying messaging layer needs to be evaluated. Since the overhead re- lated to serialization has been determined, the next layer, which is the invocation of a possibility remote action. Therefor, we are using hpx::async to invoke a free function on a remote locality. This excludes the AGAS layer, since the locality GID lookup is a constant time table lookup. As such, we are combining serialization and the underlying asynchronous, one sided message passing with the continuation mechanism that sets the hpx::future ready, once the remote task has been completed. In Figure 6.7 the throughput of action calls are being presented. The one-sided nature of the messages is one of the biggest sources of added overhead to this benchmark. The benchmark is set up to perform a two sided ping pong message to directly measure the through- put. As a consequence, the HPX based solution falls short in the case when there is no other work to perform (window size = 1). However, once there are multiple messages in flight (window size = 128) this disadvantage seems to disappear after the payload has exceeded 8 KByte. Given the additional overhead of scheduling tasks and serializing 98 Chapter 6. Evaluation the data, this result is remarkable, and puts HPX into the same ballpark as MPI based communication.

Component Creation Costs

ARM64 2000000 X86-64 KNL

1500000

Cycles 1000000

500000

0 Local Remote

Figure 6.8.: The local component creation overhead is almost zero. On the other hand, remote object creation comes at a significant cost, due to the involved RPC.

Of course, the effects of AGAS needs to be investigated as well. Figure 6.8 is therefor shedding light on the costs involved in creation objects inside of AGAS and compares this to the creation of regular objects. It can be seen that the local creation of objects in AGAS come at almost no additional cost. The remote case however is limited by the overheads of additional RPC calls as well as additionally scheduled tasks. Even though object creation plays an important role in the lifetime of a C++ program, the function calls on those objects are usually higher. As such, it is important, to measure the overheads to those. Figure 6.9 is demonstrating those overheads and comparing reg- ular member function calls with local synchronous and asynchronous member function through AGAS as well as remote ones. It can be seen that the quality of the overheads is consistent with the creation of objects. However, it is important to note, that there is no significant overhead over action calls to free functions, as such, the results shownin Figure 6.7 can be applied.

Another important aspect is the direct, neighbor to neighbor synchronization as dis- cussed in Section 5.3.3. In addition to the plain synchronization, a payload is mostly as- sociated with it, for example the exchange of halo regions in stencil based calculations. As such, the latency for retrieving and receiving those synchronizations as well as the bandwidth to exchange the payload is critical. The prevalent benchmark to represent this kind of operations is defined in the OSU benchmark suite [67]. The results of using 6.2. Low Level Benchmarks 99

Component Action Costs

1500000 ARM64 X86-64 1250000 KNL

1000000

750000 Cycles

500000

250000

0 Direct Call Local AGAS Lookup Remote AGAS Lookup

Figure 6.9.: The costs of calling actions on components is consistent with the cost of call- ing actions for free actions. Remote calls incur an overhead that is easily mitigated by multiple concurrent actions.

HPX channel throughput (X86-64)

101

− 10 1 hpx channel send recv, window size = 1 hpx channel send recv, window size = 128

Bandwidth [GB/s] mpi, window size = 1 10−3 mpi, window size = 128

1 2 4 8

16 32 64

128 256 512

1024 2048 4096 8192

16384 32768 65536

131072 262144 524288

1048576 Processed bytes

Figure 6.10.: The Channel implements means for point-to-point synchronization. The performance matches the results of regular actions, and can therefor be seen as a feasible method to implement neighborhood communications. 100 Chapter 6. Evaluation a channel for this message exchange in comparison to using MPI can be seen in Figure 6.10. In addition to only sending one ping-pong at a time, we amended the benchmark with having multiple messages in flight, to highlight the computation-communication overlapping capabilities. The results show a similar than what has been discussed in Figure 6.7.

Broadcast Throughput (X86-64) MPI HPX window size = 1 HPX window size = 8

1

0

0 Bandwidth [GB/s]

0 200 0 200 0 200 Number of Nodes Number of Nodes Number of Nodes MPI, size=1.0 B HPX direct, size=1.0 B HPX async, size=1.0 B MPI, size=4.0 KB HPX direct, size=4.0 KB HPX async, size=4.0 KB MPI, size=1.0 MB HPX direct, size=1.0 MB HPX async, size=1.0 MB

Figure 6.11.: The broadcast implementation of HPX shows the same qualitative behavior as the MPI implementation however, the absolute performance is worse.

Last but not least, Figure 6.11 is showing the results of the broadcasting implementation inside HPX. Since the OSU benchmark suite also defines a set of benchmark for the global broadcast collective, this serves as the baseline to compare to. It can be seen that this global collective primitive as implemented in HPX severely lacks performance in comparison to the MPI based implementation. Even in the case where we have multiple broadcasts in flight. This has to be investigated further but is outside the scope ofthis thesis. Nevertheless, the qualitative behavior of the operation is comparable between HPX and MPI.

6.2.3. STREAM Benchmark

To conclude the low level benchmarks, this section demonstrates the efficiency, in terms of achievable bandwidth, of the proposed target of execution infrastructure presented in Section 5.2. As a reference implementation, the de-facto standard STREAM [61] has been chosen as a reference and ported to HPX using targets and parallel algorithms. 6.2. Low Level Benchmarks 101 typedef hpx::compute::vector vector_type; // Creating the allocator Allocator alloc; // Allocate the data vector_type a(size, alloc); vector_type b(size, alloc); vector_type c(size, alloc); // Creating the executor Executor exec; // Creating the policy used in the parallel algorithms auto policy = hpx::parallel::execution::par.on(exec); // Vector Triad hpx::parallel::transform(policy, b.begin(), b.end(), c.begin(), c.end(), a.begin(), [scalar](double a, double b) { return a + b * scalar; });

Listing 6.1: This listing is showing the gist of the port of the stream benchmark TRIAD kernel using parallel algorithms in a executor and allocator aware fashion showing performance portability.

STREAM TRIAD Benchmark (ARM64) OpenMP HPX 5

4

3 Bandwidth [GB/s]

2 2 4 6 8 2 4 6 8 Cores Cores size = 781.25 KB size = 7.63 MB size = 76.29 MB

Figure 6.12.: The result of the benchmark on the ARM64 system shows that HPX achieves comparable performance for the larger input sizes. The smallest one, suffers from too fine grained tasks. 102 Chapter 6. Evaluation

The benchmark contains four measurements to asses the sustainable bandwidth of a given computational platform: COPY, SCALE, ADD and TRIAD. As a non-performance result, it is important to note that the porting to HPX was straight forward by using the parallel algorithms (see Listing 6.1). In contrast to the original implementation, the HPX implementation is platform independent, and chooses the target to run on, based on the passed executor. The implementation solely relies on the premise that the passed con- tainer is close to executor. The results of running the benchmark on our test platforms can be seen in Figure 6.12, Figure 6.13 and Figure 6.14. The task based implementa- tion of the parallel algorithms, doesn’t fall short compared to the data-parallel model as presented in OpenMP for larger input sizes. For smaller input sizes, we have too fine- grained tasks, which are scheduled randomly over the cores. This has the effect that the OpenMP solution, which is not only able to perform finer grained tasks, in addition is able to exploit cache effects more efficiently.

STREAM TRIAD Benchmark (X86-64) OpenMP HPX

200

150

100

50 Bandwidth [GB/s]

0 10 20 10 20 Cores Cores size = 781.25 KB size = 7.63 MB size = 76.29 MB size = 762.94 MB

Figure 6.13.: The result of the benchmark on the X86-64 system shows that HPX achieves comparable performance for the larger input sizes. The smaller ones, suf- fers from too fine grained tasks, and in addition, the OpenMP based solu- tion is capable of exploiting cache effects due to the static work dispatching.

6.3. Benchmark Applications

The low-level benchmarks are sufficiently characterizing the runtime system on ami- croscopic level. Nevertheless, to asses the full power of the provided functionality, this 6.3. Benchmark Applications 103

STREAM TRIAD Benchmark (KNL) OpenMP HPX 500

400

300

200

100 Bandwidth [GB/s]

0 0 20 40 60 0 20 40 60 Cores Cores size = 781.25 KB size = 7.63 MB size = 76.29 MB size = 762.94 MB

Figure 6.14.: The result of the benchmark on the KNL system follows the same pattern as the ARM64 and X86-64 results.

Section will focus on 1) A toy stencil application to re-iterate and demonstrate the API discussed in this thesis as well as showing the performance impact and 2) OctoTiger, a production code astrophysics simulation that has been tuned to run at extreme scales during the course of the research for this thesis.

6.3.1. Two-dimensional Stencil Application

The pattern of stencil based computations is a common pattern in HPC applications, and can be seen as a basic block in many scientific applications. As such, this part of the thesis is discussing the pragmatic solving of the application of the Laplace Order operator to a function, representing the second derivative of a function f in space (see Figure 6.15):

δ2f δ2f u = △ f = + (x, y) ∈ [0, 1] · [0, 1] and f(0, y) = f(x, 0) = f(1, y) = f(x, 1) = 1 δx2 δy2

A numerical solution to this problem can be obtained by discretizing the spatial domain and using the finite central difference method. We obtain a sparse matrix to whichthe Jacobi-Method can be applied [32]. As such, the pseudo algorithm to calculate the so- lution can be formulated as shown in Listing 6.2. That is, each element in the two di- mensional grid is calculated by the mean of the surrounding elements of the previously 104 Chapter 6. Evaluation

Figure 6.15.: The PDE result calculated grid elements with a constant boundary of 1. Two grids are being allocated that are used to store the results of the previous iteration i.

for i in 0...I for x in 1...Nx for y in 1...Ny u[i+1](x, y) = 0.25 * (u[i](x-1,y) + u[i](x+1,y) + ,→ u[i](x,y-1) + u[i](x,y+1)) - u[i](x,y))

Listing 6.2: The 2D Stencil pseudo algorithm showing the application of a 5-point stencil over the entire grid.

The remainder of this Section will develop a performance portable, parallel solver for this problem using modern C++ techniques which is usable for shared and distributed memory and makes use of available accelerators. The performance implications and drawbacks will be shown during the introduction of the subsequent steps. The full code, with different sources for the different steps can be found at GitHub5. The commit used for this evaluation section is 4814c7c.

5https://github.com/STEllAR-GROUP/tutorials/tree/master/examples 6.3. Benchmark Applications 105

Modeling the Stencil in C++

Figure 6.16.: The 2D Grid with a dimension Nx·Ny.

As the starting point, the foundation is to model the basic C++ classes such that we are able to efficiently model our solver. As a first prerequisite, we want to have a linear, contiguous block of memory to store the grid (see Figure 6.16). Using std::vector for that purpose sounds like a natural choice, however, we want to keep the developed solution generic to account for other possible containers (for example the target aware vector introduced in Section 5.2.6). Nx

First Row i-th Row Last Row ......

0 * Nx Ny i * Nx (Ny-1) * Nx i-th Row, Element Access i * Nx + 0 i * Nx + Nx-1 i * Nx + j

Figure 6.17.: The access to grid elements is implemented over a row-major contigous block of memory.

Indexing an element in a linearized array using two-dimensional coordinates can be expressed with the formula index = (i ∗ Nx) + j where 0 <= i < Ny and 0 <= j < Nx. That is, i is denoting the i-th row and j the j-th column of the underlying grid (see Figure 6.17). With that indexing scheme, a row-major order has been realized.

Iterators are the goto solution for C++ to traverse the elements of a container. While the iterator provided by std::vector would serve the purpose of traversing our grid, it becomes cumbersome to calculate the upper and lower neighbors needed for the 106 Chapter 6. Evaluation

Figure 6.18.: The 5 point stencil needs access to the row above below the current, to be updated, element index. stencil update (see Figure 6.18). Instead, the algorithm we want to build should have an abstraction to iterate one line after the other while making the access to our upper and lower neighbor convenient (see Listing 6.3).

template OutIter line_update(InIter begin, InIter end, OutIter result) { ++result; // Iterate over the interior: skip the last and first element for(InIter it = begin + 1; it != end - 1; ++it, ++result) { *result = 0.25 * (it.up[-1] + it.up[+1] + it.down[-1] + ,→ it.down[+1]) - *it.middle; } ++result;

return result; }

Listing 6.3: The concrete line update of the stencil implemented in C++ via specialized iterators for one particular row in the grid.

Realizing this abstraction can be achieved by developing iterator adapters that are or- ganized hierarchically. That is, one iterator that represents a single line, where an in- crement is advancing the element pointed to by one (see Listing 6.4) and one iterator advancing from row-by-row (see Listing 6.5).

In addition to having the special semantics of a grid traversal, the presented iterators in addition carry along the information to access the neighbors as required by the algo- rithm presented. With this little prelude, the ability to write generic, yet highly domain specific code, resulting in a high level of abstraction, using modern C++ techniques is being presented. It is important to note, that these building blocks serve as the main building block for adding the various forms of parallelism. 6.3. Benchmark Applications 107

template struct line_iterator { UpIter up; MiddleIter middle; DownIter down;

void increment() { ++up; ++middle; ++down; }

void decrement() { --up; --middle; --down; }

void advance(std::ptrdiff_t n) { up += n; middle += n; down += n; }

double dereference() const { return *middle; } };

Listing 6.4: The stencil row iterator is a thin wrapper over the upper, middle and below row to allow for efficient accesses to the required elements. 108 Chapter 6. Evaluation

template struct row_iterator { typedef line_iterator ,→ line_iterator_type;

row_iterator(std::size_t Nx, MiddleIter middle_) : up_(middle - Nx) , middle(middle_) , down_(middle + Nx) , Nx_(Nx) {}

line_iterator dereference() const { return line_iterator(up_, ,→ middle, down_); }

void advance(std::ptrdiff_t n) { up_ += (n * Nx_); middle += (n * Nx_); down_ += (n * Nx_); }

UpIter up_; MiddleIter middle; DownIter down_; std::size_t Nx_; };

Listing 6.5: The stencil row iterator serves the purpose to formulate an abstraction to advance from one row to the next. And allows to derefernce this row to allow to iterate from one element to the next (see Listing 6.4) 6.3. Benchmark Applications 109

As a first measure, those building blocks can be summarized, and the solver canbe implemented:

// Initialization typedef std::vector data_type; std::array U;

U[0] = data_type(Nx * Ny, 0.0); U[1] = data_type(Nx * Ny, 0.0); init(U, Nx, Ny); typedef row_iterator::iterator> iterator; // Construct our column iterators. We want to begin with the second // row to avoid out of bound accesses. iterator curr(Nx, U[0].begin()); iterator next(Nx, U[1].begin()); for (std::size_t t = 0; t < steps; ++t) { // We store the result of our update in the next middle line. // We need to skip the first row. auto result = next.middle + Nx;

// Iterate over the interior: skip the first and last column for(auto it = curr + 1; it != curr + Ny - 1; ++it) { result = line_update(*it, *it + Nx, result); }

std::swap(curr, next); }

Adding shared memory parallelism

Having the basic algorithm layed out, and with keeping the availability of parallel algo- rithms in mind, this becomes trivially parallelized (see Listing 6.6). That is, by applying the correct parallel algorithm, the outer loop is being parallelized. Unfortunately, the loop presented here, is not part of C++, however, it is a perfect fit for this application. It resembles a regular for-loop, and is, as such very similar to how one would parallelize a loop with #pragma omp for. The important difference is that it is a first class C++ 110 Chapter 6. Evaluation citizen, and as such supports our iterator wrapper out of the box. The induction6, is performed in each iteration.

// We store the result of our update in the next middle line. hpx::parallel::for_loop(policy, curr + 1, curr + Ny-1, // We need to advance the result by one row each iteration hpx::parallel::induction(next.middle + Nx, Nx), [Nx](iterator it, data_type::iterator result) { line_update(*it, *it + Nx, result); } );

Listing 6.6: Stencil parallelized, shared memory

Adding Support for different Compute Targets

Stencil Example (Single Node Scaling) ARM64 X86-64 KNL

4 10 0.20 8 3 0.15 6

GLUPS 2 4 0.10 1 2 0.05 0 2.5 5.0 7.5 0 10 20 0 50 Cores Cores Cores partitions = 1 partitions = 2 partitions = 4 partitions = 8

Figure 6.19.: Stencil performance on various architectures using shared memory paral- lelism. The behavior shown matches the expected memory bound nature of the Application for ARM64 and X86-64.

With having the basic parallelization, the next step is to maximize the resource usage.

6Induction in the context of a loop is the code that is executed after each iteration. 6.3. Benchmark Applications 111

For once, it is apparent that we can’t immediately benefit from the advantages of NUMA architectures with the code in Listing 6.6 and accelerator devices aren’t covered as well. By having designed the underlying infrastructure in a generic manner, adding support for the aforementioned targets is trivial by switching to the target aware vector imple- mentation and using the appropriate allocator as well as executor (see Section 5.2.3 and Section 5.2.4). The executor will be used with the policy of the parallel algorithm, and the allocator with the vector. The iterator implementation does not need to be touched and can be reused without change.

Figure 6.19 is presenting the Giga Lattice Updates per Second (GLUPS) that can be ob- tained using the presented on the various platforms. Since the presented stencil update step is memory bound on ARM64 and X86-64, we can see a flattening of the achieved performance once the memory controller is saturated. On the KNL platform, we don’t see this flattening due to the lack of generated vector instructions by the compiler. The application was run on a 10000x10000 grid for the X86-64 and KNL platform and due to less available memory on 2000x2000 grid on the ARM64 platform.

Adding Distributed Memory support

For adding support for distributed memory, the choice was to model the send/receive pattern usually found in MPI based applications, and as such, follows the Single Pro- gram Multiple Data (SPMD) model. The synchronization between the differently run- ning processes is achieved by employing multiple channels (see Section 5.3.3). By as- signing symbolic names to each instance of the channel, the concept of a communicator can be established. The decomposition of the data, for the sake of simplicity, has been chosen to be striped. That is, each partition has an upper and lower neighbor (with the exception of the first and last one). By assigning a rank to each partition, the needed channels can be named after the rank and whether they correspond to the first or last row. This makes up the communicator which is then used for the neighborhood syn- chronization through ghost zone exchange.

This, domain specific abstraction, allows us to integrate both, the communication and the synchronization with the required neighbors efficiently and elegantly (see Listing 6.7). Important to note here is that the ghost zones do not need to be allocated ex- plicitly, a-priori, in our underlying storage. In fact, after receiving the correspond- ing row from within our channel, we can use an iterator pointing into that row, reuse the line_iterator, and use the received vector immediately in the previously writ- ten, generic algorithm to update the corresponding line. The update for the last row is symmetric to the update of the first, with the only notable change that we use the received vector as the bottom line in the last row, instead of the top line. The synchro- nization takes place once the line update has been completed, and the respective channel is charged with the updated values. 112 Chapter 6. Evaluation

if (comm.has_neighbor(communicator_type::up)) { // Get the first row. auto result = next.middle; // retrieve the row which is 'up' from our first row. std::vector up = comm.get(communicator_type::up, ,→ t).get(); // Create a row iterator with that top boundary auto it = curr.top_boundary(up);

// After getting our missing row, we can update our first row line_update(it, it + Nx, result);

// Finally, we can send the updated first row for our neighbor // to consume in the next timestep. Don't send if we are on // the last timestep comm.set(communicator_type::up, std::vector(result, result + Nx), t + 1); }

Listing 6.7: Stencil communicator, first row update

Futurization of the boundary exchange

While looking at the presented code so far, the next natural step is to apply futurization (see Section 3.3.3) to our developed application to further improve the overlapping of computation and communication. It is important to note, that so far, there is explicit waiting on both, the synchronization between neighbors and waiting on the result of the parallel algorithm to compute the update of the inner region not depending on the boundary exchange.

Over-subscription for further latency hiding

After having overlapping of communication and computation mostly under control, there are still minor gaps inside of the task trace. One reason for that is that the commu- nication part can not be speed up indefinitely and a mechanism to further hide latencies should be implemented. When revisiting the communicator presented previously, we can observe that it is not bound to a particular process. Instead, it can be used as a mean to synchronize between different partitions. The natural conclusion is to allow for more than one communicator. The fact that local exchange of boundary data through the on 6.3. Benchmark Applications 113 node network is faster than off node communication leads to a reduction in latencies of parts of the communication, while the computation, on average, stays the same.

Stencil Example (Distributed Run) ARM64 100 3

2 50 GLUPS 1

0 Parallel Efficiency [%] 2 4 6 8 10 12 14 16 X86-64 100 1000

500 50 GLUPS

0 0 Parallel Efficiency [%] 0 50 100 150 200 250 KNL 30000 100

20000 50

GLUPS 10000

0 0 Parallel Efficiency [%] 0 2000 4000 6000 8000 Nodes 1 partition 2 partitions 4 partitions 8 partitions ideal

Figure 6.20.: Distributed weak scaling of the stencil application. We can see that the application developed shows excellent weak scaling properties, achieving a parallel efficiency between 8̃7% and 9̃6% for our evaluation platforms.

As a result, this section developed a version of a 5 point stencil that shows excellent 114 Chapter 6. Evaluation weak scaling behavior (see Figure 6.20). The weak scaling experiment kept the number of elements in the x-direction constant, and used 10000 elements per node on X86-64 and KNL and 2000 elements per node on ARM64 on the y-direction. The graphs show that our application developed on the HPX runtime system is able to scale almost perfectly to up to 8192 nodes (557056 cores) with a parallel efficiency of 87%. This is achieved by futurization and the ability to easily overlap communication and communication. As such, even though the individual performance metrics as shown in Section 6.2 might give the impression that HPX is not made for regular applications, by exploiting the properties of the system and implementing an asynchronous, futurized algorithm, we can leverage the largest scales available today.

6.3.2. OctoTiger

OctoTiger is a 3D octree based finite-volume AMR hydrodynamic code with Newtonian gravity application. The astrophysical fluid is modeled using the inviscid Euler equa- tions and solved using a finite-volume central scheme [57]. The gravitational potential and force are computed using a modified version of the Fast Multipole Method (FMM) [20]. This enables the simulation of the dynamic evolution of binary star systems with unprecedented physical realism. The code’s capability to conserve angular momentum at scale is novel and facilities long-running ab inito simulations spanning thousands of orbits, such as stable mass transfer binaries (see Figure 6.21), that would otherwise be inaccurate.

OctoTiger discretizes the computational domain on a 3D octree AMR structure. Each node in the structure is a N × N × N Cartesian sub-grid (in this work, N = 8), and may be further refined into eight child nodes, each containing their own N × N × N sub- grid with twice the resolution of the parent. The AMR structure is ”properly nested”, meaning there is no more than one jump in refinement level across adjacent leaf nodes. OctoTiger does not refine in time; we are unaware of a technique for temporal refinement that conserves angular momentum. This spatial discretization scheme directly leads to an over-subscription of the cores in the system, by placing more than one sub-grid per core on a computational node. In fact, the number of sub-grids per core directly influ- ences the amount of achievable parallelism during the simulation and thus defines the achievable scaling and parallel efficiency of OctoTiger. In the present implementation, the refinement criteria is based solely on density. OctoTiger checks the refinement crite- ria every 15 time steps to see if refinement or coarsening is required, introducing aneed for dynamic load balancing. This is the minimal number of time-steps required for a feature of the flow to propagate two cells. Each refinement or coarsening step causes the sub-grids to be redistributed for load balancing.

The focus of this Section is to highlight the application of futurization to the previ- ously existing code which transforms a seemingly sequential code into a wait-free asyn- 6.3. Benchmark Applications 115 chronous application in order to scale to 643,280 Intel Knight’s Landing cores on the Cori Supercomputer reaching a parallel efficiency of 96.8%.

Figure 6.21.: Double-white-dwarf systems with a low enough mass ratio exhibit stable mass transfer. If the accretor is small enough, the accretion stream will miss the donor on the first pass and form a disc. The OctoTiger model ofa 0.2M⊙ donor, shown with 9 levels of refinement, is such a system. These phenomena, known as AM Canum Venaticorum (AM CVn) systems, exist in a state of mass transfer for millions of years as the donor slowly trans- fers matter to the accretor and the orbital separation widens. When mass transfer first ensues, they have orbital periods of only a few minutes. The mass transfer causes the orbit to separate, and the AM CVns we observe typically have periods between 10 and 40 minutes. Periodic instabilities in the disc can result in dwarf nova. The buildup of Helium in the disc can periodically detonate as a sub-luminous Supernova known as a Type “.Ia” Supernova. Stable mass transfer cases requires thousands of orbits to simulate and thus necessitate both extreme computation scales and the conservation of angular momentum.

A Futurized Octree Based AMR Algorithm

Applying Futurization to OctoTiger is motivated by the desire to eliminating global bar- riers. While not all aspects in an AMR based code require a global barrier, re-balancing 116 Chapter 6. Evaluation the tree and computing and distributing the maximum allowed time-step size do.

The goal was to perform the algorithm on a natural basis: traverse the tree recursively, apply a transformation (that is: refine/ coarsen, migrate children, perform time step) and return a result representing the result of the visitation of a particular level. To demonstrate the approach, a tree_node provides our generic data structure used as the tree representation, containing all necessary member variables and functions. The obvious, and naive way of traversing such a tree can be represented in a recursive fashion as shown in Listing 6.8. If a specific node is refined, that is if it contains children, we can recurse in a depth-first visitation pattern. This (denoted by compute_result) can be combined with the results returned by the traversal from the children, leading to a natural way to gather all results and propagate values from the leaves to the root. The downside with this approach is that a stack overflow might happen due to too much recursion depth when dealing with trees of great depth.

T tree_node::traverse() { if (is_refined) { // 8 for children, 1 for this node. array results; for (int i = 0; i < 8: ++i) results[i] = children[i].traverse(); results[8] = compute_result(); return combine_results(results); } else return compute_result(); }

Listing 6.8: Natural recursive tree traversal

By applying the Futurization techniques as described in Section 3.3.3, the formulation of tree traversal can be parallelized trivially. Listing 6.9 outlines the generic traversal al- gorithm. Instead of a direct recursion, the algorithm does the recursion asynchronously and the computation for a given tree_node, representing a subgrid in OctoTiger, is au- tomatically overlapped with those of the children. When communication with possibly remote tree_node objects takes place, it is transparently hidden by other ongoing com- putations.

The extension to a distributed memory application using AGAS is straightforward. As our initial approach is using an object oriented design, the tree_node objects are placed into AGAS. They are referencable by a Globally Unique Identifier (GID), which is used as handle to dispatch work to wherever the object has been placed. These function calls are called actions, which are a highly efficient Remote Procedure Call (RPC) mechanism, returning a future that blends perfectly into the Futurization picture. The algorithm, as shown in Listing 6.9, doesn’t change at all from a semantics point of few, but when 6.3. Benchmark Applications 117 hpx::future traverse(tree_node const& t) { if (is_refined) { // 8 for children, 1 for this node. array, 9> results; for (int i = 0; i < 8: ++i) results[i] = async(traverse, children[i]); results[8] = compute_result(); return when_all(results).then(combine_results); } else return make_ready_future(compute_result()); }

Listing 6.9: Futurized recursive tree traversal the tree nodes are distributed over various different compute nodes, it becomes auto- matically parallelized for a distributed memory system, since the actions are executed wherever the objects have been placed.

The futurized tree traversal is the basic template and pattern used to implement the functionality in OctoTiger. The functionality is realized with load, regrid, step and save. regrid represents the scalable algorithm performing refinement/coarsening of an ex- isting octree. It’s implemented as five distinct futurized tree traversals. The first checks if a given node in the octree requires refinement and ensures the correct refinement lev- els of its neighbors. Once this initial step is performed, the second traversal collects the number of tree nodes in total and the number in each child. With this information, the rebalancing step can be performed. The distribution of the individual tree nodes to lo- calities uses a space filling curve. Tree nodes whose location has changed are moved to their new localities by their parent tree nodes. At the end of this procedure, every tree node has correct references for its parent and child nodes, but the remaining ref- erences to neighboring tree nodes may be incorrect due to the rebalancing. Because of this, after every regridding step, the tree is traversed once more, with parents up- dating the neighbor references for their children. step implements the computation of the physical properties. OctoTiger combines numerical solvers for fluid hydrodynamics and Newtonian self-gravity into the octree structure. Each tree node has its own N 3 sub-grid containing the evolved variables for its own sub-domain.

The futurized fluid solver, gravity solver and time-step size propagation differ from the other futurized tree traversals. The hydrodynamics and gravity solvers depend on ghost zone data from neighboring regions and the dynamically-computed time-step size from a reduction on the previous time-step’s entire octree. 118 Chapter 6. Evaluation

A B

N N N N N N N N

N N N N C N D N N N Figure 6.22.: A simple example of a solver with channel-based asynchronous ghost zone exchange on a 2D mesh partitioned into 4 subgrids. This technique allows computation to proceed as far as possible without needless waiting. First, all sub-grids begin computing the first sub-step (red). Sub-grids A,B and D finish their computations and send ghost zone data to their neigh- bor’s channels. Then, A, B and D combine the dependencies of the next sub-step with when_all and attach a continuation that computes the next sub-step to the resulting future. C is still computing the first sub-step, so it has not sent the first sub-step ghost zone data to A or D and the second sub- step continuation for A and D does not start yet (C, however, has recieved ghost zone data from A and D). B’s continuation for the second sub-step has all the data it needs, so that computation is executed (blue). When it is completed, a third sub-step continuation is created and the second sub-step ghost zone data is sent from B to A and D’s channels, where it is effectively buffered. In OctoTiger, the dependencies are more complicated, dueto3D geometry, coarse-fine boundaries, and other communications such asflux corrections from children nodes to parent nodes. 6.3. Benchmark Applications 119

To avoid introducing needless waiting during these solves and exchanges, OctoTiger uses a channel to propagate results between neighboring regions. channels are prim- itives that represent a series of futures that will be produced asynchronously. A con- sumer can request a future for a particular “epoch” (e.g. an integer) from a channel and a producer can set a value for an epoch. The associated state and storage for each epoch’s future is only created on demand, e.g. after either a consumer or producer requests the epoch. channels allows producers and consumers to agree on a location (the channel) where they will communicate repeatedly. channels transparently buffer values on the fly when needed, allowing producers to proceed ahead of consumers – avoiding needless waiting.

Our asynchronous solvers use channels to allow computation to proceed as far as pos- sible, and overlap communication and computation. After the solve for one sub-step is complete, a tree node will retrieve futures from its channels and attach continuations to the futures (via when_all) that will compute the next solve. The continuation will only begin executing when the necessary data for that sub-step has been sent to the channels. A simplified example of a channel-based asynchronous ghost zone exchange is shown in 6.22.

To advance the hydrodynamics variables for each cell a single sub-step in time requires knowledge of the variables in the neighboring three cells on each side and in each di- mension (e.g. “ghost zones” of width 3). These “ghost zones” are updated after ev- ery sub-step, with each tree node sending the required data from its interior (non-ghost zone) cells to its neighboring tree nodes. When a leaf node has no neighboring tree node at the same level of refinement, the ghost zones are interpolated from the neighboring tree nodes of its parent.

Like the hydrodyamics solver, the FMM solver also requires data from neighboring tree nodes on the same level. It also requires data from its child and parent tree nodes. Each tree node executes three steps for the FMM algorithm:

1. Compute multipoles by combining multipoles from child tree nodes, and com- municate multipoles to its parent and the relevant subsets to its neighboring tree nodes.

2. Compute the Taylor expansion of the gravitational interaction between multipoles, including those from neighboring tree nodes on the same refinement level.

3. adding to its own Taylor expansions the Taylor expansion of the gravitational po- tential from the parent tree node, and communicating the total Taylor expansions to child tree nodes.

Each cell of FMM data requires data from its neighboring cells in a locus four cells in 120 Chapter 6. Evaluation radius around it. Because of the large amount of neighboring data required, rather than using ghost cells for the FMM solver, the data from neighboring cells is discarded once the relevant interactions are computed.

200000 200000 60000 60000

50000 150000 50000 150000

40000 40000 100000 100000 30000 30000 Power (Watts) Power (Watts) 20000 20000 50000 50000 Total Threads Executing Total Threads Executing 10000 10000

0 0 0 0 0 150 300 450 600 750 900 1050 1200 0 150 300 450 600 750 Time (sample period) Time (sample period)

Figure 6.23.: APEX concurrency views of OctoTiger running on 1024 nodes of Cori (higher values means better system utilization). The first figure shows two long serialization periods after the checkpoint load and grid creation stage (first yellow bulge) and after the first gravity solve (center), followed bytwo time steps. The figure on the right shows the same sequence after apply- ing the futurizing methodology to make the regrid algorithm more scal- able. Each color represents an HPX task type, and the aggregate power consumption (Watts) across nodes is visible as a dashed black line. The axis scales are equivalent.

The effectiveness of the Futurization of OctoTiger was observed by profiling theappli- cation with APEX [38, 37] at scale (see Figure 6.23). The efficiency of our asynchronous, wait-free tree traversal is determined by the speed at which the underlying runtime system is able to spawn new tasks. The following sections present results from our ex- periments on the Cori Supercomputer which prove the scalabilty of this approach. As a first measure, we look at single node scalability to determine a suitable number ofcores and processing elements per core for performing the distributed memory full-system scaling experiments.

Node-Level Scaling

Figure 6.24 presents scalability on a single KNL node for 7 LoR performing 10 solve steps. For that purpose, we ran the application with different numbers of hyperthreads (HTs) per core. This experiment shows that using two hyperthreads per core gives the overall best performance, with a total speedup of 74.6 giving a 1.3× speedup over using one HT per core, which in turn exposes a parallel efficiency of 87%.

This clearly demonstrates the applicability of the futurization technique showing its ef- fect for a many-core system like the KNL. By having the system oversubscribed with ~24 6.3. Benchmark Applications 121

Speedup and Parallel Efficieny of OctoTiger (single node, 7 Levels of Refinement, 1641 sub-grids) 80 1

70 0.8 60

50 0.6

40

0.4 30 Speedup (1 HT) 20 Speedup (2 HT) 0.2 Speedup (4 HT) Speedup (relative core) 1(relative to Speedup 10

Parallel Efficiency (1 HT) core) 1(relative to Efficiency Parallel 0 0 0 10 20 30 40 50 60 70 Number of Cores

Figure 6.24.: Speedup (yellow: 1 hyperthread, red: 2 hyperthreads, blue: 4 hyper- threads, left axis) and parallel efficiency (green, right axis) of OctoTiger strongly scaling up on a single KNL node relative to one core. The graphs show the speedup and efficiency results for the total application runtime. The application achieves a speedup of ~55.9 when scaling from 1 to 64 cores (one hyperthread) which corresponds to a parallel efficiency of 87%. 122 Chapter 6. Evaluation tree nodes per core, we are able to exploit the on-node parallelism efficiently. Increasing the number of subgrids per core even further does not show significant improvements with respect to scalability. Reducing this number however, leads to a drop in concur- rency.

Full-System Scaling

To assess scalabilty and distributed performance of our futurized application, we did a strong scaling run for different LoRs. Figure 6.25 provides an overview of those results. They show that OctoTiger is able to sustain good scalability over the different amounts of subgrids to process. Since each LoR experiment is considered to be strong scaling in itself, the speedup naturally flattens off after the number of subgrids per core drops below a certain value. In our experiments, this was happening at ~100 subgrids/node. The overall speedup from 10 LoR at 16 compute nodes (~44 subgrids per core) to 14 LoR at 9640 compute nodes (~171 subgrids per core) is ~342× which corresponds to a parallel efficiency of ~56.8%. Since the number of subgrids per core is not the same, the comparison does not fully reflect the scalabilty of OctoTiger. However, it can be clearly observed that excellent scalabilty can be achieved by providing sufficient work to be executed by each core and the network.

To further discuss the scalability of OctoTiger and show the effects of futurization at scale, Figures 6.26 and 6.27 are providing a breakdown of the different application stages.

When looking at the different stages of the application when executing the 13 LoR and14 LoR problems it can be observed that the discussed futurized tree traversal (performed during all stages of OctoTiger) is indeed able to scale up to the full Cori System using 655,520 cores. While the initialization of the problem, which includes loading the initial octree from disk, is hampered by I/O limitations and therefore shows a reduction in parallel efficiency, performing the actual computation exhibits a parallel efficiency of 96.8% for the strong scaling of the 14 LoR problem from 4096 nodes to 9640 nodes. We started the scaling experiment for the 14 LoR problem at 4096 nodes as it does not fit on a smaller number of compute nodes. On the other hand, the scaling of the 13 LoR problem is limited by the lack of work, as the number of subgrids per core drops to ~51 at 4096 nodes causing the parallel efficiency in this case to be reduced87 to~ %. 6.3. Benchmark Applications 123

Number of Sub-grids Processed per Second

Number of Cores 0 100,000 200,000 300,000 400,000 500,000 600,000 700 10 LoR 600

11 LoR Thousands 500 12 LoR 13 LoR 400 14 LoR

300

200

grids per secondper grids 100 -

Sub 0 0 2000 4000 6000 8000 10000 Number of Nodes

Figure 6.25.: Shows the number of subgrids/second processed for different number of levels of refinement (LoR). This graph indicates close to perfect scalability for the different LoR, with a clear improvement of performance from one LoR to the next by improving latency hiding of communication through higher over-subscription (higher number of sub-grids per core). 124 Chapter 6. Evaluation

Speedup and Parallel Efficiency of OctoTiger (13 Levels of Refinement, ~14.5 million sub-grids)

4 1 )

3.5 0.8 3 Initialization 12→13 2.5 Computation Total 0.6 2 Initialization Computation Total nodesrelative 1024to 0.4 1.5

1 0.2

0.5 Speedup (relative nodes)1024(relative to Speedup

0 0 ( Efficicieny Parallel 1536 / 104448 2048 / 139264 4096 / 278528 Number of Nodes / Cores

Figure 6.26.: Speedup (bars, left axis) and parallel efficiency (lines, right axis) ofOc- toTiger strongly scaling up a problem with 13 levels of refinement rela- tive to 1024 KNL nodes / 69632 cores on Cori. The graphs show separate speedup and efficiency results for three application stages (initialization, initial regridding from the 12 levels of refinement restart file to 13 levels of refinement, and the actual computation) and for the total runtime. The computational phase achieves a speedup of 3.46 when scaling from 1024 to 4096 nodes which corresponds to a parallel efficiency of 87%. 6.3. Benchmark Applications 125

Speedup and Parallel Efficiency of OctoTiger (14 Levels of Refinement, ~111 million sub-grids)

2.5 1 )

2 0.8

1.5 0.6 relative nodesrelative 4096to 1 0.4 Initialization 12→14 Computation Total 0.5 0.2 Initialization Computation

Speedup (relative nodes)4096(relative to Speedup Total

0 0 ( Efficicieny Parallel 8192 / 557056 9460 / 643280 Number of Nodes / Cores

Figure 6.27.: Speedup (bars, left axis) and parallel efficiency (lines, right axis) ofOc- toTiger strongly scaling up a problem with 14 levels of refinement rela- tive to 4096 KNL nodes / 278528 cores on Cori. The graphs show separate speedup and efficiency results for three application stages (initialization, initial regridding from the 12 levels of refinement restart file to 14 levels of refinement, and the actual computation) and for the total runtime. The computational phase reaches a speedup of 2.24 when scaling from 4096 to 9460 nodes which corresponds to a parallel efficiency of 96.8%.

Conclusion 7

This thesis provides the presentation of the HPX parallel runtime system by extending the already existing concepts introduced in the C++ programming language for paral- lelism and concurrency with the capability to support distributed memory systems as well as accelerators. Those extensions have been presented by preserving syntax and se- mantics of local asynchronous task invocations by relying on AGAS, a distributed global address space. By fully adhering to the principle of work following data, interfaces to support complex hierarchies, of both computational entities and as well as memory sub- systems has been presented.

The presented solution, in the form of the HPX parallel runtime system, is an attempt to show that the paradigm shift away from BSP and Fork/Join style parallel programming is not only feasible but possible. The provided performance evaluation has shown, that HPX is offering a performance portable way to provide application scalability beyond the possibilities of today’s programming models. This is achieved by providing an API that strictly adheres to a modern and high productivity programming standard allowing it to exploit systems by providing high resource utilization.

By offering micro benchmarks that allow to determine the associated overheads ofthe runtime system it has been shown that those overheads are either minimal or can be mitigated by applying the presented futurization technique. This provides the ground for future optimization efforts, and in addition shows that the developed techniques can be used today without impeding performance of the overall system. In fact, the analysis shows the potential of the presented solution.

Having shown an example application that demonstrated the expressiveness as well as the performance portability adds to the claim of having a feature rich solution to today’s challenges of increasing parallelism. The 2D stencil example fully demonstrates the feasibility of the presented claims and applicability of the programming model to a problem which can be considered to fall in the realm of the traditional programming

127 128 Chapter 7. Conclusion models like MPI.

In addition, the presentation of the control flow of the OctoTiger application with per- formance results obtained by running on one of the most powerful supercomputers today is proofing the claim that futurization in general and HPX in particular isindeed a feasible approach to solve tomorrow’s exascale programming challenges.

Even though the benchmarks presented in this thesis demonstrate the usage of the HPX runtime system effectively, there are various further examples. LibGeoDecomp [87] is one high level framework for developing stencil applications that has been equipped with a HPX backend. It has been shown that the performance of a NBody application developed with this backend is indeed able to outperform the existing MPI based solu- tion by 8̃% [33]. Furthermore, an algorithm for correlating positions using radio signals has been ported to HPX for usage in an embedded platform, demonstrating the excel- lent scalability and performance of the runtime system even on small devices [34]. In addition, HPX can be used as an underlying runtime system to devise even higher level programming models [35].

It should be clear that the development of the runtime system is not completed yet. Many global collective communication primitives are missing and with that, a clear migration path for legacy applications to provide a smooth transition of existing ap- plications towards the HPX programming model. Likewise, the various components of HPX need to be continuously adapted to new computing architectures and the path laid out with targets needs to be completed by having direct support for high bandwidth memory and other non-volatile memory infrastructures. The possibility for adapting the behavior of the runtime as well as applications using performance counters needs to be enhanced to fully enable online capabilities to adapt to various factors inhibiting performance, or in the same manner accounting for resiliency.

Furthermore, exciting times to simplify the user facing API in terms programmability are ahead of us. Developments in the C++ Standardization process offer a great basis for this. On the one hand, the Coroutines TS makes the Futurization technique more approachable and efficient by having first-class language keywords offering morepro- grammability. On the other hand, the recently presented proposal for metaclasses [96] offer the capabilities to directly embody AGAS into the C++ type system. These changes to the C++ programming language will provide a way to express HPX programs more naturally and with less boilerplate leading to easier to write and less error prone pro- grams.

However, even without those additions, the programming model presented in this the- sis offers a viable alternative to existing solutions to parallel programming by defininga consistent API to provide solutions for problems existing today. The performance porta- bility in combination with the offered feature set is unmatched with any existing frame- 129 work or language. HPX today represents an industry grade, efficient implementation of a parallel runtime system written in C++ to support shared and distributed memory systems as well as accelerators with user extensible features to allow the exploitation of every part of the system without sacrificing generality or performance. This doctoral thesis has shown, that the techniques are applicable to today’s largest systems as well as small embedded devices.

Appendix A

A.1. Atomic Operations

template struct atomic { static constexpr bool is_always_lock_free = ,→ implementation-defined ; bool is_lock_free() const volatile noexcept; bool is_lock_free() const noexcept; void store(T, memory_order = memory_order_seq_cst) volatile ,→ noexcept; void store(T, memory_order = memory_order_seq_cst) noexcept; T load(memory_order = memory_order_seq_cst) const volatile ,→ noexcept; T load(memory_order = memory_order_seq_cst) const noexcept; operator T() const volatile noexcept; operator T() const noexcept; T exchange(T, memory_order = memory_order_seq_cst) volatile ,→ noexcept; T exchange(T, memory_order = memory_order_seq_cst) noexcept; bool compare_exchange_weak(T&, T, memory_order, memory_order) ,→ volatile noexcept; bool compare_exchange_weak(T&, T, memory_order, memory_order) ,→ noexcept; bool compare_exchange_strong(T&, T, memory_order, memory_order) ,→ volatile noexcept; bool compare_exchange_strong(T&, T, memory_order, memory_order) ,→ noexcept;

131 bool compare_exchange_weak(T&, T, memory_order = ,→ memory_order_seq_cst) volatile noexcept; bool compare_exchange_weak(T&, T, memory_order = ,→ memory_order_seq_cst) noexcept; bool compare_exchange_strong(T&, T, memory_order = ,→ memory_order_seq_cst) volatile noexcept; bool compare_exchange_strong(T&, T, memory_order = ,→ memory_order_seq_cst) noexcept; atomic() noexcept = default; constexpr atomic(T) noexcept; atomic(const atomic&) = delete; atomic& operator=(const atomic&) = delete; atomic& operator=(const atomic&) volatile = delete; T operator=(T) volatile noexcept; T operator=(T) noexcept; };

Listing A.1: Basic atomic interface

template <> struct atomic { // all operations from atomic, see Listing ,→ %\ref{lst:cpp:atomic}% integral fetch_add(integral, memory_order = ,→ memory_order_seq_cst) volatile noexcept; integral fetch_add(integral, memory_order = ,→ memory_order_seq_cst) noexcept; integral fetch_sub(integral, memory_order = ,→ memory_order_seq_cst) volatile noexcept; integral fetch_sub(integral, memory_order = ,→ memory_order_seq_cst) noexcept; integral fetch_and(integral, memory_order = ,→ memory_order_seq_cst) volatile noexcept; integral fetch_and(integral, memory_order = ,→ memory_order_seq_cst) noexcept; integral fetch_or(integral, memory_order = memory_order_seq_cst) ,→ volatile noexcept; integral fetch_or(integral, memory_order = memory_order_seq_cst) ,→ noexcept; integral fetch_xor(integral, memory_order = ,→ memory_order_seq_cst) volatile noexcept; integral fetch_xor(integral, memory_order = ,→ memory_order_seq_cst) noexcept;

132 integral operator++(int) volatile noexcept; integral operator++(int) noexcept; integral operator--(int) volatile noexcept; integral operator--(int) noexcept; integral operator++() volatile noexcept; integral operator++() noexcept; integral operator--() volatile noexcept; integral operator--() noexcept; integral operator+=(integral) volatile noexcept; integral operator+=(integral) noexcept; integral operator-=(integral) volatile noexcept; integral operator-=(integral) noexcept; integral operator&=(integral) volatile noexcept; integral operator&=(integral) noexcept; integral operator=(integral) volatile noexcept; integral operator=(integral) noexcept; integral operator^=(integral) volatile noexcept; integral operator^=(integral) noexcept; };

Listing A.2: Additianal operations for the atomic interface when T is an integral nu- meric type. template struct atomic { // all operations from atomic, see Listing ,→ %\ref{lst:cpp:atomic}% T* fetch_add(ptrdiff_t, memory_order = memory_order_seq_cst) ,→ volatile noexcept; T* fetch_add(ptrdiff_t, memory_order = memory_order_seq_cst) ,→ noexcept; T* fetch_sub(ptrdiff_t, memory_order = memory_order_seq_cst) ,→ volatile noexcept; T* fetch_sub(ptrdiff_t, memory_order = memory_order_seq_cst) ,→ noexcept; T* operator++(int) volatile noexcept; T* operator++(int) noexcept; T* operator--(int) volatile noexcept; T* operator--(int) noexcept; T* operator++() volatile noexcept; T* operator++() noexcept; T* operator--() volatile noexcept; T* operator--() noexcept; T* operator+=(ptrdiff_t) volatile noexcept;

133 T* operator+=(ptrdiff_t) noexcept; T* operator-=(ptrdiff_t) volatile noexcept; T* operator-=(ptrdiff_t) noexcept; };

Listing A.3: Additional operations for atomic when T is a pointer.

A.2. Asynchronous Providers

template struct promise { public: promise(); template promise(allocator_arg_t, const Allocator& a); promise(promise&& rhs) noexcept; promise(const promise& rhs) = delete; promise(); // assignment promise& operator=(promise&& rhs) noexcept; promise& operator=(const promise& rhs) = delete; void swap(promise& other) noexcept; // retrieving the result future get_future(); // setting the result void set_value(); void set_value(T&& t); void set_value(T const& t); void set_value(T& t); void set_exception(exception_ptr p); // setting the result with deferred notification void set_value_at_thread_exit(); void set_value_at_thread_exit(T&& t); void set_value_at_thread_exit(T const& t); void set_value_at_thread_exit(T& t); void set_exception_at_thread_exit(exception_ptr p); };

Listing A.4: The interface to a promise:A promise can be used to asynchronously set a value and retrieval of a future which is marked ready once the promise has been set

134 template struct packaged_task; template struct packaged_task { // construction and destruction packaged_task() noexcept; template explicit packaged_task(F&& f); template packaged_task(allocator_arg_t, const Allocator& a, F&& f); ~ packaged_task(); // no copy packaged_task(const packaged_task&) = delete; packaged_task& operator=(const packaged_task&) = delete; // move support packaged_task(packaged_task&& rhs) noexcept; packaged_task& operator=(packaged_task&& rhs) noexcept; void swap(packaged_task& other) noexcept; bool valid() const noexcept; // result retrieval future get_future(); // execution void operator()(ArgTypes... ); void make_ready_at_thread_exit(ArgTypes...); void reset(); };

Listing A.5: The interface to a packaged_task:A packaged_task is used to wrap an arbitrary task. Once the task is executed, the return value is used to set the retrieved packaged_taskfuture! in a ready state. template class future { public: future() noexcept; future(future &&) noexcept; future(const future& rhs) = delete; ~future(); future& operator=(const future& rhs) = delete;

135 future& operator=(future&&) noexcept; shared_future share() noexcept; // retrieving the value T get(); // functions to check state bool valid() const noexcept; void wait() const; template future_status wait_for(const chrono::duration& ,→ rel_time) const;a template future_status wait_until(const chrono::time_point& abs_time) const; };

Listing A.6: The interface to a future:A future is the receiving end of an asynchronous operation. Member functions to query the state, wait on the result and at- taching continuations are defined.

template class shared_future { public: shared_future() noexcept; shared_future(const shared_future& rhs) noexcept; shared_future(future&&) noexcept; shared_future(shared_future&& rhs) noexcept; ~shared_future(); shared_future& operator=(const shared_future& rhs) noexcept; shared_future& operator=(shared_future&& rhs) noexcept; // retrieving the value see below get() const; // functions to check state bool valid() const noexcept; void wait() const; template future_status wait_for(const chrono::duration& ,→ rel_time) const; template future_status wait_until(const chrono::time_point& abs_time) const; };

136 Listing A.7: The interface to a shared_future:A shared_future is the receiving end of an asynchronous operation. The difference to future (see Listing A.6) is that there might be multiple instances for the same asynchronous provider. Member functions to query the state, wait on the result and attaching con- tinuations are defined.

A.3. Task Block classs task_block { private: ~task_block(); public: task_block(const task_block&) = delete; task_block& operator=(const task_block&) = delete; void operator&() const = delete;

// Run a specific function inside this task block. The ,→ destructor will block // until each forked task finishes template void run(F&& f);

void wait(); };

// Instantiates a task_block and pass it to f where children can be ,→ spawned. template void define_task_block(F&& f);

Listing A.8: Fork Join Support in C++

137 A.4. Parallel Algorithms

all_of checks if a predicate is true for all, any or none of the elements in any_of a range none_of for_each applies a function to a range of elements for_each_n applies a function object to the first n elements of a sequence count returns the number of elements satisfying specific criteria count_if mismatch finds the first position where two ranges differ equal determines if two sets of elements are the same find find_if finds the first element satisfying specific criteria find_if_not find_end finds the last sequence of elements in a certain range find_first_of searches for any one of a set of elements adjacent_find finds the first two adjacent items that are equal (or satisfy agiven predicate) search searches for a range of elements search_n searches for a number consecutive copies of an element in a range

Table A.2.: Non-modifying sequence operations

merge merges two sorted ranges inplace_merge merges two ordered ranges in-place includes returns true if one set is a subset of another set_difference computes the difference between two sets set_intersection computes the intersection of two sets set_symmetric_difference computes the symmetric difference between two sets set_union computes the union of two sets

Table A.4.: Set operations on sorted ranges

138 copy copies a range of elements to a new location copy_if copy_n copies a number of elements to a new location copy_backward copies a range of elements in backwards order move moves a range of elements to a new location move_backward moves a range of elements to a new location in backwards order fill assigns a range of elements a certain value fill_n assigns a value to a number of elements transform applies a function to a range of elements generate saves the result of a function in a range generate_n saves the result of N applications of a function remove removes elements satisfying specific criteria remove_if remove_copy copies a range of elements omitting those that satisfy specific remove_copy_if criteria replace replaces all values satisfying specific criteria with another replace_if value replace_copy copies a range, replacing elements satisfying specific criteria replace_copy_if with another value swap swaps the values of two objects swap_ranges swaps two ranges of elements iter_swap swaps the elements pointed to by two iterators reverse reverses the order of elements in a range reverse_copy creates a copy of a range that is reversed rotate rotates the order of elements in a range rotate_copy copies and rotate a range of elements random_shuffle randomly re-orders elements in a range shuffle sample selects n random elements from a sequence unique removes consecutive duplicate elements in a range unique_copy creates a copy of some range of elements that contains no con- secutive duplicates

Table A.6.: Modifying sequence operations

139 is_partitioned determines if the range is partitioned by the given predicate partition divides a range of elements into two groups partition_copy copies a range dividing the elements into two groups stable_partition divides elements into two groups while preserving their rela- tive order partition_point locates the partition point of a partitioned range

Table A.8.: Partitioning operations

is_sorted checks whether a range is sorted into ascending order is_sorted_until finds the largest sorted subrange sort sorts a range into ascending order partial_sort sorts the first N elements of a range partial_sort_copy copies and partially sorts a range of elements stable_sort sorts a range of elements while preserving order between equal elements nth_element partially sorts the given range making sure that it is parti- tioned by the given element

Table A.10.: Sorting operations

lower_bound returns an iterator to the first element not less than the given value upper_bound returns an iterator to the first element greater than a certain value binary_search determines if an element exists in a certain range equal_range returns range of elements matching a specific key

Table A.12.: Binary search operations

140 is_heap checks if the given range is a max heap is_heap_until finds the largest subrange that is a max heap make_heap creates a max heap out of a range of elements push_heap adds an element to a max heap pop_heap removes the largest element from a max heap sort_heap turns a max heap into a range of elements sorted in ascending or- der

Table A.14.: Heap operations

max returns the greater of the given values max_element returns the largest element in a range min returns the smaller of the given values min_element returns the smallest element in a range minmax returns the smaller and larger of two elements minmax_element returns the smallest and the largest elements in a range clamp clamps a value between a pair of boundary values lexicographical_compare returns true if one range is lexicographically less than another is_permutation determines if a sequence is a permutation of another sequence next_permutation generates the next greater lexicographic permutation of a range of elements prev_permutation generates the next smaller lexicographic permutation of a range of elements

Table A.16.: Minimum/maximum operations

141 iota fills a range with successive increments of the start- ing value accumulate sums up a range of elements inner_product computes the inner product of two ranges of ele- ments adjacent_difference computes the differences between adjacent elements in a range partial_sum computes the partial sum of a range of elements reduce similar to std::accumulate, except out of order exclusive_scan similar to std::partial_sum, excludes the ith input element from the ith sum inclusive_scan similar to std::partial_sum, includes the ith input element in the ith sum transform_reduce applies a functor, then reduces out of order transform_exclusive_scan applies a functor, then calculates exclusive scan transform_inclusive_scan applies a functor, then calculates inclusive scan

Table A.18.: Numeric Operations

uninitialized_copy copies a range of objects to an uninitialized area of mem- ory uninitialized_copy_n copies a number of objects to an uninitialized area of memory uninitialized_fill copies an object to an uninitialized area of memory, de- fined by a range uninitialized_fill_n copies an object to an uninitialized area of memory, de- fined by a start and a count uninitialized_move moves a range of objects to an uninitialized area of mem- ory uninitialized_move_n moves a number of objects to an uninitialized area of memory

Table A.20.: Operations on uninitialized memory

142 Glossary

AGAS Active Global Address Space. i, 4, 43, 44, 47, 48, 50, 52, 55, 57, 61, 67, 70, 77, 91, 96, 98, 127, 128

AMR Adaptive Mesh Refinement. 10, 114

AMT Asynchronous Many Task. 3, 9

API Application Programming Interface. 3, 4, 9

BSP Bulk Synchronous Programming. 2, 5, 86, 127

CPS Continuation Passing Style. 9, 10, 31, 35–37, 40

CPU Central Processing Unit. 70, 78, 79

DAG Directed Acyclic Graph. 6, 36

FIFO First In, First Out. 46

FMM Fast Multipole Method. 114

GID Globally unique Identifier. i, iii, v, 47–55, 61, 72, 76

GLUPS Giga Lattice Updates per Second. 111

GPU Graphic Processing Unit. ii, 32, 80–82

HPC High Performance Computing. 3, 5, 9, 42, 55, 56, 66, 67, 71, 77, 80, 85, 103

IDL Interface Definition Language. 62

ILP Instruction-Level Parallelism. 24

ISA Instruction Set Architecture. 24

LCO Lightweight Control Object. 60

143 LIFO Last In, First Out. 46

LRU Least Recently Used. 54

MPI Message Passing Interface. 5, 56

NIC Network Interface Card. 66, 68, 70

NUMA Non Uniform Memory Access. ii, 78–80, 82

OOP Object Oriented Programming. 9

OS Operating System. 44, 45, 79, 80, 85, 86

PGAS Partitioned Global Address Space. 47, 56

RAII Resource Acquisition is Initialization. 16, 17, 65

RDMA Remote Direct Memory Access. 56, 66, 68

RPC Remote Procedure Call. 9, 32, 44, 47, 48, 53, 56, 67, 72, 76, 96, 98

RTTI Runtime Type Information. 57

SDK Software Development Kit. 81

SIMD Single Instruction Multiple Data. 2, 24, 80

SM Streaming Multiprocessor. 80, 81

SMT Symmetric Multi Threading. 24

SPMD Single Program Multiple Data. 111

TMP Template Meta Programming. 9, 10 Bibliography

[1] Bilge Acun et al. “Parallel Programming with Migratable Objects: Charm++ in Practice”. In: Proceedings of the International Conference for High Performance Com- puting, Networking, Storage and Analysis. SC ’14. New Orleans, Louisana: IEEE Press, 2014, pp. 647–658. isbn: 978-1-4799-5500-8. doi: 10.1109/SC.2014.58. url: https://doi.org/10.1109/SC.2014.58. [2] Saman Amarasinghe et al. “Exascale Programming Challenges”. In: Proceed- ings of the Workshop on Exascale Programming Challenges, Marina del Rey, CA, USA. U.S Department of Energy, Office of Science, Office of Advanced Sci- entific Computing Research (ASCR), July 2011. url: http : / / science . energy . gov / ~ / media / ascr / pdf / program - documents / docs / ProgrammingChallengesWorkshopReport.pdf. [3] M. Anderson et al. “MHD with adaptive mesh refinement”. In: Class. Quant. Grav. 23 (2006), pp. 6503–6524. [4] Matthew Anderson et al. “Neutron Star Evolutions using Tabulated Equations of State with a New Execution Model”. In: CoRR abs/1205.5055 (2012). [5] Henry C. Baker and . “The incremental garbage collection of pro- cesses”. In: SIGART Bull. New York, NY, USA: ACM, Aug. 1977, pp. 55–59. doi: http://doi.acm.org/10.1145/872736.806932. url: http://doi.acm.org/ 10.1145/872736.806932. [6] Richard F Barrett et al. “Toward an evolutionary task parallel integrated MPI + X programming model”. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM ’15. PMAM ’15. New York, New York, USA: ACM Press, Feb. 2015, pp. 30–39. isbn: 9781450334044. doi: 10.1145/2712386.2712388. url: http://dl.acm.org/ citation.cfm?doid=2712386.2712388. [7] David R Butenhof. Programming with POSIX threads. Addison-Wesley Profes- sional, 1997. [8] C++ AMP (C++ Accelerated Massive Parallelism). http://msdn.microsoft.com/en- us/library/hh265137.aspx. 2013. [9] C++ Single-source Heterogeneous Programming for OpenCL. https://www.khronos.org/sycl/. 2018.

145 [10] Pierre Carbonnelle. PYPL PopularitY of Programming Language. Accessed: 9.8.2016. 2016. url: http://pypl.github.io/PYPL.html. [11] H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. “Kokkos”. In: J. Parallel Distrib. Comput. 74.12 (Dec. 2014), pp. 3202–3216. issn: 0743-7315. doi: 10. 1016/j.jpdc.2014.07.003. url: http://dx.doi.org/10.1016/j.jpdc. 2014.07.003. [12] Bradford L. Chamberlain, David Callahan, and Hans P. Zima. “Parallel pro- grammability and the Chapel language”. In: International Journal of High Perfor- mance Computing Applications 21 (2007), pp. 291–312. [13] Philippe Charles et al. “X10: an object-oriented approach to non-uniform cluster computing”. In: Proceedings of the 20th annual ACM SIGPLAN conference on Object- oriented programming, systems, languages, and applications. OOPSLA ’05. San Diego, CA, USA: ACM, 2005, pp. 519–538. isbn: 1-59593-031-0. doi: 10.1145/1094811. 1094852. url: http://doi.acm.org/10.1145/1094811.1094852. [14] Sanjay Chatterjee et al. “Integrating Asynchronous Task Parallelism with MPI.” In: IPDPS. IEEE Computer Society, 2013, pp. 712–725. isbn: 978-1-4673-6066-1. url: http : / / dblp . uni - trier . de / db / conf / ipps / ipdps2013 . html # ChatterjeeTBCCGSY13. [15] UPC Consortium. UPC Language Specifications, v1.2, Tech Report LBNL-59208. Lawrence Berkeley National Lab, 2005 October. url: http://upc.gwu.edu. [16] James Coplien. “Advanced C++ programming styles and idioms”. In: Technology of Object-Oriented Languages and Systems, 1997. TOOLS 25, Proceedings. IEEE. 1997, pp. 352–352. [17] James O Coplien. Multi-paradigm design for C++. Vol. 53. Addison-Wesley Read- ing, MA, 1999. [18] CUDA. http://www.nvidia.com/object/cuda_home_new.html. 2013. [19] L. Dagum and R. Menon. “OpenMP: an industry standard API for shared- memory programming”. In: IEEE and Engineering 5.1 (Jan. 1998), pp. 46–55. issn: 1070-9924. doi: 10.1109/99.660313. [20] W. Dehnen. “A Very Fast and Momentum-conserving Tree Code”. In: Astrophys- ical Journal, Letters 536 (June 2000), pp. L39–L42. doi: 10.1086/312724. eprint: astro-ph/0003209. [21] Chirag Dekate et al. “N-Body SVN repository”. Available under a BSD-style open source license. Contact [email protected] for repository access. 2011. url: https : / / svn . cct . lsu . edu / repos / projects / parallex / trunk / history/nbody. [22] J. B. Dennis. “Data Flow Supercomputers”. In: Computer 13.11 (1980), pp. 48–56. issn: 0018-9162. doi: http://dx.doi.org/10.1109/MC.1980.1653418. [23] Jack B. Dennis and David Misunas. “A Preliminary Architecture for a Basic Data- Flow Processor”. In: 25 Years ISCA: Retrospectives and Reprints. 1998, pp. 125–131. [24] James Dinan et al. “Enabling MPI interoperability through flexible communica- tion endpoints”. In: Proceedings of the 20th European MPI Users’ Group Meeting. ACM. 2013, pp. 13–18. [25] Margaret A. Ellis and Bjarne Stroustrup. The Annotated C++ Reference Manual. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1990. isbn: 0- 201-51459-1. [26] Daniel P. Friedman and David S. Wise. “CONS Should Not Evaluate its Argu- ments”. In: ICALP. 1976, pp. 257–284. [27] Guang R. Gao et al. “ParalleX: A Study of A New Parallel Computation Model”. In: IPDPS. 2007, pp. 1–6. [28] TIOBE Group et al. TIOBE Index for ranking the popularity of Programming lan- guages. Accessed: 9.8.2016. 2016. url: http://www.tiobe.com/tiobe-index/. [29] Patricia A Grubel. Dynamic adaptation in HPX-A task-based parallel runtime system. New Mexico State University, 2016. [30] Paul Grun et al. “A brief introduction to the OpenFabrics interfaces-a new net- work API for maximizing high performance application efficiency”. In: High- Performance Interconnects (HOTI), 2015 IEEE 23rd Annual Symposium on. IEEE. 2015, pp. 34–39. [31] HCC: Heterogeneous Compute Compiler Tools. https://gpuopen.com/compute- product/hcc-heterogeneous-compute-compiler/. 2018. [32] T. Heller, H. Kaiser, and K. Iglberger. “Application of the ParalleX Execution Model to Stencil-based Problems”. In: Proceedings of the International Supercom- puting Conference ISC’12, Hamburg, Germany. 2012. url: http://stellar.cct. lsu.edu/pubs/isc2012.pdf. [33] Thomas Heller et al. “Using HPX and LibGeoDecomp for Scaling HPC Appli- cations on Heterogeneous Supercomputers”. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. ScalA ’13. Denver, Colorado: ACM, 2013, 1:1–1:8. isbn: 978-1-4503-2508-0. doi: 10.1145/2530268. 2530269. url: http://doi.acm.org/10.1145/2530268.2530269. [34] Arne Hendricks et al. “Evaluating Performance and Energy-efficiency of a Paral- lel Signal Correlation Algorithm on Current Multi and Manycore Architectures”. In: Procedia Computer Science 80 (2016), pp. 1566–1576. [35] Arne Hendricks et al. “The AllScale Runtime Interface—Theoretical Founda- tion and Concept”. In: Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS), 2016 9th Workshop on. IEEE. 2016, pp. 13–19. [36] M. D. Hill and M. R. Marty. “Amdahl’s Law in the Multicore Era”. In: Computer 41.7 (July 2008), pp. 33–38. issn: 0018-9162. doi: 10.1109/MC.2008.209. [37] Kevin Huck. APEX: Autonomic Performance Environment for eXascale. http://khuck.github.io/xpress-apex/. 2017. url: %7Bhttp://khuck.github. io/xpress-apex/%7D. [38] Kevin Huck et al. “An Autonomic Performance Environment for Exascale”. In: Supercomputing Frontiers and Innovations 2.3 (2015). issn: 2313-8734. url: http: //superfri.org/superfri/article/view/64. [39] Kevin Huck et al. “An Early Prototype of an Autonomic Performance Environ- ment for Exascale”. In: Proceedings of the 3rd International Workshop on Runtime and Operating Systems for Supercomputers. ROSS ’13. Eugene, Oregon: ACM, 2013, 8:1–8:8. isbn: 978-1-4503-2146-4. doi: 10.1145/2491661.2481434. url: http: //doi.acm.org/10.1145/2491661.2481434. [40] Intel. Intel Thread Building Blocks 4.4. 2016. url: http : / / www . threadingbuildingblocks.org. [41] Intel SPMD Program Compiler. http://ispc.github.io/. 2011-2012. [42] Intel(R) Cilk(tm) Plus. http://software.intel.com/en- us/intel- cilk- plus. 2014. [43] ISO. ISO/IEC 14882:1998 — Programming languages – C++. Tech. rep. Geneva, Sept. 1998. url: http://www.iso.org/iso/home/store/catalogue_ics/ catalogue_detail_ics.htm?csnumber=25845. [44] ISO. ISO/IEC 14882:2003 — Programming languages – C++. Tech. rep. Geneva, Oct. 2003. url: http : / / www . iso . org / iso / iso _ catalogue / catalogue _ tc / catalogue_detail.htm?csnumber=38110. [45] ISO. ISO/IEC 14882:2011 Information technology — Programming languages – C++. Tech. rep. Geneva, Sept. 2011. url: http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail.htm?csnumber=50372. [46] ISO. ISO/IEC 14882:2014 Information technology — Programming languages – C++. Tech. rep. Geneva, Jan. 2015. url: http://www.iso.org/iso/iso_catalogue/ catalogue_tc/catalogue_detail.htm?csnumber=50372. [47] ISO. ISO/IEC CD 14882 Information technology — Programming languages – C++. Tech. rep. Geneva, July 2016. url: http://www.iso.org/iso/home/store/ catalogue_ics/catalogue_detail_ics.htm?csnumber=68564. [48] Jared Hoberock and Michael Garland and Chris Kohlhoff and Chris Mysen and Carter Edwards and Gordon Brown. P0443r1: A Unified Executors Proposal for C++. Tech. rep. https://wg21.link/p0443r1., 2017. [49] K. Kadam et al. “Numerical Simulations of Close and Contact Binary Systems Having Bipolytropic Equation of State”. In: American Astronomical Society Meeting Abstracts. Vol. 229. American Astronomical Society Meeting Abstracts. Jan. 2017, p. 433.14. [50] Kundan Kadam et al. “A numerical method for generating rapidly rotating bipolytropic structures in equilibrium”. In: Monthly Notices of the Royal Astronomi- cal Society 462.2 (July 2016), pp. 2237–2245. issn: 1365-2966. doi: 10.1093/mnras/ stw1814. url: http://dx.doi.org/10.1093/mnras/stw1814. [51] A. Kagi, J. R. Goodman, and D. Burger. “Memory Bandwidth Limitations of Fu- ture Microprocessors”. In: Computer Architecture, 1996 23rd Annual International Symposium on. May 1996, pp. 78–78. doi: 10.1109/ISCA.1996.10002. [52] H. Kaiser, M. Brodowicz, and T. Sterling. “ParalleX An Advanced Parallel Execu- tion Model for Scaling-Impaired Applications”. In: 2009 International Conference on Parallel Processing Workshops. Sept. 2009, pp. 394–401. doi: 10.1109/ICPPW. 2009.14. [53] Hartmut Kaiser et al. “HPX: A Task Based Programming Model in a Global Ad- dress Space”. In: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. PGAS ’14. Eugene, OR, USA: ACM, 2014, 6:1– 6:11. isbn: 978-1-4503-3247-7. [54] L.V. Kale et al. Design and Implementation of Parallel Java with Global Object Space. 1997. [55] Matthias Kretz. “Extending C++ for Explicit Data-Parallel Programming via SIMD Vector Types”. PhD thesis. Goethe University Frankfurt, 2015. doi: 10 . 13140 / RG . 2 . 1 . 2355 . 4323. url: http : / / publikationen . ub . uni - frankfurt.de/frontdoor/index/index/docId/38415. [56] Matthias Kretz. P0214R4: Data-Parallel Vector Types & Operations. ISO/IEC C++ Standards Committee Paper. 2017. url: http://www.open- std.org/jtc1/ sc22/wg21/docs/papers/2017/p0214r4.pdf. [57] A. Kurganov and E. Tadmor. “New High-Resolution Central Schemes for Non- linear Conservation Laws and Convection-Diffusion Equations”. In: Journal of Computational Physics 160 (May 2000), pp. 241–282. doi: 10.1006/jcph.2000. 6459. [58] Guy Lewis Jr, Gerald Jay Sussman, et al. “Lambda: The Ultimate Imperative”. In: (1976). [59] J. Luitjens and M. Berzins. “Improving the performance of Uintah: A large-scale adaptive meshing computational framework”. In: Proceedings of the 24th IEEE International Parallel and Distributed Processing Symposium (IPDPS). Atlanta, GA: IEEE, Apr. 2010, pp. 1–10. doi: 10.1109/IPDPS.2010.5470437. [60] D. C. Marcello and J. E. Tohline. “A Numerical Method for Studying Super- Eddington Mass Transfer in Double White Dwarf Binaries”. In: Astrophysical Jour- nal, Supplement 199, 35 (Apr. 2012), p. 35. doi: 10.1088/0067-0049/199/2/35. [61] John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Per- formance Computers. Tech. rep. A continually updated technical report. http://www.cs.virginia.edu/stream/. Charlottesville, Virginia: University of Virginia, 1991-2007. url: http://www.cs.virginia.edu/stream/. [62] William F McColl. “BSP Programming.” In: Specification of Parallel Algorithms 18 (1994), pp. 25–35. [63] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. Stuttgart, Germany: High Performance Computing Center Stuttgart (HLRS), Sept. 2009. [64] Microsoft. Microsoft Parallel Pattern Library. http://msdn.microsoft.com/en- us/library/dd492418.aspx. 2010. url: http : / / msdn . microsoft . com / en - us/library/dd492418.aspx. [65] Luc Moreau and Jean Duprat. “Aconstruction of distributed reference counting”. In: Acta Informatica 37.8 (May 2001), pp. 563–595. issn: 1432-0525. doi: 10.1007/ PL00013315. url: https://doi.org/10.1007/PL00013315. [66] Luc Moreau and Jean Duprat. “Aconstruction of distributed reference counting”. In: Acta Informatica 37.8 (2001), pp. 563–595. [67] MVAPICH: MPI over InfiniBand, 10GigE/iWARP and .RoCE http://mvapich. cse.ohio-state.edu/benchmarks/. 2015. [68] N4406: Parallel Algorithms Need Executors. Tech. rep. http://www.open- std.org/jtc1/sc22/wg21/docs/papers/2015/n4406.pdf., 2015. [69] N4656: Working Draft, C++ Extensions for Networking. Tech. rep. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4656.pdf., 2017. [70] N4669: Working Draft, Technical Specification for C++ Exten- sions for Parallelism Version 2. Tech. rep. http://www.open- std.org/jtc1/sc22/wg21/docs/papers/2015/n4501.html., 2015. [71] John Nickolls et al. “Scalable Parallel Programming with CUDA”. In: Queue 6.2 (Mar. 2008), pp. 40–53. issn: 1542-7730. doi: 10.1145/1365490.1365500. url: http://doi.acm.org/10.1145/1365490.1365500. [72] Robert W. Numrich and John Reid. “Co-array Fortran for parallel programming”. In: SIGPLAN Fortran Forum 17 (2 Aug. 1998), pp. 1–31. issn: 1061-7264. doi: http: //doi.acm.org/10.1145/289918.289920. url: http://doi.acm.org/10. 1145/289918.289920. [73] Dorit Nuzman, Ira Rosen, and Ayal Zaks. “Auto-vectorization of interleaved data for SIMD”. In: ACM SIGPLAN Notices 41.6 (2006), pp. 132–143. [74] OpenACC - Directives for Accelerators. http://www.openacc-standard.org/. 2013. [75] OpenCL - The open standard for parallel programming of heterogeneous systems. https://www.khronos.org/opencl/. 2013. [76] OpenMP V4.0 Specification. http : / / www . . org / mp - documents / OpenMP4.0.0.pdf. 2013. [77] Oracle. Project Frotress. https://projectfortress.java.net/. 2011. [78] P0688R0: A Proposal to Simplify the Unified Executors Design. Tech. rep. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0688r0.html., 2015. [79] Simon L. Peyton Jones and Philip Wadler. “Imperative Functional Program- ming”. In: Proceedings of the 20th ACM SIGPLAN-SIGACT Symposium on Princi- ples of Programming Languages. POPL ’93. Charleston, South Carolina, USA: ACM, 1993, pp. 71–84. isbn: 0-89791-560-7. doi: 10.1145/158511.158524. url: http: //doi.acm.org/10.1145/158511.158524. [80] PGAS. PGAS - Partitioned Global Address Space. http://www.pgas.org. 2011. [81] Rob Pike. “The go programming language”. In: Talk given at Google’s Tech Talks (2009). [82] José M. Piquer. “Indirect reference counting: A distributed garbage collection algorithm”. In: PARLE ’91 Parallel Architectures and Languages Europe: Volume I: Parallel Architectures and Algorithms Eindhoven, The Netherlands, June 10–13, 1991 Proceedings. Ed. by Emile H. L. Aarts, Jan van Leeuwen, and Martin Rem. Berlin, Heidelberg: Springer Berlin Heidelberg, 1991, pp. 150–165. isbn: 978-3-540-47471- 5. doi: 10.1007/BFb0035102. url: https://doi.org/10.1007/BFb0035102. [83] Howard Pritchard et al. “The GNI provider layer for OFI libfabric”. In: Proceed- ings of Cray User Group Meeting, CUG. 2016. [84] Stefan L Ram et al. “Charakterisierung von C++”. In: (2015). [85] Tim Rentsch. “Object Oriented Programming”. In: SIGPLAN Not. 17.9 (Sept. 1982), pp. 51–57. issn: 0362-1340. doi: 10 . 1145 / 947955 . 947961. url: http : //doi.acm.org/10.1145/947955.947961. [86] P. E. Ross. “Why CPU Frequency Stalled”. In: IEEE Spectrum 45.4 (Apr. 2008), pp. 72–72. issn: 0018-9235. doi: 10.1109/MSPEC.2008.4476447. [87] Andreas Schäfer and Dietmar Fey. “LibGeoDecomp: A Grid-Enabled Library for Geometric Decomposition Codes”. In: Proceedings of the 15th European PVM/MPI Users’ Group Meeting on Recent Advances in and Message Passing Interface. Dublin, Ireland: Springer, 2008, pp. 285–294. isbn: 978-3-540- 87474-4. [88] Andreas Schäfer, Dietmar Fey, and Adrian Knoth. “Tool for Automated Genera- tion of MPI Datatypes”. In: (). [89] Sangmin Seo et al. “Argobots: A Lightweight Low-Level Threading and Tasking Framework”. In: IEEE Transactions on Parallel and Distributed Systems 29 (2018), pp. 512–526. [90] Richard Snodgrass. The interface description language: definition and .use Computer Science Press, Inc., 1989. [91] International Organization for Standardization. N4723: Working Draft, C++ Ex- tensions for Coroutines. Tech. rep. http://www.open- std.org/jtc1/sc22/ wg21/docs/papers/2017/n4723.pdf., 2018. [92] StarPU - A Unified Runtime System for Heterogeneous Multicore Architectures. http: //runtime.bordeaux.inria.fr/StarPU/. 2013. [93] Guy Lewis Steele Jr. LAMBDA: The Ultimate Declarative. Tech. rep. DTIC Docu- ment, 1976. [94] Guy L. Steele Jr. “Making Asynchronous Parallelism Safe for the World”. In: Pro- ceedings of the 17th ACM SIGPLAN-SIGACT Symposium on Principles of Program- ming Languages. POPL ’90. San Francisco, California, USA: ACM, 1990, pp. 218– 231. isbn: 0-89791-343-4. doi: 10.1145/96709.96731. url: http://doi.acm. org/10.1145/96709.96731. [95] Audie Sumaray and S Kami Makki. “A comparison of data serialization formats for optimal efficiency on a mobile platform”. In: Proceedings of the 6th international conference on ubiquitous information management and communication. ACM. 2012, p. 48. [96] Herb Sutter. P0707R0: Metaclasses. ISO/IEC C++ Standards Committee Paper. 2017. url: http://www.open- std.org/jtc1/sc22/wg21/docs/papers/ 2017/p0707r0.pdf. [97] Herb Sutter. “The free lunch is over: A fundamental turn toward concurrency in software”. In: Dr. Dobb’s journal 30.3 (2005), pp. 202–210. [98] R. Teyssier. AMR and parallel computing. 2010. url: hipacc.ucsc.edu/html/ HIPACCLectures/lecture%5C_amr.pdf. [99] The C++ Standards Committee. ISO International Standard ISO/IEC 14882:2011, Programming Language C++. Tech. rep. http : / / www . open - std . org / jtc1 / sc22/wg21. Geneva, Switzerland: International Organization for Standardiza- tion (ISO)., 2011. [100] The C++ Standards Committee. ISO International Standard ISO/IEC 14882:2014, Programming Language C++. Tech. rep. http : / / www . open - std . org / jtc1 / sc22/wg21. Geneva, Switzerland: International Organization for Standardiza- tion (ISO)., 2014. [101] The C++ Standards Committee. ISO International Standard ISO/IEC 14882:2017, Programming Language C++. Tech. rep. http : / / www . open - std . org / jtc1 / sc22/wg21. Geneva, Switzerland: International Organization for Standardiza- tion (ISO)., 2017. [102] The OmpSs Programming Model. https://pm.bsc.es/ompss. 2013. [103] The Qthread Library. http://www.cs.sandia.gov/qthreads/. 2014. [104] Peter Thoman et al. “A taxonomy of task-based parallel programming technolo- gies for high-performance computing”. In: The Journal of Supercomputing 74.4 (Apr. 2018), pp. 1422–1434. issn: 1573-0484. doi: 10.1007/s11227-018-2238-4. url: https://doi.org/10.1007/s11227-018-2238-4. [105] Konrad Trifunovic et al. “Polyhedral-model guided loop-nest auto- vectorization”. In: Parallel Architectures and Compilation Techniques, 2009. PACT’09. 18th International Conference on. IEEE. 2009, pp. 327–337. [106] undefined et al. “Evolving MPI+X Toward Exascale”. In: Computer 49.8 (2016), p. 10. issn: 0018-9162. doi: doi . ieeecomputersociety . org / 10 . 1109 / MC . 2016.232. [107] UPC Consortium. UPC Language Specifications, v1.2. Tech Report LBNL-59208. Lawrence Berkeley National Lab, 2005. url: http://www.gwu.edu/%5C~%7B% 7Dupc/publications/LBNL-59208.pdf. [108] Leslie G. Valiant. “A bridging model for parallel computation”. In: Commun. ACM 33.8 (1990), pp. 103–111. issn: 0001-0782. doi: http : / / doi . acm . org / 10.1145/79173.79181. [109] Abhinav Vishnu, Monika ten Bruggencate, and Ryan Olson. “Evaluating the po- tential of cray gemini interconnect for pgas communication runtime systems”. In: High Performance Interconnects (HOTI), 2011 IEEE 19th Annual Symposium on. IEEE. 2011, pp. 70–77. [110] Andrew M. Wissink, David Hysom, and Richard D. Hornung. “Enhancing scal- ability of parallel structured AMR calculations”. In: Proceedings of the 17th annual international conference on Supercomputing. ICS ’03. San Francisco, CA, USA: ACM, June 2003, pp. 336–347. isbn: 1-58113-733-8. doi: http://doi.acm.org/10. 1145/782814.782861. url: http://doi.acm.org/10.1145/782814.782861. [111] Andrew M. Wissink et al. “Large scale parallel structured AMR calculations us- ing the SAMRAI framework”. In: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM). Supercomputing ’01. Denver, Colorado: ACM, Nov. 2001, pp. 6–6. isbn: 1-58113-293-X. doi: http : / / doi . acm . org / 10 . 1145 / 582034.582040. url: http://doi.acm.org/10.1145/582034.582040. [112] X-Stack: Programming Challenges, Runtime Systems, and Tools, DoE-FOA-0000619. http://science.energy.gov/~/media/grants/pdf/foas/2012/SC_FOA_ 0000619.pdf. 2012.