A Modern Openmp Implementation Leveraging HPX, an Asynchronous Many-Task System

An Introduction to hpxMP – A Modern OpenMP Implementation Leveraging HPX, An Asynchronous Many-Task System Tianyi Zhang Shahrzad Shirzad Patrick Diehl [email protected] [email protected] [email protected] Center for Computation and Center for Computation and Center for Computation and Technology, LSU Technology, LSU Technology, LSU R. Tohid Weile Wei Hartmut Kaiser [email protected] [email protected] [email protected] Center for Computation and Center for Computation and Center for Computation and Technology, LSU Technology, LSU Technology, LSU ABSTRACT CCS CONCEPTS Asynchronous Many-task (AMT) runtime systems have gained in- • Computing methodologies → Parallel programming lan- creasing acceptance in the HPC community due to the performance guages; improvements offered by fine-grained tasking runtime systems. At the same time, C++ standardization efforts are focused on creating KEYWORDS higher-level interfaces able to replace OpenMP or OpenACC in OpenMP, hpxMP, Asynchronous Many-task Systems, C++, clang, modern C++ codes. These higher level functions have been adopted gcc, HPX in standards conforming runtime systems such as HPX, giving users the ability to simply utilize fork-join parallelism in their own codes. ACM Reference Format: Despite innovations in runtime systems and standardization efforts Tianyi Zhang, Shahrzad Shirzad, Patrick Diehl, R. Tohid, Weile Wei, and Hart- users face enormous challenges porting legacy applications. Not mut Kaiser. 2019. An Introduction to hpxMP – A Modern OpenMP Imple- only must users port their own codes, but often users rely on highly mentation Leveraging HPX, An Asynchronous Many-Task System. In Inter- optimized libraries such as BLAS and LAPACK which use OpenMP national Workshop on OpenCL (IWOCL’19), May 13–15,2019, Boston, MA, USA. for parallization. Current efforts to create smooth migration paths ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3318170.3318191 have struggled with these challenges, especially as the threading systems of AMT libraries often compete with the treading system 1 INTRODUCTION of OpenMP. To overcome these issues, our team has developed hpxMP, an The Open Multi-Processing (OpenMP) [OpenMP Consortium 2018] implementation of the OpenMP standard, which utilizes the under- standard is widely used for shared memory multiprocessing and is lying AMT system to schedule and manage tasks. This approach often coupled with the Message Passing Interface (MPI) as MPI+X leverages the C++ interfaces exposed by HPX and allows users to [Bader 2016] for distributed programming. Here, MPI is used for execute their applications on an AMT system without changing the inter-node communication and X, in this case, OpenMP, for their code. the intra-node parallelism. Nowadays, Asynchronous Many Task In this work, we compare hpxMP with Clang’s OpenMP library (AMT) run time systems are emerging as a new parallel program- with four linear algebra benchmarks of the Blaze C++ library. While ming paradigm. These systems are able to take advantage of fine hpxMP is often not able to reach the same performance, we demon- grained tasks to better distribute work across a machine. The C++ strate viability for providing a smooth migration for applications standard library for concurrency and parallelism (HPX) [Heller et al. but have to be extended to benefit from a more general task based 2017] is one example of an AMT runtime system. The HPX API programming model. conforms to the concurrency abstractions introduced by the C++ 11 standard [C++ Standards Committee 2011] and to the parallel algorithms introduced by the C++ 17 standard [C++ Standards Com- mittee 2017]. These algorithms are similar to the concepts exposed by OpenMP, e.g. #pragma omp parallel for. Permission to make digital or hard copies of all or part of this work for personal or AMT runtime systems are becoming increasing used for HPC classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation applications as they have shown superior scalability and parallel on the first page. Copyrights for components of this work owned by others than the efficiency for certain classes of applications (see [Heller etal. 2018]). author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission At the same time, the C++ standardization efforts currently focus and/or a fee. Request permissions from [email protected]. on creating higher-level interfaces usable to replace OpenMP (and IWOCL’19, May 13–15,2019, Boston, MA, USA other #pragma-based parallelization solutions like OpenACC) for © 2019 Copyright held by the owner/author(s). Publication rights licensed to the modern C++ codes. This effort is driven by the lack of integration Association for Computing Machinery. ACM ISBN 978-1-4503-6230-6/19/05...$15.00 of #pragma based solutions into the C++ language, especially the https://doi.org/10.1145/3318170.3318191 language’s type system. IWOCL’19, May 13–15,2019, Boston, MA, USA Zhang, et al. Both trends call for a migration path which will allow existing languages [Leiserson 2009] are general-purpose programming lan- applications that directly or indirectly use OpenMP to port potions guages which target multi-thread parallelism by extending C/C++ of the code an AMT paradigm. This is especially critical for ap- with parallel loop constructs and a fork-join model. Kokkos [Ed- plications which use highly optimized OpenMP libraries where wards et al. 2014] is a package which exposes multiple parallel it is not feasible to re-implement all the provided functionalities programming models such as CUDA and pthreads through a com- into a new paradigm. Examples of these libraries are linear algebra mon C++ interface. Open Multi-Processing (OpenMP) [Dagum and libraries [Anderson et al. 1999; Blackford et al. 2002; Galassi et al. Menon 1998] is a widely accepted standard used by application 2002; Guennebaud et al. 2010; Iglberger et al. 2012; Rupp et al. 2016; and library developers. OpenMP exposes fork-join model through Sanderson and Curtin 2016; Wang et al. 2013], such as the Intel compiler directives and supports tasks. math kernel library or the Eigen library. The OpenMP 3.0 standard1 introduced the concept of task-based For these reasons, it is beneficial to combine both technologies, programming. The OpenMP 3.1 standard2 added task optimization AMT+OpenMP, where the distributed communication is handled by within the tasking model. The OpenMP 4.0 standard3 offers user the AMT runtime system and the intra-node parallelism is handled a more graceful and efficient way to handle task synchronization by OpenMP or even combine OpenMP and the parallel algorithms by introducing depend tasks and task group. The OpenMP 4.5 stan- on a shared memory system. Currently, these two scenarios are not dard4 was released with its support for a new task-loop construct, possible, since the light-weighted thread implementations usually which is providing a way to separate loops into tasks. The most present in AMTs interferes with the system threads utilized by the recent, the OpenMP 5.0 standard5 supports detached tasks. available OpenMP implementations. There have also been efforts to integrate multi-thread parallelism To overcome this issue, hpxMP, an implementation of the OpenMP with distributed programming models. Charm++ has integrated standard [OpenMP Consortium 2018] that utilizes HPX’s light- OpenMP into its programming model to improve load balance [PPL weight threads is presented in this paper. The hpxMP library is 2011]. However, most of the research in this area has focused on compatible with the clang and gcc compiler and replaces their MPI+X [Bader 2016; Barrett et al. 2015] model. shared library implementations of OpenMP. hpxMP implements all of the OpenMP runtime functionalities using HPX’s lightweight 3 C++ STANDARD LIBRARY FOR threads instead of system threads. CONCURRENCY AND PARALLELISM (HPX) Blaze, an open source, high performance C++ math library, This section briefly describes the features of the C++ Standard Li- [Iglberger et al. 2012] is selected as an example library to vali- brary for Concurrency and Parallelism (HPX) [Heller et al. 2017] date our implementation. Blazemark, the benchmark suite available which are utilized in the implementation of hpxMP in the Sec- with Blaze, is used to run some common benchmarks. The mea- tion 5. HPX facilitates distributed parallel applications of any scale sured results are compared against the same benchmarks run on and uses fine-grain multi-threading and asynchronous communi- top of the compiler-supplied OpenMP runtime. This paper focuses cations [Khatami et al. 2016]. HPX exposes an API that strictly on the implementation details of hpxMP as a proof of concept im- adheres the current ISO C++ standards [Heller et al. 2017]. This plementing OpenMP with an AMT runtime system. We use HPX approach to standardization encourages programmers to write code as an exemplary AMT system that already exposes all the required that is high portability in heterogeneous systems [Copik and Kaiser functionalities. 2017]. The paper is structured as follows: Section 2 emphasizes the HPX is highly interoperable in distributed parallel applications, related work. Section 3 provides a brief introduction to HPX’s con- such that, it can be used on inter-node communication setting of cepts and Section 4 a brief introduction to OpenMP’s concepts a single machine as well as intra-node parallelization scenario of utilized in the implementation in Section 5. The benchmarks com- hundreds of thousands of nodes [Wagle et al. 2018]. The future paring hpxMP with clang’s OpenMP implementation are shown in functionality implemented in HPX permits threads to continually Section 6. Finally, we draw our conclusions in Section 7. finish their computation without waiting for their previous steps to be completed which can achieve a maximum possible level of parallelization in time and space [Khatami et al.

A Modern Openmp Implementation Leveraging HPX, an Asynchronous Many-Task System

Integration of CUDA Processing Within the C++ Library for Parallelism and Concurrency (HPX)

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

Bench - Benchmarking the State-Of- The-Art Task Execution Frameworks of Many- Task Computing

HPX – a Task Based Programming Model in a Global Address Space

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Scalable and High Performance MPI Design for Very Large

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor

Exascale Computing Project -- Software

HPXMP, an Openmp Runtime Implemented Using

Library for Handling Asynchronous Events in C++

Intel Threading Building Blocks