Fibers Are Not (P)Threads the Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

Fibers are not (P)Threads The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations Joseph Schuchart Christoph Niethammer José Gracia [email protected] [email protected] [email protected] Höchstleistungsrechenzentrum Höchstleistungsrechenzentrum Höchstleistungsrechenzentrum Stuttgart (HLRS) Stuttgart (HLRS) Stuttgart (HLRS) Stuttgart, Germany Stuttgart, Germany Stuttgart, Germany ABSTRACT multi- and many-core systems by exposing all available concur- Asynchronous programming models (APM) are gaining more and rency to a runtime system. Applications are expressed in terms of more traction, allowing applications to expose the available concur- work-packages (tasks) with well-defined inputs and outputs, which rency to a runtime system tasked with coordinating the execution. guide the runtime scheduler in determining the correct execution While MPI has long provided support for multi-threaded commu- order of tasks. A wide range of task-based programming models nication and non-blocking operations, it falls short of adequately have been developed, ranging from node-local approaches such supporting APMs as correctly and efficiently handling MPI com- as OpenMP [40] and OmpSs [13] to distributed systems such as munication in different models is still a challenge. Meanwhile, new HPX [27], StarPU [3], DASH [49], and PaRSEC [7]. All of them have low-level implementations of light-weight, cooperatively scheduled in common that a set of tasks is scheduled for execution by a set of execution contexts (fibers, aka user-level threads (ULT)) are meant to worker threads based on constraints provided by the user, either in serve as a basis for higher-level APMs and their integration in MPI the form of dependencies or dataflow expressions. implementations has been proposed as a replacement for traditional At the same time, MPI is still the dominant interface for inter- POSIX thread support to alleviate these challenges. process communication in parallel applications [6], providing block- In this paper, we first establish a taxonomy in an attempt to ing and non-blocking point-to-point and collective operations as clearly distinguish different concepts in the parallel software stack. well as I/O capabilities on top of elaborate datatype and process We argue that the proposed tight integration of fiber implemen- group abstractions [37]. Recent years have seen significant improve- tations with MPI is neither warranted nor beneficial and instead ments to the way MPI implementations handle communication by is detrimental to the goal of MPI being a portable communication multiple threads of execution in parallel [20, 41]. abstraction. We propose MPI Continuations as an extension to the However, task-based programming models pose new challenges MPI standard to provide callback-based notifications on completed to the way applications interact with MPI. While MPI provides operations, leading to a clear separation of concerns by providing non-blocking communications that can be used to hide commu- a loose coupling mechanism between MPI and APMs. We show nication latencies, it is left to the application layer to ensure that that this interface is flexible and interacts well with different APMs, all necessary communication operations are eventually initiated namely OpenMP detached tasks, OmpSs-2, and Argobots. to avoid deadlocks, that in-flight communication operations are eventually completed, and that the interactions between communi- KEYWORDS cating tasks are correctly handled. Higher-level tasking approaches such as PaRSEC or HPX commonly track the state of active com- MPI+X, Tasks, Continuations, OpenMP, OmpSs, TAMPI, Fiber, ULT munication operations and regularly test for completion before ACM Reference Format: acting upon such a change in state. With node-local programming Joseph Schuchart, Christoph Niethammer, and José Gracia. 2020. Fibers are models such as OpenMP, this tracking of active communication not (P)Threads: The Case for Loose Coupling of Asynchronous Program- operations is left to the application layer. Unfortunately, the simple ming Models and MPI Through Continuations. In 27th European MPI Users’ approach of test-yield cycles inside a task does not provide optimal Group Meeting (EuroMPI/USA ’20), September 21–24, 2020, Austin, TX, USA. arXiv:2011.06684v1 [cs.DC] 12 Nov 2020 ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/3416315.3416320 performance due to CPU cycles being wasted on testing and (in the case of OpenMP) may not even be portable [50]. 1 BACKGROUND AND MOTIVATION MPI is thus currently not well equipped to support users and developers of asynchronous programming models, which (among Asynchronous (task-based) programming models have gained more other things) has prompted some higher level runtimes to move and more traction, promising to help users better utilize ubiquitous away from MPI towards more asynchronous APIs [10]. Permission to make digital or hard copies of all or part of this work for personal or Different approaches have been proposed to mitigate this burden, classroom use is granted without fee provided that copies are not made or distributed including Task-Aware MPI (TAMPI) [46] and a tight integration of for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the user-level thread, fiber, and tasking libraries (cf. Section 2) with MPI author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or implementations [32, 34, 45]. In this paper, we argue that the latter republish, to post on servers or to redistribute to lists, requires prior specific permission is the least preferable solution and in fact should be seen critical and/or a fee. Request permissions from [email protected]. 1 EuroMPI/USA ’20, September 21–24, 2020, Austin, TX, USA by the MPI community. After establishing a taxonomy to be used © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8880-1/20/09...$15.00 1This paper is in part an extended version of an argument made online at https: https://doi.org/10.1145/3416315.3416320 //github.com/open-mpi/ompi/pull/6578#issuecomment-602542762. EuroMPI/USA ’20, September 21–24, 2020, Austin, TX, USA Joseph Schuchart, Christoph Niethammer, and José Gracia Process resources such as memory pages and file pointers between different Tasks processes. A process may use one or more light-weight processes. Task Scheduler Light-weight Process (LWP). A light-weight process can be con- Fibers sidered an abstraction of a processor inside the operating system (OS), on which threads are scheduled. LWPs are scheduled preemp- Fiber Scheduler User tively by the OS kernel, i.e., their execution may be preempted and Space User another LWP be scheduled onto the used processor. Different LWPs Threads User Thread Scheduler may share resources such as file descriptors and memory if they Kernel belong to the same process. Space Cooperative and Preemptive Scheduling. These are two main Kernel LWP Threads scheduling strategies to distinguish. With cooperative scheduling, a scheduler assigns resources to a process or thread until it relin- Kernel Scheduler quishes them. The benefits of cooperative scheduling are lower Hardware scheduling overheads (scheduling decisions are made only at well- Processors defined points) and—at least on single-core systems—atomicity of execution between scheduling points [1]. With preemptive schedul- Figure 1: Hierarchy of kernel and user threads, fibers, and ing, the scheduler is able to preempt the execution of a thread after tasks with the different scheduler entities. The user thread the expiration of a time-slice or to schedule processes or threads scheduler is optional and not present in most modern archi- with higher priority. Most modern OS kernels employ preemptive tectures. Tasks may be executed directly on top of threads scheduling to ensure fairness and responsiveness. without the use of fibers. User Thread. User threads are threads of execution that share resources within the confines of a process. # threads may be stati- cally assigned to # LWP (1:1 mapping) or multiplexed onto either throughout this paper (Section 2), we will make the argument that a single (1:N mapping) or " LWPs (N:M mapping) by a user thread such an integration is neither warranted by the MPI standard nor scheduler. In all cases, threads are comprised of some private state beneficial to the goal of supporting the development of portable MPI (stack memory, register context, thread-local variables, signal masks, applications (Section 3). Instead, we propose adding continuations errno value, scheduling priority and policy) and have access to re- to the MPI standard to facilitate a loose coupling between higher- sources shared between threads within a process (heap memory, level concurrency abstractions and MPI (Section 4). We will show global variables, file descriptors, etc). that the interface is flexible and can be used with abstractions on A common example of a user thread implementation are POSIX different levels, including OpenMP detached tasks, OmpSs-2 task threads (pthreads)[24]. In all currently available Unix flavors known events, and Argobots (Section 5). to the authors, a 1:1 mapping between pthreads and LWPs is main- tained, including Linux [31] and various BSD flavors [55–57]. They 2 TAXONOMY are the basic building block for any type of multi-threaded ap- Unfortunately, the

Fibers Are Not (P)Threads the Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

Other Apis What’S Wrong with Openmp?

Designing an Ultra Low-Overhead Multithreading Runtime for Nim

Fiber Context As a ﬁrst-Class Object

Thread and Multitasking

Fiber-Based Architecture for NFV Cloud Databases

A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

The Impact of Process Placement and Oversubscription on Application Performance: a Case Study for Exascale Computing

And Computer Systems I: Introduction Systems

Thread Scheduling in Multi-Core Operating Systems Redha Gouicem

Table of Contents