Enabling Communication Concurrency Through Flexible MPI Endpoints

Special Issue Article The International Journal of High Performance Computing Applications 2014, Vol. 28(4) 390–405 Enabling communication concurrency Ó The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav through flexible MPI endpoints DOI: 10.1177/1094342014548772 hpc.sagepub.com James Dinan1, Ryan E Grant2, Pavan Balaji3, David Goodell4, Douglas Miller5,MarcSnir3 and Rajeev Thakur3 Abstract MPI defines a one-to-one relationship between MPI processes and ranks. This model captures many use cases effectively; however, it also limits communication concurrency and interoperability between MPI and programming models that utilize threads. This paper describes the MPI endpoints extension, which relaxes the longstanding one-to-one relationship between MPI processes and ranks. Using endpoints, an MPI implementation can map separate communication contexts to threads, allowing them to drive communication independently. Endpoints also enable threads to be addressable in MPI operations, enhancing interoperability between MPI and other programming models. These characteristics are illu- strated through several examples and an empirical study that contrasts current multithreaded communication performance with the need for high degrees of communication concurrency to achieve peak communication performance. Keywords MPI, endpoints, hybrid parallel programming, interoperability, communication concurrency 1. Introduction Interoperability between MPI and other parallel programming systems has long been a productivity Hybrid parallel programming in the ‘‘MPI+X’’ model and composability goal in the parallel programming has become the norm in high-performance computing. community. The widespread adoption of MPI+Xpar- This approach to parallel programming mirrors the allel programming has put additional pressure on the hierarchy of parallelism in current high-performance community to produce a solution that enables full systems, in which a high-speed interconnect joins many interoperability between MPI and system-level pro- highly parallel nodes. While MPI is effective at manag- gramming models, such as X10, Chapel, Charm++, ing internode parallelism, alternative data-parallel, fork- UPC, and Coarray Fortran, as well as node-level pro- join, and offload models are needed to utilize current gramming models such as OpenMP*, threads, and and future highly parallel nodes effectively. IntelÒ TBB. A key challenge to interoperability is the In the MPI+X programming model, multiple cores ability to generate additional MPI ranks that can be are utilized by a single MPI process with a shared MPI assigned to threads used in the execution of such rank. As a result, communication for all cores in the models. MPI process is effectively funneled through a single The MPI 3.0 standard resolved several issues affect- MPI address and its corresponding communication ing hybrid parallel programming with MPI and context. As processor core counts have increased, high- threads, but it did not include any new mechanisms to speed interconnects have evolved to provide greater address these foundational communication resources to support concurrent communications with multiple cores. For such networks, a growing number 1Intel Corporation, Hudson, MA, USA of cores must be used in order to realize peak perfor- 2 Sandia National Laboratories, Albuquerque, NM, USA mance (Underwood et al., 2007; Blagojevic´et al., 2010; 3Argonne National Laboratory, Lemont, IL, USA Do´zsa et al., 2010; Barrett et al., 2013). This situation 4Cisco Systems Incorporated, San Jose, CA, USA is at odds with conventional hybrid programming tech- 5International Business Machines Corporation, Rochester, MN, USA niques, where the node is partitioned by several MPI Corresponding author: processes and communications are funneled through a James Dinan, Intel Corporation, 75 Reed Road, Hudson, MA 01749, USA. small fraction of the cores. Email: [email protected] Dinan et al 391 concurrency and interoperability challenges. In current by the authors, members of the MPI Forum have multithreaded MPI programs, the programmer must rekindled this discussion in the context of the MPI end- use either tags or communicators to distinguish com- points extension that is proposed for MPI 4.0.1 munication operations between individual threads. Initially, static interfaces were explored, and we present However, both approaches have significant limitations. these designs in Section 3.1. The dynamic interface, When tags are used, it is not possible for multiple presented in Section 3.2, is preferred as a more flexible threads sharing the rank of an MPI process to partici- alternative to the static interface, and it was refined as pate in collectives. In addition, when multiple threads described in Section 3.5 into a proposal that is cur- perform wildcard receive operations, matching is non- rently under consideration for inclusion in version 4.0 deterministic. Using multiple communicators can side- of the MPI standard. step some of these restrictions, but at the expense of MPI interoperability has been investigated exten- partitioning threads into individual communicators sively in the context of a variety of parallel program- where only one thread per parent process can be pres- ming models (Jose et al., 2010, 2012; Yang et al., 2014). ent in each new communicator. Interoperability between MPI and Unified Parallel C Several solutions to the communication concurrency was defined in terms of one-to-one and many-to-one challenge have been explored, including internally par- mappings of UPC threads to MPI ranks (Dinan et al., allelizing MPI processing (Kumar et al., 2012; Tanase 2010). Support for the one-to-one mapping cannot be et al., 2012) and modifying the networking layer to provided in MPI 3.0 when the unified parallel C (UPC) enable greater concurrency for threads using the cur- implementation utilizes operating system threads, rent MPI interface (Luo et al., 2011). However, these rather than processes, to implement UPC threads. approaches can require customized thread-management However, this mode of operation can be supported techniques to avoid overheads and noise from oversub- through the endpoints interface by assigning endpoint scribing cores, and they are not able to address the pro- ranks to threads. grammability challenges that arise when MPI threads Hybrid parallel programming, referred to as share a rank. ‘‘MPI+X’’, which combines MPI and a node-level par- In this paper we present an MPI extension, called allel programming model, has become commonplace. ‘‘MPI endpoints’’, that enables the programmer to cre- MPI is often combined with multithreaded ate additional ranks at existing MPI processes. We parallel programming models, such as MPI+OpenMP explore the design space of MPI endpoints and the (Smith and Bull, 2001; Rabenseifner et al., 2009). impact of endpoints on MPI implementations. We The 1997 MPI 2.0 Forum (Message Passing Interface illustrate through several examples that endpoints Forum, 1997) defined MPI’s interaction with threads address the problem of interoperability between MPI in terms of several levels of threading support that and parallel programming systems that utilize threads. can be provided by the MPI library. MPI 3.0 further Additionally, we explore how endpoints enrich the clarified the interaction between several MPI con- MPI model by relaxing the one-to-one relationship structs and threads. For example, matched-probe between ranks and MPI processes. We explore this new operations were added to enable deterministic use of potential through a brief example that utilizes end- MPI_Probe when multiple threads share an MPI points to achieve communication-preserving load bal- rank. In addition, MPI 3.0 added support for inter- ancing by mapping work (e.g. mesh tiles) to ranks and process shared memory through the Remote reassigning multiple ranks to processes. We also con- Memory Access (RMA) interface (Hoefler et al., duct an empirical study of communication concurrency 2012, 2013). in a modern many-core system. Results confirm that Recently, researchers have endeavored to integrate multiple cores are required to drive the interconnect to MPI and accelerator programming models. This its full performance. They further suggest that private, effort has focused on the impact of separate accelerator rather than shared, communication endpoints may be memory on communication operations. Several necessary to achieve high levels of communication approaches to supporting the use of NVIDIA* CUDA* efficiency. or OpenCL* buffers directly in MPI operations (Wang et al., 2011; Ji et al., 2012) have been developed. Other 2. Background efforts have focused on enabling accelerator cores to perform MPI calls directly (Stuart et al., 2011). The design of MPI communicators has been vigorously Numerous efforts have been made to integrate node- debated within the MPI community, and several level parallelism with MPI. Fine-Grain MPI (FG-MPI) approaches to increasing the generality of MPI com- (Kamal and Wagner, 2010) implements MPI processes municators have been suggested (Geist et al., 1996; as lightweight coroutines instead of operating-system Demaine et al., 2001; Graham and Keller, 2009). Led processes, enabling each coroutine to have its own MPI 392 The International Journal of High Performance Computing Applications 28(4) rank. In order to support fine-grain coroutines, 3.1. Static endpoint creation FG-MPI provides mechanisms to create additional During initialization,

Load more