Mercury: Enabling Remote Procedure Call for High-Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
Mercury: Enabling Remote Procedure Call for High-Performance Computing Jerome Soumagne⇤, Dries Kimpe†, Judicael Zounmevo‡, Mohamad Chaarawi⇤, Quincey Koziol⇤, Ahmad Afsahi‡, Robert Ross† ⇤The HDF Group Champaign, IL 61820 †Argonne National Laboratory Argonne, IL 60439 ‡ECE dept. Queen’s University 19 Union Street, Kingston, ON, K7L 3N6, Canada Abstract—Remote Procedure Call (RPC) is a technique that to take advantage of the native transport mechanism to transfer has been largely used by distributed services. This technique, data efficiently, as these frameworks are mainly designed on now more and more used in the context of High-Performance top of TCP/IP protocols; and the inability to transfer very large Computing (HPC), allows the execution of routines to be dele- gated to remote nodes, which can be set aside and dedicated amounts of data, as the limit imposed by the RPC interface is to specific tasks. However, existing RPC frameworks assume generally of the order of the megabyte. In addition, even if no a sockets based network interface (usually on top of TCP/IP) limit is enforced, transferring large amounts of data through which is not appropriate for HPC systems, as this API does not the RPC library is usually discouraged, mostly due to overhead typically map well to the native network transport used on those caused by serialization and encoding, causing the data to be systems, resulting in lower network performance. In addition, existing RPC frameworks often do not support handling large copied many times before reaching the remote node. data arguments, such as those found in read or write calls. The paper is organized as follows: we first discuss related We present in this paper an asynchronous RPC interface work in section II, then in section III we discuss the network specifically designed for use in HPC systems that allows asyn- abstraction layer on top of which the interface is built, as chronous transfer of parameters and execution requests and well as the architecture defined to transfer small and large direct support of large data arguments. The interface is generic to allow any function call to be shipped. Additionally, the network data efficiently. Section IV outlines the API and shows its implementation is abstracted, allowing easy porting to future advantages to enable the use of pipelining techniques. We then systems and efficient use of existing native transport mechanisms. describe the development of network transport plugins for our interface as well as performance evaluation results. Section V I. INTRODUCTION presents conclusions and future work directions. When working in an heterogeneous environment, it is often very useful for an engineer or a scientist to be able to distribute II. RELATED WORK the various steps of an application workflow; particularly so in The Network File System (NFS) [7] is a very good example high-performance computing where it is common to see systems of the use of RPC with large data transfers and therefore very or nodes embedding different types of resources and libraries, close to the use of RPC on an HPC system. It makes use which can be dedicated to specific tasks such as computation, of XDR [8] to serialize arbitrary data structures and create a storage or analysis and visualization. Remote procedure call system-independent description, the resulting stream of bytes (RPC) [1] is a technique that follows a client/server model is then sent to a remote resource, which can deserialize and get and allows local calls to be transparently executed onto remote the data back from it. It can also make use of separate transport resources. It consists of serializing the local function parameters mechanisms (on recent versions of NFS) to transfer data over into a memory buffer and sending that buffer to a remote RDMA protocols, in which case the data is processed outside target which in turn deserializes the parameters and executes of the XDR stream. The interface that we present in this paper the corresponding function call. Libraries implementing this follows similar principles but in addition handles bulk data technique can be found in various domains such as web services directly. It also does not limit to the use of XDR for data with Google Protocol Buffers [2] or Facebook Thrift [3], or encoding, which can be a performance hit, especially when in domains such as grid computing with GridRPC [4]. RPC sender and receiver share a common system architecture. By can also be realized using a more object oriented approach providing a network abstraction layer, the RPC interface that with frameworks such as CORBA [5] or Java RMI [6] where we define gives the ability to the user to send small data and abstract objects and methods can be distributed across a range large data efficiently, using either small messages or remote of nodes or machines. memory access (RMA) types of transfer that fully support one- However, using these standard and generic RPC frameworks sided semantics present on recent HPC systems. Furthermore, on an HPC system presents two main limitations: the inability all the interface presented is non-blocking and therefore allows an asynchronous mode of operation, preventing the caller to reducing the amount of user code needed to implement RPC wait for an operation to execute before another one can be functionality. issued. Another similar approach can be seen with the Decoupled The I/O Forwarding Scalability Layer (IOFSL) [9] is another and Asynchronous Remote Transfers (DART) [16] project. project upon which part of the work presented in this paper is While DART is not defined as an explicit RPC framework, it based. IOFSL makes use of RPC to specifically forward I/O allows transfer of large amounts of data using a client/server calls. It defines an API called ZOIDFS that locally serializes model from applications running on the compute nodes of a function parameters and sends them to a remote server, where HPC system to local storage or remote locations, to enable they can in turn get mapped onto file system specific I/O remote application monitoring, data analysis, code coupling, operations. One of the main motivations for extending the and data archiving. The key requirements that DART is trying work that already exists in IOFSL is the ability to send not to satisfy include minimizing data transfer overheads on only a specific set of calls, as the ones that are defined through the application, achieving high-throughput, low-latency data the ZOIDFS API, but a various set of calls, which can be transfers, and preventing data losses. Towards achieving these dynamically and generically defined. It is also worth noting goals, DART is designed so that dedicated nodes, i.e., separate that IOFSL is built on top of the BMI [10] network transport from the application compute nodes, asynchronously extract layer used in the Parallel Virtual File System (PVFS) [11]. data from the memory of the compute nodes using RDMA. In It allows support for dynamic connection as well as fault this way, expensive data I/O and streaming operations from the tolerance and also defines two types of messaging, unexpected application compute nodes to dedicated nodes are offloaded, and expected (described in section III-B), that can enable an and allow the application to progress while data is transferred. asynchronous mode of operation. Nevertheless, BMI is limited While using DART is not transparent and therefore requires in its design by not directly exposing the RMA semantics that explicit requests to be sent by the user, there is no inherent are required to explicitly achieve RDMA operations from the limitation for integration of such a framework within our client memory to the server memory, which can be an issue network abstraction layer and therefore wrap it within the and a performance limitation (main advantages of using an RPC layer that we define, hence allowing users to transfer data RMA approach are described in section III-B). In addition, using DART on the platforms it supports. while BMI does not offer one-sided operations, it does provide a relatively high level set of network operations. This makes III. ARCHITECTURE porting BMI to new network transports (such as the Cray As mentioned in the previous section, Mercury’s interface Gemini interconnect [12]) to be a non-trivial work, and more relies on three main components: a network abstraction layer, time consuming than it should be, as only a subset of the an RPC interface that is able to handle calls in a generic fashion functionality provided by BMI is required for implementing and a bulk data interface, which complements the RPC layer RPC in our context. and is intended to easily transfer large amounts of data by Another project, Sandia National Laboratories’ NEtwork abstracting memory segments. We present in this section the Scalable Service Interface (Nessie) [13] system provides a sim- overall architecture and each of its components. ple RPC mechanism originally developed for the Lightweight File Systems [14] project. It provides an asynchronous RPC A. Overview solution, which is mainly designed to overlap computation The RPC interface follows a client / server architecture. As and I/O. The RPC interface of Nessie directly relies on the described in figure 1, issuing a remote call results in different Sun XDR solution which is mainly designed to communicate steps depending on the size of the data associated with the between heterogeneous architectures, even though practically call. We distinguish two types of transfers: transfers containing all High-Performance Computing systems are homogeneous. typical function parameters, which are generally small, referred Nessie provides a separate mechanism to handle bulk data to as metadata, and transfers of function parameters describing transfers, which can use RDMA to transfer data efficiently large amounts of data, referred to as bulk data.