Infiniband Scalability in Open

Infiniband Scalability in Open MPI G. M. Shipman1,2, T. S. Woodall1, R. L. Graham1, A. B. Maccabe2 1 Advanced Computing Laboratory Los Alamos National Laboratory 2 Scalable Systems Laboratory Computer Science Department University of New Mexico Abstract including Myrinet [17], Quadrics [3], Gigabit Ethernet and, recently, Infiniband [1]. Infini- Inifiniband is becoming an important intercon- band (IB) is increasingly deployed in small to nect technology in high performance comput- medium sized commodity clusters. It is IB’s low ing. Recent efforts in large scale Infiniband de- price/performance qualities that has made it at- ployments are raising scalability questions in the tractive to the HPC market. HPC community. Open MPI, a new production Of the available distributed memory program- grade implementation of the MPI standard, pro- ming models, the Message Passing Interface vides several mechanisms to enhance Infiniband (MPI) standard [16] is currently the most widely scalability. Initial comparisons with MVAPICH, used. Several MPI implementations support In- the most widely used Infiniband MPI imple- finiband including Open MPI [10], MVAPICH mentation, show similar performance but with [15], LA-MPI [11] and NCSA MPI [18]. How- much better scalability characterics. Specifi- ever, there are concerns about the scalability of cally, small message latency is improved by up Infiniband for MPI applications, partially arising to 10% in medium/large jobs and memory usage from the fact that Infiniband was initially devel- per host is reduced by as much as 300%. In ad- oped as a general I/O fabric technology and not dition, Open MPI provides predicatable latency specifically targeted to HPC [4]. that is close to optimal without sacrificing band- In this paper, we describe Open MPI’s scal- width performance. able support for Infiniband. In particular, Open MPI makes use of Inifiniband feature not currently used by other MPI/IB implementations, 1 Introduction allowing Open MPI to scale more effectively than current implementations. We illustrate High performance computing (HPC) systems the scalability of Open MPI’s Infiniband sup- are continuing a trend toward distributed mem- port through comparisions with the widely- ory clusters consisting of commodity compo- used MVAPICH implementation, and show that nents. Many of these systems make use of Open MPI uses less memory and provides better commodity or ‘near’ commodity interconnects latency than MVAPICH on medium/large-scale 1 clusters. and scalability parameters which allow for easy The remainder of this paper is organized as tuning. follows. Section 2 presents a brief overview of The Open MPI point-to-point (p2p) design the Open MPI general point-to-point message and implementation is based on multiple MCA design. Next, section 3 discusses the Infini- frameworks. These frameworks provide func- band architecture including current limitations tional isolation with clearly defined interfaces. of the architecture. MVAPICH is discussed in Figure 1 illustrates the p2p framework architec- section 4 including potential scalability issues ture. relating to this implementation. Section 5 provides a detailed description of Infiniband sup- MPI port in Open MPI. Scalability and performance results are discussed in section 6, followed by PML conclusions and future work in section 7. BML OpenIB OpenIB SM BTL BTL BTL Open IB Open IB SM MPool MPool MPool 2 Open MPI Rcache Rcache The Open MPI Project is a collaborative effort by Los Alamos National Lab, the Open Systems Figure 1: Open MPI p2p framework Laboratory at Indiana University, the Innova- tive Computing Laboratory at the University of Tennessee and the High Performance Comput- As shown in Figure 1 the architecture con- ing Center at the University of Stuttgart (HLRS). sists of four layers. Working from the bottom up The goal of this project is to develop a next gen- these layers are the Byte Transfer Layer (BTL), eration implementation of the Message Passing BTL Management Layer (BML), Point-to-Point Interface. Open MPI draws upon the unique ex- Messaging Layer (PML) and the MPI layer. pertise of each of these groups which includes Each of these layers is implemented as an MCA prior work on LA-MPI, LAM/MPI [20], FT- framework. Other MCA frameworks shown are MPI [9] and PAX-MPI [13]. Open MPI is how- the Memory Pool (MPool) and the Registration ever, a completely new MPI, designed from the Cache (Rcache). While these are illustrated ground up to address the demands of current and and defined as layers, critical send/receive paths next generation architectures and interconnects. bypass the BML, as it is used primarily during Open MPI is based on a Modular Compo- initialization/BTL selection. nent Architecture [19]. This architecture sup- ports the runtime selection of components that MPool The memory pool provides mem- are optimized for a specific operating environ- ory allocation/deallocation and registra- ment. Multiple network interconnects are sup- tion/deregistration services. Infiniband re- ported through this MCA. Currently there are quires memory to be registered (phys- two Infiniband components in Open MPI. One ical pages present and pinned) before supporting the OpenIB Verbs-API and another send/receive or RDMA operations can use supporting the Mellanox Verbs-API. In addition the memory as a source or target. Sep- to being highly optimized for scalability these arating this functionality from other com- components provide a number of performance ponents allows the MPool to be shared 2 among various layers. For example, layer may be safely bypassed by upper lay- MPI ALLOC MEM uses these MPools to ers for performance. The current BML register memory with available intercon- component is named R2. nects. PML The PML implements all logic for Rcache The registration cache allows memory p2p MPI semantics including standard, pools to cache registered memory for later buffered, ready, and synchronous commu- operations. When initialized, MPI message nication modes. MPI message transfers are buffers are registered with the Mpool and scheduled by the PML based on a specific cached via the Rcache. For example, dur- policy. This policy incorporates BTL spe- ing an MPI SEND the source buffer is reg- cific attributes to schedule MPI messages. istered with the memory pool and this reg- Short and long message protocols are im- istration may be then be cached, depending plemented within the PML. All control on the protocol in use. During subsequent messages (ACK/NACK/MATCH) are also MPI SEND operations the source buffer is managed at the PML. The benefit of this checked against the Rcache, and if the reg- structure is a seperation of transport proto- istration exists the PML may RDMA the col from the underlying interconnects. This entire buffer in a single operation without significantly reduces both code complex- incurring the high cost of registration. ity and code redundancy enhancing main- tainability. There are currently three PMLs BTL The BTL modules expose the underlying available in the Open MPI code base. This semantics of the network interconnect in a paper discusses OB1 the latest generation consistent form. BTLs expose a set of com- PML in the later section. munication primitives appropriate for both During startup, a PML component is selected send/receive and RDMA interfaces. The and initialized. The PML component selected BTL is not aware of any MPI semantics; it defaults to OB1 but may be overridden by a run- simply moves a sequence of bytes (poten- time parameter/environment setting. Next the tially non-contiguous) across the underly- BML component R2 is selected. R2 then opens ing transport. This simplicity will enable and initializes all available BTL modules. Dur- early adoption of novel network devices ing BTL module initialization, R2 directs peer and encourages vendor support. There are resource discovery on a per-BTL basis. This al- several BTL modules currently available; lows the peers to negotiate which set of inter- including TCP, GM, Portals, Shared Mem- faces they will use to communicate with each ory (SM), Mellanox VAPI and OpenIB other. This infrastructure allows for heteroge- VAPI. In the later section we discusses the nous networking interconnects within a cluster. Mellanox VAPI and OpenIB VAPI BTLs. BML The BML acts as a thin multi-plexing 3Infiniband layer, allowing the BTLs to be shared among multiple upper layers. Discovery of The Infiniband specification is published by the peer resources is coordinated by the BML Infiniband Trade Association (ITA) originally and cached for multiple consumers of the created by Compaq, Dell, Hewlett-Packard, BTLs. After resource discovery, the BML IBM, Intel, Microsoft, and Sun Microsystems. 3 IB was originally proposed as a general I/O Two-sided send/receive operations are initi- technology, allowing for a single I/O fabric to ated by enqueueing a send WQE on a QP’s send replace mutliple existing fabrics. The goal of a queue. The WQE specifies only the senders lo- single I/O fabric has faded and currently Infini- cal buffer. The remote process must pre-post band is targeted as an Inter Process Communi- a receive WQE on the corresponding receive cation (IPC) and Storage Area Network (SAN) queue which specifies a local buffer address to interconnect technology. be used as the destination of the receive. Send Infiniband, similar to Myrinet and Quadrics, completion indicates the send WQE is com- provides both Remote Direct Memory Access pleted locally and results in a sender side CQ (RDMA) and Operating System (OS) bypass fa- entry. When the transfer actually completes a cilities. RDMA enables data transfer from the CQ entry will be posted to the receivers CQ. address space of an application processs to a One-sided RDMA operations are likewise ini- peer process across the network fabric without tiated by enqueueing a RDMA WQE on the requiring involvement of the host CPU. Infini- Send Queue. However, this WQE specifies both band RDMA operations support both two-sided the source and target virtual addresses along send/receive and one-sided put/get semantics.

Load more