A Comparison of Application Performance Using Open MPI and Cray MPI
Total Page:16
File Type:pdf, Size:1020Kb
A Comparison of Application Performance Using Open MPI and Cray MPI Richard L. Graham, Oak Ridge National Laboratory, George Bosilca, The University of Tennessee, and Jelena Pjeˇsivac-Grbovi´c, The University of Tennessee Abstract MPI collective communications operations are key to overall application performance. Open MPI is the result of an active international This paper present the first study of applications Open-Source collaboration of Industry, National that run on the Cray systems at the National Cen- Laboratories, and Academia. This implementation ter for Computational Sciences (NCCS) at Oak Ridge is becoming the production MPI implementation at National Laboratory, comparing overall code per- many sites, including some of DOE’s largest Linux formance as a function of the MPI implementation production systems. This paper presents the results employed. The applications used include Chimera, of a study comparing the application performance of GTC, and POP, codes of key importance at NCCS. VH-1, GTC, the Parallel Ocean Program, and S3D The performance of these applications is studied us- on the Cray XT4 at Oak Ridge National Laboratory, ing two different MPI implementations, Open MPI with data collected on up to 1024 process runs. The and Cray-MPI, over a range of processor counts. Be- results show that the application performance using cause Cray-MPI is a closed source implementation, Open MPI is comparable to slightly better than that comparisons are limited to gross performance data, obtained using Cray-MPI, even though platform spe- such as overall timing data. cific optimizations, beyond a basic port, have yet to be done in Open MPI. 2 Background 1 Introduction 2.1 Open MPI Since most High Performance Computing (HPC) ap- Open MPI [12, 22] is a recent Open Source imple- plications use the Message Passing Interface (MPI) mentation of both the MPI-1 [20, 24] and MPI- for their inter-processor communications needs, scal- 2 [13, 16] standards. Its origins in LA-MPI [2, 14], ability and performance of MPI affect overall appli- LAM/MPI [6, 26], FT-MPI [9, 11], and PACX- cation performance. As the number of processors MPI [18]. The original XT3 port is described in these applications use grows, application scalability [3]. Open MPI is composed of three major software and absolute performance is often greatly impacted pieces, the MPI layer (Open MPI), a run-time layer, by that of the underlying MPI implementation. As Open Run-Time Environment (OpenRTE) [7], and the number of processors participating in MPI com- the Open Portability Access Layer (OPAL). munications increase, the characteristics of the im- OPAL provides basic portability and building plementation that are most important for overall block features useful for large scale application de- performance change, and while at small processor velopment, serial or parallel. A number of useful count quantities such as small message latency and functions provided only on a handful of platforms asymptotic point-to-point message bandwidth are (such as asprintf, snprintf, and strncpy) are imple- often used to characterize application performance, mented in a portable fashion, so that the rest of the other factors become more important for determin- code can assume they are always available. High res- ing overall application performance. At large scale, olution / low perturbation timers, atomic memory the scalability of the internal MPI message handling operations, and memory barriers are implemented infrastructure, as well as the performance of the for a large number of platforms. The core support 1 code for the component architecture, which handles sections describes these PMLs, as well as the lower loading components at run-time, is also implemented level abstractions developed to support these. within OPAL. OPAL also provides a rich reference As Figure 1 shows, the OB1 and DR PML design counted object system to simplify memory manage- is based multiple MCA frameworks. These PMLs dif- ment, as well as to implement a number of container fer in the design features of the PML component it- classes, such as doubly-linked lists, last-in-first-out self, and share the lower level Byte Transfer Layer queues, and memory pool allocators. (BTL), Byte Management Layer (BML), Memory Pool OpenRTE provides a resource manager (RMGR) (MPool), and the Registration Cache (Rcache) frame- to provide process control, a global data store works. While these are illustrated and defined as lay- (known as the GPR), an out-of-band messaging layer ers, critical send/receive paths bypass the BML, as it is (the RML), and a peer discovery system for paral- used primarily during initialization and BTL selection. lel start-up (SDS). In addition, OpenRTE provides These components are briefly described below. basic datatype support for heterogeneous network support, process naming, and standard I/O for- warding. Each subsystem is implemented through a MPI component framework, allowing whole-sale replace- ment of a subsystem for a particular platform. On PML - OB1/DR PML - CM most platforms, the components implementing each BML - R2 MTL- MX MTL- MTL- PSM (Myrinet) Portals (QLogic) BTL - BTL - subsystem utilize a number of underlying compo- GM OpenIB MPool- MPool- nent frameworks to customize OpenRTE for the GM OpenIB Rcache specific system configuration. For example, the Rcache standard RMGR component utilizes additional com- ponent frameworks for resource discovery, process Figure 1: Open MPI’s Layered Architecture start-up and shutdown, and failure monitoring. A key design feature of this implementation is the extensive use of a component architecture, the The policies these two PMLs implement incorporate Modular Component Architecture (MCA) [25], which BTL specific attributes, such as message fragmentation is used to achieve Open MPI’s poly-morphic behav- parameters and nominal network latency and band- ior. All three layers of the Open MPI implementation width parameters, for scheduling MPI messages. These make use of this feature, but in this paper we will fo- PMLs are designed to provide concurrent support for cus on how this is used at the MPI layer to achieve a efficient use of all the networking resources available portable, and flexible implementation. Specifically, to a given application run. Short and long message we will focus on the Point-To-Point and Collective- protocols are implemented within the PML, as well as communications implementations. message fragmentation and re-assembly. All control messages (ACK/NACK/MATCH) are also managed by 2.1.1 Point-To-Point Architecture the PML. The DR PML differs from the OB1 PML primarily in that DR is designed to provide high perfor- We will provide only brief description of the Point- mance scalable point-to-point communications in the To-Point architecture, as a detailed description of context of potential network failures. The benefit of the Point-To-Point architecture is given in [15]. this structure is a separation of the high-level (e.g., The Point-To-Point Management Layer (PML) im- MPI) transport protocol from the underlying transport plements all logic for point-to-point MPI seman- of data across a given interconnect. This significantly tics such as standard, buffered, ready, and syn- reduces both code complexity and code redundancy chronous communication modes, synchronous and while enhancing maintainability, and provides a means asynchronous communications, and the like. MPI of building other communications protocols on top of message transfers are scheduled by the PML. Fig- these components. ure 1 provides a cartoon of the three PMLs in ac- The common MCA frameworks used in support of tive use in the Open MPI code base − OB1, DR, the OB1 and DR PMLs are described briefly below. and CM. These PMLs can be grouped into two cat- egories based on the component architecture used to MPool The memory pool provides memory al- implement these PMLs, with OB1 and DR forming one location/deallocation and registration / de- group, and CM in a group by itself. The following sub- registration services. For example, InfiniBand re- 2 quires memory to be registered (physical pages and buffer management for MPI’s buffered sends. The present and pinned) before send/receive or RDMA other aspects of MPI’s point-to-point semantics are operations can use the memory as a source or tar- implemented by the Matching Transport Layer (MTL) get. Separating this functionality from other com- framework, which provides an interface between the ponents allows the MPool to be shared among var- CM PML and underlying network library. Currently ious layers. For example, MPI ALLOC MEM uses there are three implementations of the MTL, for Myri- these MPools to register memory with available com’s MX library, QLogic’s InfiniPath library, and the interconnects. Cray Portals communication stack. In this paper we will collect application performance Rcache The registration cache allows memory pools data using both the Portals CM and the OB1 PMLs. to cache registered memory for later operations. When initialized, MPI message buffers are regis- tered with the Mpool and cached via the Rcache. 2.1.2 Collectives Architecture For example, during an MPI SEND the source The MPI collective communicates in Open MPI are buffer is registered with the memory pool and also implemented using the MCA architecture. Of the this registration may be then be cached, de- collective communications algorithm components im- pending on the protocol in use. During subse- plemented in Open MPI, the component in broadest quent MPI SEND operations the source buffer is use is the Tuned Collectives component [10]. This checked against the Rcache, and if the registra- component implements several versions of each collec- tion exists the PML may RDMA the entire buffer tive operation, and currently all implementations are in a single operation without incurring the high based on PML level Point-To-Point communications, cost of registration. with run-time selection logic used to pick which ver- BTL The BTL modules expose the underlying seman- sion of the collective operations are selected for a given tics of the network interconnect in a consistent instance of the operation.