Computer Science – Computational Architecture Exploration

Open MPI: A High-Performance, Fault-Tolerant Message-Passing Interface

Brian Barrett, Ralph Castain, Galen Shipman, CCS-1

pen MPI is a mature The Open MPI project was originally message-passing interface started by LANL; Indiana University, (MPI) implementation Bloomington; and the University of developed by Los Alamos Tennessee, Knoxville. Involvement in National Laboratory (LANL) in the project has since grown to include O other national laboratories, universities, collaboration with a number of academic, industry, and national- and commercial vendors. Support for laboratory partners. Open MPI is a number of interconnects, including designed to run at the large scales InfiniBand (MVAPI and Open Fabrics), currently found within the Advanced Myrinet (GM and MX), Portals, Simulation and Computing program. TCP, and , is provided in Recently, Open MPI successfully a single unified release. The Open Run- demonstrated its performance at scale at Time Environment, developed in unison Sandia National Laboratories, where it with the Open MPI project, provides was an integral part in boosting the a scalable run-time environment with high-performance linpack (HPL) support for launching utilizing performance to 53.00 teraflops, 84.7% of rsh/ssh, PBS/Torque’s TM interface, peak performance up from 38.27 BProc, and SLURM. teraflops and 71.4% of peak a year earlier. Open MPI is the production MPI The performance of Open MPI is com- at LANL supporting several clusters petitive with other high-performance including the Roadrunner base system. MPI implementations such as MPICH/ MX and MVAPICH. Figure 1 illustrates

Fig. 1. The performance of Open MPI over Myrinet MX, which is comparable to MPICH/ MX.

Assocate Drectorate for Theory, Simulaton, and Computaton (ADTSC) Fig. 2. The performance of Open MPI over Mel- lanox InfiniBand, which is comparable to MVAPICH.

the performance of Open MPI over optimizations for heterogeneous data Myrinet MX, which is comparable to transfer, run-time support for dynamic MPICH/MX. Figure 2 illustrates the processes across heterogeneous performance of Open MPI over processors, and efficient usage of Mellanox InfiniBand, which is InfiniBand in nonuniform memory comparable to MVAPICH. Open MPI architectures. provides 1-byte latency of 3.11 μs using Mellanox InfiniBand vs 3.15 μs for For more information contact Galen MVAPICH, and 3.15 μs using Myrinet Shipman at [email protected]. MX vs 2.97 μs for MPICH/MX. These results indicate that a single MPI Funding Acknowledgements implementation can provide good NNSA’s Advanced Simulation and performance across multiple Computing (ASC), System Software architectures while providing the and Support, and Tools and Capabilities and stability of a production- programs. grade MPI implementation.

The Open MPI team at Los Alamos is currently researching a number of issues related to running in large-scale InfiniBand environments. These include adaptive resource allocation, protocol selection and recovery from spurious host channel adapter (HCA) resets that are seen in large-scale InfiniBand clusters. In addition to generic large- scale InfiniBand clusters, the Los Alamos team is also researching issues related to the full-scale Roadrunner configuration. These include

Nuclear Weapons Highlights 2007 LALP-07-041