Application-Transparent Checkpoint/Restart for MPI Programs Over Inﬁniband ∗

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand ∗ Qi Gao Weikuan Yu Wei Huang Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {gaoq, yuw, huanwei, panda}@cse.ohio-state.edu Abstract While the failure rate of an entire system grows rapidly with Ultra-scale computer clusters with high speed intercon- the number of the components, few of such large-scale sys- nects, such as InfiniBand, are being widely deployed for their tems are equipped with built-in fault tolerance support. The excellent performance and cost effectiveness. However, the applications running over these systems also tend to be more failure rate on these clusters also increases along with their error-prone because the failure of any single component can augmented number of components. Thus, it becomes criti- cascade widely to other components due to the interaction and cal for such systems to be equipped with fault tolerance sup- dependence among them. port. In this paper, we present our design and implementation The Message Passing Interface (MPI) [21] is the de facto of checkpoint/restart framework for MPI programs running programming model on which parallel applications are typ- over InfiniBand clusters. Our design enables low-overhead, ically written. However, it has no specification about the application-transparent checkpointing. It uses coordinated fault tolerance support that a particular implementation must protocol to save the current state of the whole MPI job to achieve. As a result, most MPI implementations are designed reliable storage, which allows users to perform rollback re- without the fault tolerant support, providing only two modes covery if the system runs into faulty states later. Our solution of the working state: RUNNING or FAILED. Faults occurred has been incorporated into MVAPICH2, an open-source high during the execution time often abort the program and the pro- performance MPI-2 implementation over InfiniBand. Perfor- gram has to restart from the beginning. For long running mance evaluation of this implementation has been carried out programs, this can waste a large amount of computing re- using NAS benchmarks, HPL benchmark, and a real-world sources because all the computation that has already been ac- application called GROMACS. Experimental results indicate complished is lost. To save the valuable computing resources, that in our design, the overhead to take checkpoints is low, it is desirable that a parallel application can restart from some and the performance impact for checkpointing applications previous state before a failure occurs and continue the execu- periodically is insignificant. For example, time for check- tion. Thus checkpointing and rollback recovery is one of the pointing GROMACS is less than 0.3% of the execution time, commonly used techniques in fault recovery. and its performance only decreases by 4% with checkpoints The InfiniBand Architecture (IBA) [18] has been recently taken every minute. To the best of our knowledge, this work standardized in industry to design next generation high-end is the first report of checkpoint/restart support for MPI over clusters for both data-center and high performance comput- InfiniBand clusters in the literature. ing. Large cluster systems with InfiniBand are being deployed. For example, in the Top500 list recently released on November 2005 [31], the 5th, 20th, and 51st most pow- 1 Introduction erful supercomputers use InfiniBand as their parallel application communication interconnect. These systems can have as High End Computing (HEC) systems are quickly gaining many as 8,000 processors. It becomes critical for such large- in their speed and size. In particular, more and more computer scale systems to be deployed with checkpoint/restart support clusters with multi-thousand nodes are getting deployed dur- so that the long-running MPI parallel programs are able to ing recent years because of their low price/performance ratio. recover from failures. However, it is still an open challenge ∗ to provide checkpoint/restart support for MPI programs over This research is supported in part by a DOE grant #DE-FC02- 01ER25506 and NSF Grants #CNS-0403342 and #CNS-0509452; grants InfiniBand clusters. from Intel, Mellanox, Cisco Systems, Linux Networx, and Sun MicroSys- In this paper, we take on this challenge to enable tems; and equipment donations from Intel, Mellanox, AMD, Apple, Appro, checkpoint/restart for MPI programs over InfiniBand clus- Dell, Microway, PathScale, IBM, SilverStorm, and Sun MicroSystems. 1 ters. Based on the capability of Berkeley Lab’s Check- and (b) solely uncoordinated checkpointing is susceptible to point/Restart(BLCR) [12] to take snapshots of processes on the domino effect [26], where the dependencies between pro- a single node, we design a checkpoint/restart framework to cesses make all processes roll back to the initial state. take global checkpoints of the entire MPI program while en- With respect to the transparency to user application, check- suring the global consistency. We have implemented our de- point/restart techniques can be divided into two categories: sign of checkpoint/restart in MVAPICH2 [24], which is an application-level checkpointing and system-level checkpoint- open-source high performance MPI-2 implementation over ing. The former usually involves user application in the InfiniBand, and is widely used by the high performance com- checkpointing procedure. While gaining advantages of ef- puting community. Checkpoing/restart-capable MVAPICH2 ficiency with assistance from the user application, this ap- enables low-overhead, application-transparent checkpointing proach has a major drawback: the source code of user ap- for MPI applications with only insignificant performance im- plications need to be tailored to the checkpointing interface, pact. For example, time for checkpointing GROMACS [11] which often involves a significant amount of work for each is less than 0.3% of the execution time, and its performance application. The latter is application-transparent, because OS only decreases by 4% with checkpoints taken every minute. takes care of saving the state of running processes. Although The rest of the paper is organized as follows: In section 2 it may involve more overhead, it does not need any code mod- and section 3, we describe the background of our work, and ification of applications. Thus we follow this approach. identify the challenges involved in checkpointing InfiniBand parallel applications. In section 4, we present our design in 3 Challenges detail with discussions on some key design issues. In section 5, we describe the experimental results of our current im- Most studies on checkpointing parallel applications as- plementation. In section 6, we discuss related works. Finally, sume the communication is based on TCP/IP stack. Although we provide our conclusions and describe future works in sec- InfiniBand also provides TCP/IP support using IP over IB tion 7. (IPoIB), it does not deliver as good performance as native In- finiBand verbs. In this section, we identify the challenging 2 Background issues in checkpointing the parallel programs that are built over native InfiniBand protocols as follows. 2.1 InfiniBand and MVAPICH2 First, parallel processes over InfiniBand communicate via InfiniBand[18] is an open standard of next generation high an OS-bypass user-level protocol. In regular TCP/IP net- speed interconnect. In addition to send/receive semantics, works, the operating system (OS) kernel handles all net- the native transport services, a.k.a InfiniBand verbs, provide work activities, so these network activities can be temporar- memory-based semantics, Remote Direct Memory Access ily stopped in an application-transparent manner. However, (RDMA), for high performance interprocess communication. InfiniBand provides its high performance communication via By directly accessing and/or modifying the contents of remote OS-bypass capabilities in its user-level protocol [6]. The use memory, RDMA operations are one sided and do not incur of these user-level protocols has the following side effect: the CPU overhead on the remote side. Because of its high perfor- operating system is skipped in the actual communication and mance, InfiniBand is gaining wider deployment as high end does not maintain the complete information of ongoing net- computing platforms [31]. work activities. Because of this gap of information regarding Designed and implemented based on its predecessor MVA- the communication activities between the OS kernel and the PICH [20] and MPICH2 [1], MVAPICH2 is an open-source user-land of application process, it becomes difficult for the high performance implementation of MPI-2 standard. MVA- operating system to directly stop network activities and take PICH2, along with MVAPICH, is currently being used by checkpoints without loosing consistency. more than 355 organizations across the world. Currently en- Second, the context of network connection is available abling several large-scale InfiniBand clusters, MVAPICH2 in- only in network adapter. In regular TCP/IP networks, the cludes a high performance transport device over InfiniBand, network communication context is stored in kernel memory, which takes advantage of RDMA capabilities. which can be saved to checkpoints. Different from TCP/IP networks, InfiniBand network adapter stores the network con- 2.2 Checkpointing and Rollback Recovery nection context in the adapter memory.

Load more