MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes George Bosilca, Aurelien Bouteiller, Franck Cappello, Samir Djilali, Gilles Fedak, Cecile Germain, Thomas Herault, Pierre Lemarinier, Oleg Lodygensky, Frederic Magniette, Vincent Neri, Anton Selikhov LRI, Universite´ de Paris Sud, Orsay, France Abstract and Peer-to-Peer computing systems are examples of this trend. Global Computing platforms, large scale clusters and fu- For systems gathering thousands of nodes, node failures ture TeraGRID systems gather thousands of nodes for com- or disconnections are not rare, but frequent events. For puting parallel scientific applications. At this scale, node Large Scale Machine like the ASCI-Q machine, the MTBF failures or disconnections are frequent events. This Volatil- (Mean Time Between Failure) for the full system is esti- ity reduces the MTBF of the whole system in the range of mated to few hours. The Google Cluster using about 8000 hours or minutes. nodes experiences a node failure rate of 2-3% per year [6]. We present MPICH-V, an automatic Volatility tolerant This can be translated to a node failure every 36 hours. A MPI environment based on uncoordinated checkpoint/ roll- recent study of the availability of desktop machines within a back and distributed message logging. MPICH-V architec- large industry network ( 64,000 machines) [5], which is a ture relies on Channel Memories, Checkpoint servers and typical Large Scale Virtual PC Farms targeted for Global theoretically proven protocols to execute existing or new, Computing platforms in industry and university, demon- SPMD and Master-Worker MPI applications on volatile strates that from 5% to 10% of the machines become un- nodes. reachable in a 24 hour period. Moreover a life time evalua- To evaluate its capabilities, we run MPICH-V within a tion states that 1/3 of the machines disappeared (the con- framework for which the number of nodes, Channels Mem- nection characteristics of the machine changed) in a 100 ories and Checkpoint Servers can be completely configured days time period. The situation is even worse for Inter- as well as the node Volatility. We present a detailed per- net Global and Peer-to-Peer Computing platforms relying formance evaluation of every component of MPICH-V and on cycle stealing for resource exploitation. In such environ- its global performance for non-trivial parallel applications. ment, nodes are expected to stay connected (reachable) less Experimental results demonstrate good scalability and high than 1 hour per connection. tolerance to node volatility. A large portion of the applications executed on large scale clusters, TeraScale machines and envisioned for Tera- GRID systems are parallel applications using MPI as message passing environment. This is not yet the case for Large 1 Introduction Scale Distributed Systems. Their current application scope considers only bag of tasks, master-worker applications, or A current trend in Technical Computing is development document exchanging (instant messaging, music and movie of Large Scale Parallel and Distributed Systems (LSPDS) files). However many academic and industry users would gathering thousands of processors. Such platforms are the like to execute parallel applications with communication result of construction of a single machine or the clustering between tasks on Global and Peer-to-Peer platforms. The of loosely coupled computers that may belong to geograph- current lack of message passing libraries allowing the exe- ically distributed computing sites. TeraScale computers like cution of parallel applications is one limitation to a wider the ASCI machines in US, the Tera machine in France, large distribution of these technologies. scale PC clusters, large scale LANs of PC used as clusters, For parallel and distributed systems, the two main future GRID infrastructures (such as the US and Japan Ter- sources of failure/disconnection are the nodes and the net- aGRID), large scale virtual PC farms built by clustering PCs work. Human factors (machine/application shutdown, net- of several sites, large scale distributed systems like Global work disconnection), hardware or software faults may also 0-7695-1524-X/02 $17.00 (C) 2002 IEEE be at the origin of failures/disconnections. For a sake of Non named receptions The last difficulty is in message simplicity, we consider the failures/disconnections as node receptions with no sender identity. Some MPI low level volatility: the node is no more reachable and the eventual control messages as well as the user level API may al- results computed by this node after the disconnection will low such receptions. For checkpoint/restart fault tolerance not be considered in the case of a later reconnection. The approaches, the difficulty comes from two points: node last statement is reasonable for message passing parallel ap- volatility and scalability. For the execution correctness, plications since a volatile node will not be able to contribute internal task events and task communications of restarted any more (or for a period of time) to the application and, tasks should be replayed in a consistent way according to further more, it may stall (or slowdown) the other nodes ex- previous executions on lost nodes. The scalability issue changing messages with it. Volatility can affect individual comes from the unknown sender identity in non named re- or group of nodes when a network partition failure occurs ceptions. A mechanism should be designed to prevent the within a single parallel computer, a large scale virtual PC need for the receiver to contact every other nodes of the sys- farm or an Internet Global Computing platform. tem. The cornerstone for executing a large set of existing par- In this paper we present MPICH-V, a distributed, asyn- allel applications on Large Scale Parallel and Distributed chronous automatic fault tolerant MPI implementation de- Systems is a scalable fault tolerant MPI environment for signed for large scale clusters, Global and Peer-to-Peer Volatile nodes. Building such message passing environment Computing platforms. MPICH-V solves the above four means providing solution for several issues: issues using an original design based on uncoordinated checkpoint and distributed message logging. The design Volatility tolerance It relies on redundancy and/or check- of MPICH-V considers additional requirements related to point/restart concepts. Redundancy implies classical tech- standards and ease of use: 1) it should be designed from niques such as task cloning, voting, and consensus. Check- a widely distributed MPI standard implementation to en- point/restart should consider moving task contexts and sure wide acceptance, large distribution, portability and take restarting them on available nodes (i.e., task migrations) benefit of the implementation improvements and 2) a typical since lost nodes may not come back before a long time. user should be able to execute an existing MPI application on top of volunteer personal computers connected to Inter- Highly distributed & asynchronous checkpoint and net. This set of requirements involves several constraints: restart protocol Designing a volatility tolerant message a) running an MPI application without modification, b) en- passing environment and considering thousands of nodes, it suring transparent fault tolerance for users, c) keeping the is necessary to build a distributed architecture. Moreover, hosting MPI implementation unmodified, d) tolerating N si- a checkpoint/restart based volatility tolerant system should multaneous faults (N being the number of MPI processes in- not rely on global synchronization because 1) it would con- volved in the application), e) bypassing fire-walls, f) featur- siderably increase the overhead and 2) some nodes may ing scalable infrastructure and g) involving only user level leave the system during synchronization. libraries. The second section of the paper presents a survey of fault-tolerance message passing environments for parallel Inter-administration domain communications Har- computing and show how our work differs. Section 3 nessing computing resources belonging to different presents the architecture, and overviews every component. administration domains implies dealing with fire-walls. Section 4 evaluates the performance of MPICH-V and its When the system gathers a relatively small number of sites components. like for most of the currently envisioned GRID deploy- ments, security tool sets like GSI [13] can be used to allow communications between different administration domains. 2 Related Work Global and Peer-to-Peer Computing systems generally use a more dynamic approach because they gather a very A message passing environment for Global and Peer-to- large number of resources for which the security issue Peer computing involves techniques related to Grid comput- cannot be discussed individually. Thus, many P2P systems ing (enabling nodes belonging to different administration overcomes fire-walls by using their asymmetric protection domains to collaborate for parallel jobs) and fault tolerance set-up. Usually, fire-walls are configured to stop incoming (enabling the parallel execution to continue despite the node requests and accept incoming replies to outgoing requests. volatility). When client and server nodes are both fire-walled, a non Many efforts have been conducted to provide MPI envi- protected resource implements a relay between them which ronments for running MPI applications across multiple par- works as a post office where communicating nodes

Load more