A Failure Detector for HPC Platforms
Total Page:16
File Type:pdf, Size:1020Kb
Article The International Journal of High Performance Computing Applications 1–20 A failure detector for HPC platforms © The Author(s) 2017 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/1094342017711505 journals.sagepub.com/home/hpc George Bosilca1, Aurelien Bouteiller1, Amina Guermouche2, Thomas Herault1, Yves Robert1,3, Pierre Sens4 and Jack Dongarra1,5,6 Abstract Building an infrastructure for exascale applications requires, in addition to many other key components, a stable and effi- cient failure detector. This article describes the design and evaluation of a robust failure detector that can maintain and distribute the correct list of alive resources within proven and scalable bounds. The detection and distribution of the fault information follow different overlay topologies that together guarantee minimal disturbance to the applications. A virtual observation ring minimizes the overhead by allowing each node to be observed by another single node, providing an unobtrusive behavior. The propagation stage uses a nonuniform variant of a reliable broadcast over a circulant graph overlay network and guarantees a logarithmic fault propagation. Extensive simulations, together with experiments on the Titan Oak Ridge National Laboratory supercomputer, show that the algorithm performs extremely well and exhibits all the desired properties of an exascale-ready algorithm. Keywords MPI, failure detection, fault tolerance 1. Introduction overall performance of a fault-tolerant HPC solution. A major factor to assess the efficacy of an FD algo- Failure detection (FD) is a prerequisite to failure miti- rithm is the trade-off that it achieves between scalabil- gation and a key component of any infrastructure that ity and the speed of information propagation in the requires resilience. This article is devoted to the design system. and evaluation of a reliable algorithm that will main- Although we focus primarily on the most widely tain and distribute the updated list of alive resources used programming paradigms, the message passing with a guaranteed maximum delay. We consider a typi- interface (MPI), the techniques, and algorithms pro- cal high-performance computing (HPC) platform in posed have a larger scope and are applicable in any steady-state operation mode. Because in such environ- resilient distributed programming environment. We ments the transmission time can be considered as consider the platform as being initially composed of N bounded (although that bound is unknown), it becomes nodes, but with a high probability, some of these possible to provide a perfect failure detector according resources will become unavailable throughout the exe- to the classical definition of Chandra and Toueg cution. When exposed to the death of a node, tradi- (1996). A failure detector is a distributed service able to tional applications would abort. However, the return the state of any node, alive or dead (subject to a applications that we consider are augmented with fault- crash).1 A failure detector is perfect if any node death is tolerant extensions that allow them to continue across eventually suspected by all surviving nodes and if no surviving node ever suspects a node that is still alive. 1 Critical fault-tolerant algorithms for HPC and imple- ICL, University of Tennessee Knoxville, Knoxville, TN, USA 2Telecom SudParis, E´vry, France mentations of communication middleware for unreli- 3LIP, E´cole Normale Supe´rieure de Lyon, Lyon, France able systems rely on the strong properties of perfect 4LIP6, Universite´ Paris 6, Paris, France failure detectors (see e.g. Bland et al., 2013a, 2013b, 5Oak Ridge National Lab, Oak Ridge, TN, USA 2015; Egwutuoha et al., 2013; Herault et al., 2015; 6Manchester University, Manchester, UK Katti et al., 2015). Their cost in terms of computation Corresponding author: and communication overhead, as well as their proper- Yves Robert, ICL, University of Tennessee Knoxville, Knoxville, TN, USA; ties in terms of latency to detect and notify failures and LIP, E´cole Normale Supe´rieure de Lyon, Lyon, France. of reliability, have thus a significant impact on the Email: [email protected] 2 The International Journal of High Performance Computing Applications 00(0) failures (e.g. Bland et al., 2013), either using a generic broadcast operation that introduces many messages. or an application-specific fault-tolerant model. The The major contributions of this work are as follows: design of this model is outside the scope of this article, but without loss of generality, we can safely assume • It provides a proven algorithm for FD based on a that any fault-tolerant recovery model requires a robust robust protocol that tolerates an arbitrary number fault detection mechanism. Our goal is to design such a of failures, provided that no more than f consecu- robust protocol that can detect all failures and enable tive failures strike within a time window of duration the efficient repair of the execution platform. T (f ). By repairing the platform, we mean that all surviving • The protocol has minimal overhead in failure-free nodes will eventually be notified of all failures and will operation, with a unique observer per node. therefore be able to compute the list of surviving nodes. • The protocol achieves FD and propagation in loga- The state of the platform where all dead nodes are rithmic time for up to fmax = b log nc - 1 where n is known to all processes is called a stable configuration the number of alive nodes. More precisely, the (note that nodes may not be aware that they are in a bound T ðf max Þis deterministic and logarithmic in n, stable configuration). even in the worst case. By robust, we mean that regardless of the length of • All performance guarantees are expressed within a the execution, if a set of up to f failures disrupt the plat- realistic one-port communication model. form and precipitate it into an unstable configuration, • It provides a detailed theoretical and practical com- the protocol will bring the platform back into a stable parison with randomized protocols. configuration within T ( f ) time units—we will define • Extensive simulations and experiments with user- T (f ) later in the article. Note that the goal is not to tol- level failure mitigation (ULFM; Bland et al., 2013) erate up to f failures overall. On the contrary, the pro- show very good performance of the algorithm. tocol will tolerate an arbitrary number of failures throughout an unbounded-length execution, provided The rest of the article is organized as follows: We that no more than f successive overlapping failures start with an informal description of the algorithm in strike within the T (f ) time window. Hence, f induces a Section 2. We detail the model, the proof of correctness, constraint on the frequency of failures, but not on the and the time-performance analysis in Section 3. Then, total number of failures. we assess the efficiency of the algorithm in a practical By efficiently, we aim at a low-overhead protocol setting, first by reporting on a comprehensive set of that limits the number of messages exchanged to detect simulations in Section 4, and then by discussing experi- the faults and repair the platform. Although we assume mental results on the Titan Oak Ridge National a fully connected platform (any node may communi- Laboratory (ORNL) supercomputer in Section 5. cate with any other), we use a realistic one-port commu- Section 6 provides an overview of related work. Finally, nication model (Bhat et al., 2003) where a node can we outline conclusions and directions for future work send and/or receive at most one message at any time in Section 7. step. Independent communications, involving distinct sender/receiver pairs, can take place in parallel; how- ever, two messages sent by the same node will be serial- 2. Algorithm ized. Note that the one-port model is only an This section provides an informal description of the assumption used to model the performance and provide algorithm (for a list of main notations, see Table 1). an upper bound for the overheads. In real situations We refer to Section 3 for a detailed presentation of the where platforms support multiport communications, model, a proof of correctness, and a time-performance our algorithm is capable of taking advantage of such analysis. We maintain two main invariants in the capabilities. All these goals seem contradictory, but algorithm: they only call for a carefully designed trade-off. As shown in the studies by Ferreira et al. (2008), Hoefler 1. Each alive node maintains its own list of known et al. (2010), and Kharbas et al. (2012), system noise dead resources. created by the messages and computations of the fault 2. Alive nodes are arranged along a ring, and each detection mechanism can impose significant overheads node observes its predecessor in the ring. In other in HPC applications. Here, system noise is broadly words, the successor/observer receives heartbeats defined as the impact of operating system and architec- from its predecessor/emitter (see below). tural overheads onto application performance. Hence, the efficiency of the approach must be carefully When a node dies, its observer broadcasts the infor- assessed. The overhead should be kept minimal in the mation and reconnects the ring: From now onward, the absence of failures, while FD and propagation should observer will observe the last known predecessor execute quickly, which usually implies a robust (accounting for locally known failures) of its former Bosilca et al. 3 Table 1. List of notations. Let A be the complement of Di in f0, 1, .. , N - 1g Platform parameters and let n = jAj. The elements of A are labeled from 0 to n - 1, where the source i of the broadcast is labeled N Initial number of nodes 0. The broadcast is tagged with a unique identifier and t Upper bound on the time to transfer a message involves only nodes of the labeled list A (this list is Protocol parameters computable at each participant as Di is part of the mes- h Period for heartbeats sage).