Sequencer: Smart Control of Hardware and Software Components in Clusters (And Beyond)

Sequencer: smart control of hardware and software components in clusters (and beyond). Dr. Pierre Vignéras [email protected] Extreme Computing R&D Bull, Architect of an Open World www.bull.com September 15, 2011 Abstract tures [1], EPO may happen for various reasons. In such cases, stopping (or starting, both cases are ad- Starting/stopping a whole cluster or a part of it is dressed) macro components such as a whole rack or a a real challenge considering the dierent commands rack set requires an appropriate sequence of actions. related to various device types and manufacturers, Considering the vast number of dierent components and the order that should be respected. This arti- a cluster is composed of: cle presents a solution called the sequencer that al- nodes: compute nodes2, login nodes, management lows the automatic shutting down and starting up nodes, io (nfs, lustre, ...) nodes, ... of clusters, subset of clusters or even data-centers. It provides two operation modes designed for ease hardware: power switches (also called Power Dis- of use and emergency conditions. Our product has tribution Units or PDUs), ethernet switches, in- been designed to be ecient and it is currently used niband [2] switches, cold doors3, disk arrays, to power on and power o one of the largest cluster ... in the world: the Tera-100, made of more than 4000 powering on/o a whole set of racks can be a real nodes. challenge. Keywords: emergency power o, start/stop First, since it is made of a set of heterogeneous procedure, actions sequencing, workow, devices, starting/stopping each component of a clus- planning, cluster management, automation, ter is not straightforward: usually each device type case study. comes with its own poweron/o command. For example, shutting down a node can be as simple as an 1 Introduction 2From a hardware perspective, a node in a cluster is just a computer. A distinction is made however between nodes de- pending on their roles in the cluster. For example, user might Emergency Power O (EPO) is often not consid- connect to a login node for development, and job submission. ered from a system administration point of view in The batch scheduler runs on the management node and dis- traditionnal clusters1. Even for Tier-IV infrastruc- patch jobs to compute nodes. Compute nodes access storage through io nodes and so on. 1In this article, we consider clusters because they are the 3A cold door is a water-based cooling system produced by rst target of our solution. However, this solution also applies Bull that allows high density server in the order of 40 kW per to general data-centers. rack. 1 'ssh host /sbin/halt -p'. However, it might be Our proposition called the sequencer ad- preferable to use an out of band command through dresses the problem of starting/stopping a cluster (or the IPMI BMC4 if the node is unresponsive for ex- a data-center). Its design takes into account emer- ample. Starting a node can be done using a wake on gency conditions. Those conditions impose various lan [3] command or an IPMI [4] command. Some de- constraints addressed by our solution: vices cannot be powered on/o remotely (inniband or ethernet switches for example). Those devices • Predictive: an EPO should have been validated might be connected to manageable Power Distribu- before being used. It should not perform un- tion Units (PDUs) that can remotely switch their known actions. outlets on/o using SNMP [5] commands. On the extreme case, manual intervention might be required • Easy: an EPO should be easy to launch. The to switch on/o the electrical power. emergency cause may happen at any time, espe- For software components, there is also a need cially when skilled sta is not present. There- to manage the shutdown of multiple components fore, the EPO procedure should be launchable on dierent nodes. Considering high availability by unskilled humans. framework, virtualization, localization, and clients, Fast: an EPO should be as fast as possible. using standard calls to /etc/init.d/service • [start|stop] is often inappropriate. • Smart: an EPO should power o each compo- Finally, the set of instructions for the powering nent of a cluster in the correct order so most on/o of each cluster's components should be or- resources will be preserved. dered. Trivial examples include: Robust: an EPO should be tolerant to failure. powering o an ethernet switch too soon may • • For example, if a shutdown on a node cooled by prevent other components, including nodes, from a cold door returned an error, the corresponding being powered o; cold door should not be switched o to prevent the burnout of the node. On the other side, the • powering o a cold door should be done at the very end to prevent cooled components from be- rest of the cluster can continue the EPO process. ing burnt out. This article is organized as follow: section 2 exposes By the way, this ordering problem is not only relevant the design of our solution while some implementation to hardware devices. A software component can also details are dealt with in section 3 following by scala- require that a sequence of instructions is executed bility issues in section 4. Some results of our initial before being stopped. As a trivial example, when implementation are given section 5. Section 6 com- an NFS daemon is stopped, one may prefer that all pares our solution to related works. Finally, section 7 NFS clients unmount their related directories rst in presents future works. order to prevent either the ll of syslog with NFS mount error (when NFS mount option is 'soft') or the load average brutal increase due to the freezing 2 Design of softwares accessing the NFS directories (when NFS mount option is 'hard'). Three major diculties arise when considering the Therefore, in this article, the generic term 'com- starting/stopping of a cluster or of a subset of it: ponent' may dene a hardware component such as a 1. the computing of the dependency graph between node, a switch, or a cold door, or a software compo- components (power o a cold door after all com- nent such as a lustre server or an NFS server. ponents of the related rack have been powered 4Baseboard Management Controller. o); 2 2. the dening of an ecient (scalable) instructions sequence where the order dened by the dependency graph is respected (powering o nodes might be done in parallel); 3. the execution of the instructions set itself, taking failure into account properly (do not power o the cold door, if related rack's nodes have failed to power o). Therefore, the sequencer is made of three distinct functional layers: Dependency Graph Maker (DGM): this layer Figure 2.1: Incremental Mode: each stage output can computes the dependency graph according to de- be veried and modied before being passed to the pendency rules dened by the system adminis- next one. trator in a database. Instructions Sequence Maker (ISM): this layer computes the instructions sequence that should be executed for starting/stopping the given list of components and that satises dependency constraints dened in the dependency graph computed by the previous layer. Figure 2.2: Black Box Mode: using the sequencer for Instructions Sequence Executor (ISE): this simple non-critical usage. layer executes the instructions sequence computed by the previous layer and manages the handling of failures. Black Box Mode: in this mode, illustrated in g- Finally, a chaining of those layers is supported ure 2.2, chaining feature is used to start/stop through an all-in-one command. components as shown by the following syntax: This design provides therefore two distinct pro- # clmsequencer \ # command name cesses for the starting/stopping of components: stop \ # ruleset name colddoor3 node[100-200] # components list Incremental Mode: in this mode, each stage is run This command is somewhat equivalent to the fol- separately. The output of each stage can be ver- lowing: ied and modied before being passed to the # clmsequencer \ # command name next stage as shown on gure 2.1. The incre- depmake \ # dgm stage mental mode generates a script from constraints stop \ # ruleset name expressed in a database table and from a com- colddoor3 node[100-200] \ # components list ponents list. This script is optimized in terms |clmsequencer seqmake \ # ism stage of scalability and is intepretable by the instruc- |clmsequencer seqexec # ise stage tions sequence executor that deals with paral- and can therefore be seen as a syntactic sugar. lelism and failures. This mode is the one designed for emergency cases. The instructions set The computation of the dependency graph and of computed should be validated before being used the instruction set can take a signicant amount of in production. time, especially on very large clusters such as the 3 Tera-1005. This is another good reason for choosing prex[a-b,c-d,...]: is the standard contracted nota- the incremental mode in emergency conditions where tion for designing a set of names prexed by each minute is important. 'prefix' and suxed by a number taken in the range given by intervals [a − b], [c − d], and so on. For example, compute[1-3,5,7-8] de- 3 Implementation nes names: compute1, compute2, compute3, compute5, compute7, compute8. 3.1 Dependency Graph Maker (DGM) category: is optionnal and denes the table6 where given names should be looked for their type The Dependency Graph Maker (DGM) is the rst (if not given). The type of a component is stage of the sequencer. It takes a components list used in the denition of the dependency table in parameter, and produces a dependency graph in as described in section 3.1.2.

Load more