Architecting resilient computing systems: A component-based approach for adaptive fault tolerance Miruna Stoicescu, Jean-Charles Fabre, Matthieu Roy
To cite this version:
Miruna Stoicescu, Jean-Charles Fabre, Matthieu Roy. Architecting resilient computing systems: A component-based approach for adaptive fault tolerance. Journal of Systems Architecture, Elsevier, 2017, 73, pp.6-16. 10.1016/j.sysarc.2016.12.005. hal-01472877
HAL Id: hal-01472877 https://hal.archives-ouvertes.fr/hal-01472877 Submitted on 21 Feb 2017
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Architecting Resilient Computing Systems: A Component-Based Approach for Adaptive Fault Tolerance
Miruna Stoicescu1 Jean-Charles Fabre Matthieu Roy LAAS-CNRS, Université de Toulouse, CNRS, INP, Toulouse, France 1 EUMETSAT (European Organisation for the Exploitation of Meteorological Satellites)
ABSTRACT be easily updated through transition packages. Evolution of systems during their operational life is manda- The paper is organized as follows. Section 2 presents the tory and both updates and upgrades should not impair their context and motivation of our work. Section 3.1 briefly de- dependability properties. Dependable systems must evolve scribes the resilient system architecture we consider; next, to accommodate changes, such as new threats and undesir- we describe a set of Fault Tolerance Mechanisms (FTMs) able events, application updates or variations in available and analyze transitions between them in Section 3.2. Sec- resources. A system that remains dependable when facing tion 4 presents the process of designing FTMs for subsequent changes is called resilient. In this paper, we present an inno- adaptation. In Section 4.4, we detail the practical imple- vative approach taking advantage of component-based soft- mentation of FTMs on top of a reflective component-based ware engineering technologies for tackling the on-line adap- support and in Section 5, we describe the implementation tation of fault tolerance mechanisms. We propose a develop- of on-line transitions between FTMs. Next, we evaluate our ment process that relies on two key factors: designing fault approach in Section 6. In Section 7, we discuss related work tolerance mechanisms for adaptation and leveraging a re- and in Section 8 we present the lessons learned and conclude. flective component-based middleware enabling fine-grained control and modification of the software architecture at run- 2. PROBLEM STATEMENT time. We thoroughly describe the methodology, the devel- Resilient computing refers to dependability in the pres- opment of adaptive fault tolerance mechanisms and evaluate ence of changes due to system evolution [5]. Our work fo- the approach in terms of performance and agility. cuses on Fault Tolerance Mechanisms (FTMs) that can be influenced by changes occurring in the system or in its envi- 1. INTRODUCTION ronment. Ideally, fault-tolerant applications consist of two Dependable systems are becoming increasingly complex interconnected abstraction layers. The first one (the base and their capacity to evolve in order to efficiently accommo- level) contains the business logic that implements the func- date changes is a requirement of utmost importance. Changes tional requirements . The second one (the meta level) con- have different origins, such as fluctuations in available re- tains fault tolerant mechanisms and is attached to the first sources, additional features requested by users, environmen- one through clearly identified hooks. As the development tal perturbations. . . All changes that may occur during ser- of a fault-tolerant application implies different roles/stake- vice life are rarely foreseeable when designing the system. holders (application developer(s), safety expert, integrator), Dependable systems must cope with changes while main- separation of concerns is a key concept. Separation of con- taining their ability to deliver trustworthy services and their cerns has some limit since fault tolerance strategies often required attributes, e.g., availability, reliability, integrity [1]. depend on the semantics of the business logic. However, A lot of effort has been put into building applications that hooks can be defined to externalize some “application de- can adapt themselves to changing conditions. A rich body of fined assertions” used to parameterize fault tolerance mech- research exists in the field of software engineering consisting anisms. In any case, the concept of separation of concerns of concepts, tools, methodologies and best practices for de- is essential to implement adaptive fault tolerance without signing and developing adaptive software [2]. For instance, impacting deeply the business logic. agile software development approaches [3] emphasize the im- Among the well-established and documented FTMs, e.g., portance of accommodating change during the lifecycle of [6, 7], the choice of an appropriate FTM for a given applica- an application at a reasonable cost, rather than striving to tion depends on the values of several parameters. We iden- anticipate an exhaustive set of requirements. Component- tified three classes of parameters: fault tolerance require- based approaches [4] separate service interfaces from their ments, application characteristics and available resources. actual implementation in order to increase flexibility, evolv- • Fault tolerance requirements (FT). This class of pa- ability and reuse. Although dependable systems could ben- rameters contains the considered fault model. A fault model efit from these advancements in order to become more flex- determines the category of FTM to be used (e.g., simple ible and adaptive, very little has been done in this direction replication or diversification). Our fault model classification for now. In this paper, we describe an innovative approach is well-known [1], dealing with crash faults, value faults and for tackling on-line adaptation of dependability mechanisms development faults. We focus on hardware faults (perma- that leverages component-based software engineering tech- nent and transient physical faults) but the approach can be nologies. Component-based fault tolerance mechanisms can extended to other FTMs.
1 • Application characteristics (A). The characteristics tem may encounter throughout its service life and making that have an impact on the choice of an FTM are applica- provisions for them is obviously impossible. tion statefulness, state accessibility and determinism. State This approach is of interest for long-lived space systems accessibility is essential for checkpointing-based fault toler- (satellites and deep-space probes) and for automotive appli- ance strategies. Determinism refers here to behavioural de- cations regarding over-the-air software updates, a very im- terminism, i.e., the same inputs produce the same outputs portant trend in the automotive industry today. in the absence of faults, mandatory for active replication. In this context, we propose an alternative to preprogram- • Resources (R). FTMs require resources in terms of CPU, med adaptation that we denote agile adaptation of FTMs battery life/energy, bandwidth, etc. A cost function can be (the term “agile” is inspired from agile software develop- associated to each FTM based on the values of such pa- ment). Agile adaptation of FTMs enables their system- rameters. For a given set of resources, several mechanisms atic evolution: new FTMs can be designed off-line at any can be used with different trade-offs (e.g., more CPU, less point during service life and integrated on-line in a flexible bandwidth). manner, with limited impact on the existing software archi- tecture. Our approach for the agile adaptation of FTMs The first two parameters FT and A correspond to assump- leverages advancements in the field of software engineer- tions to be considered for the selection of an appropriate ing such as Component-Based Software Engineering (CBSE) mechanism and to determine its validity. The resource di- technologies [4], Service Component Architecture [11] and mension R states the amount of resources needed to accept Aspect-Oriented Programming [12]. Using such concepts a given solution according to system resources availability. and technologies, we design FTMs as brick-based assemblies In practice, based on the values of (FT,A,R) set at design (similar to “Lego” constructions) that can be methodically time, an FTM is attached to an application when the system modified at runtime through fine-grained modifications af- is installed for the first time. As far as resilient computing [5] fecting a limited number of bricks. This approach maximizes is concerned, the challenge lies in maintaining consistency reuse and flexibility, contrary to monolithic replacements of between the FTM(s) and the non-functional requirements FTMs found in related work, e.g., [8, 9, 10]. It is worth despite variations of the parameters at runtime, e.g.: noting that, whatever the approach is (pre-programmed or new threats/faults and physical perturbations such as • agile), when the FTM evolution goes outside the foreseen electromagnetic interferences trigger variations of FT; boundaries of the FTM loaded into the system, the system the introduction of new versions of applications or mod- • may fail. The benefits of the proposed agile approach is to ules may trigger variations of A; provide means to react more quickly and to simplify updates resource loss or addition of hardware components imply • of loaded FTMs. variations of R. Summary of contributions. In this paper, after a short If the FTM is inconsistent with the current (FT,A,R) val- description of the resilient system architecture, we describe ues, it will most likely fail to tolerate the faults the system our methodology for agile adaptation of FTMs and its re- is confronted with. Therefore, a transition towards a new sults, focussing on the following three key contributions: FTM is required before the current FTM becomes invalid, • A “design for adaptation” approach of a set of FTMs, which is possible in Adaptive Fault Tolerance (AFT). based on a generic execution scheme. On-line adaptation of FTMs has attracted research efforts • A component-based architecture of the considered FTMs for some time now because dependable systems cannot be and fine-grained on-line transitions between them. fully stopped for performing off-line modifications. However, • Illustration of on-line FTM adaptation through several most of the proposed solutions [8, 9, 10] tackle adaptation transition scenarios. in a preprogrammed manner: all FTMs necessary during the service life of the system must be known and deployed from the beginning and adaptation consists in choosing the 3. RESILIENT SYSTEM DESIGN appropriate execution branch or tuning some parameters. Nevertheless, predicting all events and threats that a sys- 3.1 Architecture The core objective of a resilient system is to guarantee the Initial non functional consistency of FTMs attached to applications according to specifications Operational inputs (failure data, resource usage, config changes, etc.) major assumptions regarding the fault model and the appli- è Adaptation triggers cation characteristics. First, this means identifying the link FTMs Design for between applications and FTMs, and, more importantly, dy- adaptation Adaptation Engine namically adapting FTMs according to operational condi- tions at runtime. As soon as FTMs are developed as assem- On-line adaptation of Monitoring blies of Lego bricks, it becomes easier to adjust their config- componentized FTMs Engine Componentization of uration by removing, adding, modifying individual bricks. FTMs System This means that bindings between application and FTMs, Loading missing State but also inside FTMs can be managed dynamically. At run- components (FTM Analysis updates) time, any FTM implemented as a graph of Lego bricks is Validation of FTM and modified on-line when the FTM configuration is updated. composition / inconsistency detection Some Lego bricks can be uploaded when they are missing, after an off-line development when the FTM solution does not exist yet. The adaptation is carried out on-line by a OFF-Line Agile development of FTMs ON-Line Agile Management of FTMs specific service, called Adaptation Engine (see Figure 1). A second important feature of a resilient system, also Figure 1: Adaptation process overview shown on Figure 1, is the runtime monitoring of the state
2 of the system according to several view points. This core on a single host. A request is processed twice and results service of the architecture is called Monitoring Engine. The are compared. If results differ, due to a transient fault, the first conventional role of the monitoring is to measure re- request is processed again and if two results out of three are source usage R. The second important role of the monitor- identical, the reply is sent. ing is to analyze the non-functional behavior of the system. Another technique can be used to tolerate transient faults This task is more complex and requires specific observers using assertions derived from safety properties of the appli- to capture rare error events at the hardware level, the rate cation: the Assertion&Duplex (A&Duplex) strategy. When of exceptions in the operating system and the applications, the first execution fails (the assertion is false), then the re- errors raised by operating system calls, and other incorrect execution is done on a different node. The synchronization behaviors reported in logs. From the collection of such in- between replicas can be based on any duplex protocol vari- puts, triggers for FTM adaptation are computed. Monitor- ant mentioned above, tolerating thus crash faults as well. ing issues are not addressed in this paper; we consider that The assertion can be determined by a safety analysis of the triggers to signal a change in resource usage R or fault model system, e.g. a Failure Mode, Effects, and Critical Analysis FT are already available. (FMECA). This technique is similar to distributed recovery A third service, called Resilience Management, corres- blocks [14] that can tolerate both hardware and software ponds to the link with the off-line System Manager respon- faults when replicas are diversified. sible for the updates of the application (i.e. versioning), the • Assumptions and performance analysis: The under- identification of their characteristics A, the development of lying characteristics of the considered FTMs, in terms of a new FTM as a graph of software components, the update (FT,A,R), are shown in Table 1. For instance, PBR and of components when needed, and the definition of transition LFR tolerate the same fault model, but have different A scripts to modify the on-line configuration of an FTM. and R. PBR allows non-determinism of applications because The big picture of the resilient system architecture we con- only the Primary computes client requests while LFR only sider is the following: works for deterministic applications as both replicas com- • a Middleware responsible for the execution of applications pute all requests. PBR requires state access for checkpoints and FTMs as assemblies of software components; and higher network bandwidth (in general), while LFR does • an Adaptation Engine responsible for the management of not require state access but generally incurs higher CPU FTMs as graphs of software components; costs (and, consequently, higher energy consumption) as • a Monitoring Engine responsible for the capture of the both replicas perform all computations. TR requires state system state and the computation of triggers; access because application state must be captured before the • a Resilience Management Service responsible for the in- first request processing and restored between two consecu- teraction with the system manager. tive executions. As it runs on only one host, it cannot toler- This framework is consistent with the Open Systems De- ate crash and it has no bandwidth requirements. A&Duplex pendability concept and the principles of Dependability En- can tolerate both crash faults and value faults as two CPUs gineering for Ever-Changing Systems [13]. are used to run the replicas. Both TR and A&Duplex re- quire more computation power (i.e., more energy) than PBR 3.2 Adaptation of FTMs because of multiple request processing. In this section, we describe a set of well-established FTMs FTM in terms of their underlying characteristics and we present PBR LFR TR A&Duplex a transition scenario encompassing these FTMs. Our objec- Characteristics tive in this work is not to invent more sophisticated FTMs, Crash Transient value Fault Model (FT) nor to discuss complex variants of the proposed mechanisms, Permanent value but to illustrate our approach for resilient computing with Application Deterministic quite simple implementations of conventional techniques. Characteristics Non-deterministic (A) Requires state access Bandwidth high low n/a low Resources (R) 3.2.1 Illustrative set of FTMs CPU low low high high To illustrate our approach, we use two variants of a du- plex protocol tolerating crash faults and two mechanisms for Table 1: (FT, A, R) parameters of considered FTMs tolerating hardware value faults (e.g., bit flips). • Tolerance to crash faults: Simple replication of the • Dealing with more complex fault tolerance strate- server on two distinct hosts (master and slave(s)) is a solu- gies: We voluntarily use simple definitions of the mecha- tion: a crash of the master is detected by a dedicated entity nisms here for the sake of clarity. However, many variants (e.g., heartbeat, watchdog) and triggers a recovery action by of passive and active strategy exist. A more detailed dis- which the slave becomes master. There are two main types cussion about replication techniques in distributed systems of duplex protocols: passive and active, and many variants. can be found in [15]. Variants of the LFR strategy where Primary-Backup Replication (PBR) is a passive strategy: non-deterministic decisions are captured and forwarded from only the master processes client requests and sends check- the Leader to the Follower do exist. We could also con- points containing its state to the slave(s). Leader-Follower sider multiple Backups or Followers making thus the use of Replication (LFR) is an active strategy: all replicas process Atomic Broadcast protocols highly useful in the implemen- input requests but only the master replies to the client. tation. With the Assertion&Duplex (A&Duplex) strategy, • Tolerance to value faults: A solution can rely on rep- one can decide to use a diversified implementation of the etition of request processing or duplex processing with ac- function to handle software faults. Clearly, Fault Tolerance ceptance tests/assertions. For instance, Time Redundancy Design Patterns can be implemented in many ways. (TR) tolerates transient value faults by processing repetition Considering several software fault tolerance techniques like
3 Recovery Blocks (RB) and Triple Modular Redundancy (TMR), structural and conceptual dependencies between them. the proposed approach, promoting dynamic updates using In its first stage, the framework contained only the de- Lego bricks, is of interest without changing the execution sign (and corresponding implementation) of a PBR strategy logic of the mechanism — for RB, an update consists of whose core was concentrated in a class encompassing gen- changing the acceptance test; for TMR, an update consists eral fault tolerance concerns, duplex concerns and specific of replacing the decision algorithm. Both updates help im- PBR concerns. Our aim was to reach a clean separation proving the fault tolerance coverage of such techniques. between all these concerns in order to maximize reuse when The resource parameters, monitored on-line, obviously de- developing new duplex variants and other FTMs. pend on the application, the physical architecture of the sys- tem (CPU, networks, HW configuration, etc.). 4.1 First design loop By analyzing the above described FTMs, we identified 3.2.2 Possible transitions a generic execution scheme which captures their common During the service life of the system, the values of the pa- parts and their variable features. Upon reception of a re- rameters enumerated in Table 1 can change. An application quest from the client, a fault-tolerant server executes some may become non-deterministic because a new version is in- actions before processing. Then it proceeds with request pro- stalled. The fault model may change from crash fault only cessing. After processing, it executes some actions and fi- to crash and value fault, due to hardware aging or physi- nally it sends the reply to the client. We call this the Before- cal perturbations. Available resources may also vary, e.g., Proceed-After generic execution scheme, inspired from aspect- bandwidth drop or constraints in energy consumption. Fig- oriented programming [12]. The Before-Proceed-After soft- ure 2 shows a graph of possible transitions between the pre- ware bricks comply with the Server coordination phase (syn- viously defined FTMs variants. The vertices represent the chronisation before), the Execution phase (proceed) and the FTMs and the edges are labeled with the parameter (FT, A, Agreement coordination phase (synchronization after) de- R) whose variation triggers the transition. For instance, the scribed in [15]. Table 2 describes the content of each before- PBR LFR transition is triggered by a change in application proceed-after execution step for the FTMs we consider. characteristics→ or in resources, while the PBR A&Duplex transition is triggered by a change in the considered→ fault FTM Before Proceed After PBR (Primary) Nothing Compute Checkpoint to Backup model, i.e. the need to check a user-defined safety property PBR (Backup) Nothing Nothing Process checkpoint with an assertion. Transitions can occur in both directions, LFR (Leader) Forward request Compute Notify Follower according to parameter variation. LFR (Follower) Receive request Compute Process notification TR Capture state Compute Restore state A&Duplex Nothing Compute Assert output FT LFR LFR TR ⊕ Table 2: Generic execution scheme of considered FTMs FT A,FT When designing a duplex mechanism, this scheme can be A& A,R A,R Duplex translated in Before-Proceed-After, because an inter-replica synchronization takes place before request processing and FTA,FT another one after. This generic execution scheme enabled start PBR PBR TR us to factorize in a class what is common to all duplex pro- FT ⊕ tocols, DuplexProtocol and then specialize, through inheri- tance, the concrete FTMs, PBR and LFR (see Figure 3). Other Figure 2: Transitions between FTMs duplex variants can be added to the framework, either by inheriting from the abstract base class DuplexProtocol or by overloading concrete classes. The variation of the fault model often requires a composi- tion of FTMs, e.g., if we start by tolerating only crash faults 4.2 Second design loop with PBR and we want to add tolerance to transient value Another separation can be done between what is common faults, we will compose PBR with TR and obtain a com- to all FTMs and what is specific to duplex ones. Com- posed FTM combining the features specific to PBR with munication with the client, preservation of “at-most-once” those specific to TR. In this paper, the operator denotes semantics and request forwarding to the concrete functional composition (e.g., PBR TR in Figure 2).⊕ ⊕ service in the processing step are now encapsulated in a class, FaultToleranceProtocol in Figure 3. This second factor- 4. DESIGN FOR ADAPTATION OF FTMs ization enabled us to introduce in our framework non-duplex The second step of our development process is an analysis protocols targeting other fault models than crash, more pre- of the previously described FTMs in order to extract com- cisely value faults (transient and permanent): TimeRedun- mon parts and variable features between them. This leads to dancy and Assertion, which follow the same generic ex- a generic execution scheme of fault tolerance strategies and ecution scheme. New protocols can be easily added to the a framework of Fault Tolerance Design Patterns (FTDPs). framework either by extending the abstract base class Fault- We follow a “design for adaptation” approach consisting of ToleranceProtocol or one of the concrete classes. several design loops. Each loop represents a refinement step Composing FTMs — As a direct consequence of the two towards the optimal representation of the various concerns design loops, the composition of FTMs is intuitive and al- and protocols. The result is a pattern system [16] for fault most immediate. By inheriting from a duplex protocol (tol- tolerance because the considered FTMs are reusable and cus- erating crash faults) and from a value fault tolerance mech- tomizable solutions to specific problems and we highlight the anism, we obtain four composed FTMs (Figure 3): PBR_TR
4 «interface» Server Remote StateManager Development time 6
RemoteServer 5
4
RecoverableRemoteServer 3 Days 2 FaultToleranceProtocol 1 «virtual» «virtual» «virtual» 0 TimeRedundancy DuplexProtocol Assertion 1st design LFR 2nd design Assertion Time Composition loop loop Redundancy
PBR LFR Figure 4: FT design patterns: development time
PBR_TR LFR_A Source Lines of Code 250 LFR_TR PBR_A 200 150 Figure 3: Excerpt of Fault Tolerance Design Patterns 100 50 0 SLOC and LFR_TR, corresponding to PBR TR and LFR TR re- spectively in Figure 2, and PBR_A and⊕ LFR_A which⊕ are two variants of A&Duplex. Figure 3 shows an excerpt of the final framework resulting from the two design loops. Variable features — The elements of our generic execu- tion scheme represent the variable features between FTMs. Figure 5: FT design patterns: source lines of code sloc Comparing, for instance, the execution scheme of PBR with the one of LFR (see Table 2) gives us the intuition that by dividing the inter-replica protocol in isolated bricks/com- of development time of new FTMs (see Figure 4), in terms ponents which can be identified and manipulated on-line, of code reuse and lines of code to be produced (see Figure 5). we could execute a differential transition between PBR and The development of any FTM is carried out off-line. Any LFR. This means replacing only the components contain- update impacts the FTM that must be validated off-line ing the variable features between the two FTMs, without before it can be used. The analysis of possible inconsisten- modifying the rest of the system (e.g., communication with cies in FTMs composition is part of the off-line validation the client, the actual processing to which proceed only for- process and is thus performed before any on-line adapta- wards the requests). By identifying the variable features, we tion. Inconsistency analysis of FTMs composition has been can easily develop FTM variants1 from existing ones (off-line) addressed in our previous works. The interested reader is and, using a component-based middleware support, execute referred to [17, 18] for details. transitions between FTMs with few modifications (on-line). 4.4 Component-based FTMs 4.3 Validation of the design In this step of the development process, the fine-grained All the above design versions developed in UML with IBM design of FTMs based on the generic Before-Proceed-After Rational Software Architect have been implemented and val- execution scheme was mapped on FraSCAti [19], an open- idated in C++. As further explained, the benefits of the source reflective component-based middleware. FraSCAti “design for adaptation” approach lie in reducing the develop- provides runtime support for applications designed accord- ment time and effort necessary for implementing new FTMs ing to the Service Component Architecture (SCA) specifi- and in increasing readability and separation of concerns/- cations [11]. FraSCAti enriches the basic SCA specifica- functionalities. The implementation of composed mecha- tion with support for on-line exploration and reconfigura- nisms is almost immediate thanks to the two design loops. tion of component-based assemblies. To this aim, it inte- Figure 4 shows that the effort spent on factorization and grates FScript [20], a script language for writing reconfig- design refinement during the two design loops is 4 to 5 times uration procedures. FraSCAti and FScript provide what more significant than the effort to develop one FTM. How- we have identified as the minimal API for performing the ever, once the right design is achieved, the development of fine-grained adaptation of FTMs, namely: both individual and composed FTMs is extremely easy. For control over component lifecycle at runtime (add, re- instance, while the second design loop took 5 days, the de- • move, start, stop); velopment of Assertion and Time Redundancy each took half a day. The composition of FTMs, which is probably control over interactions between components for cre- the most interesting result of this endeavor, only took half • ating and removing reference-service connections; a day thanks to the two design loops. During the development of this system of patterns, we ob- consistency of reconfigurations performed by scripts served that our design approach was very efficient, in terms • written in FScript.
5 It is worth noting that our approach is reproducible on any such an FTM exists). The Adaptation Engine gets the re- other platform that provides these features. quired transition package from the FTM&Adaptation Repos- itory. This package contains the new bricks that must be integrated into the existing software architecture in order to master SCA legend execute a differential transition from the current FTM to the X Y composite component new one and a script that operates the transition. Next, the replyLog wire client protocol reference service transition package feeds the Script interpreter that modifies syncBefore proceed property the current FTM. server In our view, the process encompasses two aspects: cold resilient computing, covering off-line activities (the develop- syncAfter slave failure ment of the transition packages from the repository), and FTM detector hot resilient computing, consisting of all the other entities and their interactions that occur on-line. Together, these Figure 6: Component-based architecture of PBR two aspects form a loop: the cold one feeds transition pack- ages to the hot one, and the hot one provides feedback for improving and enriching the cold one. Figure 6 shows the resulting component-based architec- ture of PBR. The steps of the generic execution scheme are 5.2 On-line fine-grained transitions isolated in the components syncBefore, proceed, syncAfter. To illustrate the feasibility of the approach, we performed Thanks to the “design for adaptation” process, these vari- several fine-grained transitions between selected FTMs, cor- able features, that are subject to change during transitions, responding to the transition scenario from Figure 2. The are mapped on small stateless components. The actual state PBR LFR transition is triggered by variations in A or R of the FTMs (e.g., request id, computation result) and the (based→ on Table 1) and requires changing the syncBefore common parts of FTMs that were captured in the two base and syncAfter components (based on the generic scheme in FaultTolerance- classes in the object-oriented design (i.e., Table 2). All the other components corresponding to the Protocol DuplexProtocol and in Figure 3) are now mapped massive common parts are left untouched. on components that are not modified during transitions be- Each transition implies the deployment of a transition tween FTMs, i.e., protocol, replyLog and server. package containing the new components that must be intro- duced in the existing architecture (in this example, syncBe- 5. ADAPTATION OF FTMs AT RUNTIME fore and syncAfter of LFR) and a script written in FScript that removes the components that are no longer necessary (here, syncBefore and syncAfter of PBR) and replaces them 5.1 Adaptation process implementation with the new ones. In short, the script written in FScript Figure 7 outlines the adaptation process implementation that performs the PBR LFR transition does the following: for executing fine-grained transitions between FTMs, un- → der the supervision of the System Manager. The target disconnect the old syncBefore and syncAfter from all system runs an application to which are attached appro- • their services and references; priate FTM(s) (i.e., consistent with the current values of (FT,A,R)). delete old components and add the new ones; •
Cold Resilient Computing Hot Resilient Computing connect the new syncBefore and syncAfter to all the • necessary services and references.
Adaptation Engine The LFR LFR TR transition (i.e., the composition of → ⊕ FTM & LFR with TR) is triggered by the evolution of the fault Adaptation trigger FTM developer Repository call model, from crash fault to crash fault and transient value Transition Resilience faults. Such an evolution of the fault model may occur package1 Management New Service Transition due to hardware aging or physical perturbations. Table 2 script components Monitoring interpreter outlines the specificities of TR, namely the state capture Engine script and restoration before and after processing, respectively. To Transition modify package2 observe simplify the mapping on the component-based architecture, New components the behavior of TR is implemented by a proceed compo- System script nent that repeats processing and compares results. The LFR LFR TR transition thus consists in replacing the el- System Manager ementary→ proceed⊕ component of LFR. OFF-LINE ON-LINE 5.3 Consistency of distributed adaptation As evolvability must not affect the reliability of fault- Figure 7: Adaptation process implementation tolerant application, the consistency of transitions between FTMs must be ensured. When there is an inconsistency between the values of Local consistency— FraSCAti and its integrated FScript (FT,A,R) and the current FTM, the Resilience Management engine guarantee the consistency of local reconfigurations Service triggers the Adaptation Engine giving it as input the performed using scripts. FScript enforces an all-or-nothing new FTM towards which a transition must be executed (if semantics [20]. The reliability of the reconfiguration is achieved
6 using a model of component-based configurations and recon- intra-FTM transitions (dotted black lines). figurations (i.e., transitions between two consistent configu- rations), verifying integrity constraints and performing re- Mandatory transitions: Parameters whose variation in- • configurations in a transactional manner [20]. In case of validates the initial FTM or affects its performance and constraint violation during the reconfiguration process, a therefore requires a transition towards an appropriate ScriptException is thrown, the transaction is rolled back one. These are mandatory transitions. When starting and the FTM remains in its initial configuration. from PBR, there are two such cases: bandwidth drop Consistency of request processing— Stopping a com- (which introduces undesired overheads) and state in- ponent that must be replaced implies waiting for it to reach accessibility (which makes checkpointing impossible). a quiescent state where all its internal processing is finished, Possible transitions: Parameters whose variation only blocking its inputs and buffering them. This means that a • client request received before adaptation is processed and makes optional the use of another FTM, without in- the client receives the reply before the subsequent requests validating the initial one, lead to possible transitions. are blocked and buffered and the interaction between repli- When starting from PBR, there are two such cases: in- cas is locked. When the lock is released, the buffered client crease in available CPU (because LFR demands more requests are processed in the new configuration of the FTM. processing than PBR) and application determinism Distributed consistency — The FTMs consist of two (because both PBR and LFR work in this case). replicas and a specific inter-replica protocol, therefore tran- While mandatory transitions can be executed automat- sitions between FTMs is performed on two hosts. On each ically, possible transitions are executed only if the system one, a script component performs the required reconfigu- manager decides to. The intra-FTM transitions are only ration, which can either terminate successfully or with a represented in Figure 8 for the sake of completeness. If a ScriptException, if integrity constraints are violated. The possible transition towards a new FTM is not executed (e.g., script component on each host is wrapped in a Java class to go from “PBR with non-determinism” to “LFR with state which kills the local replica if an exception is thrown, to access” when the application becomes deterministic), the enforce the fail-silent assumption and to prevent the overall current FTM will still change its configuration, i.e., execute FTM from reaching an inconsistent configuration. There- an intra-FTM transition (e.g., “PBR with non-determinism” fore, a local reconfiguration can either terminate successfully goes to “PBR with determinism” ). or crash. The failure detection mechanism incorporated in Stability of transitions and triggers— A problem which all duplex strategies detects the crash and informs the re- can be encountered in adaptive fault tolerance, perceived as maining replica. If this replica has successfully reconfigured, a closed loop system, is the oscillation between FTMs: if a it becomes master-alone. transition is triggered by the variation of a parameter that Recovery of adaptation— Replicated computing units oscillates near the reconfiguration threshold, the system can may crash during transitions. Upon successful completion become unstable and reconfigure itself too often, thus reduc- of the reconfiguration of one replica (either master or slave), ing its availability. By distinguishing between mandatory the current configuration (i.e., the target FTM) is logged and possible transitions, this issue is solved for our FTM on a stable storage. Should the other replica crash (either examples: as Figure 8 shows, the reverse of a mandatory because of a ScriptException or because of a physical fault) transition is always a possible one. The risk of oscillation is before completing its own reconfiguration, a new replica is eliminated because once the system executes a mandatory restarted in the same configuration as its counterpart, which transition due to the variation of a parameter, it will not has successfully completed the transition. This information be able to revert to the previous FTM, unless the system is recovered from the stable storage which keeps track of the manager decides to. currently active configuration. Figure 8 also shows that changes of some parameters can be detected automatically by using probes , while others 5.4 Detailed analysis of transition scenarios likely require input/observations from the application devel- Figure 8 shows an excerpt of the graph of possible tran- oper or from the system manager . The former encom- sitions (on the left) and an extended graph of transition passes R variations, the latter concerns A and FT variations. scenarios between the FTMs in our subset resulting from it Reactive vs. proactive — A fundamental difference in (on the right). We call it a graph of scenarios because there the nature of transitions concerns when they must be trig- are several events which lead to a transition between FTMs. gered, either as a reaction to an event which has occurred For the sake of clarity, we present a simplified view of or in advance, before the foreseen occurrence of an event. adaptation triggers. In general, they consist in more com- Changes in available resources R usually impact the perfor- plex logical expressions, which are out of the scope of this pa- mance overhead entailed by a given FTM. Therefore, the per. As previously explained, PBR requires state access and system manager can search for a more appropriate FTM as can be used both for deterministic and non-deterministic ap- a reaction to fluctuations in resources. In the case of appli- plications. LFR requires application determinism and can cation changes A due to versioning, the system manager can be used both for applications that provide state access and select/define a suitable FTM for the new version, taking in- those that do not (see Table 1). This explains the two PBR put from application designers wrt the new characteristics. states and the two LFR states in Figure 8. The “No generic The transition encompasses changing the application version solution”state corresponds to non-deterministic applications together with the FTM (if the new characteristics invalidate that do not provide state access. the previous FTM). As such, the FTM is changed as a re- Mandatory vs. possible transitions — Figure 8 shows action to application changes. three types of transitions: mandatory transitions (continu- The case of fault model FT changes is the most complex. ous red lines), possible transitions (dashed green lines) and In the context of operational phases, one can understand
7 REACTIVE PROACTIVE