A Component-Based Approach for Adaptive Fault Tolerance Miruna Stoicescu, Jean-Charles Fabre, Matthieu Roy

Architecting resilient computing systems: A component-based approach for adaptive fault tolerance Miruna Stoicescu, Jean-Charles Fabre, Matthieu Roy To cite this version: Miruna Stoicescu, Jean-Charles Fabre, Matthieu Roy. Architecting resilient computing systems: A component-based approach for adaptive fault tolerance. Journal of Systems Architecture, Elsevier, 2017, 73, pp.6-16. 10.1016/j.sysarc.2016.12.005. hal-01472877 HAL Id: hal-01472877 https://hal.archives-ouvertes.fr/hal-01472877 Submitted on 21 Feb 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Architecting Resilient Computing Systems: A Component-Based Approach for Adaptive Fault Tolerance Miruna Stoicescu1 Jean-Charles Fabre Matthieu Roy LAAS-CNRS, Université de Toulouse, CNRS, INP, Toulouse, France 1 EUMETSAT (European Organisation for the Exploitation of Meteorological Satellites) ABSTRACT be easily updated through transition packages. Evolution of systems during their operational life is manda- The paper is organized as follows. Section 2 presents the tory and both updates and upgrades should not impair their context and motivation of our work. Section 3.1 briefly de- dependability properties. Dependable systems must evolve scribes the resilient system architecture we consider; next, to accommodate changes, such as new threats and undesir- we describe a set of Fault Tolerance Mechanisms (FTMs) able events, application updates or variations in available and analyze transitions between them in Section 3.2. Sec- resources. A system that remains dependable when facing tion 4 presents the process of designing FTMs for subsequent changes is called resilient. In this paper, we present an inno- adaptation. In Section 4.4, we detail the practical imple- vative approach taking advantage of component-based soft- mentation of FTMs on top of a reflective component-based ware engineering technologies for tackling the on-line adap- support and in Section 5, we describe the implementation tation of fault tolerance mechanisms. We propose a develop- of on-line transitions between FTMs. Next, we evaluate our ment process that relies on two key factors: designing fault approach in Section 6. In Section 7, we discuss related work tolerance mechanisms for adaptation and leveraging a re- and in Section 8 we present the lessons learned and conclude. flective component-based middleware enabling fine-grained control and modification of the software architecture at run- 2. PROBLEM STATEMENT time. We thoroughly describe the methodology, the devel- Resilient computing refers to dependability in the pres- opment of adaptive fault tolerance mechanisms and evaluate ence of changes due to system evolution [5]. Our work fo- the approach in terms of performance and agility. cuses on Fault Tolerance Mechanisms (FTMs) that can be influenced by changes occurring in the system or in its envi- 1. INTRODUCTION ronment. Ideally, fault-tolerant applications consist of two Dependable systems are becoming increasingly complex interconnected abstraction layers. The first one (the base and their capacity to evolve in order to efficiently accommo- level) contains the business logic that implements the func- date changes is a requirement of utmost importance. Changes tional requirements . The second one (the meta level) con- have different origins, such as fluctuations in available re- tains fault tolerant mechanisms and is attached to the first sources, additional features requested by users, environmen- one through clearly identified hooks. As the development tal perturbations. All changes that may occur during ser- of a fault-tolerant application implies different roles/stake- vice life are rarely foreseeable when designing the system. holders (application developer(s), safety expert, integrator), Dependable systems must cope with changes while main- separation of concerns is a key concept. Separation of con- taining their ability to deliver trustworthy services and their cerns has some limit since fault tolerance strategies often required attributes, e.g., availability, reliability, integrity [1]. depend on the semantics of the business logic. However, A lot of effort has been put into building applications that hooks can be defined to externalize some \application de- can adapt themselves to changing conditions. A rich body of fined assertions" used to parameterize fault tolerance mech- research exists in the field of software engineering consisting anisms. In any case, the concept of separation of concerns of concepts, tools, methodologies and best practices for de- is essential to implement adaptive fault tolerance without signing and developing adaptive software [2]. For instance, impacting deeply the business logic. agile software development approaches [3] emphasize the im- Among the well-established and documented FTMs, e.g., portance of accommodating change during the lifecycle of [6, 7], the choice of an appropriate FTM for a given applica- an application at a reasonable cost, rather than striving to tion depends on the values of several parameters. We iden- anticipate an exhaustive set of requirements. Component- tified three classes of parameters: fault tolerance require- based approaches [4] separate service interfaces from their ments, application characteristics and available resources. actual implementation in order to increase flexibility, evolv- • Fault tolerance requirements (FT). This class of pa- ability and reuse. Although dependable systems could ben- rameters contains the considered fault model. A fault model efit from these advancements in order to become more flex- determines the category of FTM to be used (e.g., simple ible and adaptive, very little has been done in this direction replication or diversification). Our fault model classification for now. In this paper, we describe an innovative approach is well-known [1], dealing with crash faults, value faults and for tackling on-line adaptation of dependability mechanisms development faults. We focus on hardware faults (perma- that leverages component-based software engineering tech- nent and transient physical faults) but the approach can be nologies. Component-based fault tolerance mechanisms can extended to other FTMs. 1 • Application characteristics (A). The characteristics tem may encounter throughout its service life and making that have an impact on the choice of an FTM are applica- provisions for them is obviously impossible. tion statefulness, state accessibility and determinism. State This approach is of interest for long-lived space systems accessibility is essential for checkpointing-based fault toler- (satellites and deep-space probes) and for automotive appli- ance strategies. Determinism refers here to behavioural de- cations regarding over-the-air software updates, a very im- terminism, i.e., the same inputs produce the same outputs portant trend in the automotive industry today. in the absence of faults, mandatory for active replication. In this context, we propose an alternative to preprogram- • Resources (R). FTMs require resources in terms of CPU, med adaptation that we denote agile adaptation of FTMs battery life/energy, bandwidth, etc. A cost function can be (the term \agile" is inspired from agile software develop- associated to each FTM based on the values of such pa- ment). Agile adaptation of FTMs enables their system- rameters. For a given set of resources, several mechanisms atic evolution: new FTMs can be designed off-line at any can be used with different trade-offs (e.g., more CPU, less point during service life and integrated on-line in a flexible bandwidth). manner, with limited impact on the existing software architecture. Our approach for the agile adaptation of FTMs The first two parameters FT and A correspond to assump- leverages advancements in the field of software engineer- tions to be considered for the selection of an appropriate ing such as Component-Based Software Engineering (CBSE) mechanism and to determine its validity. The resource di- technologies [4], Service Component Architecture [11] and mension R states the amount of resources needed to accept Aspect-Oriented Programming [12]. Using such concepts a given solution according to system resources availability. and technologies, we design FTMs as brick-based assemblies In practice, based on the values of (FT,A,R) set at design (similar to \Lego" constructions) that can be methodically time, an FTM is attached to an application when the system modified at runtime through fine-grained modifications af- is installed for the first time. As far as resilient computing [5] fecting a limited number of bricks. This approach maximizes is concerned, the challenge lies in maintaining consistency reuse and flexibility, contrary to monolithic replacements of between the FTM(s) and the non-functional requirements FTMs found in related work, e.g., [8, 9, 10]. It is worth despite variations of the parameters at runtime, e.g.: noting that, whatever the approach is (pre-programmed or new threats/faults and physical perturbations such as • agile), when the FTM evolution goes outside

Load more