Reliability Prediction for Fault-Tolerant Software Architectures

Reliability Prediction for Fault-Tolerant Software Architectures Franz Brosch1, Barbora Buhnova2, Heiko Koziolek3, Ralf Reussner4 1Research Center for Information Technology (FZI), Karlsruhe, Germany 2Masaryk University, Brno, Czech Republic 3Industrial Software Systems, ABB Corporate Research, Ladenburg, Germany 4Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany [email protected], buhnova@fi.muni.cz, [email protected], [email protected] ABSTRACT architecture level [20, 18]. FT mechanisms are commonly Software fault tolerance mechanisms aim at improving the used to improve the reliability of software systems (i.e., the reliability of software systems. Their effectiveness (i.e., re- probability of failure-free operation in a given time span). liability impact) is highly application-specific and depends The effect of a FT mechanism (i.e., the extent of such an on the overall system architecture and usage profile. When improvement) is however non-trivial to quantify because it examining multiple architecture configurations, such as in highly depends on the application context. software product lines, it is a complex and error-prone task The challenge of assessing FT mechanisms in different to include fault tolerance mechanisms effectively. Existing contexts becomes particularly apparent on the architec- approaches for reliability analysis of software architectures ture level, when evaluating different architectural configu- either do not support modelling fault tolerance mechanisms rations, and even more in the design of software product or are not designed for an efficient evaluation of multiple ar- lines (SPL) [7]. An SPL has a core assembly of software chitecture variants. We present a novel approach to analyse components and variation points where application-specific the effect of software fault tolerance mechanisms in varying software can be filled in. Thus, it can result in many different architecture configurations. We have validated the approach products all sharing the same core with different configura- in multiple case studies, including a large-scale industrial tion at the variation points. system, demonstrating its ability to support architecture de- Existing approaches for reliability prediction [11, 13, 15] sign, and its robustness against imprecise input data. either do not support modelling FT mechanisms (e.g., [5, 8, 12, 21]) or do not allow for explicit definition and reuse Categories and Subject Descriptors of core modelling artefacts, and hence are difficult to apply in varying contexts, as with architecture design and SPLs D.2.11 [Software]: SOFTWARE ENGINEERING|Soft- (e.g., [24, 26]). ware Architectures; D.2.4.g [Software]: SOFTWARE EN- The contribution of this paper is an approach to analyse GINEERING|Software/Program Verification|Reliability the effect of a software fault tolerance mechanism in depen- General Terms dence of the overall system architecture and usage profile. The approach is novel as it (i) takes software fault tolerance Software Engineering, Reliability, Design mechanisms explicitly into account, and (ii) reuses model parts for effective evaluation of architectural alternatives or Keywords system configurations. The approach is ideally suited for Component-Based Software Architectures, Reliability Pre- software product lines, which are used to formulate and illus- diction, Fault Tolerance, Software Product Lines trate the approach. It builds upon the Palladio Component Model and an associated reliability prediction approach [4], 1. INTRODUCTION which includes both software reliability and hardware avail- Software fault tolerance (FT) mechanisms mask faults in ability as influencing factors. Our tool support allows the software systems and prohibit them to result in a failure. FT architects to design the architecture with UML-like mod- mechanisms are established on different abstraction levels, els, which are automatically transformed to Markov-based such as exception handling on the source code level, watch- prediction models, and evaluated to determine the expected dog and heart-beat as design patterns, and replication on the system reliability. The remainder of this paper is structured as follows. Sec- tion 2 outlines our approach and explains the steps involved. Section 3 details the models used in our approach and then Permission to make digital or hard copies of all or part of this work for Section 4 explains how these models are formalised and anal- personal or classroom use is granted without fee provided that copies are ysed to predict the system reliability. Section 5 evaluates the not made or distributed for profit or commercial advantage and that copies approach on two case studies. Section 6 delimits our work bear this notice and the full citation on the first page. To copy otherwise, to from related approaches. Finally, Section 7 draws conclu- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. sions and sketches future directions. QoSA ’11 Boulder, Colorado, USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. 2. PREDICTION PROCESS The model allows to express variation points on different This section outlines our reliability prediction approach levels: for fault-tolerant software architectures, software families, • architectural level: using composite components to en- and software product lines (SPL) in particular. Accord- capsulate subsystems and enable their replacement ing to Clements et al. [7], an SPL is defined as \a set of • component level: selecting different component imple- software-intensive systems that share a common, managed mentations for a given specification set of features satisfying the specific needs of a particular • component implementation level: using component pa- market segment or mission and that are developed from a rameters with values for specific product configura- common set of core assets in a prescribed way". Our mod- tions elling approach provides support only for the technical part • resource level: using allocation references expressing of an SPL or software family as we assume that domain different deployment schemes for product line configu- engineering and asset scoping have been performed before rations modelling the architecture and reusable components. Our model is better suited for our purposes than UML ex- Our approach iteratively follows eight steps depicted in tended with the MARTE-DAM profile [3] because it allows Fig. 1. First, the software architect creates a CoreAsset- model reuse through the core asset base and is reduced to Base, which models interfaces, reusable software compo- concepts needed for the prediction. The following sections nents, and their (abstract) behaviour (step 1). The Core- explain the core asset base (Section 3.1), the fault-tolerance AssetBase is enriched with software failure probabilities of mechanisms (Section 3.2), the resource environment (Sec- actions forming component (service) behaviour (step 2). Af- tion 3.3) and the product (Section 3.4) in detail. terwards, the software architect can include different FT mechanisms, such as recovery blocks (explained in Sec- 3.1 Core Asset Base tion 3.2) or redundancy (step 3), either as additional com- The modelling element CoreAssetBase of our meta model ponents or directly into already modelled component be- (cf. Fig. 2) represents a repository for elements assem- haviours. The FT mechanisms allow for different configu- bled into products and contains ComponentTypes, Inter- rations, e.g., the number of retries or replicated instances. faces, and FailureTypes. ComponentTypes can either be atomic PrimitiveComponents or hierarchically structured 1. Model 2. Model 3. Model fault 4. Model 5. Model CompositeComponents with nested inner components. components, failure tolerance, resource products, Composite components allow the core asset base to con- interfaces, and probabilities in adjust environment, allocation, behaviour behaviours configurations HW availability usage tain whole architecture fragments (e.g., SPL core assem- blies) that can be reused in different products. Such a core Resource Core Asset Base Products environments assembly can have optionally required interfaces as varia- (Section 4.1) (Section 4.3) (Section 4.2) tion points. Component types are associated with interfaces through ProvidedRoles or RequiredRoles, and can export Results Not OK ComponentParameters that allow for implementation-level DTMCs 8. Product 7. Markov chain 6. Model (Section 5) instantiation Results analysis transformation variation points. Fig. 3 shows an excerpt of an instance of OK a core asset base including a composite component (4 ) and Figure 1: Process activities and artifacts a component parameter (value). For reliability analyses, the model requires constructs The software architect then creates a Resource- to express the behaviour of component services in terms Environment to model hardware resources (step 4) and spe- of using hardware resources and calling other compo- cific Products (step 5), including component allocation and nents. Therefore, a component type can contain a num- system usage information. Combined with the CoreAsset- ber of ServiceBehaviours that specify the actions executed Base, these models are transformed into multiple discrete- upon calling a specific Signature of one of the compo- time Markov chains (step 6), from which system reliability nent's provided interfaces. The behaviour may consist of predictions and sensitivity analyses can be deduced (step InternalActions, ExternalCallActions, and control flow

Reliability Prediction for Fault-Tolerant Software Architectures

The Google File System (GFS)

Introspection-Based Verification and Validation

Fault-Tolerant Components on AWS

Network Reliability and Fault Tolerance

Fault Tolerance

Detecting and Tolerating Byzantine Faults in Database Systems Benjamin Mead Vandiver

Fault Tolerance Techniques for Scalable Computing∗

Fault-Tolerant Operating System for Many-Core Processors

Fault Tolerance in Tandem Computer Systems

Reliability and Fault-Tolerance by Choreographic Design∗

Milesight-Troubleshooting RAID

Scalable Design of Fault-Tolerance for Wireless Sensor Networks