<<

Reliability Prediction for Fault-Tolerant

Franz Brosch1, Barbora Buhnova2, Heiko Koziolek3, Ralf Reussner4 1Research Center for Information Technology (FZI), Karlsruhe, Germany 2Masaryk University, Brno, Czech Republic 3Industrial Software Systems, ABB Corporate Research, Ladenburg, Germany 4Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany [email protected], buhnova@fi.muni.cz, [email protected], [email protected]

ABSTRACT level [20, 18]. FT mechanisms are commonly Software mechanisms aim at improving the used to improve the reliability of software systems (i.e., the reliability of software systems. Their effectiveness (i.e., re- probability of failure-free operation in a given time span). liability impact) is highly application-specific and depends The effect of a FT mechanism (i.e., the extent of such an on the overall system architecture and usage profile. When improvement) is however non-trivial to quantify because it examining multiple architecture configurations, such as in highly depends on the application context. software product lines, it is a complex and error-prone task The challenge of assessing FT mechanisms in different to include fault tolerance mechanisms effectively. Existing contexts becomes particularly apparent on the architec- approaches for reliability analysis of software architectures ture level, when evaluating different architectural configu- either do not support modelling fault tolerance mechanisms rations, and even more in the of software product or are not designed for an efficient evaluation of multiple ar- lines (SPL) [7]. An SPL has a core assembly of software chitecture variants. We present a novel approach to analyse components and variation points where application-specific the effect of software fault tolerance mechanisms in varying software can be filled in. Thus, it can result in many different architecture configurations. We have validated the approach products all sharing the same core with different configura- in multiple case studies, including a large-scale industrial tion at the variation points. system, demonstrating its ability to support architecture de- Existing approaches for reliability prediction [11, 13, 15] sign, and its robustness against imprecise input data. either do not support modelling FT mechanisms (e.g., [5, 8, 12, 21]) or do not allow for explicit definition and reuse Categories and Subject Descriptors of core modelling artefacts, and hence are difficult to apply in varying contexts, as with architecture design and SPLs D.2.11 [Software]: SOFTWARE ENGINEERING—Soft- (e.g., [24, 26]). ware Architectures; D.2.4.g [Software]: SOFTWARE EN- The contribution of this paper is an approach to analyse GINEERING—Software/Program Verification—Reliability the effect of a software fault tolerance mechanism in depen- General Terms dence of the overall system architecture and usage profile. The approach is novel as it (i) takes software fault tolerance Software Engineering, Reliability, Design mechanisms explicitly into account, and (ii) reuses model parts for effective evaluation of architectural alternatives or Keywords system configurations. The approach is ideally suited for Component-Based Software Architectures, Reliability Pre- software product lines, which are used to formulate and illus- diction, Fault Tolerance, Software Product Lines trate the approach. It builds upon the Palladio Component Model and an associated reliability prediction approach [4], 1. INTRODUCTION which includes both software reliability and hardware avail- Software fault tolerance (FT) mechanisms mask faults in ability as influencing factors. Our tool support allows the software systems and prohibit them to result in a failure. FT architects to design the architecture with UML-like mod- mechanisms are established on different abstraction levels, els, which are automatically transformed to Markov-based such as on the source code level, watch- prediction models, and evaluated to determine the expected dog and heart-beat as design patterns, and replication on the system reliability. The remainder of this paper is structured as follows. Sec- tion 2 outlines our approach and explains the steps involved. Section 3 details the models used in our approach and then Permission to make digital or hard copies of all or part of this work for Section 4 explains how these models are formalised and anal- personal or classroom use is granted without fee provided that copies are ysed to predict the system reliability. Section 5 evaluates the not made or distributed for profit or commercial advantage and that copies approach on two case studies. Section 6 delimits our work bear this notice and the full citation on the first page. To copy otherwise, to from related approaches. Finally, Section 7 draws conclu- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. sions and sketches future directions. QoSA ’11 Boulder, Colorado, USA Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. 2. PREDICTION PROCESS The model allows to express variation points on different This section outlines our reliability prediction approach levels: for fault-tolerant software architectures, software families, • architectural level: using composite components to en- and software product lines (SPL) in particular. Accord- capsulate subsystems and enable their replacement ing to Clements et al. [7], an SPL is defined as “a set of • component level: selecting different component imple- software-intensive systems that share a common, managed mentations for a given specification set of features satisfying the specific needs of a particular • component implementation level: using component pa- market segment or mission and that are developed from a rameters with values for specific product configura- common set of core assets in a prescribed way”. Our mod- tions elling approach provides support only for the technical part • resource level: using allocation references expressing of an SPL or software family as we assume that domain different deployment schemes for product line configu- engineering and asset scoping have been performed before rations modelling the architecture and reusable components. Our model is better suited for our purposes than UML ex- Our approach iteratively follows eight steps depicted in tended with the MARTE-DAM profile [3] because it allows Fig. 1. First, the software architect creates a CoreAsset- model reuse through the core asset base and is reduced to Base, which models interfaces, reusable software compo- concepts needed for the prediction. The following sections nents, and their (abstract) behaviour (step 1). The Core- explain the core asset base (Section 3.1), the fault-tolerance AssetBase is enriched with software failure probabilities of mechanisms (Section 3.2), the resource environment (Sec- actions forming component (service) behaviour (step 2). Af- tion 3.3) and the product (Section 3.4) in detail. terwards, the software architect can include different FT mechanisms, such as recovery blocks (explained in Sec- 3.1 Core Asset Base tion 3.2) or (step 3), either as additional com- The modelling element CoreAssetBase of our meta model ponents or directly into already modelled component be- (cf. Fig. 2) represents a repository for elements assem- haviours. The FT mechanisms allow for different configu- bled into products and contains ComponentTypes, Inter- rations, e.g., the number of retries or replicated instances. faces, and FailureTypes. ComponentTypes can either be atomic PrimitiveComponents or hierarchically structured

1. Model 2. Model 3. Model fault 4. Model 5. Model CompositeComponents with nested inner components. components, failure tolerance, resource products, Composite components allow the core asset base to con- interfaces, and probabilities in adjust environment, allocation, behaviour behaviours configurations HW availability usage tain whole architecture fragments (e.g., SPL core assem- blies) that can be reused in different products. Such a core Resource Core Asset Base Products environments assembly can have optionally required interfaces as varia- (Section 4.1) (Section 4.3) (Section 4.2) tion points. Component types are associated with interfaces through ProvidedRoles or RequiredRoles, and can export Results Not OK ComponentParameters that allow for implementation-level DTMCs 8. Product 7. Markov chain 6. Model (Section 5) instantiation Results analysis transformation variation points. Fig. 3 shows an excerpt of an instance of OK a core asset base including a composite component (4 ) and Figure 1: Process activities and artifacts a component parameter (value). For reliability analyses, the model requires constructs The software architect then creates a Resource- to express the behaviour of component services in terms Environment to model hardware resources (step 4) and spe- of using hardware resources and calling other compo- cific Products (step 5), including component allocation and nents. Therefore, a component type can contain a num- system usage information. Combined with the CoreAsset- ber of ServiceBehaviours that specify the actions executed Base, these models are transformed into multiple discrete- upon calling a specific Signature of one of the compo- time Markov chains (step 6), from which system reliability nent’s provided interfaces. The behaviour may consist of predictions and sensitivity analyses can be deduced (step InternalActions, ExternalCallActions, and control flow 7). If the prediction results do not satisfy the reliability re- constructs, such as probabilistic branches and loops. quirements, the FT mechanisms can be reconfigured and/or An internal action represents a component-internal com- the resource environment and products adjusted iteratively. putation and can contain multiple FailureOccurrences and Otherwise, the modelled products are deemed sufficient for ResourceDemands. FailureOccurrences model software the requirements, and the product can safely be instantiated failures during the execution of a service using a probability. from the core asset base (step 8). The following two sections These failure probabilities can be determined with different describe the models and the predictions in detail. techniques [11, 5, 17], such as reliability growth modelling, defect prediction based on code metrics, statistical testing, 3. RELIABILITY MODELLING or fault injection. An external call action represents a call to other com- This section introduces our meta model for describing the ponents and thus references a Signature contained in one reliability characteristics of software product lines, which of the required interfaces of the current component. Fig. 4 are used to formulate our approach. The model and its shows a small example of a service behaviour containing in- reliability solver are implemented using the Eclipse Modeling ternal actions and external call actions. Notice that the Framework (EMF) and build upon the Palladio Component transitions of the BranchAction are labelled with input pa- Model (PCM) [2]. An excerpt of our meta model is depicted rameter dependencies (e.g., P (X ≤ 1)), because the concrete in Fig. 2; for a full documentation, refer to our website 1. values for the input parameters (e.g., X) are not known to 1http://sdqweb.ipd.kit.edu/wiki/ReliabilityPrediction the component developer providing the specification. The Core […] ProvidedRole Product Assembly AssetBase Composite Primitive 0..* Connector Component Component RequiredRole 0..* 0..* Component Component 0..* 0..* Type Instance […] 0..* 0..* +instantiatedComponentType 0..* 0..* 0..* 0..* UserInterface Service Component Component […] FailureType Interface […] Behaviour Parameter +instantiatedParameter Configuration

Product UserCall LoopAction […] Usage +occuringFailureType +succ +probability 0..* 0..1 1..* BranchAction […] 0..* FailureHandling +pre Abstract Signature +allocatedResourceContainer Entity 0..1 Action InternalAction 1..* +calledService +to Resource Link Container Resource 0..* 0..* +from 0..* +failProb ExternalCall Failure Resource 0..* Action Occurence Demand +probability +demand 0..* RecBlock Recovery Processing Behaviour 2..* BlockAction +usedResourceType +instantiated Resource ProcRe- ResourceType +mttf Resource sourceType +mttr Environment +nextBehaviour +occuringFailureType

Figure 2: Excerpt of our meta model for specifying reliability characteristics in a software product line component developer should keep the component as reusable is represented with a RecoveryBlockAction that contains an as possible, i.e. make no assumptions about the usage pro- acyclic sequence of at least two RecBlockBehaviours. The file. The developer specifies parameter dependencies as an first behaviour models the normal execution within a ser- influencing factor on the component reliability. The depen- vice, while the following behaviours handle failures of cer- dencies are automatically resolved during system analysis, tain types and initiate alternative actions (analogically to when the usage profile is known. Hence, the influence of try and catch blocks in exception handling). Each behaviour the given system-level usage profile on the system reliability contains an inner sequence of AbstractActions that again is explicitly considered by our approach by propagating the can contain any type of action and even nested recovery system-level input parameters through the component-based blocks. architecture [4]. RecBlockBehaviour1 RecBlockBehaviour2 RecBlockBehaviour3 <> <> Handled Failures: Handled Failures: Handled Failures: <> <> 1 <> Component>> <> Failure A, B, C, D Failure B, C Failure B, D 3 4 <> Component>> <> 6 No Failure Success No Failure value No Failure <> Behaviour>> Failure C Behaviour>> Failure C 1 2 3 Failure A Failure B Failure D <> Failure B <> Action>> Action>> BlockAction>> <> A Type>> B Type>> C Type>> D <1) <> <> Behaviour>> Behaviour>> Failure D <>

<> Figure 5: Recovery-block example and its semantics <> <> probability = 0.00012 <> Consider the example in Fig. 5 of a recovery block ac- Figure 4: Service behaviour example (partial view) tion with three RecBlockBehaviours and four different fail- ure types A, B, C, D. During the first behaviour all failure types can occur. Failures of type A cannot be recovered and 3.2 Fault-tolerance Meta-Model lead to a failure of the whole block. The second behaviour To model fault tolerance within component services, we handles failures of type B and C and the third behaviour use the concept of recovery blocks, which is analogical to handles failures of type C and D. In the second behaviour, the exception handling in object oriented programming. It failures B and C are again possible, whereas failures of type B lead to a failure of the block, while failures of type C and models a deployment on a certain . Thus, the com- D are handled by the third behaviour. Notice that failures ponent reliability depends on the availability of the server’s of type C cannot occur from this recovery block as they are hardware resources, employed by the component instance. handled by behaviour 3 and cannot again occur during the The ProductUsage contains a number of UserCalls with execution of behaviour 3. different probabilities, which model the system usage profile As a RecBlockBehaviour can contain arbitrary be- of a product. havioural constructs, such as internal actions, calls, branches, loops, and nested recovery block actions, they are 4. RELIABILITY PREDICTION a flexible means to model FT mechanisms. They allow mod- This section first describes how the system reliability (un- elling exception handling, checkpoint and restarting, process der a specified usage profile) is calculated on the software pairs, recovery blocks, or consensus recovery blocks and are layer (Section 4.1) and then integrated with calculations on therefore capable of making reliability predictions for a large the hardware and network layers (Sections 4.2, 4.3). class of existing FT mechanisms. If an external call action is 4.1 Software Layer embedded into a RecBlockBehaviour, errors from the called component (and any other component down the caller stack) For the software layer, the approach derives the reliabil- can be handled. The case studies in Section 5 show different ity from a MarkovThe modelApproach: that reflects POFOD all possible Calculation execution possible usages of recovery block actions. paths through a Product architecture, and their correspond- ing probabilities.

3.3 Resource Environment Service Behaviour A ResourceEnvironment contains ResourceContainers modelling server nodes and LinkResources modelling net- Action 1 Action 2 Action n-1 Action n work connections. Each resource container can contain a number of ProcessingResources like CPUs or hard disks. Discrete-time Markov chain To include hardware availability into the calculation of the 1.0 1 – ∑fpi(A1) 1 – ∑fpi(An) system reliability, each processing resource contains a mean I A1 An S time to failure (MTTF) and a mean time to repair (MTTR) fp (A ) fp (A ) … fp (A ) attribute. These values can be determined from vendor spec- 1 1 2 1 m 1 ification and/or former experience [23]. Link resources con- fp (A fp (A … fp (A tain failure probabilities for network problems, which can be 1 n) 2 n) m n) determined using simple test series. F F F A resource environment model can be reused across dif- 1 2 m ferent Products using allocation references (i.e. mapping of components to resources). Furthermore, it is possible to Figure 6: Markov model for service behaviours have multiple resource environments (e.g., different server Fig. 6 shows how a ServiceBehaviour represented sizes or servers), which then constitute an through a sequence of n AbstractActions is transformed additional variation point or feature of the product line. into an absorbing discrete-time Markov chain. The chain Components in the core asset base are decoupled from con- contains an initial state I, an absorbing success state S, one crete resource environments, because they only refer to ab- state Ai for each action, and one absorbing failure state Fj stract , but not to concrete ProcResourceTypes Processing-Franz Brosch – Proposalfor eachPresentation of the – Dagstuhl,m user-defined March 2009 software FailureTypes.A Resources. Thus, it is possible to connect a product to a transition from Ai to Fj denotes that action i might exhibit specific resource environment through an allocation refer- a failure of type j upon execution, with a probability of ence without the need to alter the core asset base. fpj (Ai). The probability of failure of the whole behaviour 3.4 Product fp(Beh) is the probability to reach any of the failure states Fj (and not the success state S) from the initial state I: A Product contains a number of ComponentInstances n m wired through AssemblyConnectors and accessed through Y X fp(Beh) = 1 − (1 − fpj (Ai)) a UserInterface. Only components with matching inter- i=1 j=1 faces may be composed in a product. It is possible to con- nect different component instances complying to the same For the success probability of the behaviour sp(Beh), we interfaces at different points in the architecture, thus im- have: plementing architecture-level variation points. Through the sp(Beh) = 1 − fp(Beh) compositions, the overall system behaviour is defined as a connection of all service behaviours (i.e., after composition, The calculation of fpj (Ai) depends on the type of the ac- external call actions in service behaviours can be replaced tion Ai. For InternalActions, the failure probabilities by the service behaviours of the called services). fpj (Ainternal) are given as a direct input to the model Component instances can contain Component- (cf. Fig 2). LoopActions, BranchActions, ExternalCall- Configurations that realize implementation-level variation Actions, and RecoveryBlockActions have nested Service- points. We introduced these parameters in our former Behaviours Beh (again sequences of AbstractActions), work [4]. They are data values that can model different which need to be evaluated in a recursive step first. For configurations or features of a component implementation loops, we have: and change component behaviour. k ci−1 Component instances within a Product must be allocated X X l fpj (Aloop) = (P (ci) · (sp(Beh) · fpj (Beh))) to ResourceContainers through an allocation reference that i=1 l=0 with a finite set of loop iteration counts {c1, . . . , ck} ⊆ N, being either available (OK) or unavailable (F AIL). Assum- each with its probability of occurrence P (ci). For branches, ing independent hardware failures, the probability P (sj ) for we have: the system to be in state sj is the product of the individ-

k ual state probabilities. For example, for a system with 2 X i i resources, the probabilityCombined for r1 being OK SW/HWand r2 being Layers fpj (Abranch) = (P (Beh ) · fpj (Beh )) i=1 F AIL is: with a finite set of nested behaviours Beh1, . . . , Behk and P ((r1 = OK) ∧ (r2 = F AIL)) = A1 · (1 − A2) their probabilities of occurrence P (Behi). An External- CallAction fails if the called behaviour fails: I P(s1) P(sq) fpj (Acall) = fpj (Beh) s1 sq

ForestRecovery of blocks Behaviour are characterized Trees for through a Recovery a sequence Block of fp (Beh |s ) … … sp(Beh |s ) 1 k i 1 T 1 T q nested behaviours [Beh , . . . , Beh ]. Having fpj (Beh ) for every behaviour and failure type, each Behi can be repre- F F R R sented with a trivial Markov model T (Behi), as illustrated 1 m 1 p S in Fig. 7, and the recovery-block model constructed as their combination (following the semantics illustrated in Fig. 5). Figure 8: Combined HW/SW consideration Fig. 8 illustrates the combined consideration of the soft- 1 I Handles: none 2 I Handles: j , j ,… k I Handles: j , j ,… T(B ): 1 T(B ): 2 21 22 T(B ): k k1 k2 ware and the hardware layer. Upon a user call, the system might be in any of the physical system states sj . This is

1 1 … 1 2 2 … 2 k k … k 1-∑fpj(B ) fp1(B ) fpm(B ) 1-∑fpj(B ) fp1(B ) fpm(B ) 1-∑fpj(B ) fp1(B ) fpm(B ) reflected by a transition from the initial state I to sj with the corresponding state probability P (sj ). Being in sj , the … … … execution might either fail due to an unavailable hardware S1 F11 F1m S2 F21 F2m Sk Fk1 Fkm resource accessed by system control flow (sj to Ri), or with a software failure (represented through transitions from s Figure 7: Markov models for recovery behaviours j Franz Brosch – Proposalto F Presentationk), or it might– Dagstuhl, succeed March 2009 (sj to S). i To calculate the failure probabilities fpk(BehT |sj ), k ∈ Let for each model T (Beh ) be Ii its initial state, Si its {1, . . . , m + p}, we need to incorporate the failures due to success state and Fij its failure state for each failure type j. To connect the isolated trees into a single Markov chain, hardware unavailability into the software layer model (as shown in Fig. 6). To this end, we set the hardware-caused we add j + 2 states, namely I, S and Fj for each j, and the following transitions: failure probability fpk(BehT |sj ), k > m, of Internal- 1.0 Actions to 1.0 if they require a ProcessingResource that • I −−→I1, 1.0 is not available under the given physical system state sj . • Si −−→S for each i ∈ {1, . . . , k}, If an internal action requires two or more unavailable re- 1.0 • Fij −−→ Ix where x ∈ {i + 1, . . . , k} is the index of the sources, the failure probability is distributed evenly among closest tree handling failure type j if such a tree exists, these. In this manner, the software layer is evaluated in- 1.0 dependently for each physical system state, and the overall • and if not, Fij −−→ Fj for each Fij with no outgoing transitions. success probability of service execution (represented by its topmost behaviour Beh ) is computed as the weighted sum Finally, the failure probabilities fpj (Arecovery) are com- T over the success probabilities of all physical system states: puted as the probability of reaching Fj from I (via sum- mation of the multiplied probabilities over available paths). q X Based on the described calculations, the success probabil- sp(BehT ) = (P (sj ) · sp(BehT |sj )) ity of the topmost behaviour sp(BehT ) (sequence of system j=1 services invoked by the user) can be calculated in a recursive where for each physical system state we have: way, yielding the system-level reliability with respect to the software layer. m+p X sp(BehT |sj ) = 1 − fpk(BehT |sj ) 4.2 Hardware Layer k=1 For ProcessingResources, the approach employs a hard- Thanks to the separation of service behaviour according to ware availability model, and integrates this model with the the individual physical system states, the approach can bet- software layer for a combined consideration of software and ter reflect the correlation of subsequent resource demands hardware failures. Having the MTTFi and MTTRi values (i.e. a resource accessed twice within a single service exe- for all resources {r1, . . . , rp}, we first calculate the steady- cution is likely to be either OK in both cases or F AIL in state availability Ai of ri: both cases), caused by significantly longer resource failure and repair times compared to the execution time of a single A = MTTF /(MTTF + MTTR ) i i i i user-triggered service [22]. and interpret it as the probability of ri being available when requested at an arbitrary point in time. Furthermore, we de- 4.3 Network Layer fine the set of physical system states as the set {s1, . . . , sq} To incorporate the network layer into the model, we as- where each sj is a combination of the individual resource sume that each call routed over a LinkResource involves states. We distinguish two possible states for each resource, two message transports (request and return), and that each transport might fail with the given failure probability of the a centralized storage for media files, such as audio or video link fp(L). We adapt the software layer model (see Sec- files, and a corresponding up- and download functionality. tion 4.1) by a differentiation of ExternalCallActions into The media store product line contains three standard prod- local calls and remote calls. We keep the failure probabilities uct configurations: standard, comfort, and power. More con- for local calls: figurations are possible by instantiating the feature model in Fig. 9. fpj (AlocalCall) = fpj (Beh) but incorporate the link failure probability into the calcula- Mandatory Alternative Media Store Optional Or CacheHitRate tion for remote calls: UserInteractionChoice FileLoaderChoice EncoderChoice DataAccessChoice 2 fpj (AremoteCall) = fpj (Beh) · fp(L) UserInteraction FileLoader Encoder DataAccess FT UserInteraction Thus, we enable coarse-grained consideration of network re- FileLoader Encoder Comfort Power DataAccess liability, without going into the details of more sophisticated Power UserInteraction FileLoader Encoder network simulations, which are out of scope of this paper. FT FT FT Figure 9: Feature model of the media store SPL 5. EVALUATION Fig. 10 summarizes the different products and design al- 5.1 Goals and Setting ternatives of the media store product line. The core func- tionality is provided through four component types: User- This section serves to validate our reliability and fault Interaction, FileLoader, Encoding, and DataAccess. For tolerance modelling approach. The goals of the validation some of these components alternative comfort or power vari- are (i) to demonstrate how the new FT modelling techniques ants are present in the core asset base for the different prod- can support design decisions, (ii) to provide a rationale for uct variants (cf. Fig. 9). The and the DataAccess the validity of our models and their resilience to imprecise component are deployed on one (or optionally two) sepa- input parameters, and (iii) to show the effectiveness of our rated database server(s). models for SPLs. Regarding goal (ii), validating reliability prediction against measured values is inherently difficult, as failures are rare events, and the necessary time to observe a sta- IUser- IUser- «Component IData- «Component «Component Instance» Instance» Instance» Access DataAccess tistically relevant number of failures is infeasibly long for Inter- Inter- UserInteraction UserInteractionFT action action [Comfort] high-reliability systems. Several existing approaches there- «Processing «Processing IFileLoader Resource» Resource» fore limit their validation to demonstrating examples and HDD CPU «ComponentInstance» «ComponentInstance» sensitivity analyses (e.g., [8, 10, 12, 22, 25]), showing how Encoding[Power] FileLoaderFT the approaches can be used to learn about system failure IEncoding IFileLoader behaviour, and proving the robustness of prediction results IEncoding «ComponentInstance» «ComponentInstance» IData- «Component against imprecise input data at design time. A number EncodingFT FileLoader[Power] Instance» Access DataAccess of authors involve real-world industrial software systems in IDataAccess «Processing «Processing their validation (e.g., [5, 16, 26]). We follow the same path, «ComponentInstance» «Processing «Processing Resource» Resource» Resource» Resource» DataAccessFT and additionally compare our numerically calculated predic- HDD CPU HDD CPU tions against a more realistic, but also more time-consuming queueing network simulation [4], in order to at least partially Figure 10: Media store products and design alter- validate prediction accuracy. The simulation has fewer as- natives sumptions than the analytical solution. It takes system ex- During up- and download of media files, different types ecution times (encoded into ResourceDemands) into account of failures may occur in the involved component instances: and lets resources fail and be repaired according to their A BusinessLogicFailure may occur during the processing of MTTF and MTTR, not based on the simplified steady-state user requests in the UserInteraction[Comfort] component. availability. A CacheAccessFailure may occur in the FileLoader[Power] We have validated our approach on a number of different component induced by malfunctioning cache memory. Bugs systems: a distributed business reporting system, the com- in the compression algorithm of the Encoder[Power] compo- mon component modelling example (CoCoME), a web-based nent may lead to an EncodingFailure.A DataAccessFailure media store product line [2], an industrial control system may occur in the DataAccess component due to internal product line [17], and the SLA@SOI Open Reference Case. database errors or faults in the database server’s file system. The models for these systems can be retrieved from our web- Additionally, as hardware failures, a CommunicationFailure, site 1. In the following, we describe the predictions for the CPUFailure, and/or HDDFailure can occur. web-based media store (Section 5.2) and the industrial con- trol system (Section 5.3) in detail. The media store allows <> DataAccessFT.retrieveFile() <> <> 1 DataAccess[Main].retrieveFile() us to present all reliability predictions, while the predictions <> <> 2 <> confidentiality reasons. Handles CPUFailure DataAccess[].retrieveFile() 5.2 Case Study I: Web-based Media Store Figure 11: Service behaviour of DataAccessFT The media store model is inspired by common web- For illustrative purposes, we set the software-level failure service-based data storage solutions and has similar func- probabilities to 10−5 for each individual failure occurrence in tionality to the ITunes Music Store [2]. The system provides the model, with the following exceptions distinguishing the system failure failure reduction probability 25% 0,14% HDDFailure 20% 0,12% UserFT CPUFailure 0,10% 15% FileFT CommunicationFailure 0,08% EncFT DatabaseFailure 10% 0,06% DataFT CacheAccessFailure 0,04% 5% EncodingFailure 0,02% BusinessLogicFailure 0% product 0,00% product type standard comfort power type standard comfort power

system reliability failure probability 99,96% [no FT] 0,04% standard UserFT 0,03% comfort 99,92% FileFT 0,02% power EncFT 99,88% 0,01% DataFT 0,00% 99,84% product standard comfort power type failure type (a) System reliability for all design alternatives (b) Failure probabilities per failure type system reliability prediction 99,96% system reliability failure probability simulationBusiness 99,94% 0,025%99,92% Encoding [noFT] Cache 99,92% EncFT 0,020%99,88% Database 0,015% FileFT 99,84% 99,90% Comm UserFT 0,010% product 99,80% CPUtype 99,88% DataFT 0,005% standard comfort power HDD[UserFT] altered 0,000% # of reques- failure type 2 4 6 8 10 12 14 16 18 20 ted files System reliability changes for altered input data systemFailure probabilities depending on usage profile (c) (d)reliability 99,998%

Figure 12: Media store99,997% prediction results [no FT] 99,996% UserFT products: in the UserInteractionComfort component, the ality, which involves additional computingFileFT and requests to −4 99,995% probability of BusinessLogicFailures rises to 10 because of the database for storage and retrieval.EncFT The power product 99,994% the more complex business logic compared to the standard has the highest system reliability, as theDataFT high hit rate in the 99,993% variant. Compression algorithms are generally complex and FileLoaderPower cache decreases the number of necessary 99,992% −4 number of may fail with a probability of 10 in the Encoder compo- database2 4 accesses.6 8 10 Employing12 14 16 18 the20 requestedDataAccessFTfiles component nent, and with 2 × 10−4 in the EncoderPower component. has the highest effect compared to the design alternatives For all hardware resources, we assume the MTTF being one without fault tolerance. Notice that the FT mechanisms year and the MTTR 50 minutes, implying an availability have different influences in the different variants. For exam- of 99.99%. In other settings, these values could have been ple, the UserInteractionFT is most effective for the comfort extracted from log files of existing similar systems. variant. Fault tolerance mechanisms may be optionally introduced Fig. 12(b) provides more detail and shows the probability into each media store product, in terms of additional com- of a system failure due to a certain failure type. Summarized ponents which are shown in grey in Fig. 10. For example, a over all products, CommunicationFailures, CPUFailures and UserInteractionFT component may be put in front of the HDDFailures most probably cause a system failure. The risk UserInteraction[Comfort] component. It has the ability of a CommunicationFailure is especially high for the comfort to buffer incoming requests, to re-initialise the business logic product, which requires many database accesses and corre- in case of a BusinessLogicFailure, and to retry the failed re- sponding network traffic. Thus, a software architect may quest. As another example, the DataAccessFT component recognize the need to introduce new fault tolerance mecha- may be used to handle CPUFailures on the main DB server nisms for these failures. by redirecting calls to the backup server. Fig. 11 illustrates To demonstrate the robustness of our model to imprecise the service behaviour of the file retrieval call, which con- input data (goal (ii) in Section 5.1), we first examined the sists of a single RecoveryBlockAction with two RecBlock- robustness of the reliability prediction to alterations in the Behaviours. input failure probabilities. We changed the failure probabil- Each described fault tolerance mechanism can be used for ities of the components of the comfort product one at a time each product, and more than one mechanism may be applied by multiplying them with 10−1. We also increased the hard- in parallel. We focus on cases where at most one mechanism ware resource availability 99.99% to 99.999% one at a time is used. for each resource. Fig. 12(c) shows new system reliabilities The usage profile of the media store consists of 20% upload for the different fault tolerance variants, indicating that the calls and 80% download calls, an average of 10 requested ranking of the design alternatives is almost identical over the files per call, and a probability of 0.3 for files to be large, different failure type alterations. The DataAccessFT is al- i.e. requiring compression during upload. We calculated ways top-ranked. However, rank changes do occur in case of the expected system reliability for each product and design altered BusinessLogicFailure and CacheAccessFailure prob- alternative. Each calculation took below one second on a abilities, indicating that these probabilities should be esti- standard PC with a 2.2 GHz CPU and 2.00 GB RAM. mated as careful as possible. To provide evidence about the possible decision support As a second sensitivity analysis, Fig. 12(d) focuses on the for different design alternatives (goal (i) in Section 5.1), power product without fault tolerance. It shows the sen- Fig. 12(a) shows the system reliability for each product and sitivity of the failure probabilities per failure type to the fault tolerance alternative. The comfort product has the number of requested files (i.e., a change to the usage pro- lowest reliability, because of the included statistics function- file). For most failure types, the failure probabilities rise <> C7 <> FT2 with the number of files, as more database accesses and mes- <> C5 0.805 <> sage transports over network are required, as well as more <> Action>> 0.193 <> C6 <> Behaviour>> Behaviour>> HDDFailure <> however, is independent0,12% of the number of files, which keeps probability = ... CPUFailure <> C4 Action>> C4' 0,08% DatabaseFailure and HDDFailures0,06% do not influence the system reliability for CacheAccessFailure Server1 Server2 Server3 Server4 more than 8 files.0,04% For cases with 2 to 8EncodingFailure files there is a chance 0,02% C7 FT1 BusinessLogicFailure C4 Ext1 that all files are0,00% found in the cache, and no database access product type C8 standard comfort power C1 FT2 is necessary, thus lowering the failure probability. C6 To analyse prediction accuracy, we ran the reliability Server3' failure probability Ext2 C5 C2 simulation for0,04% eachstandard of the three products in the User- C4' 7 C3 InteractionFT0,03%variantcomfort for 10 simulated seconds, and got a deviation from0,02% the analyticalpower results between 0.0006% and 0.0067%. Fig.0,01% 13 shows the results. The ranking of the three Figure 14: Control system product line model with considered variants0,00% is confirmed by the simulation, which in- two exemplary service behaviours dicates that the analytical results are sufficiently accurate. failure type determined using source code instrumentation and system execution. The hardware reliability parameters were based system reliability prediction on vendor specifications. 99,96% simulation The industrial control system is realized as a product line 99,92% and sold to customers in different variants depending on 99,88% their requirements. Fig. 15 shows a small excerpt of variants 99,84% in terms of a feature model. There are many more possible product 99,80% type variants, as third party components can be integrated into standard comfort power [UserFT] the system via standardized interfaces. The components C1- C8 are mandatory. For component C4, there are two alterna- Figure 13: Media store simulation results tive implementations (C41 and C42), which address different The effectiveness of the approach for SPLs (goal (iii) customer requirements. There are two external components in Section 5.1) is demonstrated by the fact that nearly Ext1 and Ext2, which can be optionally included into the all model parts can be reused throughout the media store core system. The feature model also includes the different products and design alternatives. Only some Component- FT mechanisms as variants. Instances are specific to certain alternatives and need to be Mandatory Alternative Control System connected via additional AssemblyConnectors: the User- Optional Or InteractionComfort, EncodingPower, FileLoaderPower, and all FT components. C1 C2 C3 C4 Ext FT […]

5.3 Case Study II: Industrial Control System C5 C6 C7 C8 C41 C42 Ext1 Ext2 FT1 FT2 As a second case study, we analysed the reliability of a Figure 15: Feature model of the control system large-scale industrial control system from ABB, which is product line variants (excerpt) used in many different domains, such as power generation, pulp and paper handling, or oil and gas processing. The sys- For the scope of this paper we restricted the reliability tem is implemented in several millions lines of C++ code. analysis to the core system (standard) and three different On a high abstraction level, the core of the system consists of variants. Variant 1 uses component C42 instead of C41 but eight software components that can be flexibly deployed on is otherwise identical to the core system. Variant 2 incorpo- multiple servers depending on the system capacity required rates the external component Ext1, which is only connected by customers. Fig. 14 depicts a possible configuration of the to component C4 (cf. Fig. 14). Variant 3 incorporates com- system with four servers. The names of the components and ponent Ext2, which is connected to component C1, C2, C4, their failure probabilities have been obfuscated for confiden- and C6. These variants correspond to realistic configura- tiality reasons. tions, which have formerly been sold to customers. The upper part of Fig. 14 shows the ServiceBehaviours To demonstrate the decision support for different alter- for the components C7 and FT2. The components FT1 and natives (goal (i) in Section 5.1) we analysed how the pre- FT2 have been introduced into model inspired by existing dicted system reliability varies for the different variants and FT mechanisms. FT1 is able to restart component C1 upon FT mechanisms (Fig. 16(a)). The actual values are obfus- failed requests. FT2 is able to query two redundant instances cated for confidentiality reasons. Variant 1 is the predicted of component C4, which are deployed on different servers, as being the most reliable. Introducing FT1 generally bears thereby implementing fault tolerance against hardware fail- a higher increase in reliability than introducing FT2, which ures. includes adding an additional server for the redundant in- The reliability of the core system has been analysed in stance of component C4. The impact on system reliability a former study [17], where no fault tolerance mechanisms of FT2 is less pronounced for variant 1 than for the other or product variants were considered. For this case study, we variants, because it already uses a higher reliable version reused the failure probabilities from the former study, which of component C4. Thus, the software architect can decide had been determined using software reliability growth mod- whether the increased costs for adding an additional server els based on bug tracker data. We also reused the transi- for realising FT2 in this variant are justified. tion probabilities between the components, which had been To show the robustness of the models against imprecise C1

C2

C4 no FT C5 FT1 FT2

FT1+FT2

System Reliability System System Reliability System Product Component Failure Probability Standard Variant 1 Variant 2 Variant 3 Type (a) Prediction results for different variants (b) Sensitivity to component failure probabilities Figure 16: Exemplary prediction results for the industrial control system input parameters (goal (ii) in Section 5.1), we conducted Some approaches do tackle the problem of incorporat- a sensitivity analysis modifying the failure probabilities of ing fault tolerance mechanisms into the architectural pre- selected components (Fig. 16(b)). The system reliability is diction models [18]. Sharma and Trivedi [25] includes addi- most sensitive to the component reliability of C1 as the curve tional states and recovery transitions into architecture level has the steepest slope. The system reliability is robust to Markov model to model component restarts or system re- the component reliability of C5. Overall, the model behaves boots. Wang et al. [26] provides constructs for Markov linearly and the deviations of the system reliability are com- chains to model replicated components. Kanoun et al. [16] parably small to changes in individual component failure model fault tolerance of hardware/software systems using probabilities. In this case, the ranking of the design al- generalized stochastic Petri nets. These approaches do not ternatives remained robust against uncharacteristically high consider component-internal control and data flow, and how variations of the component failure probabilities. it is influenced by error handling constructs. Thus, they For a comparison between numerically computed predic- may yield inaccurate predictions when fault tolerant soft- tions and simulation data, we ran a simulation for each vari- ware behaviour deviates from the specific cases considered ant for 106 simulated seconds. The mean error between by the authors. Furthermore, none of these approaches sup- the numerically computed and simulated system reliability ports reusing model artefacts. across all variants was 0.0077 percent. The ranking of the Considering reliability during the design of a software variants remained was the same for the simulation results as product line is a major challenge, because different prod- for the numerical results. We conclude that the numerical uct variants may have different influences on the expected calculations were sufficiently precise in this case. reliability. Immonen [14] proposes the ’reliability and avail- To show the effectiveness of the approach for SPLs (goal ability prediction’ (RAP) method for SPLs. RAP, however, (iii) in Section 5.1), we quantified the amount of changes nec- does not support compositional models, hardware reliability, essary to model the product variants in our case. For Variant or explicit fault tolerance mechanisms. Olumofin et al. [19] 1, a single ComponentInstance and AssemblyConnector had tailor the architecture trade-off analysis method to evaluate to be added to the standard Product and deployed to the SPLs for different quality attributes. They focus on the iden- respective ResourceContainer. This did not require the ad- tification of scenarios but provide no architectural model or justment of transition probabilities. For Variants 2 and 3, predictions. Dehlinger et al. [9] introduce the PLFaultCAT also only single ComponentInstances had to be added to the tool to analyse SPL safety using fault tree analysis. Their standard Product. models do not reflect the and therefore complicate evaluating different design alternatives. Auer- swald et al. [1] model product families of embedded systems 6. RELATED WORK using block diagrams, but provide no usage profile model or Our method for architectural fault tolerance modelling quantitative reliability prediction. is related to approaches on software architecture reliabil- ity modelling [13, 21], fault tolerance modelling on the level of software architecture [18], and for 7. CONCLUSIONS software product lines [7]. We presented an approach to support the design of reliable Multiple surveys on software architecture reliability mod- and fault-tolerant software architectures and software fam- elling are available [11, 13, 15]. R.C. Cheung [6] was among ilies. Our approach allows modelling different architectural the first to propose architectural reliability modelling using alternatives and product line configurations from a shared Markov chains. Some recent approaches refine such mod- core asset base and offers a flexible way to include many els to support compositionality [21], and different failure different fault tolerance mechanisms. A tool transforms the modes [10], but do not regard fault tolerance mechanisms. models into Markov chains and calculates the system relia- L. Cheung et al. [5] use hidden Markov models to deter- bility involving both software and hardware reliabilities. We mine component failure probabilities for Markov chain ar- evaluated our approach in multiple case studies and demon- chitecture models. Further approaches in this area apply the strated its value to support architectural design decisions, UML modelling language [8, 12] or are specifically tailored its robustness against imprecise input data, and its effec- to service-oriented systems [27], but also do not include fault tiveness for SPLs. tolerance mechanisms or support for reusing model artefacts Our approach provides a new perspective for designing in different contexts, such as product configurations. software architectures and families. It allows software ar- chitects to validate their during early development [12] K. Goseva-Popstojanova, A. Hassan, A. Guedem, stages and supports their design decisions quantitatively. As W. Abdelmoez, D. E. M. Nassar, H. Ammar, and the effectiveness of different fault tolerance mechanisms is A. Mili. Architectural-Level Risk Analysis Using UML. highly context dependent, our approach enables software ar- IEEE Trans. on Softw. Eng., 29(10):946–960, 2003. chitects to quickly analyse many different alternatives and [13] K. Goseva-Popstojanova and K. S. Trivedi. rule out poor design choices. This can potentially lead to Architecture-based approach to reliability assessment more reliable systems, which are built more cost-effectively of software systems. Performance Evaluation, because late life-cycle changes for better reliability can be 45(2-3):179–204, 2001. avoided. [14] A. Immonen. Software Product Lines, chapter A In future work, we aim to include more sophisticated hard- Method for Predicting Reliability and Availability at ware reliability modelling techniques into our approach to of- the Architecture Level, pages 373–422. Springer, 2006. fer more refined predictions. We will extend our tool for au- [15] A. Immonen and E. Niemel¨a. Survey of reliability and tomated sensitivity analyses and design optimisation. Our availability prediction methods from the viewpoint of prediction approach can potentially be extended for other software architecture. Software and Systems Modeling, quality attributes, such as performance or security. 7(1):49–65, 2008. [16] K. Kanoun and M. Ortalo-Borrel. Fault-tolerant 8. ACKNOWLEDGMENTS system dependability-explicit modeling of hardware This work was supported by the European Commission as and software component-interactions. IEEE part of the EU-projects SLA@SOI (grant No. FP7-216556) Transactions on Reliability, 49(4):363–376, 2000. and Q-ImPrESS (grant No. FP7-215013). Furthermore, [17] H. Koziolek, B. Schlich, and C. Bilich. A Large-Scale we thank Igor Lankin for his support in developing the ap- Industrial Case Study on Architecture-based Software proach. Reliability Analysis. In Proc. 21st International Symposium on Software Reliability Engineering 9. REFERENCES (ISSRE’10). IEEE Society, 2010. To appear. [1] M. Auerswald, M. Herrmann, S. Kowalewski, and [18] H. Muccini and A. Romanovsky. Architecting Fault V. Schulte-Coerne. Software Product-Family Tolerant Systems. Technical Report CS-TR-1051, Engineering, volume 2290 of LNCS, chapter University of Newcastle upon Tyne, 2007. Reliability-Oriented Product Line Engineering of [19] F. G. Olumofin and V. B. Misic. Extending the Embedded Systems, pages 237–280. Springer, 2001. ATAM Architecture Evaluation to Product Line [2] S. Becker, H. Koziolek, and R. Reussner. The Palladio Architectures. In Proc. of WICSA’05, pages 45–56. Component Model for Model-Driven Performance IEEE Computer Society, 2005. Prediction. Journal of Systems and Software, [20] B. Randell. System structure for software fault 82(1):3–22, 2009. tolerance. In Proc. Int. Conf. on Reliable software, [3] S. Bernardi, J. Merseguer, and D. Petriu. A pages 437–449. ACM, 1975. dependability profile within MARTE. Software and [21] R. H. Reussner, H. W. Schmidt, and I. H. Poernomo. Systems Modeling, pages 1–24, 2009. Reliability prediction for component-based software [4] F. Brosch, H. Koziolek, B. Buhnova, and R. Reussner. architectures. Journal of Systems and Software, Parameterized Reliability Prediction for 66(3):241–252, 2003. Component-based Software Architectures. In Proc. of [22] N. Sato and K. S. Trivedi. Accurate and efficient QoSA’10, volume 6093 of LNCS, pages 36–51. stochastic reliability analysis of composite services Springer, 2010. using their compact Markov reward model [5] L. Cheung, R. Roshandel, N. Medvidovic, and representations. In Proc. of SCC’07, pages 114–121. L. Golubchik. Early prediction of software component IEEE Computer Society, 2007. reliability. In Proc. of ICSE’08, pages 111–120. ACM [23] B. Schroeder and G. A. Gibson. Understanding disk Press, 2008. failure rates: What does an MTTF of 1,000,000 hours [6] R. C. Cheung. A User-Oriented Software Reliability mean to you? ACM Trans. Storage, 3(3):8, 2007. Model. IEEE Trans. Softw. Eng., 6(2):118–125, 1980. [24] V. Sharma and K. Trivedi. Quantifying software [7] P. Clements and L. Northrop. Software Product Lines: performance, reliability and security: An Practices and Patterns. Addison-Wesley, 2001. architecture-based approach. Journal of Systems and [8] V. Cortellessa, H. Singh, and B. Cukic. Early Software, 80:493–509, 2007. reliability assessment of UML based software models. [25] V. S. Sharma and K. S. Trivedi. Reliability and In Proc. of WOSP’02, pages 302–309. ACM, 2002. Performance of Component Based Software Systems [9] J. Dehlinger and R. R. Lutz. PLFaultCAT: A with Restarts, Retries, Reboots and Repairs. In Proc. Product-Line Software Fault Tree Analysis Tool. of ISSRE’06, pages 299–310. IEEE, 2006. Automated Software Engineering, 13(1):169–193, 2006. [26] W.-L. Wang, D. Pan, and M.-H. Chen. [10] A. Filieri, C. Ghezzi, V. Grassi, and R. Mirandola. Architecture-based software reliability modeling. Reliability Analysis of Component-Based Systems Journal of Systems and Software, 79(1):132–146, 2006. with Multiple Failure Modes. In Proc. of CBSE’10, [27] Z. Zheng and M. R. Lyu. Collaborative reliability volume 6092 of LNCS, pages 1–20. Springer, 2010. prediction of service-oriented systems. In Proc. of [11] S. S. Gokhale. Architecture-Based Software Reliability ICSE’10, pages 35–44. ACM Press, 2010. Analysis: Overview and Limitations. IEEE Trans. on Dependable and Secure Computing, 4(1):32–40, 2007.