International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1165 ISSN 2229-5518 Dependability Tests Selection Based on the Concept of Layered Networks Andrey A. Shchurov, Radek Mařík

Abstract— Nowadays, the consequences of failure and downtime of distributed systems have become more and more severe. As an obvious solution, these systems incorporate protection mechanisms to tolerate faults that could cause systems failures and system dependability must be validated to ensure that protection mechanisms have been implemented correctly and the system will provide the desired level of reliable service. This paper presents a systematic approach for identifying (1) characteristic sets of critical system elements for dependability testing (single points of failure and recovery groups) based on the concept of layered networks; and (2) the most important combinations of components from each recovery group based on a combinatorial technique. Based on these combinations, we determine a set of test templates to be performed to demonstrate system dependability.

Index Terms— Dependebility testing, distributed systems, formal models, layered networks. ——————————  ——————————

1 INTRODUCTION OMPUTING systems have come a long way from a single and the likely consequences of faults C processor to multiple distributed processors, from indi- • fault tolerance, i.e. avoidance of service failures in the vidual-separated systems to networked-integrated sys- presence of faults as the basic mechanism for achiev- tems, and from small-scale programs to sharing of large-scale resources. Moreover, nowadays virtualization and cloud tech- ing dependability requirements [3]. We need to state nologies make another level of distributed system complexity. here the difference between fault tolerance and high On the other hand, the consequences of failure and downtime availability: a fault tolerant environment has no ser- have become more severe. They might endanger human lives vice interruption, while a highly available environ- and the environment, do serious damage to major economic ment has minimal service interruption. infrastructure, endanger personal privacy, undermine the via- The key factor of fault tolerance (or fault transparency [4] is bility of whole business sectors and facilitate crime [1]. As an obvious solution, systems incorporate protection preventing failures due to system architectures and it address- mechanisms to tolerate faults that could cause systems failures es the fundamental characteristic of dependability require- and, as a consequence, the most difficult part of systems de- ments in two ways [5]: ployment is the question of assurance that system dependabil- • replication, i.e. providing multiple identical instances ity mechanisms (fault toleranceIJSER or high availability) have been of the same component and choosing the correct implemented correctly and a system is able to provide the de- sired level of reliable service. result on the basis of a quorum (voting); Dependability is an integrating concept that unites the at- • redundancy, i.e. providing multiple identical instanc- tributes of reliability, availability, safety, integrity and main- es of the same component and switching to one of the tainability. The original definition of dependability determines remaining instances in case of a failure (failover). the system ability to deliver service that can justifiably be As a consequence, a system must be validated to ensure that trusted [2] (this definition stresses the need for justification of its replication/redundancy mechanism has been correctly im- trust). The engineering definition is simpler – dependability is the ability of a system to avoid service failures or the probabil- plemented and the system will provide the desired level of ity that a system will operate when needed. The four major reliable service. Fault injection (the deliberate insertion of categories of dependability are: faults into a system to determine its response [6] [7]) offers an • fault prevention, i.e. prevention of the occurrence or effective solution to this problem. Fault-injection experiments introduction of faults; provide a means for understanding how these systems behave • fault removal, i.e. reduction of the number and severi- in the presence of faults (the monitoring of the effects the in- ty of faults; jected faults have on the systems final results). Simulated fault • fault forecasting, i.e. estimation of the present number injection (or environment fault injection) [8] [9] [10] [11] can support all system abstraction levels – architectural, function- ———————————————— • Andrey A. Shchurov - Department of Telecommunications Engineering, al, logical, and electrical. This mixed-mode simulation, where Czech Technical University in Prague, The Czech Republic, E-mail: the system is hierarchically decomposed for simulation at dif- [email protected] • Radek Mařík - Department of Telecommunications Engineering, Czech ferent abstraction levels, is particularly useful in the case of Technical University in Prague, The Czech Republic, E-mail: complex distributed systems. [email protected]

IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1166 ISSN 2229-5518 In turn, strategies for the fault-injection experiments are er/subscriber and provider behavior and basically work on generally based on methods for assessing system reliability the principle of evaluating transmission time to compute the (identifying potential faults and determining the resulting execution time of each file or program under real conditions running in a distributed environment. As a consequence, the error effects) [12] [13] [14] [15]. Typically, it is a document- system reliability is based on the operational or usage profile centric evaluation, where a group of engineers evaluates the of the given set of services. system. But in the case of complex or non-standard systems, The common analytical tool for user-centric models is time- personal experience and/or intuition are often inadequate. based models (founded on the queueing theory) [19] [20]. Us- Our main goal is the automated design and generation of test- er-centric approaches can be characterized as multi-stage problem solving processes where the system is conceived in ing procedures/specifications and plans for distributed sys- terms of user behavior. tems based on end-user requirements and technical specifica- tions as a necessary part of project documentation. Thus, to 2.2 Architecture-Based Models accomplish such a goal we need to identify a test strategy for In contrast to the user-centric models, architecture-based fault-injection experiments based on a formal model with the models can be defined as the bottom-up or hardware-based following criteria: (1) it has to cover all aspects of distributed approach (the viewpoint of the engineering community) to the systems [4]; and (2) it has to be simple enough for commercial reliability of distributed systems. In turn, they can be classified application. into: (1) component-oriented models; (2) communication- This paper presents a systematic approach for identifying: oriented models. • characteristic sets of critical elements (hardware and Component-oriented models represent distributed systems software) of the distributed system under test (DSUT) as a composition of multiple processors but completely ignore the failures of communications and assume that the communi- for dependability testing (single points of failure and cation channels (links) among the processors are perfect [21] recovery groups) based on the concept of layered [22] [23]. Without considering communication failures, the networks; exchanged information between components (software and • the most important combinations of components from hardware) must always be correct. In this case, the problem of each recovery group based on the combinatorial (or distributed system reliability can be reduced to a parallel- series structure. In turn, the parallel-series reliability is easy to truth tables) technique. calculate [23] [24] [25] [26]. Such condition may be a good ap- Based on these combinations, we determine the set of test proximation for a system that exchanges only a little infor- templates which should be performed to demonstrate that mation among nodes, such as those where the processors do protection mechanisms for achieving dependability require- only their own jobs (no intensive data transmission). ments (fault tolerance or high availability) have been imple- The analytical tool for component-oriented models is relia- mented correctly. IJSERbility block diagrams (one of the conventional and most com- mon tools of system reliability analysis [25] [26]). The rest of this paper is structured as follows. Section 2 in- In contrast to the component-oriented models, communica- troduces the related work. Section 3 presents a test strategy for tion-oriented models consider the communication failures and fault-injection experiments based on the formal multilayer assume that the components themselves (the nodes of net- model of distributed systems for checklist generation missions works) are always perfect [23] [26] [27]. They suppose that the and analytical tools for system reliability assessment. Section 4 system failures are caused by the communication failures on channels (links) while the components (or nodes) cannot fail introduces an example based on a simple multi-layered sys- during the executing of programs. Such condition is a good tem. Finally, conclusion remarks are given in Section 5. approximation for cases where the communication time dom- inates the time of program execution or the components are highly reliable in comparison to the channels. 2 RELATED WORK The analytical tool for communication-oriented models is Nowadays, models for assessing reliability of distributed network diagrams (commonly used in representing communi- systems can be roughly classified into [16]: cation networks consisting of individual links [23]). • user-centric models; An additional effective analytical tool for architecture-base • architecture-based models; models is fault tree diagrams (the underlying graphical model in fault tree analysis) [23] [25] [28]. Whereas the reliability • state-based models. block diagrams and network diagrams are mission success 2.1 User-Centric Models oriented, the fault tree shows which combinations of the com- Generally, user-centric models can be defined as the top-down ponent failures can lead to system failures. And fault tree dia- or service-oriented approach (i.e. the viewpoint of the busi- grams can describe the fault propagation in systems. Howev- ness/end-users community) to the reliability of distributed er, repair and maintenance (two important operations in sys- systems [17] [18] [19]. As reliability of any system has direct tem analysis) cannot be expressed using a fault tree formula- impact on the system usage, so these models focus on us- tion. IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1167 ISSN 2229-5518 2.3 State-Based Models 3 TEST STRATEGY FOR FAULT-INJECTION The first generation of state-based models that considered EXPERIMENTS both node failures and link failures have a common assump- The essential idea of our approach is based on the concept of tion - the operational of nodes or links are con- layered complex networks [38]. But in contrast to [36] [37], the stant without considering bandwidth and content (constant- core component of our solution is the formal four layered reliability models) [29] [30] [31]. However, this assumption of model for test generation missions [39]. This model is stated as the constant-reliability of elements is not suitable in practice. a four-layered graph as follows: Intuitively, downloading a larger file from a remote site will • The functional (or ready-for-use system) architecture have a higher risk of failure than downloading a smaller file through the same link [32]. layer defines functional components and their The most recent models relax this assumption for the ele- interconnections (end-user requirements ments (nodes and links). Instead, they assume that the failures representation). An important note – this layer can be of elements follow Poisson processes, so that the more time an used for representation of social networks. element works (including execution and communication), the • The service architecture layer defines software-based less reliable that element is [32] [33] [34]. In addition, the tradi- tional models study the network topology by physical links components (services/applications) and their and nodes that are static without considering dynamic chang- interconnections. es of components and logic structures. To solve these prob- • The logical architecture layer defines logical (virtual) lems, recent models use a virtual structure instead of physical components and their interconnections. structure [33] [34]. • The physical architecture layer defines hardware The analytical tool for state-based models is Markov mod- els [23] [25]. To deal with all sorts of errors such as time-out (physical) components and their interconnections. failures, blocking failures, network failures, etc. (which can • The interlayer projections define all types of occur during operations of execution and communication), a components hierarchical (interlayer) hierarchical model must be used. This model suggests tackling relations/mapping. These relations make the layered various errors in different layers and uses Markov state prin- model consistent and represent interlayer ciple to map layers into different physical states [16]. technologies (virtualization, clustering, etc.) used to

In turn, the general approach (common to all types of mod- build DSUT. els) is to treat reliability as a complex problem and to decom- The model formal notation G for each layer n can be repre- sented as: pose the distributed system into a hierarchy of related subsys- n tems or components. Rebaiaia and Ait-Kadi [35] provide a G = (V ,E ,M ,V ) (1) n survey of methods, and software tools. But it is And: n n n n−1 n−1 important to note that the reliability evaluation problem is G = G (2) NP-complete and, as a consequence, the generation of an exact IJSERN solution is very problematic. An interesting solution called ⋃n=1 n G mission-oriented reliability is represented by Wang et al [36] where is multi-layered 3D graph, derived from the system and Luo et al [37] based on the concept of layered networks specification; N is the number of DSUT layers; V is a finite,

[38]. The core component of this solution is a two-layer model non-empty set of components on layer n; E is an finite, non- with the lower-layer topology (physical network) for a physi- empty set of component-to-component connectionsn on layer n; cal graph and the upper-layer topology (mission network) for M is a finite set of component-to-component projection a logical graph. Both graphs have the same nodes and one fromn layer n to layer n-1; and V is a finite set of components edge in the logical graph is related to a path from the source n−1 node to the destination node in the physical graph. In that on layer n-1. n−1 way, the network missions are modeled as a network, and the Applying the requirements-coverage test strategy [40] to mission-oriented network reliability can be calculated on the the model covers each interaction from the end-user require- mission network which consists of many fewer nodes and ments on system, logical and physical architectural layers and, links than the physical network does (reduction of complexi- as a consequence, provides the sets of test templates for each ty). architectural layer: Moreover, reliability of network topologies is the great challenge for all these models. Network communications are T = {(P , c ), (P ,c ),…, (P ,c )} (3) usually considered either (1) as physical communication struc- n n1 n1 n2 n2 nk nk tures based on the properties of communication hardware where P represents the paths (or data flows) in G ; and c are technical characteristics of component-to-component commu- (physical links and nodes); or (2) as virtual communication ni n ni structures (virtual information links) which normally hide the nication processes. In turn, each path P is the set of individu- al components which communicate each other and define this properties of communication hardware. In both cases, the lay- ni ered structure of real communication protocols is completely path (data flow): ignored. P = {v , v , … , v }, v V (4) IJSER © 2015 ni n1 n2 nm ni n http://www.ijser.org ∈ International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1168 ISSN 2229-5518 terms of Boolean equations is that these equations can be This approach allows use of the advantages of (1) the concept reduced to its the most compact form which represents the of layered complex networks [38] and (2) the approach of minimal path sets or “minimal prevention sets” [42]. In our mission-oriented reliability [37] but it covers all layers of OSI case, the most convenient representation of these compact Reference Model [41] and, as a consequence, both software-based forms is the conjunctive normal form (CNF) that provides a and network-based aspects of distributed systems [39]. particular representation of the success tree as a set of sets of The next steps are based on the appropriate analytical tools for basic events. The Boolean expression of the success tree in reliability assessment (analysis is performed independently for conjunctive normal form for each layer n can be written as: each architectural layer): • success (logic) tree approach (as a special case of the S = v (9) fault tree approach); CNF l ri ∗ n i=1 j=1 nij • ⋀ �⋁ � combinatorial (or truth tables) technique for the logic where l is the number of clauses in the CNF expression, and is trees evaluation. the number of literals (or basic events) in each clause; and v푖 푟 ∗ 3.1 Success Tree Approach represents the basic events: nij The system performance can be considered from two opposite 1, component v is in OS viewpoints: the various ways that a system fails or the various v = , v V 0, component v is in FS (10) ways a system succeeds. The success tree approach [28] is a ∗ nij nij nij n deductive process by means of which a desirable event, called And: � nij ∈ the top event, is postulated, and the possible ways for this v | 1 i l; 1 j r {v | 1 i m} V (11)

event to occur are systematically deduced. The success tree, � nij ≤ ≤ ≤ ≤ i� ⊆ ni ≤ ≤ ⊆ n which shows the various combinations success events that This resulting CNF expression defines two characteristic (pre- guarantee the occurrence of the top event, can be logically vention) sets for dependability testing: Single Points of Failure [13]. If a clause of the resulting expres- represented by path sets. The Boolean expression for the sion in CNF is a literal, i.e. it represents a single component of success tree can be written as [42]: DSUT, so this component is a single point of failure. It means all possible paths can be destroyed by removing this compo- S = P P … P (5) nent. As a consequence, components of that kind do not need ∗ ∗ ∗ additional testing but disaster recovery plans as part of project n n1 n2 nk ⋁ ⋁ ⋁ documentation.The set of components that are single points of where S is the top event which denotes the state of the system failure for each layer n: on layern n (the entire systemIJSER is in operational state iff it is in operational state on all layers simultaneously): SPOF = v |1 i l; 1 j r ;; r = 1 , v V (12)

n nij i i nij n 1, system is in operational state (OS) Recovery Groups �[13]. If≤ a clause≤ ≤of th≤e resulting� expression∈ in S = 0, system is in failure state (FS) (6) CNF is a disjunction of literals, i.e. it represents a group of n component of DSUT, so this group provides topological re- � dundancy. It means all possible paths cannot be destroyed by and P represents the path sets of the logical tree on layer n. In ∗ removing a single component of this group – an alternative turn, nieach path set can be written as [42]: path (or paths) still exists. As a consequence, components of that kind need additional testing of protection mechanisms P = v v … v (7) (sensing and switching). The set of components that provide ∗ ∗ ∗ ∗ fault tolerance for each layer n: ni n1⋀ n2⋀ ⋀ nm where v represents the basic events (or the state of individual RG = v |1 i l; 1 j r ; r 2 , v V (13) components)∗ in the success tree on layer n: ni n nij i i nij n For the case of� systems≤ ≤ which≤ can≤ tolerate≥ �failures∈ of k arbi- 1, component v is in OS v = , v V trary components simultaneously: 0, component v is in FS (8) ∗ ni ni ni n � ni ∈ l k, r k + 1 (14) A special case of recovery groups is the set of access points In the case of complex systems that need more than one de- i sirable top event simultaneously (composed of subsystems), or end-user components≥ (hardware≥ and software). As a rule, the resulting expression of the entire system can be defined as these components are starting points of the most paths (data a conjunction of success trees of their subsystems. flows). However, generally fault tolerant design (replication or One of the main purposes of representing logical trees in redundancy) is normally not used for end-user components. This statement is based on two main reasons: IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1169 ISSN 2229-5518 • economic reason – replication/redundancy tends to the reliability expression R for DSUT can be defined based on increase the system cost; the set of components that provide fault tolerance (or clauses • technical reason – replication/redundancy may of the resulting CNF expression – see (15)): increase complexity to the point where the replication/redundancy itself contributes to accidents R = (1 p) p (19) [1]. l ri−ki ri j ri−j ∏i=1 �∑j=0 � j � − � As a consequence, DSUT components of that kind do not con- where is the number of individual components in each tain protection mechanisms and do not need additional test- recovery group of the items that must be in the operational ing. Based on this assumption, access points (end-user com- 푘푖 state for the system to be in the operational state (k-out-of-r ponents) can be eliminated from analysis by converting all 푟푖 literals which represent these components into tautologies. In structure). this case, the resulting Boolean structures for each layer n can In the worst case of = 1 (this case determines the be represented as: maximum number of possible combinations), the number of 푘푖 1, v A combinations in each recovery group is: S = v , v = , A V v , v A (15) CNF l ri ∗∗ ∗∗ nij n n i=1 j=1 nij nij ∗ ∈ n n C = 2 1 | r 2 (20) ⋀ �⋁ � � nij nij n ⊂ ∉ ri where A is the set of access points (or end-user devices) on i − i ≥ However one combination must be excluded from analysis – layer n. n all components of a system are in operational state – this state 3.2 Logic Trees Evaluation is the initial state for fault-injection testing. Thus, the number When calculating the of multiple simultaneous of combinations C that must be tested is: failures, the number of combinations that should be tested can be reduced dramatically. So, for further analysis of the result- C = (2 1) 1 (21) ing Boolean structures in CNF we can use the combinatorial l ri (or truth tables) technique [25]. This method relies on a com- �∏i=1 − � − binatorial to exhaustively generate all probabilisti- Obviously, this result is better than the universal set coverage. cally significant combinations of both failure and success Nevertheless for a large volume of l and/or , the generation events and subsequently to propagate the effect of each com- of this large number of tests is still impractical and, as a 푟푖 bination on the logic tree to determine the state of the top consequence, some additional steps (optimization) must be event. The quantification of logic trees based on the combina- used. torial method yields exact results and these are associated with a specific physical state of DSUT [25]. The next step of our approach is based on two basic assumptions: IJSER• The sum of the probabilities of all possible combinations is Real distributed engineering systems are usually unity because the combinations are all mutually exclusive and under great financial and timing constraints and, as a cover all event space [25]. For the case of independent consequence, they consist of the smallest possible set identical units (IIU) with reliability of p, the universal set for of components with a minimal number of DSUT can be represented as: communication links as a tradeoff between cost and reliability requirements, i.e. their topologies can be (1 p) p = 1 (16) represented by Harary graphs [43] (connected simple m m k m−k graphs with a minimal number of edges) with the ∑k=0� k � − where k is the number of failed components, and m is the additional links based on technological requirements. • number of DSUT individual components defined by the It is not necessary to cover all possible successful resulting Boolean structure in CNF – see (15): combinations but the most important (the most likely) only [44]. In turn, the most important combination can be derived from end-user requirements as the m = r (17) l number of failures which a system is able to tolerate ∑i=1 i Theoretically testing activities should cover the universal set. simultaneously. However, in this case the number of combinations C that should We need to state here: be tested is: • Big data centers which can contain thousands of C = = 2 (18) servers (hardware and software instances) are beyond m m m the scope of this work due to possible ∑k=0� k � For independent identical units (IIU) with reliability of p, legislation challenges [45].

IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1170 ISSN 2229-5518

Fig. 1. Deviation of the system reliability assessment. in the case of Fig. 2. Deviation of the system reliability assessment. in the case of low-quality components (0.6 ≤ p ≤ 0.9). high-quality components (0.9 ≤ p < 1).

Nevertheless, they can be divided into small The typical reliability values (examples) of COTS equipment subsystems based on individual functional tasks (hardware) is shown in Table 1. Unfortunately, it is impossible (end-user requirements). to find similar data for COTS software. Nevertheless, for sta- ble releases of software components (services and/or plat- • Specific areas (the military, nuclear or aerospace forms – operating systems/firmware) the reliability function industries) are beyond the scope of this work. becomes dominated by hardware failures and the impact of The most typical values of the number of clauses (after elimi- software failures becomes smaller with respect to the total nation of end-user components) are based on architectural component failure rate [26] [47]. So, analysis of the physical solutions: architecture layer can be applicable to the entire system.

• The physical architecture layer. The assumption is

based on the hierarchical design model [46]. The 1. In the case of a system which can tolerate failures of a single arbitrary component (or just have no single points of failures), upper bound represents core, distribution the most important states that should be covered by tests are: (aggregation) and access network layers with an • a system is in operational state and all components are additional layer for server hardware and the lower in operational state; bound represents the fact that even the simplest • a system is in operational state and a single arbitrary architecture consist of at least one network component is in failure state. component and at least one server: 2 l 4. A special case (or an exception) is if solutions are based on • The logical architectureIJSER layer. It is similar to the virtualization technologies: they must tolerate failures of a ≤ ≤ physical architecture layer but based on virtual single arbitrary physical server (hardware failure) while an- components: 2 l 4. other one is in maintenance mode [48]. However it is a ques- tion of the resources sharing/allocation, not specific protection • The service architecture layer. The assumption is ≤ ≤ mechanisms. based on the three-tiered architecture [4]. The upper The system reliability assessment R (the probability of bound represents user-interface, processing and data finding DSUT in these states) is: levels and the lower bound represents the fact that in L1 the case of the simplest client-server model clients can R = p + [r (1 p)p ]p = p + m(1 p)p (19) directly communicate with servers: 1 l 3. m l ri−1 m−ri m m−1 InL1 turn, the∑ deviationi=1 i − of the system reliability assessment− D In turn, the most typical values of the number of literals (indi- vidual components) in each clause (after elimination≤ ≤ of end- can be defined as: user components) are based on technological solutions: D = 100% (20) • Redundancy [5]. Two identical instances of the same R−RL1 component in active/standby (switching to the In the case of low-qualityR individual components remaining instances in case of a failure) or (0.6 ≤ p ≤ 0.9), the result is shown in Fig. 1. In turn, Fig. 2 active/active mode, i.e. 1-out-of-2 structure: r = shows the result of high-quality components (0.9 ≤ p < 1). Based on information from Table 1, the average value of the 2, k = 1. i system reliability assessment deviation in the case of COTS • Replicationi [5]. Three identical instances of the same equipment is less than 1%. component in active mode (choosing the correct result As a consequence, the number of combinations C that on the basis of a quorum), i.e. 2-out-of-3 structure: should be tested based on the system reliability assessment is r = 3, k = 2. quite trivial compared with the universal set coverage:

i i IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1171 ISSN 2229-5518

TABLE 1 RELIABILITY OF COTS EQUIPMENT (EXAMPLES)

* Data sources: Cisco - [49] [50] [51]; Dell - [52]; Plextor - [53].

C = m, m 2 (21)

≥ But this trivial result covers at least 99% of specific states of DSUT.

2. In the specific case of reliable systems which can tolerate failures of up to two arbitrary components simultaneously, the most important states that should be covered by tests are: Fig. 3. A simple multi-layered system (an example). • a system in operational state and all components of a system are in operational state; • a system in operational state and a single arbitrary • switching mechanism verification – the switching component is in failure state; mechanism is able to reconfigure DSUT topology (re- • a system in operational state and two arbitrary route data flows) due to the component failure. components are in failure state.

So, the system reliability assessment can be represented as: Component Repair ( ). In turn, this step determines the two 퐿2 interrelated activities:퐹퐹퐹 (푅 ) R = p + m[(1 p)p ] + [(1 p) p ] (22) • sensing mechanism verification – the sensing m m−1 m m−1 2 m−2 L2 IJSER mechanism is able to (1) detect a component resurrection, − 2 − In this case, the number of combinations that should be tested and trigger the switching mechanism (if necessary); C is: • switching mechanism verification – the switching ( ) ( ) C = m + = , m 3 (23) mechanism is able to restore DSUT initial topology (if m m−1 m m+1 necessary). 2 2 ≥ It might be difficult to say whether this result is acceptable for As a consequence, each fault injection action/step defines two commercial application. A possible solution is using two test templates and the set of test templates T for fault FIJ simple (which can tolerate failures of a single arbitrary injection experiments can be defined as: n component) systems in parallel instead of a complex one. T = FIJ v , FIJ(v ) | 1 i l; 1 j r ; r 2 (24) 3.3 Fault-Injection Experiments FIJ So, inn the case ofnij systems nijwhich can tolerate failuresi i of a single Based on its nature, the dependability testing (or testing of the �� � � � ≤ ≤ ≤ ≤ ≥ � sensing and switching protection mechanisms) must include arbitrary component, the number of test templates for fault two main steps: injection experiments (dependability testing) |T | is: FIJ Component Fault Injection ( ). This step determines two |T | = 2(|RG | |A |) = 2(|V | |SPOF | |A |) (25) interrelated activities that cannot be divided: FIJ N N 퐹퐹퐹 The next∑ stepn=1 is basedn − on nthe following∑n=1 nassumptions:− n − n • sensing mechanism verification – the sensing • The number of DSUT layers is limited by the system mechanism is able to (1) detect a component failure, model (see [39]): N = 3 – the Functional architecture and (2) trigger the switching mechanism; layer does not define real (software or hardware) components.

IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1172 ISSN 2229-5518 TABLE 2.1 TABLE 2.2 APPLICATION OF THE REQUIREMENTS-COVERAGE STRATEGY. APPLICATION OF THE REQUIREMENTS-COVERAGE STRATEGY. BASIC SUBSYSTEM (END-USER REQUIREMENT): ADDITIONAL SUBSYSTEM (TECHNICAL REQUIREMENT DERIVED FROM [SERVICE_SUBSCRIBER_X, SERVICE_PROVIDER_X] END-USER REQUIREMENT): [WEB_CLIENT_X, DNS_SERVER_X]

• The maximum possible number of tests templates SPOF = {VLAN_1, VServer_1,VServer_2} • the set of components that provide fault tolerance (a (dependability testing) can be achieved if: (1) the 2 number of end-users (access point) which must be recovery group): Physical layer: eliminated from analysis has the minimal value: RG = {Switch_1, Switch_2,Server_1, Server_2} | | = 1; 1 3; and (2) the number of single 1 points of failure has the minimal value: As a result we have a set of objects which need additional (de- 퐴푛 ≤ 푛 ≤ | | = 0; 1 3. pendability) testing of sensing and switching protection mecha- nisms. So: 푛 IJSER 푆푆푆� ≤ 푛 ≤ A formal set of test templates covers all successful DSUT op- |T | 2(|V | 1) < 2|V | (26) eration status - see Table 4 NN 1 – 3, 5 – 7 and 9 – 11. The system FIJ 3 3 ≤ ∑n=1 n − ∑n=1 n reliability is 96.93723651%. In turn, in the specific case of systems which can tolerate fail- In turn, the optimized set of test templates covers the most ures of up to two arbitrary components simultaneously, the important only - see Table 4} NN 1 – 3, 5 and 9. The system reli- number of tests templates for fault injection experiments is: ability assessment is 95.93835902%. So, based on these optimization steps, we can reduce the ( ) |T | 2 |V | 1 < 2|V | (27) number of tests templates from 20 to 8 (2.5 times) with the FIJ 3 2 3 2 ≤ ∑n=1 n − ∑n=1 n Reliability Assessment Deviation 1.030%. 4 CASE STUDY As a practical example, we have a simple multi-layered system 4 CONCLUSION (see Fig. 3). The requirements-coverage strategy application to Deployment of distributed systems sets high requirements for this simple example defines data flows which cover the model based on requirements – see Table 2.1 and Table 2.2. procedures, tools and approaches for complex testing of these The resulting Boolean structures in conjunctive normal form systems. And the most difficult part of systems deployment is after elimination of end-user components for each layer are the question of assurance that system dependability protection shown in Table 3. In turn, these structures define characteristic mechanisms (fault tolerance or high availability) have been sets for dependability testing: implemented correctly and a system is able to provide the • two sets of single points of failure: desired level of reliable service. Service layer: This paper presents a systematic approach for identifying SPOF = {WEB_Server_1, DNS_Server_1} critical elements based on the concept of layered networks [38]. Logical layer: 3 The key component is the formal four layered model for test

IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1173 ISSN 2229-5518 TABLE 3 TABLE 4 RESULTING BOOLEAN EXPRESSIONS FOR DEPENDABILITY ANALYSIS COMBINATORIAL METHOD (TRUTH TABLE)

generation missions [39]. This model is the four-layered 3D graph, derived from the system technical specifications, which covers all layers of OSI Reference Model [41].

Applying the requirements-coverage test strategy [40] to the model covers each interaction from the end-user requirements * Probability of system operatin status is calculated based on data from Table 1. on system, logical and physical architectural layers and, as a REFERENCES consequence, provides the sets of paths (or data flows) for each layer (each path is the set of individual components which [1] N.G. Leveson, Safeware: system safety and computers, ACM, 1995. communicate each other and define this path). [2] A. Avizienis, J.-C. Laprie, B. Randell and C. Landwehr, "Basic The next steps are based on the analytical tools for reliability concepts and taxonomy of dependable and secure computing," Dependable and Secure Computing, IEEE Transactions on, vol. 1, no. 1, assessment (analysis is performed independently for each pp. 11-33, January 2004. architectural layer): [3] A. Bucchiarone, H. Muccini and P. Pelliccione, "Architecting Fault- • Success (logic) tree approach (as a special case of the tolerant Component-based Systems: From Requirements to Testing," fault tree diagrams) allows us to represent the sets of Electron. Notes Theor. Comput. Sci., vol. 168, pp. 77-90, February 2007. paths (or data flows) as a Boolean structure in [4] A.S. Tanenbaum and M.v. Steen, Distributed Systems: Principles and conjunctive normal form (CNF) and, as a Paradigms, 3rd ed., Prentice Hall Press, 2013. consequence, definesIJSER two characteristic sets for [5] H. Langmaack, W.-P. d. Roever and J. Vytopil, Eds., Formal dependability testing: (1) sets of single points of Techniques in Real-Time and Fault-Tolerant Systems: Third failure; (2) sets of components that provide fault International Symposium Organized Jointly with the Working Group tolerance (recovery groups). Provably Correct Systems, Springer-Verlag, 1994. • In turn, combinatorial (or truth tables) technique for [6] J.A. Clark and D.K. Pradhan, "Fault injection: a method for validating computer-system dependability," Computer, vol. 28, no. 6, pp. 47-56, the logic trees evaluation defines the most important June 1995. combinations of components from each recovery [7] D. Avresky, J. Arlat, J.-C. Laprie and Y. Crouzet, "Fault injection for group that must be tested. formal testing of fault tolerance," Reliability, IEEE Transactions on, vol. Based on these combinations, we determine the set of test 45, no. 3, pp. 443-455, September 1996. templates which should be performed to demonstrate that [8] D.d. Andres, J.-C. Ruiz, D. Gil and P. Gil, "Fault Emulation for protection mechanisms for achieving dependability require- Dependability Evaluation of VLSI Systems," Very Large Scale ments (fault tolerance or high availability) have been imple- Integration (VLSI) Systems, IEEE Transactions on, vol. 16, no. 4, pp. 422- mented correctly. 431, April 2008. This approach allows use of the advantages of (1) the con- [9] C. Bolchini, A. Miele and D. Sciuto, "Fault Models and Injection cept of layered complex networks and (2) the approach of mis- Strategies in SystemC Specifications," in Digital System Design Architectures, Methods and Tools, 2008. DSD '08. 11th EUROMICRO sion-oriented reliability – reduction of complexity - but it co- Conference on, 2008. vers all layers of OSI Reference Model and, as a consequence, [10] D. Lee and J. Na, "A Novel Simulation Fault Injection Method for both software-based and network-based aspects of distributed Dependability Analysis," Design Test of Computers, IEEE, vol. 26, no. 6, systems. pp. 50-61, 2009. [11] L. Entrena, M. Garcia-Valderas, R. Fernandez-Cardenal, A. Lindoso, M. Portela and C. Lopez-Ongil, "Soft Error Sensitivity Evaluation of

IJSER © 2015 http://www.ijser.org International Journal of Scientific & Engineering Research, Volume 6, Issue 1, January-2015 1174 ISSN 2229-5518 Microprocessors by Multilevel Emulation-Based Fault Injection," [31] Y.S. Dai, M. Xie, K.L. Poh and G.Q. Liu, "A study of service reliability Computers, IEEE Transactions on, vol. 61, no. 3, pp. 313-322, 2012. and availability for distributed systems," Rel. Eng. & Sys. Safety, vol. [12] O.P. Yadav, N. Singh and P.S. Goel, "Reliability demonstration test 79, no. 1, pp. 103-112, 2003. planning: A three dimensional consideration," [32] Y.S. Dai, M. Xie and K.L. Poh, "Reliability of grid service systems," & System Safety, vol. 91, no. 8, pp. 882-893, 2006. Computers & Industrial Engineering, vol. 50, no. 1-2, pp. 130-147, 2006. [13] E. Bauer and R. Adams, Reliability and Availability of Cloud [33] Y.S. Dai, Y. Pan and X. Zou, "A Hierarchical Modeling and Analysis Computing, 1st ed., Wiley-IEEE Press, 2012. for Grid Service Reliability," IEEE Trans. Computers, vol. 56, no. 5, pp. [14] K. Benz and T. Bohnert, "Dependability Modeling Framework: A Test 681-691, 2007. Procedure for High Availability in Cloud Operating Systems," in [34] Y.S. Dai, B. Yang, J. Dongarra and G. Zhang, "Cloud Service Vehicular Technology Conference (VTC Fall), 2013 IEEE 78th, 2013. Reliability: Modeling and Analysis," 2010. Available: [15] S. Reiter, M. Pressler, A. Viehl, O. Bringmann and W. Rosenstiel, http://www.techrepublic.com/resource-library/whitepapers/ "Reliability assessment of safety-relevant automotive systems in a [35] M.-L. Rebaiaia and D. Ait-Kadi, "Network Reliability Evaluation and model-based design flow," in Design Automation Conference (ASP- Optimization: Methods, Algorithms and Software Tools," CIRRELT, DAC), 2013 18th Asia and South Pacific, 2013. 2013. [16] W. Ahmed and Y.W. Wu, "A Survey on Reliability in Distributed [36] X. Wang, N. Huang, W. Chen and R. Li, "A new method for Systems," J. Comput. Syst. Sci., vol. 79, no. 8, pp. 1243-1255, December evaluating the performance reliability of communications network," 2013. in Information Networking and Automation (ICINA), 2010 International [17] E. Huedo, R.S. Montero and I.M. Llorente, "Evaluating the reliability Conference on, 2010. of computational grids from the end users point of view," Journal of [37] Q. Luo, M. Chen, X. Yin and H. Deng, "Testing mission-oriented Systems Architecture, vol. 52, no. 12, pp. 727-736, 2006. network reliability via hierarchical mission network," in Quality, [18] V. Cortellessa and V. Grassi, "Reliability Modeling and Analysis of Reliability, Risk, Maintenance, and (ICQR2MSE), 2012 Service-Oriented Architectures," in Test and Analysis of Web Services, L. International Conference on, 2012. Baresi and E. d. Nitto, Eds., Springer Berlin Heidelberg, 2007, pp. 339- [38] M. Kurant and P. Thiran, "Layered Complex Networks," Phys. Rev. 362. Lett., vol. 96, no. 13, April 2006. [19] Z. Zheng and M.R. Lyu, "Collaborative Reliability Prediction of [39] A.A. Shchurov, "A Formal Model of Distributed Systems For Test Service-oriented Systems," in Proceedings of the 32Nd ACM/IEEE Generation Missions," International Journal of Computer Trends and International Conference on , 2010. Technology (IJCTT), vol. 15, no. 6, pp. 128-133, September 2014. [20] D.J. Chen, M.C. Sheng and M.S. Horng, "Real-time distributed [40] A.A. Shchurov and R. Marik, "A Formal Approach to Distributed program reliability analysis," in Parallel and Distributed Processing, System Tests Design," International Journal of Computer and Information 1993. Proceedings of the Fifth IEEE Symposium on, 1993. Technology (IJCIT), vol. 3, no. 4, pp. 696-705, July 2014. [21] J.-C. Laprie and K. Kanoun, "X-ware reliability and availability [41] ISO/IEC, ITU-T Rec. X.200 - ISO/IEC 7498:1994 Information technology - modeling," Software Engineering,IJSER IEEE Transactions on, vol. 18, no. 2, Open Systems Interconnection - Basic Reference Model, 1994. pp. 130-147, 1992. [42] M. Stamatelatos, W. Vesely, J. Dugan, J. Fragola, J. Minarick and J. [22] Y.S. Dai, M. Xie, K.L. Poh and S.H. Ng, "A model for correlated Railsback, Fault Tree Handbook with Aerospace Applications, NASA, failures in N-version programming," IIE Transactions, vol. 36, no. 12, 2002. pp. 1183-1192, 2004. [43] M.v. Steen, and Complex Networks: An Introduction, [23] M. Xie, K.L. Poh and Y.S. Dai, Computing System Reliability: Models 1st ed., Maarten van Steen, 2010. and Analysis, Springer, 2004. [44] K. Dooley, Designing Large Scale LANs, O’Reilly Media, 2001. [24] M.L. Shooman, Reliability of Computer Systems and Networks: Fault [45] T. Erl, R. Puttini and Z. Mahmood, Cloud Computing: Concepts, Tolerance, Analysis, and Design, John Wiley & Sons, 2002. Technology & Architecture, 1st ed., Prentice Hall Press, 2013. [25] M. Modarres, M. Kaminskiy and V. Krivtsov, Reliability Engineering [46] Cisco Systems, "Campus Design Summary," 2014. And Risk Analysis: A Practical Guide, 2nd ed., CRC Press, 2010. [47] K.V. Vishwanath and N. Nagappan, "Characterizing Cloud [26] M.L. Ayers, Telecommunications System Reliability Engineering, Computing Hardware Reliability," in Proceedings of the 1st ACM Theory, and Practice, 1st ed., Wiley-IEEE Press, 2012. Symposium on Cloud Computing, 2010. [27] D. J. Chen and T. H. Huang, "Reliability analysis of distributed [48] Cisco Systems, "Data Center Technology Design Guide," 2013. systems based on a fast reliability algorithm," Parallel and Distributed [49] Cisco Systems, "Cisco Catalyst 3650 Series Switches Data Sheet". Systems, IEEE Transactions on, vol. 3, no. 2, pp. 139-154, 1992. [50] Cisco Systems, "Cisco Catalyst 2960-X Series Switches Data Sheet". [28] K.G. Stephens, A Fault Tree Approach to Analysis of Behavioral Systems: An Overview, Distributed by ERIC Clearinghouse, 1974. [51] Cisco Systems, "Cisco Aironet 1140 Series Access Point Data Sheet". [29] P. Kumar, S. Hariri and S.C. Raghavendra, "Distributed program [52] Dell, "Dell Power Solutions Whitepaper". reliability analysis," Software Engineering, IEEE Transactions on, Vols. [53] Plextor, "Plextor M5 Pro Solid-State Drive Whitepaper". SE-12, no. 1, pp. 42-50, 1986. [30] Y.S. Dai, M. Xie and K.L. Poh, "Reliability Analysis of Grid Computing Systems," in PRDC, 2002. IJSER © 2015 http://www.ijser.org