The Complexity of Adding Failsafe Fault-Tolerance Sandeep S

The Complexity of Adding Failsafe Fault-Tolerance Sandeep S. Kulkarni Ali Ebnenasir Department of Computer Science and Engineering Michigan State University East Lansing MI 48824 USA Abstract in simplifying the reuse of the techniques used in manually In this paper, we focus our attention on the problem of automating adding fault-tolerance. We expect that the same advantage the addition of failsafe fault-tolerance where fault-tolerance is will apply in the automated addition of fault-tolerance. added to an existing (fault-intolerant) program. A failsafe fault- The main difficulty in automating the addition of fault- tolerant program satisfies its specification (including safety and tolerance, however, is the complexity involved in this liveness) in the absence of faults. And, in the presence of faults, it process. In [2], Kulkarni and Arora showed that the problem satisfies its safety specification. We present a somewhat unexpected of adding masking fault-tolerance –where both safety and result that, in general, the problem of adding failsafe fault-tolerance liveness are satisfied in the presence of faults– is NP-hard. in distributed programs is NP-hard. Towards this end, we reduce We find that there are three possible options to deal with this the 3-SAT problem to the problem of adding failsafe fault-tolerance. complexity: (1) develop heuristics under which the synthesis We also identify a class of specifications, monotonic specifications and a class of programs, monotonic programs. Given a (positive) algorithm takes polynomial time, (2) consider a weaker form monotonic specification and a (negative) monotonic program, we of fault-tolerance such as failsafe –where only safety is show that failsafe fault-tolerance can be added in polynomial time. satisfied in the presence of faults, or nonmasking -where We note that the monotonicity restrictions are met for commonly safety may be violated temporarily if faults occur, or (3) encountered problems such as Byzantine agreement, distributed identify a class of specifications and programs for which the consensus, and atomic commitment. Finally, we argue that the addition of fault-tolerance can be performed in polynomial restrictions on the specifications and programs are necessary to add time. failsafe fault-tolerance in polynomial time; we prove that if only The first approach was used in [3], where Kulkarni, Arora and one of these conditions is satisfied, the addition of failsafe fault- Chippada presented heuristics that are applicable to several tolerance is still NP-hard. Keywords : Fault-tolerance, Formal methods, Program problems including byzantine agreement. In polynomial synthesis, Program transformation, Distributed time, their algorithm finds a fault-tolerant program or it programs declares that a fault-tolerant program cannot be synthesized. In this paper, we focus on the other two approaches. 1 Introduction Regarding the second approach, we focus our attention on We focus on the automation of failsafe fault-tolerant the design of failsafe fault-tolerance. By adding failsafe programs, i.e., programs that satisfy their safety specification fault-tolerance in an automated fashion, we can simplify – if faults occur. We begin with a fault-intolerant program and partly automate– the design of masking fault-tolerant and systematically add fault-tolerance to it. The resulting programs. More specifically, the algorithm that automates the program, thus, guarantees that if no faults occur then the addition of failsafe fault-tolerance and the stepwise method specification is satisfied. However, if faults do occur then for designing masking fault-tolerance [1] can be combined at least the safety specification is satisfied. to partially automate the design of masking fault-tolerant There are several advantages of such automation. For one, programs. The algorithm in [1] shows how a masking fault- the synthesized program is correct by construction and there tolerant program can be designed by first designing a failsafe is no need for its correctness proof. Second, since we begin (respectively, nonmasking) fault-tolerant program and then with an existing fault-intolerant program, the derived fault- adding nonmasking (respectively, failsafe) fault-tolerance to tolerant program reuses it. Therefore, it would be possible it. Thus, given an algorithm that automates the addition to add fault-tolerance even to programs for which the entire of failsafe fault-tolerance, we can automate one step of specification is not available or where the existing program designing masking fault-tolerance. is the de-facto specification. Third, in this approach, the In our investigation, we find that the design of distributed concerns of the functionality of a program and its fault- failsafe fault-tolerant programs is also NP-hard. To show this, tolerance are separated. This separation is known to help [1] we provide a reduction from 3-SAT to the problem of adding 1Email: [email protected], [email protected] failsafe fault-tolerance. Web: http://www.cse.msu.edu/˜{sandeep,ebnenasi} To deal with the complexity involved in automating the Tel: +1-517-355-2387, Fax: 1-517-432-1061 addition of failsafe fault-tolerance, we follow the third This work was partially sponsored by NSF CAREER CCR-0092724, approach considered above. Specifically, we identify the DARPA contract F33615-01-C-1901, ONR Grant N00014-01-1-0744, and a grant from Michigan State University. restrictions that can be imposed on specifications and fault- intolerant programs in order to ensure that failsafe fault- - 2 * tolerance can be added in polynomial time. Towards this , , that it can write. Also, process consists of a set of 0132¨¤4¢ 5 end, we identify a class of specifications, namely monotonic transitions ./, ; each transition is of the form where % ¤/ ' ¢ ( specifications, and a class of programs, namely monotonic 2 . We address the effect of read/write restrictions . ( programs. Given a (positive) monotonic specification and a on , in Section 2.2. The transitions of , , is the union of (negative) monotonic program, we show that failsafe fault- the transitions of its processes. tolerance can be added in polynomial time. Finally, we In this paper, in most situations we are interested in the state note that the class of monotonic specifications contains well- space of and all its transitions. Hence, unless we need to recognized [4–6] problems of distributed consensus, atomic talk about the transitions of a particular process or the values commitment, and byzantine agreement. of the particular variables, we simply let program be the '( .7( We also argue that the restrictions imposed on the tuple ')(6¤ ./(¨ where is a finite set of states and is a % ')(6? specification and the fault-intolerant program are necessary. subset of 890:2;¤4¢<5>=¨ 2;¤4¢ . ' More specifically, we show that if restrictions are imposed A state predicate of is any subset of ( . A state ' . only on the specification (respectively, the fault-intolerant predicate is closed in the program (respectively, ( ) iff % % % 0@A0: ¤/ 5BC01 ¤4 5 . 0: 'EDF 'G5 5 ¢ 2 ¢ ( 2 ¢ program) then the problem of adding failsafe fault-tolerance 2 . A sequence ¤4 ¤ © © ©H ¢ is still NP-hard. of states, 2 , is a computation of iff the following % .7( Organization of the paper. This paper is organized two conditions are satisfied: (1) 6*I¨*KJMLNO0:,@P¢¤/7,35 , 3Q as follows: In Section 2, we provide a few basic concepts and (2) if 32;¤/¢¤ © © ©H is finite and terminates in state then % 013QR¤45 .7( such as programs, computations, specifications, faults and there does not exist state such that . fault-tolerance. In Section 3, we state the problem of The projection of program on state predicate ' , denoted as % % ' '(¤<8£012¥¤4¢<5SO01 2£¤4¢ 5 .7(UTV2¨¤/¢ adding failsafe fault-tolerance. In Section 4, we prove the = , is the program = ' ' NP-completeness of the problem of adding failsafe fault- 'B? . I.e., consists of transitions of that start in ' 0XW ' ¤4. ¦5 ( tolerance. In Section 5, we precisely define the notion and end in . Given two programs, ( and )Y Y Y Y[Z Y Y Z 01W ' ¤4. ¦5 ' W\' . ( ( ( ( ( of monotonic specifications and monotonic programs, and ( , we say iff and . show their necessity and sufficiency for adding failsafe fault- Notation. When it is clear from context, we use and . ( tolerance in polynomial time. Finally, we make concluding interchangeably. Also, we say that a state predicate ' is true % ' remarks in Section 6. in a state iff . 2.2 Issues of Distribution 2 Preliminaries Now, we present the issues that distribution introduces during In this section, we give formal definitions of programs, the addition of fault-tolerance. More specifically, we identify problem specifications, faults, and fault-tolerance. The how read/write restrictions on a process affect its transitions. programs are specified in terms of their state space and their Write restrictions. Given a transition 0:32;¤4¢<5 , it is transitions. The definition of specifications is adapted from straightforward to determine the variables that need to be ¢ Alpern and Schneider [7]. The definition of faults and fault- changed in order to modify the state from 32 to . Thus, tolerance is adapted from Arora and Gouda [8] and Kulkarni the write restrictions amount to ensuring that the transitions [1]. The issues of modeling distributed programs is adapted of a process only modify those variables that it can write. from [2]. A similar modeling of distributed programs in More specifically, if process * can only write the variables in - - , read/write atomicity was independently identified by Attie , and the value of a variable other than that in is changed 0: ¤/ 5 ¢ and Emerson [9]. in the transition 2 then that transition cannot be used * in obtaining the transitions of * . In other words, if can Most definitions in this section are straightforward, and are - * only write variables in , then cannot use the transitions in ! - included to make the paper self-contained. A reader who is - ,5 familiar with this area can skip this section if necessary.

Load more