Failure Detectors for Wireless Sensor-Actuator Systems

Cleveland State University EngagedScholarship@CSU Electrical Engineering & Computer Science Electrical Engineering & Computer Science Faculty Publications Department 7-2009 Failure Detectors for Wireless Sensor-Actuator Systems Hamza A. Zia Cleveland State University Nigamanth Sridhar Clevealand State University, [email protected] Shivakumar Sastry University of Akron Follow this and additional works at: https://engagedscholarship.csuohio.edu/enece_facpub Part of the Digital Communications and Networking Commons How does access to this work benefit ou?y Let us know! Publisher's Statement NOTICE: this is the author’s version of a work that was accepted for publication in Ad Hoc Networks. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Ad Hoc Networks, 7, 5, (07-01-2009); 10.1016/j.adhoc.2008.09.003 Original Citation Zia, H. A., Sridhar, N., , & Sastry, S. (2009). Failure detectors for wireless sensor-actuator systems. Ad Hoc Networks, 7(5), 1001-1013. doi:10.1016/j.adhoc.2008.09.003 Repository Citation Zia, Hamza A.; Sridhar, Nigamanth; and Sastry, Shivakumar, "Failure Detectors for Wireless Sensor-Actuator Systems" (2009). Electrical Engineering & Computer Science Faculty Publications. 64. https://engagedscholarship.csuohio.edu/enece_facpub/64 This Article is brought to you for free and open access by the Electrical Engineering & Computer Science Department at EngagedScholarship@CSU. It has been accepted for inclusion in Electrical Engineering & Computer Science Faculty Publications by an authorized administrator of EngagedScholarship@CSU. For more information, please contact [email protected]. Failure detectors for wireless sensor-actuator systems '" d Harnza A. Zia , Nigamanth Sridhar,]'·, Shivakumar Sastry b • rlw rical Gild Computer Engineering. Cleveland SW/I' University, ]]2 Stilwell Half. 2121 Euclid Avenue. Cleveland. OH 441 /5, UnilN Sloles • flectrical Gnd Compurer fngin <"t'riug. The Uniwrsiry of Akron. Uniwl Slales 1. Introduction time (several weeks. even months) to develop. An impor tant reason for this incongruity is tha t the set of design cri Wireless sensor-actuator systems (WSAS) have teria that developers must think about is quite different in emerged as an important platform that enolble unprece the context of WSAS than in the context of ente rprise sys dented levels of fine-grained visibility and control over tems. As opposed to internet-scale network applications. our environment 11 - 61. There are several smal l and inex- WSAS applications need to consider fault-tolerance as a pensive devices that are capable of capturing and process pn'mClry design consideration. Failure is no t the exception: ing sensor inputs. In addition, the availability of protocols it is the norm. This necessitates the designer of a WSAS [7-101 and algorithms that address challenges. high fail application to include failure whil e she thinks about the ure-rales.limited resou rces and large-scale. allow us to de functional behavior of the application. This causes the de velop and deploy effective sensornets [11 - 141. In addition. sign space to be cluttered with fun ctiona l requirements as there are several competing software platforms for pro well as fault-tolerance requ irements. gramm ing sensor network applications [15- 21]. In this paper. we argue that a reusable class of failure Despite such advances. la rge-scale networks of sensors detectors can simplify the design and deployment of sen are incredibly hard to design and deploy. The time and ef sornets. We focus on a class offailures. i.e. loss of commu fort required to design and implement software for such nication links between nodes. The hardware platforms that systems are not proportional to the size of the software it WSAS applications run on are inherently unreliable: the self: most of these appli cations are only a few hundreds of cost of producing each individual node is extremely low. lines of source code. but take a disproportionate amount of and consequently. nodes are designed to be dispensable. This places a larger burden on someone designing applica tions that run on these platforms: applications ru nning on such unreliable infrastructures must still function in a reli- able manner. When a collection of nodes are unable to communicate. the WSAS is unlikely to provide an accept able level of service. To tolerate such failures. it is necessary to detect faults as and when they occur. Failure out an ‘‘I am alive” message to its neighbors. If a node p does detection logic is typically included either directly in the not hear from some neighbor q for some specified length of application, or as part of a middleware service that pro time, p adds q to its suspect list. If, later on, p does receive a vides neighborhood information. Neither of these cases is heartbeat from q, p realizes its mistake, and removes q from optimal. The former approach is tedious, error prone, and the suspect list. One way of ensuring that the same mistake does not enable reuse of the failure detection logic. While is not made again, is to modify the timeout duration based the latter decouples the failure detection scheme from on past mistakes. When p recovers from a mistake, it ex the application, the scheme is still tightly coupled to neigh tends the timeout duration for q to be longer than the time bor discovery logic. The range of failures, and correspond it took for q’s heartbeat to arrive. For example, an adaptive ingly the range of failure detection mechanisms, cannot be timeout-based failure detector is reported in [28]. effectively addressed in such tightly-coupled designs. We Instead of forcing every node to continually flood the present a design that offers a higher degree of separation network with heartbeat messages, some failure detectors between independent concerns. follow an ‘‘on-demand” approach. When a process queries Contributions. The contributions of this paper are: its failure detector module for the current suspect list, the failure detector sends an ‘‘Are you alive?” message to its • A design for a failure detector that distinguishes neighbors. Following this, it waits for some specified time- between the failure of a communication link and mobil out period at the end of which it declares a neighbor to be a ity of a node in the WSAS. suspect if no response has been received. If, on the other • A proof sketch to show that the above detector is hand, it receives an ‘‘I am alive” message, the neighbor is correct. not added to the suspect list. The message complexity of • Novel packaging of the above failure detector as a this strategy is twice that of the heartbeat strategy, except parameterizable middleware service for application that the number of times the detection cycle has to occur designers. can be greatly reduced, thereby reducing the overall mes • Experimental results based on two different failure sage complexity of the failure detector module. Such detection schemes designed using the middleware ser ping-based implementations are reported in [24]. vice. These components were implemented in nesC Unfortunately, none of these strategies are optimal for [15] for the TinyOS [11] platform, and the experiments WSAS because here, the nodes sleep for most of the time, were conducted on a testbed of TelosB [13] motes. and are only awake for a few minutes at a time. In such a context, it makes sense to reverse the role of the messages. Paper organization. Following the background and re Each node p sends a message to its neighbors requesting a lated work discussion in Section 2, we present the first fail lease for some duration. A neighboring node q now records ure detector that can detect node failures in Section 3.In this lease, and assumes that p is alive for the lease dura Section 4 we describe the second failure detector that suc tion. As long as p sends another lease request before its cessfully detects failures in dynamic topologies, along with lease expires, q does not suspect p. But if the lease expires a proof of correctness, and a case study example to illus before p sends a request, then p is added to q’s suspect list. trate the use of the failure detector middleware. In Section This strategy is described in [29], and is the strategy 5, we provide an evaluation of our failure detectors’ perfor we use in this paper for local node failure detection in mance using experiments conducted on an 80-node sensor Section 3. network testbed. Finally, we conclude with a summary of Hutle and Widder [30] present two time-free self-stabi the contributions and some directions for future work in lizing algorithms for local failure detection for sparse net Section 6. works. These algorithms apply equally well to dense networks. The first algorithm they propose requires un bounded amount of space in each process, and the second 2. Background and related work algorithm (the more realistic one) executes within bounded space when there is a known upper bound on 2.1. Failure detection algorithms the number of messages in the system. Their work, how ever, assumes a static topology and does not tolerate The use of unreliable failure detectors to implement reli mobility of nodes. able asynchronous distributed systems was first proposed Fetzer and Högstedt [31] present a protocol for failure by Chandra and Toueg [22]. The authors present four clas detection in partitionable systems. They consider systems ses of failures and each class has different completeness in which a gateway node that connects a section of the net and accuracy specifications. Since then, several other work to another section fails.

Failure Detectors for Wireless Sensor-Actuator Systems

The Dynamic Enterprise Bus∗

Low-Overhead Accrual Failure Detector

A Literature Review of Failure Detection Within the Context of Solving the Problem of Distributed Consensus

Failure Detectors for Wireless Sensor-Actuator Systems Hamza A

INFORMATION to USERS This Manuscript Has Been Reproduced from the Microfilm Master. UMI Films the Text Directly from the Origina

Self-Healing Distributed Systems

A Self-Tuning Failure Detection Scheme for Cloud Computing Service

The Failure Detector Abstraction

AVR1003: Using the XMEGA Clock System

Distributed Algorithms

Failure Detectors Outline

Fault Tolerance Management for a Hierarchical Gridrpc Middleware