Surviving Sensor Network Software Faults
Total Page:16
File Type:pdf, Size:1020Kb
Surviving Sensor Network Software Faults Yang Chen†, Omprakash Gnawali∗, Maria Kazandjieva‡, Philip Levis‡, and John Regehr† †School of Computing ∗Computer Science Department ‡Computer Systems Laboratory University of Utah University of Southern California Stanford University Salt Lake City, UT USA Los Angeles, CA USA Stanford, CA USA {chenyang, regehr}@cs.utah.edu [email protected] [email protected] [email protected] ABSTRACT after deployment [34]. A recent deployment in the Swiss Alps illus- We describe Neutron, a version of the TinyOS operating system trates this challenge. Network communication failed during morn- that efficiently recovers from memory safety bugs. Where existing ings and evenings, but worked during the day and night. The cause schemes reboot an entire node on an error, Neutron’s compiler and was temperature response differences for the processor and radio runtime extensions divide programs into recovery units and reboot oscillators. Periods of warming and cooling led their clocks to drift only the faulting unit. The TinyOS kernel itself is a recovery unit: too much for their interconnect to be reliable [3]. a kernel safety violation appears to applications as the processor Unforeseen bugs often manifest as memory errors. For example, being unavailable for 10–20 milliseconds. a popular radio chip, the ChipCon CC2420, erroneously signals re- Neutron further minimizes safety violation cost by supporting ception of corrupted packets shorter than the 802.15.4 standard per- “precious” state that persists across reboots. Application data, time mits. Early CC2420 drivers for the TinyOS operating system did synchronization state, and routing tables can all be declared as pre- not consider this case; receiving a short, corrupt packet triggered an cious. Neutron’s reboot sequence conservatively checks that pre- off-by-one error in a loop, overwriting unrelated parts of RAM [28]. cious state is not the source of a fault before preserving it. Together, As wireless sensors use microcontrollers whose RAM is no larger recovery units and precious state allow Neutron to reduce a safety than a typical MMU page, compiler-enforced safety is the stan- violation’s cost to time synchronization by 94% and to a routing dard mechanism for detecting memory bugs. For example, Safe protocol by 99.5%. Neutron also protects applications from losing TinyOS [8] uses Deputy [7] to make all of TinyOS and its applica- data. Neutron provides this recovery on the very limited resources tions type-safe, preventing pointer bugs from cascading into mem- of a tiny, low-power microcontroller. ory corruption and random consequences. Safe execution is an important step towards dependable software, Categories and Subject Descriptors D.4.5 [Operating Systems]: but it raises a difficult question: How should a node respond to a Reliability—fault tolerance safety violation? These embedded, event driven systems typically have no concept of a process or unit of code isolation. Thus, on General Terms Reliability, Design a safety violation, Safe TinyOS spits out an error message (for lab Keywords Wireless Sensor Networks, TinyOS, nesC, Deputy, testing) or reboots the entire node (in deployment). Kernel, Reboot, Reliability Rebooting an entire node is costly: it wastes energy and loses data. Systems gather state such as routing tables and link quality estimates to improve energy efficiency by minimizing communica- 1. INTRODUCTION tion. Systems gather this state slowly to avoid introducing signifi- Sensor networks consist of large numbers of small, low-power, wire- cant load. After rebooting, a node may take some time—minutes, less devices, often embedded in remote and inconvenient locations even hours—to come fully back online. For example, in a recent such as volcanoes [35], thickets [25], bird burrows [31], glaciers [32], deployment on Reventador Volcano, reboots from a software error and tops of light poles [22]. Applications commonly specify a net- led to a 3-day network outage [35], reducing mean node uptime work should operate unattended for months or years [25, 31]. Soft- from >90% to 69%. ware dependability and reliability are therefore critical concerns. This paper presents Neutron, a version of the TinyOS operating In practice, however, sensor networks operate for weeks or months system that improves the efficiency and dependability of wireless and require significant attention from developers or system admin- sensor networks by reducing the cost of memory safety violations. istrators [31, 35]. The discrepancy between desired and actual avail- Neutron has two parts: extensions to the TinyOS compiler toolchain ability is in part due to difficult-to-diagnose bugs that emerge only (nesC and Deputy) and extensions to TinyOS itself. Neutron extends the nesC compiler to provide boundaries be- tween “recovery units.” Similarly to microreboots [6], Neutron re- Permission to make digital or hard copies of all or part of this work for boots only the faulting unit on a safety violation. TOSThreads, the personal or classroom use is granted without fee provided that copies are TinyOS threading library, helps define application recovery units. not made or distributed for profit or commercial advantage and that copies Unlike microreboots, which operate only on application-level struc- bear this notice and the full citation on the first page. To copy otherwise, to tures, Neutron must also be able to survive kernel faults, as the republish, to post on servers or to redistribute to lists, requires prior specific TinyOS kernel is typically the largest and most complex part of an permission and/or a fee. SOSP’09, October 11–14, 2009, Big Sky, Montana, USA. application image. In Neutron, the kernel itself is also a recovery Copyright 2009 ACM 978-1-60558-752-3/09/10 ...$10.00. unit. If the kernel violates safety, Neutron reboots it without dis- Send/FAIL Send/EBUSY Cancel/FAIL rupting application recovery units. Cancel/FAIL Send/SUCCESS Rebooting a recovery unit is better than rebooting a node, but it conservatively wastes energy by discarding valid state. Neutron Idle Busy allows application and kernel recovery units to declare memory structures as “precious,” indicating that they should persist across faults when possible. The complication is that precious state may sendDone(SUCCESS) be involved in a safety violation and become inconsistent. Neu- tron uses a combination of static analysis, type safety checks, and sendDone(FAIL) Cancel/SUCCESS user-specified checks to determine which precious structures can be Cancel safely retained, and which must be re-initialized on a reboot. Neutron must provide these mechanisms in the limited code space Send/EBUSY (tens of kilobytes) and RAM (4–10 kB) typical to ultra low-power Cancel/SUCCESS microcontrollers. These constraints, combined with embedded sys- tem workloads, lead Neutron to take different approaches than are Figure 1: Simplified FSM for the TinyOS interface for send- typical in systems that have plenty of available resources. By mod- ing a packet. A call to send that returns SUCCESS moves the ifying variables in-place, Neutron introduces no instruction over- interface into the busy state, at which point subsequent calls to head in the common case of correctly executing code. In contrast, send return FAIL. It moves back to the idle state when it signals transactions would introduce a RAM overhead for scratch space the sendDone callback with a SUCCESS argument denoting the and a CPU overhead for memory copies. Neutron re-initializes pos- packet was sent successfully. sibly corrupt variables, rather than restore them to their last known good state, because logging good states to nonvolatile storage has a 2.1 TinyOS significant energy cost. Neutron minimizes its overhead through compiler techniques that TinyOS is a wireless sensornet operating system. Its mechanisms leverage the static nature of TinyOS programs. For example, the and abstractions are designed for ultra-low-power microcontrollers component graph of a TinyOS program allows Neutron to infer re- with limited RAM and no hardware support for memory isolation. covery unit boundaries at compile time. Similarly, Neutron stati- TinyOS typically runs on 16-bit microcontrollers, at 1–8 MHz, have cally analyzes each memory safety check to determine which pre- 4–10 kB of SRAM, and have 40–100 kB of flash memory for code [26]. cious data structures may be in the middle of an update at the pro- The operating system uses components as the unit of software gram point where the check occurs. When a safety check fails, composition [15]. Like objects, components couple code and data. Neutron does not preserve the contents of any precious data whose Unlike objects, however, they can only be instantiated at compile invariants are potentially broken. From the user’s point of view, time. TinyOS components, written in a dialect of C called nesC [12], Neutron’s interface consists of simple, optional annotations, mak- have interfaces which define downcalls (“commands”) and upcalls ing it easy to retrofit existing sensornet application code. (“events”). Upcalls and downcalls are bound statically: the absence We evaluate Neutron by demonstrating that it isolates recovery of function pointers simplifies call graph analysis. units and increases application availability. We find that Neutron TinyOS interfaces, and components in general, are designed as saves energy by preserving precious state across reboots. Our ex- simple finite state machines. Calls into a component cause state periments are on