Watchdog Timers
Total Page:16
File Type:pdf, Size:1020Kb
MULTI-TASKING Watchdog timers By Niall Murphy Author Front Panel: Designing Software for Embedded User Interfaces Making proper use of a watchdog timer is not as simple as restarting a counter. If you have a watchdog timer in your system, you must choose the timeout period care- fully, ensure that the watchdog timer is tested regularly, and, if you are multi-tasking, monitor all of the tasks. In addition, the recovery actions you imple- ment can have a big impact on overall system reliability. A watchdog timer is a piece of hardware, often built into a microcontroller that can cause a processor reset when it judges that the system has hung, or is no longer executing the correct sequence of code. This article will discuss exactly the sort of failures a watchdog can detect, and the decisions that must be made in the design of your watchdog system. The first half of the article will assume that there is no RTOS present. The second half covers a scheme for making use of a watchdog in a multi-tasking system. The hard- ware component of a watchdog is a counter that is set to a certain value and then counts down towards zero. It is the responsibility of the software occurs too soon will cause a bite, they lead to an infinite loop, an line, but other actions are also to set the count to its original but in order to use such a system, accidental jump out of the code possible. For example, when the value often enough to ensure very precise knowledge of the area of memory, or a dead-lock watchdog bites it may directly that it never reaches zero. If it timing characteristics of the main condition (in multi-tasking sit- disable a motor, engage an in- does reach zero, it is assumed that loop of your program is required. uations). Obviously, it is prefer- terlock, or sound an alarm until the software has failed in some What errors are caught? able to fix the root cause, rather the software recovers. Such ac- manner and the CPU is reset. A properly designed watch- than getting the watchdog to tions are especially important to In other texts you will see vari- dog mechanism should, at the pick up the pieces. In a complex leave the system in a safe state ous terms for restarting the timer: very least, catch events that hang embedded system it may not if, for some reason, the system’s strobing, stroking or updating the system. In electrically noisy be possible to guarantee that software is unable to run at all the watchdog. However, in this environments, a power glitch there are no bugs, but by using (perhaps due to chip death) article we will use the more may corrupt the program coun- a watchdog you can guarantee after the failure. visual metaphor of a man kick- ter, stack pointer, or data in RAM. that none of those bugs will hang A microcontroller with an ing the dog periodically-with The software would crash almost the system indefinitely. internal watchdog will almost apologies to animal lovers. If the immediately, even if the code is always contain a status bit that man stops kicking the dog, the completely bug free. This is ex- First aid gets set when a bite occurs. By dog will take advantage of the actly the sort of transient failure Once your watchdog has bitten, examining this bit after emerg- hesitation and bite the man. that watchdogs will catch. you have to decide what action ing from a watchdog-induced It is also possible to design Bugs in software can also to take. The hardware will usu- reset, we can decide whether the hardware so that a kick that cause the system to hang, if ally assert the processor’s reset to continue running, switch to EE Times-India | November 2000 | eetindia.com 1 requires, and whether that data One approach is to pick an is stored regularly and read after interval which is several seconds the system resets. long. Use this approach when you are only trying to reset a Sanity checks system that has definitely hung, Kicking the dog on a regular but you do not want to do a interval proves that the software detailed study of the timing of is running. It is often a good idea the system. This is a robust ap- to kick the dog only if the system proach. Some systems require passes some sanity check, as fast recovery, but for others, shown in Figure 1. Stack depth, the only requirement is that number of buffers allocated, or the system is not left in a hung the status of some mechanical state indefinitely. For these component may be checked more sluggish systems, there is before deciding to kick the dog. no need to do precise measure- Good design of such checks will ments of the worst case time of increase the family of errors that the program’s main loop to the the watchdog will detect. nearest millisecond. One approach is to clear a When picking the timeout number of flags before each loop you may also want to consider is started, as shown in Figure 2. the greatest amount of dam- Each flag is set at a certain point age the device can do between in the loop. At the bottom of the original failure and the the loop the dog is kicked, but watchdog biting. With a slowly first the flags are checked to see responding system, such as a that all of the important points large thermal mass, it may be in the loop have been visited. acceptable to wait 10 seconds The multi-tasking approach before resetting. Such a long discussed later is based on a time can guarantee that there similar set of sanity flags. will be no false watchdog re- For a specific failure, it is often sets. On a medical ventilator, a good idea to try to record the 10 seconds would have been cause (possibly in NVRAM), since far too long to leave the patient it may be difficult to establish unassisted, but if the device can the cause after the reset. If the recover within a second then watchdog bite is due to a bug the failure will have minimal (would that be a bug bite?) then impact, so a choice of a 500ms any other information you can timeout might be appropriate. record about the state of the When making such calculations system, or the currently active be sure to include the time task will be valuable when try- taken for the device to start up ing to diagnose the problem. as well as the timeout time of the watchdog itself. Choosing the timeout One real-life example is the interval Space Shuttle’s main engine con- Any safety chain is only as good troller. 1 The watchdog timeout a fail-safe state, and/or display On the other hand, in some sys- as its weakest link, and if the is set at 18ms, which is shorter an error message. At the very tems it is better to do a full set of software policy used to decide than one major control cycle. The least, you should count such self-tests since the root cause of when to kick the dog is not response to the watchdog biting events, so that a persistently the watchdog timeout might be good, then using watchdog is to switch over to the backup errant application won’t be re- identified by such a test. hardware can make your sys- computer. This mechanism al- started indefinitely. A reasonable In terms of the outside world, tem less reliable. If you do not lows control to pass from a failed approach might be to shut the the recovery may be instanta- fully understand the timing computer to the backup before system down if three watchdog neous, and the user may not characteristics of your program, the engine has time to perform bites occur in one day. even know a reset occurred. you might pick a timeout inter- any irreversible actions. If we want the system to The recovery time will be the val that is too short. This could While on the subject of time- recover quickly, the initialisation length of the watchdog time- lead to occasional resets of the outs, it is worth pointing out after a watchdog reset should out plus the time it takes the system, which may be difficult that some watchdog circuits be much shorter than power-on system to reset and perform its to diagnose. The inputs to the allow the very first timeout to initialisation. initialisation. How well the de- system, and the frequency of be considerably longer than the A possible shortcut is to skip vice recovers depends on how interrupts, can affect the length timeout used for the rest of the some of the device’s self-tests. much persistent data the device of a single loop. periodic checks. 2 eetindia.com | November 2000 | EE Times-India This allows the processor time to initialise, without having to worry about the watchdog biting. While the watchdog can often respond fast enough to halt mechanical systems, it offers little protection for damage that can be done by software alone. Consider an area of non-volatile RAM which may be overwritten with rubbish data if some loop goes out of control. It is likely that such an overwrite would occur far faster than a watch- dog could detect the fault. For those situations you need some other protection such as a checksum.