Watchdog Timers

Total Page:16

File Type:pdf, Size:1020Kb

Watchdog Timers MULTI-TASKING Watchdog timers By Niall Murphy Author Front Panel: Designing Software for Embedded User Interfaces Making proper use of a watchdog timer is not as simple as restarting a counter. If you have a watchdog timer in your system, you must choose the timeout period care- fully, ensure that the watchdog timer is tested regularly, and, if you are multi-tasking, monitor all of the tasks. In addition, the recovery actions you imple- ment can have a big impact on overall system reliability. A watchdog timer is a piece of hardware, often built into a microcontroller that can cause a processor reset when it judges that the system has hung, or is no longer executing the correct sequence of code. This article will discuss exactly the sort of failures a watchdog can detect, and the decisions that must be made in the design of your watchdog system. The first half of the article will assume that there is no RTOS present. The second half covers a scheme for making use of a watchdog in a multi-tasking system. The hard- ware component of a watchdog is a counter that is set to a certain value and then counts down towards zero. It is the responsibility of the software occurs too soon will cause a bite, they lead to an infinite loop, an line, but other actions are also to set the count to its original but in order to use such a system, accidental jump out of the code possible. For example, when the value often enough to ensure very precise knowledge of the area of memory, or a dead-lock watchdog bites it may directly that it never reaches zero. If it timing characteristics of the main condition (in multi-tasking sit- disable a motor, engage an in- does reach zero, it is assumed that loop of your program is required. uations). Obviously, it is prefer- terlock, or sound an alarm until the software has failed in some What errors are caught? able to fix the root cause, rather the software recovers. Such ac- manner and the CPU is reset. A properly designed watch- than getting the watchdog to tions are especially important to In other texts you will see vari- dog mechanism should, at the pick up the pieces. In a complex leave the system in a safe state ous terms for restarting the timer: very least, catch events that hang embedded system it may not if, for some reason, the system’s strobing, stroking or updating the system. In electrically noisy be possible to guarantee that software is unable to run at all the watchdog. However, in this environments, a power glitch there are no bugs, but by using (perhaps due to chip death) article we will use the more may corrupt the program coun- a watchdog you can guarantee after the failure. visual metaphor of a man kick- ter, stack pointer, or data in RAM. that none of those bugs will hang A microcontroller with an ing the dog periodically-with The software would crash almost the system indefinitely. internal watchdog will almost apologies to animal lovers. If the immediately, even if the code is always contain a status bit that man stops kicking the dog, the completely bug free. This is ex- First aid gets set when a bite occurs. By dog will take advantage of the actly the sort of transient failure Once your watchdog has bitten, examining this bit after emerg- hesitation and bite the man. that watchdogs will catch. you have to decide what action ing from a watchdog-induced It is also possible to design Bugs in software can also to take. The hardware will usu- reset, we can decide whether the hardware so that a kick that cause the system to hang, if ally assert the processor’s reset to continue running, switch to EE Times-India | November 2000 | eetindia.com 1 requires, and whether that data One approach is to pick an is stored regularly and read after interval which is several seconds the system resets. long. Use this approach when you are only trying to reset a Sanity checks system that has definitely hung, Kicking the dog on a regular but you do not want to do a interval proves that the software detailed study of the timing of is running. It is often a good idea the system. This is a robust ap- to kick the dog only if the system proach. Some systems require passes some sanity check, as fast recovery, but for others, shown in Figure 1. Stack depth, the only requirement is that number of buffers allocated, or the system is not left in a hung the status of some mechanical state indefinitely. For these component may be checked more sluggish systems, there is before deciding to kick the dog. no need to do precise measure- Good design of such checks will ments of the worst case time of increase the family of errors that the program’s main loop to the the watchdog will detect. nearest millisecond. One approach is to clear a When picking the timeout number of flags before each loop you may also want to consider is started, as shown in Figure 2. the greatest amount of dam- Each flag is set at a certain point age the device can do between in the loop. At the bottom of the original failure and the the loop the dog is kicked, but watchdog biting. With a slowly first the flags are checked to see responding system, such as a that all of the important points large thermal mass, it may be in the loop have been visited. acceptable to wait 10 seconds The multi-tasking approach before resetting. Such a long discussed later is based on a time can guarantee that there similar set of sanity flags. will be no false watchdog re- For a specific failure, it is often sets. On a medical ventilator, a good idea to try to record the 10 seconds would have been cause (possibly in NVRAM), since far too long to leave the patient it may be difficult to establish unassisted, but if the device can the cause after the reset. If the recover within a second then watchdog bite is due to a bug the failure will have minimal (would that be a bug bite?) then impact, so a choice of a 500ms any other information you can timeout might be appropriate. record about the state of the When making such calculations system, or the currently active be sure to include the time task will be valuable when try- taken for the device to start up ing to diagnose the problem. as well as the timeout time of the watchdog itself. Choosing the timeout One real-life example is the interval Space Shuttle’s main engine con- Any safety chain is only as good troller. 1 The watchdog timeout a fail-safe state, and/or display On the other hand, in some sys- as its weakest link, and if the is set at 18ms, which is shorter an error message. At the very tems it is better to do a full set of software policy used to decide than one major control cycle. The least, you should count such self-tests since the root cause of when to kick the dog is not response to the watchdog biting events, so that a persistently the watchdog timeout might be good, then using watchdog is to switch over to the backup errant application won’t be re- identified by such a test. hardware can make your sys- computer. This mechanism al- started indefinitely. A reasonable In terms of the outside world, tem less reliable. If you do not lows control to pass from a failed approach might be to shut the the recovery may be instanta- fully understand the timing computer to the backup before system down if three watchdog neous, and the user may not characteristics of your program, the engine has time to perform bites occur in one day. even know a reset occurred. you might pick a timeout inter- any irreversible actions. If we want the system to The recovery time will be the val that is too short. This could While on the subject of time- recover quickly, the initialisation length of the watchdog time- lead to occasional resets of the outs, it is worth pointing out after a watchdog reset should out plus the time it takes the system, which may be difficult that some watchdog circuits be much shorter than power-on system to reset and perform its to diagnose. The inputs to the allow the very first timeout to initialisation. initialisation. How well the de- system, and the frequency of be considerably longer than the A possible shortcut is to skip vice recovers depends on how interrupts, can affect the length timeout used for the rest of the some of the device’s self-tests. much persistent data the device of a single loop. periodic checks. 2 eetindia.com | November 2000 | EE Times-India This allows the processor time to initialise, without having to worry about the watchdog biting. While the watchdog can often respond fast enough to halt mechanical systems, it offers little protection for damage that can be done by software alone. Consider an area of non-volatile RAM which may be overwritten with rubbish data if some loop goes out of control. It is likely that such an overwrite would occur far faster than a watch- dog could detect the fault. For those situations you need some other protection such as a checksum.
Recommended publications
  • Tms320dm35x Dmsoc Timer/Watchdog Timer User's Guide
    TMS320DM35x Digital Media System-on-Chip (DMSoC) Timer/Watchdog Timer Reference Guide Literature Number: SPRUEE5A May 2006–Revised September 2007 2 SPRUEE5A–May 2006–Revised September 2007 Submit Documentation Feedback Contents Preface ............................................................................................................................... 6 1 Introduction................................................................................................................ 9 1.1 Purpose of the Peripheral....................................................................................... 9 1.2 Features ........................................................................................................... 9 1.3 Functional Block Diagram ..................................................................................... 10 1.4 Industry Standard Compatibility Statement ................................................................. 10 2 Architecture – General-Purpose Timer Mode ................................................................ 10 2.1 Backward Compatible Mode (Timer 3 Only) ................................................................ 10 2.2 Clock Control.................................................................................................... 11 2.3 Signal Descriptions ............................................................................................. 12 2.4 Timer Modes .................................................................................................... 13 2.5 Timer
    [Show full text]
  • SPECIFICATION (General Release)
    HC05LJ5GRS/H REV 1 68HC05LJ5 SPECIFICATION (General Release) November 10, 1998 Semiconductor Products Sector Motorola reserves the right to make changes without further notice to any products herein to improve reliability, function or design. Motorola does not assume any liability arising out of the application or use of any product or circuit described herein; neither does it convey any license under its patent rights nor the rights of others. Motorola products are not designed, intended, or authorized for use as components in systems intended for surgical implant into the body, or other applications intended to support or sustain life, or for any other application in which the failure of the Motorola product could create a situation where personal injury or death may occur. Should Buyer purchase or use Motorola products for any such unintended or unauthorized application, Buyer shall indemnify and hold Motorola and its officers, employees, subsidiaries, affiliates, and distributors harmless against all claims, costs, damages, and expenses, and reasonable attorney fees arising out of, directly or indirectly, any claim of personal injury or death associated with such unintended or unauthorized use, even if such claim alleges that Motorola was negligent regarding the design or manufacture of the part. Motorola, Inc., 1998 November 10, 1998 GENERAL RELEASE SPECIFICATION TABLE OF CONTENTS Section Page SECTION 1 GENERAL DESCRIPTION 1.1 FEATURES.....................................................................................................
    [Show full text]
  • Software Techniques for Improving Microcontrollers EMC Performance
    AN1015 Application note Software techniques for improving microcontrollers EMC performance Introduction A major contributor to improved EMC performance in microcontroller-based electronics systems is the design of hardened software. Problems induced by EMC disturbances need to be considered as early as possible in the design phase. EMC-oriented software increases the security and the reliability of your application. EMC-hardened software is inexpensive to implement, improves the final goods immunity performance and saves hardware and development costs. You should consider EMC disturbances to analog or digital data just like any other application parameter. Examples of problems induced by EMC disturbances: Microcontroller not responding Program Counter runaway Execution of unexpected instructions Bad address pointing Bad execution of subroutines Parasitic reset and/or parasitic interrupts Corruption of IP configuration I/O deprogramming Examples of consequences of failing software: Unexpected response of product Loss of context Unexpected branch in process Loss of interrupts Loss of data integrity Corrupted reading of input values. This application note deals with two categories of software techniques, namely: Preventive techniques: these can be implemented in existing designs, their purpose is to improve product robustness. Auto-recovery techniques: when a runaway condition is detected, a recovery subroutine is used to take the decision to execute fail-safe routines, optionally sending a warning and then automatically returning back to normal operations (this operation may be absolutely transparent to the user of the application). June 2014 DocID5833 Rev 2 1/19 www.st.com Contents AN1015 Contents 1 Related documents . 5 2 Preventive techniques . 6 2.1 Using the watchdog and time control techniques .
    [Show full text]
  • Lecture 2 General Concepts of RTOS (Real-Time Operating System)
    CENG 383 Real-Time Systems Lecture 2 General Concepts of RTOS (Real-Time Operating System) Asst. Prof. Tolga Ayav, Ph.D. Department of Computer Engineering İzmir Institute of Technology Operating System z Specialized collection of system programs is called operating system. z Must provide at least three specific functions: − Task scheduling − Task dispatching − İntertask communication z Kernel (Nucleus): The smallest portion of OS that provides those three essential functions. İzmir Institute of Technology Embedded Systems Lab Role of the Kernel İzmir Institute of Technology Embedded Systems Lab Three Kernel Functions OS Kernel - 3 functions: z Task Scheduler : To determine which task will run next in a multitasking system z Task Dispatcher: To perform necessary bookkeeping to start a task z Intertask Communication: To support communication between one process (i.e. task) and another İzmir Institute of Technology Embedded Systems Lab Kernel Design Strategies z Polled Loop Systems z Cyclic Executive z Cooperative Multitasking z Interrupt-Driven Systems z Foreground/Background Systems z Full featured RTOS İzmir Institute of Technology Embedded Systems Lab 1. Polled Loop Systems Polled loops are used for fast response to single devices. In a polled-loop system, a single and a repetitive instruction is used to test a flag that indicates whether or not some event has occurred. If the event has not occurred, then the polling continues. Example: suppose a software system is needed to handle packets of data that arrive at a rate of no more than 1 per second. A flag named packet_here is set by the network, which writes the data into the CPU’s memory via direct memory access (DMA).
    [Show full text]
  • F MC -8L MB89580B/BW Series HARDWARE MANUAL
    FUJITSU SEMICONDUCTOR CM25-10142-3E CONTROLLER MANUAL F2MCTM-8L 8-BIT MICROCONTROLLER MB89580B/BW Series HARDWARE MANUAL F2MCTM-8L 8-BIT MICROCONTROLLER MB89580B/BW Series HARDWARE MANUAL FUJITSU LIMITED PREFACE ■ Objectives and Intended Reader The MB89580B/BW series of products has been developed as one of the general products of F2MC-8L family. The F2MC-8L family is an original 8 bit one-chip microcontroller that can be used for Application Specific ICs (ASICs). The MB89580B/BW series has a wide range of uses in consumer and industrial equipment. This manual covers the functions and operations of the MB89580B/BW series and is designed for engineers who develop products using the MB89580B/BW series of microcontrollers. For information on the microcontroller’s instruction set, refer to the F2MC-8L Programming Manual. ■ Trademarks F2MC is the abbreviation of FUJITSU Flexible Microcontroller. Other system and product names in this manual are trademarks of respective companies or organizations. The symbols TM and ® are sometimes omitted in this manual. ■ Structure of This Manual This manual consists of the following 13 chapters and appendix. CHAPTER 1 "OVERVIEW" This chapter explains the features and basic specifications of the MB89580B/BW series of microcontrollers. CHAPTER 2 "HANDLING DEVICE" This chapter explains the precautions to be taken when using the USB general-purpose one- chip microcontroller. CHAPTER 3 "CENTRAL PROCESSING UNIT (CPU)" This chapter explains the functions and operation of the CPU. CHAPTER 4 "I/O PORTS" This chapter explains the functions and operation of the I/O ports. CHAPTER 5 "TIMEBASE TIMER" This chapter explains the functions and operation of the timebase timer.
    [Show full text]
  • Atrwdt, Atxwdt
    PC Watchdog Timer Card CE Models: ATRWDT and ATXWDT Documentation Number ATxWDT-1303 This product designed and manufactured in Ottawa, Illinois USA of domestic and imported parts by International Headquarters B&B Electronics Mfg. Co. Inc. 707 Dayton Road -- P.O. Box 1040 -- Ottawa, IL 61350 USA Phone (815) 433-5100 -- General Fax (815) 433-5105 Home Page: www.bb-elec.com Sales e-mail: [email protected] -- Fax (815) 433-5109 Technical Support e-mail: [email protected] -- Fax (815) 433-5104 European Headquarters B&B Electronics Ltd. Westlink Commercial Park, Oranmore, Co. Galway, Ireland Phone +353 91-792444 -- Fax +353 91-792445 Home Page: www.bb-europe.com Sales e-mail: [email protected] Technical Support e-mail: [email protected] B&B Electronics -- Revised March 2003 Documentation Number ATxWDT-1303 ATRWDT & ATXWDT Cover Page B&B Electronics Mfg Co – 707 Dayton Rd - PO Box 1040 - Ottawa IL 61350 - Ph 815-433-5100 - Fax 815-433-5104 B&B Electronics Ltd – Westlink Comm. Pk – Oranmore, Galway, Ireland – Ph +353 91-792444 – Fax +353 91-792445 Table of Contents CHAPTER 1: GENERAL INFORMATION...................................1 INTRODUCTION..................................................................................1 FEATURES..........................................................................................1 SPECIFICATIONS ................................................................................1 CHAPTER 2: SETUP AND INSTALLATION...............................2 INSPECTION .......................................................................................2
    [Show full text]
  • Binary Tree Classification of Rigid Error Detection and Correction Techniques Angeliki Kritikakou, Rafail Psiakis, Francky Catthoor, Olivier Sentieys
    Binary Tree Classification of Rigid Error Detection and Correction Techniques Angeliki Kritikakou, Rafail Psiakis, Francky Catthoor, Olivier Sentieys To cite this version: Angeliki Kritikakou, Rafail Psiakis, Francky Catthoor, Olivier Sentieys. Binary Tree Classification of Rigid Error Detection and Correction Techniques. ACM Computing Surveys, Association for Com- puting Machinery, 2020, 53 (4), pp.1-38. 10.1145/3397268. hal-02927439 HAL Id: hal-02927439 https://hal.archives-ouvertes.fr/hal-02927439 Submitted on 29 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 Binary Tree Classification of Rigid Error Detection and Correction Techniques ANGELIKI KRITIKAKOU, Univ Rennes, Inria, CNRS, IRISA, France RAFAIL PSIAKIS, Univ Rennes, Inria, CNRS, IRISA, France FRANCKY CATTHOOR, IMEC, KU Leuven, Belgium OLIVIER SENTIEYS, Univ Rennes, Inria, CNRS, IRISA, France Due to technology scaling and harsh environments, a wide range of fault tolerant techniques exists to deal with the error occurrences. Selecting a fault tolerant technique is not trivial, whereas more than the necessary overhead is usually inserted during the system design. To avoid over-designing, it is necessary to in-depth understand the available design options. However, an exhaustive listing is neither possible to create nor efficient to use due to its prohibited size.
    [Show full text]
  • Computer Science & Engineering
    Computer Science & Engineering B.TECH. PROGRAMME FIRST YEAR FIRST SEMESTER A. Theory Sl. No Code Subject Contacts Credit Points Periods/ Week L T P Total 1 HMTS1101 Business English 2 0 0 2 2 2 CHEM1001 Chemistry I 3 1 0 4 4 3 MATH1101 Mathematics I 3 1 0 4 4 4 ELEC1001 Basic Electrical Engineering 3 1 0 4 4 5 MECH1101 Engineering Mechanics 3 1 0 4 4 Total Theory 18 18 B. Laboratory 1 CHEM1011 Chemistry I Lab 0 0 3 3 2 2 ELEC1111 Basic Electrical Engineering Lab 0 0 3 3 2 3 MECH1012 Engineering Drawing 1 0 3 4 3 4. HTMS1111 Language Practice(Level 1) Lab 0 0 2 2 1 Total Practical 12 8 C. Sessional 2 HMTS1121 Extra Curricular Activity 0 0 2 2 1 Total Sessional 2 1 Total of Semester 32 27 SECOND SEMESTER A. Theory Sl. No Code Subject Contacts Credit Periods/ Week Points L T P Total 1. CSEN1201 Introduction to Computing 3 1 0 4 4 2. PHYS1001 Physics I 3 1 0 4 4 3. MATH1201 Mathematics II 3 1 0 4 4 4. ECEN1001 Basic Electronics Engineering 3 1 0 4 4 5. MECH1201 Engineering Thermodynamics and 3 1 0 4 4 Fluid Mechanics Total Theory 20 20 B. Laboratory 1. CSEN1211 Introduction to Computing Lab 0 0 3 3 2 2. PHYS1011 Physics I Lab 0 0 3 3 2 3. ECEN1011 Basic Electronics Engineering Lab 0 0 3 3 2 4. MECH1011 Workshop Practice 1 0 3 4 3 Total Practical 13 9 Total of Semester 33 29 SECOND YEAR THIRD SEMESTER A.
    [Show full text]
  • Architecure of the Linux Watchdog Software for the Ip Pbx
    U.P.B. Sci. Bull., Series C, Vol. 73, Iss. 4, 2011 ISSN 1454-234x ARCHITECURE OF THE LINUX WATCHDOG SOFTWARE FOR THE IP PBX Mihai Constantin1 În această lucrare sunt proiectate propunerile de arhitectură referitoare la software-ul de Watchdog pentru monitorizarea unui IP PBX. IP PBX conţine două aplicaţii principale care interacţionează între ele: Common Media Server şi IpSmallOfice. Common Media Server oferă capabilităţile de codec VoIP, iar IPSmallOffice, aplicaţia principală care rulează pe Linux PC, controlează apelurile telefonice şi oferă telefonie IP. Watchdog reprezintă o aplicaţie de sine stătătoare şi necesită un fişier de configurare care să citească date despre procesele pe care le monitorizează. De asemenea, această aplicaţie rulează ca „daemon” şi este pornită la startup-ul întregului sistem IP PBX. The architecture proposals of a Watchdog software for the IP PBX monitoring are designed in this paper. IP PBX contains two main applications which interact between them: Common Media Server and IpSmallOffice. Common Media Server provides VoIP codec capabilities, and IpSmallOffice is the main application running on a Linux PC, controls the phone calls and provides IP telephony switching. The Watchdog is a stand-alone application and it needs a configuration file to read the data about the processes it monitors. Moreover, this application will be demonized and is added to start at system startup. Keywords: Watchdog, Common Media Server, IpSmallOffice,Linux, Processes,Polling,SIP,Keepalives 1. Introduction IPSmallOffice and Common Media Server are interdependent applications. Given this, when one of them is shutdown or, in a worse case, crashes, both of them need to be restarted.
    [Show full text]
  • Coverstory by Markus Levy, Technical Editor
    coverstory By Markus Levy, Technical Editor ACRONYMS IN DIRECTORY ALU: arithmetic-logic unit AMBA: Advanced Microcontroller Bus Architecture BIU: bus-interface unit CAN: controller-area network CMP: chip-level multiprocessing DMA: direct-memory access EDO: extended data out EJTAG: enhanced JTAG FPU: floating-point unit JVM: Java virtual machine MAC: multiply-accumulate MMU: memory-management unit NOP: no-operation instruction PIO: parallel input/output PLL: phase-locked loop PWM: pulse-width modulator RTOS: real-time operating system SDRAM: synchronous DRAM SOC: system on chip SIMD: single instruction multiple data TLB: translation-look-aside buffer VLIW: very long instruction word Illustration by Chinsoo Chung 54 edn | September 14, 2000 www.ednmag.com EDN’s 27th annual MICROPROCESSOR/MICROCONTROLLER DIRECTORY The party’s at the high end elcome to the year 2000 version of tectural and device features that will help you the EDN Microprocessor/Microcon- quickly compare the important processor dif- troller Directory: It’s the 27th consec- ferences. These processor tables include the Wutive year of this directory, and we standard fare of features and contain informa- certainly have something to celebrate. The en- tion related to EEMBC (EDN Embedded Mi- tire processor industry has seen incredible croprocessor Benchmark Consortium, growth, but the real parties have been happen- www.eembc.org). EEMBC’s goal is to provide ing in the 32-bit, 64-bit, and VLIW fraternities. the certified benchmark scores to help you se- Although a number of new architectures have lect the processor for your next design. Rather sprung up, processor vendors have primarily than fill an entire table with the processor focused on delivering newer versions or up- scores, EDN has indicated which companies grades of their existing architectures.
    [Show full text]
  • Intel® Quickassist Technology (Intel® QAT) Software for Linux* Release Notes Package Version: QAT1.7.L.4.6.0-00025
    Intel® QuickAssist Technology (Intel® QAT) Software for Linux* Release Notes Package Version: QAT1.7.L.4.6.0-00025 June 2019 Document Number: 336211-011 You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548- 4725 or visit www.intel.com/design/literature.htm.
    [Show full text]
  • Great Watchdogs Version 1.2, Updated January 2004
    Great Watchdogs Version 1.2, updated January 2004 Jack G. Ganssle [email protected] The Ganssle Group PO Box 38346 Baltimore, MD 21231 (410) 496-3647 fax (647) 439-1454 Jack Ganssle believes that embedded development can be much more efficient than it usually is, and that we can – and must – create more reliable products. He conducts one-day seminars that teach ways to produce better firmware faster. For more information see www.ganssle.com. Building a Great Watchdog Launched in January of 1994, the Clementine spacecraft spent two very successful months mapping the moon before leaving lunar orbit to head towards near-Earth asteroid Geographos. A dual-processor Honeywell 1750 system handled telemetry and various spacecraft functions. Though the 1750 could control Clementine’s thrusters, it did so only in emergency situations; all routine thruster operations were under ground control. On May 7, 1994, the 1750 experienced a floating point exception. This wasn’t unusual; some 3000 prior exceptions had been detected and handled properly. But immediately after the May 7 event downlinked data started varying wildly and nonsensically. Then the data froze. Controllers spent 20 minutes trying to bring the system back to life by sending software resets to the 1750; all were ignored. A hardware reset command finally brought Clementine back on-line. Alive, yes, even communicating with the ground, but with virtually no fuel left. The evidence suggests that the 1750 locked up, probably due to a software crash. While hung the processor turned on one or more thrusters, dumping fuel and setting the spacecraft spinning at 80 RPM.
    [Show full text]