The Role of Software in Recent Aerospace Accidents*
Total Page:16
File Type:pdf, Size:1020Kb
The Role of Software in Recent Aerospace Accidents* Nancy G. Leveson, Ph.D.; Aeronautics and Astronautics Department, Massachusetts Institute of Technology; Cambridge MA [email protected] and http://sunnyday.mit.edu Keywords: software safety Abstract factors are considered here, that is, the causal factors that allowed the specific events to occur This paper describes causal factors related to and that affect general classes of accidents. In software that are common to several recent the Challenger accident, for example, the spacecraft accidents and what might be done to specific even leading to the loss was the O-ring mitigate them. failure, but the systemic factors included such Introduction things as flawed decision making, poor problem reporting, lack of trend analysis, a "silent" or In the process of a research project to evaluate ineffective safety program, communication accident models, I looked in detail at a variety of problems, etc. spacecraft and aircraft accidents that in some way involved software [8]. The accidents Overconfidence and Overreliance on Digital studied all demonstrated the usual cultural, Automation: All the accidents involved systems organizational, and communications problems built within an engineering culture that had such as complacency, diffusion of or lack of unrealistic expectations about software and the responsibility and authority for safety, low-level use of computers. For example, the official status and inappropriate organizational Ariane 5 accident report notes that software was placement of the safety program, and limited assumed to be correct until it was shown to be communication channels and poor information faulty. The opposite assumption is more flow. These typical problems are well known realistic. and the solutions clear although sometimes difficult to implement. Software contributions to Engineers often underestimate the complexity of accidents are less well understood, however. software and overestimate the effectiveness of testing. It is common to see risk assessments The accidents investigated were the explosion of that assume testing will remove all risk the Ariane 5 launcher on its maiden flight in associated with digital components. This form 1996; the loss of the Mars Climate Orbiter in of complacency plays a part in the common 1999; the destruction of the Mars Polar Lander proliferation of software functionality and in sometime during the entry, deployment, and unnecessary design complexity. landing phase in the following year; the placing of a Milstar satellite in an incorrect and unusable In the aircraft accidents examined, orbit by the Titan IV B-32/Centaur launch in overconfidence in automation both (1) 1999; the flight of an American Airlines B-757 encouraged engineers to trust software over into a mountain near Cali, Columbia; the humans and give final authority to the computer collision of a Lufthansa A320 with an earthbank rather than the pilot, and (2) encouraged pilots to at the end of the runway at Warsaw, and the trust their computer-based decision aids beyond crash of a China Airlines A320 short of the the point where they should have. runway at Nagoya, Japan. Some of the technical inadequacies in high-tech On the surface, the events and conditions aircraft system design stem from lack of involved in the accidents appear to be very confidence in the human and overconfidence in different. A more careful, detailed analysis of the automation. In several of the Airbus the systemic factors, however, reveals striking accidents, the pilots found themselves fighting similarities. Only the root causes or systemic the automation for control of the aircraft---which * This research was partially supported by a grant from the NASA Ames Design for Safety program and by the NASA IV&V Center Software Initiative program. had been designed to give ultimate authority to modes for software are very different than those the automation. for physical devices and the contribution of software to accidents is also different: Even if automation is considered to be more Engineering activities must be changed to reflect reliable than humans, it may be a mistake not to these differences. Almost all software-related allow flexibility in the system for emergencies accidents can be traced back to flaws in the and allowance for pilots to override physical requirements specification and not to coding interlocks, such as the inability of the pilots to errors. In these cases, the software performed operate the ground spoilers and engine thrust exactly as specified (the implementation was reversers in the Warsaw A-320 accident because "correct") but the specification was incorrect the computers did not think the airplane was on because (1) the requirements were incomplete or the ground. Reliable operation of the automation contained incorrect assumptions about the is not the problem here; the automation was very required operation of the system components reliable in all these cases. Instead the issue is being controlled by the software or about the whether software can be constructed that will required operation of the computer or (2) there exhibit correct appropriate behavior under every were unhandled controlled-system states and foreseeable and unforseeable situation and environmental conditions. This in turn implies whether we should be trusting software over that the majority of the software system safety pilots. effort should be devoted to requirements analysis, including completeness (we have At the same time, some of the aircraft accident specified an extensive set of completeness reports cited the lack of automated protection criteria), correctness, potential contribution to against or nonalerting of the pilots to unsafe system hazards, robustness, and possible states, such as out-of-trim situations. A operator mode confusion and other operator sophisticated hazard analysis and close errors created or worsened by software design. cooperation among system safety engineers, human factors engineers, aerospace engineers, Confusing Reliability and Safety: Accidents are and software engineers is needed to make these changing their nature. We are starting to see an difficult decisions about task allocation and increase in system accidents that result from feedback requirements. dysfunctional interactions among components, not from individual component failure. Each of Engineers are not alone in placing undeserved the components may actually have operated reliance on software. Research has shown that according to its specification (as is true for most operators of highly reliable automated systems software involved in accidents), but the (such as flight management systems) will combined behavior led to a hazardous system increase their use of and reliance on automation state. When humans are involved, often their as their trust in the system increases. At the behavior can only be labeled as erroneous in same time, merely informing flightcrews of the hindsight–at the time and given the context, their hazards of overreliance on automation and behavior was reasonable (although this does not advising them to turn it off when it becomes seem to deter accident investigators from placing confusing is insufficient and may not affect pilot all or most of the blame on the operators). procedures when it is most needed. System accidents are caused by interactive The European Joint Aviation Authorities' Future complexity and tight coupling. Software allows Aviation Safety Team has identified "crew us to build systems with a level of complexity reliance on automation'' as the top potential and coupling that is beyond our ability to safety risk in future aircraft [5]. This reliance on control; in fact, we are building systems where and overconfidence in software is a legitimate the interactions among the components cannot be and important concern for system safety planned, understood, anticipated, or guarded engineering. against. This change is not solely the result of using digital components, but it is made possible Not Understanding the Risks Associated with because of the flexibility of software. Software: The accident reports all exhibited the common belief that the same techniques used for Standards for commercial aircraft certification, electromechanical components will work in even relatively new ones, focus on component software-intensive systems. However, the failure reliability and redundancy and thus are not effective against system accidents. In the aircraft be false in practice and by scientific experiments accidents studied, the software satisfied its (see, for example, [4]). Common-cause (but specifications and did not "fail" yet the usually different) logic errors tend to lead to automation obviously contributed to the flight incorrect results when the various versions crews' actions and inactions. Spacecraft attempt to handle the same unusual or difficult- engineering in most cases also focuses primary to-handle inputs. In addition, such designs effort on preventing accidents by eliminating usually involve adding to system complexity, component failures or preparing for failures by which can result in failures itself. A NASA study using redundancy. These approaches are fine for of an experimental aircraft with two versions of electromechanical systems and components, but the control system found that all of the software will not be effective for software-related problems occurring during flight testing resulted accidents. from errors in the redundancy management system and not in