The Role of Software in Recent Aerospace Accidents*

Nancy G. Leveson, Ph.D.; Aeronautics and Astronautics Department, Massachusetts Institute of Technology; Cambridge MA [email protected] and http://sunnyday.mit.edu Keywords: software safety

Abstract factors are considered here, that is, the causal factors that allowed the specific events to occur This paper describes causal factors related to and that affect general classes of accidents. In software that are common to several recent the Challenger accident, for example, the spacecraft accidents and what might be done to specific even leading to the loss was the O-ring mitigate them. failure, but the systemic factors included such Introduction things as flawed decision making, poor problem reporting, lack of trend analysis, a "silent" or In the process of a research project to evaluate ineffective safety program, communication accident models, I looked in detail at a variety of problems, etc. spacecraft and aircraft accidents that in some way involved software [8]. The accidents Overconfidence and Overreliance on Digital studied all demonstrated the usual cultural, Automation: All the accidents involved systems organizational, and communications problems built within an engineering culture that had such as complacency, diffusion of or lack of unrealistic expectations about software and the responsibility and authority for safety, low-level use of computers. For example, the official status and inappropriate organizational Ariane 5 accident report notes that software was placement of the safety program, and limited assumed to be correct until it was shown to be communication channels and poor information faulty. The opposite assumption is more flow. These typical problems are well known realistic. and the solutions clear although sometimes difficult to implement. Software contributions to Engineers often underestimate the complexity of accidents are less well understood, however. software and overestimate the effectiveness of testing. It is common to see risk assessments The accidents investigated were the explosion of that assume testing will remove all risk the Ariane 5 launcher on its maiden flight in associated with digital components. This form 1996; the loss of the Mars Climate Orbiter in of complacency plays a part in the common 1999; the destruction of the Mars Polar Lander proliferation of software functionality and in sometime during the entry, deployment, and unnecessary design complexity. landing phase in the following year; the placing of a Milstar satellite in an incorrect and unusable In the aircraft accidents examined, orbit by the Titan IV B-32/Centaur launch in overconfidence in automation both (1) 1999; the flight of an American Airlines B-757 encouraged engineers to trust software over into a mountain near Cali, Columbia; the humans and give final authority to the computer collision of a Lufthansa A320 with an earthbank rather than the pilot, and (2) encouraged pilots to at the end of the runway at Warsaw, and the trust their computer-based decision aids beyond crash of a China Airlines A320 short of the the point where they should have. runway at Nagoya, Japan. Some of the technical inadequacies in high-tech On the surface, the events and conditions aircraft system design stem from lack of involved in the accidents appear to be very confidence in the human and overconfidence in different. A more careful, detailed analysis of the automation. In several of the Airbus the systemic factors, however, reveals striking accidents, the pilots found themselves fighting similarities. Only the root causes or systemic the automation for control of the aircraft---which

* This research was partially supported by a grant from the NASA Ames Design for Safety program and by the NASA IV&V Center Software Initiative program. had been designed to give ultimate authority to modes for software are very different than those the automation. for physical devices and the contribution of software to accidents is also different: Even if automation is considered to be more Engineering activities must be changed to reflect reliable than humans, it may be a mistake not to these differences. Almost all software-related allow flexibility in the system for emergencies accidents can be traced back to flaws in the and allowance for pilots to override physical requirements specification and not to coding interlocks, such as the inability of the pilots to errors. In these cases, the software performed operate the ground spoilers and engine thrust exactly as specified (the implementation was reversers in the Warsaw A-320 accident because "correct") but the specification was incorrect the computers did not think the airplane was on because (1) the requirements were incomplete or the ground. Reliable operation of the automation contained incorrect assumptions about the is not the problem here; the automation was very required operation of the system components reliable in all these cases. Instead the issue is being controlled by the software or about the whether software can be constructed that will required operation of the computer or (2) there exhibit correct appropriate behavior under every were unhandled controlled-system states and foreseeable and unforseeable situation and environmental conditions. This in turn implies whether we should be trusting software over that the majority of the software system safety pilots. effort should be devoted to requirements analysis, including completeness (we have At the same time, some of the aircraft accident specified an extensive set of completeness reports cited the lack of automated protection criteria), correctness, potential contribution to against or nonalerting of the pilots to unsafe system hazards, robustness, and possible states, such as out-of-trim situations. A operator mode confusion and other operator sophisticated hazard analysis and close errors created or worsened by software design. cooperation among system safety engineers, human factors engineers, aerospace engineers, Confusing Reliability and Safety: Accidents are and software engineers is needed to make these changing their nature. We are starting to see an difficult decisions about task allocation and increase in system accidents that result from feedback requirements. dysfunctional interactions among components, not from individual component failure. Each of Engineers are not alone in placing undeserved the components may actually have operated reliance on software. Research has shown that according to its specification (as is true for most operators of highly reliable automated systems software involved in accidents), but the (such as flight management systems) will combined behavior led to a hazardous system increase their use of and reliance on automation state. When humans are involved, often their as their trust in the system increases. At the behavior can only be labeled as erroneous in same time, merely informing flightcrews of the hindsight–at the time and given the context, their hazards of overreliance on automation and behavior was reasonable (although this does not advising them to turn it off when it becomes seem to deter accident investigators from placing confusing is insufficient and may not affect pilot all or most of the blame on the operators). procedures when it is most needed. System accidents are caused by interactive The European Joint Aviation Authorities' Future complexity and tight coupling. Software allows Aviation Safety Team has identified "crew us to build systems with a level of complexity reliance on automation'' as the top potential and coupling that is beyond our ability to safety risk in future aircraft [5]. This reliance on control; in fact, we are building systems where and overconfidence in software is a legitimate the interactions among the components cannot be and important concern for system safety planned, understood, anticipated, or guarded engineering. against. This change is not solely the result of using digital components, but it is made possible Not Understanding the Risks Associated with because of the flexibility of software. Software: The accident reports all exhibited the common belief that the same techniques used for Standards for commercial aircraft certification, electromechanical components will work in even relatively new ones, focus on component software-intensive systems. However, the failure reliability and redundancy and thus are not effective against system accidents. In the aircraft be false in practice and by scientific experiments accidents studied, the software satisfied its (see, for example, [4]). Common-cause (but specifications and did not "fail" yet the usually different) logic errors tend to lead to automation obviously contributed to the flight incorrect results when the various versions crews' actions and inactions. Spacecraft attempt to handle the same unusual or difficult- engineering in most cases also focuses primary to-handle inputs. In addition, such designs effort on preventing accidents by eliminating usually involve adding to system complexity, component failures or preparing for failures by which can result in failures itself. A NASA study using redundancy. These approaches are fine for of an experimental aircraft with two versions of electromechanical systems and components, but the control system found that all of the software will not be effective for software-related problems occurring during flight testing resulted accidents. from errors in the redundancy management system and not in the control software itself, The first step in handling system accidents is for which worked perfectly [12]. engineers to recognize the need for change and to understand that safety and reliability are Assuming Risk Decreases over Time: In the different qualities for software---one does not Milstar satellite loss, the Titan Program Office imply the other. One of the founders of system had decided that because the software was safety, C.O. Miller, cautioned that "mature, stable, and had not experienced "distinguishing hazards from failures is implicit problems in the past,'' they could use the limited in understanding the difference between safety resources available after the initial development and reliability" [13]. Although confusing effort to address hardware issues. Other reliability with safety is common in engineering accidents studied had this same flawed approach (and particularly common in software to resource prioritization. engineering), it is perhaps most unfortunate with regard to software as it encourages spending A common assumption is that risk decreases much of the effort devoted to safety on activities over time as accident-free operation that are likely to have little or no effect. accumulates. In fact, risk usually increases over time, particularly in software-intensive systems. Overrelying on Redundancy: Redundancy The Therac-25, a radiation therapy machine that usually has a greater impact on reliability than massively overdosed five patients due to safety. System accidents, for example, will not software flaws, operated safely thousands of be decreased at all by the use of redundancy. In times before the first accident. Industrial robots fact, the added complexity introduced by operated safely around the world for several redundancy has frequently resulted in accidents. million hours before the first fatality. In addition, redundancy is most effective against random wearout failures and least effective Risk may increase over time because caution against requirements and design errors–the latter wanes and safety margins are cut, because time being the only type found in software. For increases the probability the unusual conditions example, the Ariane report notes that according will occur that trigger an accident, or because the to the culture of the Ariane program, only system itself or its environment changes. In some random failures are addressed and they are cases, the introduction of an automated device primarily handled with redundancy. This may actually change the environment in ways approach obviously failed when on the Ariane not predicted during system design. 5's first flight both Inertial Reference System computers shut themselves down (exactly as they Software also tends to be frequently changed and were designed to do) as a result of the same "evolves'' over time, but changing software unexpected input value. without introducing errors or undesired behavior is much more difficult than building correct To cope with software design errors, "diversity'' software in the first place. The more changes that has been suggested in the form of independent are made to software, the more "brittle'' the groups writing multiple versions of software software becomes and the more difficult it is to with majority voting on the outputs. This make changes without introducing errors. approach is based on the assumption that such versions will fail in a statistically independent The history of accidents shows that a strong manner, but this assumption has been shown to system safety program is needed during operations. All changes to the software must be flags. The system safety information system analyzed for their impact on safety. Such change should include the collection of such information analysis will not be feasible unless special steps and its analysis to detect problems before they are taken during development to document the cause serious losses. information needed. Incident and accident analysis, as for any system, will also be Inadequate Cognitive Engineering: Commercial important as well as performance monitoring and aviation is the first industry where shared control periodic operational process audits. of safety-critical functions between humans and computers has been widely implemented. The The environment in which the system and very difficult problems that result, such as those software are operating will change over time, associated with mode confusion and deficiencies partially as a result of the introduction of the in situational awareness, are slow to be automation or system itself. Basic assumptions recognized and acknowledged. It is more made in the original hazard analysis process common to simply blame the pilot for the must have been recorded and then should be accident than to investigate the aspects of system periodically evaluated to ensure they are not design that may have led to the human error(s). being violated in practice. For example, in order not to distract pilots during critical phases of All the aircraft accident reports focused on pilot flight, TCAS includes the ability for the pilot to error. Some of the spacecraft accidents also switch to a Traffic-Advisory-Only mode where focused their investigations on the ground traffic advisories are displayed but display of controllers and why they did not catch the resolution advisories (escape maneuvers) is software problems before the loss instead of inhibited. It was assumed in the original TCAS focusing on why the software problems were system design and hazard analysis that this introduced in the first place and not caught feature would be used only during final approach before operational deployment of the system. to parallel runways when two aircraft come close to each other and TCAS would call for an Cognitive engineering, particularly that directed evasive maneuver. The actual use of this feature at the influence of software design on human in practice would be an important assumption to error, is still in its early stages. Human factors check periodically to make sure it is not being experts have written extensively on the potential used in other situations where it might lead to a risks introduced by the automation capabilities of hazard. But that requires that the assumption glass cockpit aircraft. Among those identified was recorded and not forgotten. are: over reliance on automation; shifting workload by increasing it during periods of already high workload and decreasing it during Ignoring Warning Signs: Warning signs almost periods of already low workload; being "clumsy" always occur before major accidents. For or difficult to use; being opaque or difficult to example, all the aircraft accidents considered in understand; and requiring excessive experience this research had had precursors but priority was to gain proficiency in its use. In the Cali not placed on fixing the causal factors before accident, for example, the accident report noted they reoccurred or the responses were inadequate task saturation and overload, poor situation to prevent future loss. Two of the three aircraft awareness (inadequate mental models of the accidents studied involved problems for which automation and the situation), and distraction software fixes had been created but for various from appropriate behavior. reasons had not been installed on the specific aircraft involved. The reasons for this omission Researchers have suggested that pilots of high- are complicated and sometimes involved politics tech aircraft can lose awareness of the current and marketing (and their combination) as much aircraft flight mode or exhibit other forms of as complacency or cost factors. mode confusion. In addition, many of the problems found in human--automation Engineers noticed the problems with the interaction lie in the human not getting adequate Titan/Centaur software after it was delivered to feedback to monitor the automation and to make the launch site, but nobody seemed to take them appropriate decisions. seriously. The problems experienced with the Mars Climate Orbiter software during the early System safety needs to consider these potential stages of the flight did not seem to raise any red problems in any hazard analysis. Just as hazard analysis needs to begin in the early conceptual descent engines. For software, however, only stages, so does the design of the human-- one statement---Flight software fails to execute computer interaction. We have defined a system properly---is identified, and it is labeled as engineering process that is both human-centered common to all phases. and safety-driven [11]. The information generated by system safety engineers in the The problem with such vacuous statements is system hazard analysis process can be extremely that they provide no useful information---it is useful in defining operator goals and equivalent to simply substituting a single responsibilities, task allocation principles, and statement for all the other identified system operator task and training requirements, i.e., in hazards with Hardware fails to operate properly. the activities involved in designing safer human- Singling out the JPL engineers is unfair here -computer interactions. because I find the same types of useless statements about software in almost all the fault Inadequate Specifications: The Mars Polar trees and other hazard analyses I see in industry. Lander accident report notes that the system- Boxes in fault trees that simply say Software level requirements document did not specifically Fails can be worse than useless because they are state the failure modes the requirement was untrue---all software misbehavior will not cause protecting against (in this case, possible sensor the identified hazard---and it leads to nonsensical transients) and speculates that the software activities like using a general reliability figure designers or one of the reviewers might have for software (assuming one believes such a discovered the missing software requirement if number can be produced) in quantitative fault they had been aware of the rationale underlying tree analyses when it does not reflect in any way the system requirements. The Ariane accident the probability of the software exhibiting a report refers in several places to inadequate particular hazardous behavior. specification practices and notes that the structure of the documentation obscured the Instead, specific hazardous software behavior ability to review the critical design decisions and needs to be identified. In a collision avoidance their underlying rationale. system, for example, a fault tree box might contain: (1) Collision avoidance logic calls for a The small part of the Mars Polar Lander software reversal of an advisory when the pilot has requirements specification shown in the accident insufficient time to respond or (2) two boxes report (which may very well be misleading) connected by an AND might contain Software avoids all mention of what the software must not issues a crossing advisory and The pilot does not do. In fact, some standards and industry follow his/her advisory. These two identified practices even forbid such negative requirements hazardous software behaviors might be translated statements. The result is that software into two system design constraints: specifications often describe nominal behavior well but are very incomplete with respect to [1.] The software must not call for a reversal of required software behavior under off-nominal an advisory when two aircraft are separated by conditions and rarely describe what the software less than 200 feet vertically and 10 seconds or is not supposed to do. Most safety-related less remain to closest point of approach and requirements and design constraints are best [2.] Crossing maneuvers must be avoided if described using such negative requirements or possible design constraints so they are often omitted. along with a pilot procedural requirement to This is a place where good system hazard follow all advisories and to continue to do so analysis can be very helpful. Unfortunately, until the advisory is removed. many hazard analyses treat software superficially at best. The hazard analysis produced for the Complete and understandable specifications are Mars Polar Lander (MPL) during the accident not only necessary for development, but they are investigation is typical. The JPL report on the critical for operations and the handoff between MPL loss identifies the hazards for each phase of developers, maintainers, and operators. The the entry, descent, and landing sequence, such as ground operations staff in the spacecraft (1) Propellant Line Rupture, (2) Excessive accidents and the pilots in the aircraft accidents horizontal velocity causes lander to tip over at all had misunderstandings about the automation touchdown, and (3) Premature shutdown of the and how to use it. All three aircraft accident reports cited inadequate, conflicting, or poorly it was not realized that the test coverage was designed documentation and information inadequate to expose such limitations. presentation. Software hazard analysis and requirements A specification method, called Intent analysis techniques exist (and more should be Specifications [7], has been defined that developed) to detect all these types of integrates safety information into the engineering incompleteness. To make such a review feasible, decision-making environment during the early the requirements should include only the stages of system conceptual design and externally visible (blackbox) behavior; all functional allocation, encourages documentation implementation-specific information should be of design rationale and safety-related design put into a separate software design specification assumptions, and provides complete traceability (which will be subject to a later software design from high-level requirements and hazard review by a more limited set of reviewers). The analyses down through the levels of system only information relevant to requirements review design to the component implementation details at this level is the software behavior that is and vice versa. While intent specifications are visible outside the computer. Specifying only useful during the original system design, they blackbox behavior (in engineering terms, this is should be particularly helpful during operations often referred to as the transfer function across in performing safety assessments on potential the digital component) allows a wide set of changes to the system and software. reviewers to concentrate on the information of importance to them without being overwhelmed Flawed Review Process: In general, software is by internal design information that has no impact difficult to review and the success of such an on externally observable behavior. effort is greatly dependent on the quality of the specifications. However, identifying unsafe The language used to specify the software behavior, i.e., the things that the software should requirements is critical to the success of the not do and concentrating on that behavior for at review. The best way to find errors in the least part of the review process, helps to focus software requirements is to include a large range the review and to ensure that critical issues are of disciplines and expertise in the review adequately considered. The fact that process, including system safety engineers. specifications usually include only what the Formal specification methods have tremendous software should do and omit what it should not potential for enhancing our ability to provide do makes this type of review even more correct and complete requirements. In addition, important and effective in finding serious executable specification can be helpful in problems. Such unsafe (or mission-critical) understanding the implications of complex behavior should be identified in the system software behavior. We showed in our TCAS engineering process before software project that it is possible for a formal, executable development begins. The design rationale and requirements specification to be readable and design features used to prevent the unsafe understandable with a minimum of training and behavior should also have been documented and without requiring advanced degrees in discrete can be the focus of such a review. This math and logic [10], but most designers of presupposes, of course, a system safety process formal requirements specification languages to provide the information, which does not have not put a high priority on readability and appear to have existed for the projects that were learnability. We have used what we learned involved in the accidents studied. during the TCAS project to design an even more reviewable new requirements specification The two identified Mars Polar Lander software language. errors, for example, involved incomplete handling of software states and are both Inadequate System Safety Engineering: All of examples of very common specification flaws the accident reports studied are surprisingly and logic omissions often involved in accidents. silent about the safety programs involved. One As such, they were potentially detectable by would think that the safety activities and why formal and informal analysis and review they had been ineffective would figure techniques. The Ariane report also says that the prominently in the investigations, assuming, of limitations of the inertial reference system course, there were active safety programs and the software were not fully analyzed in reviews, and efforts were not marginalized or ignored by the other engineering activities. Judging only from engineering approaches that are unlikely to be the information (or lack of it) provided in the effective for system accidents or software-related accident reports, it is likely that none of these causal factors. projects had a robust system safety or software system safety program. Violation of Basic Safety Engineering Practices in the Digital Parts of the System: Although Providing the information needed to make system safety engineering textbooks and safety-related engineering decisions is the major standards include principles for safe design, contribution of system safety techniques to software engineers are almost never taught them. engineering. It has been estimated that 70-90% As a result, software often does not incorporate of the safety-related decisions in an engineering basic safe design principles---for example, project are made during the early concept separating and isolating critical functions, development stage [2]. When hazard analyses are eliminating unnecessary functionality, designing not performed, are done only after the fact (for error-reporting messages such that they cannot example, as a part of quality or mission be confused with critical data (see the Ariane 5 assurance of a completed design), or are loss), and reasonableness checking of inputs and performed but the information is never integrated internal states. into the system design environment, they can have no effect on these decisions and the safety Consider the Mars Polar Lander loss as an effort reduces to a cosmetic and perfunctory role. example. The JPL report on the accident states that the software designers did not include any The Titan accident provides an example of what mechanisms to protect against transient sensor happens when such analysis is not done. The risk signals nor did they think they had to test for analysis, in that case, was not based on transient conditions. They also apparently did determining the steps critical to mission success not include a check of the current altitude before but instead considered only the problems that turning off the descent engines. Runtime had occurred in previous launches. Software reasonableness and other types of checks should constant generation (an important factor in the be part of the design criteria used for any real- Milstar satellite loss) was considered to be low time software. risk because there had been no previous problems with it. There is, however, a Another basic design principle for mission- potentially enormous (perhaps unlimited) critical software is that unnecessary code or number of errors related to software and functions should be eliminated or separated from considering only those mistakes made mission-critical code and its processing. The previously, while certainly prudent, is not Arian 5 and Titan IVB-32 accidents involved adequate. Proper hazard analysis that examines code that was not needed (it was in some reused all the ways the system components (including software designed for other spacecraft). In the software) or their interaction can contribute to case of Mars Polar Lander, the code that caused accidents needs to be performed and used in the problems was necessary (in fact, it was decision making. critical) but was executing at a time when it was not needed. I am sure that each of these The Mars Climate Orbiter accident report decisions was considered carefully, but the recommended that the NASA Mars Program tradeoffs may not have been made in an optimal institute a classic system safety engineering way and risk may have been discounted with program, i.e., continually performing the system respect to other properties. hazard analyses necessary to explicitly identify mission risks and communicating these risks to Software Reuse without Appropriate Safety all segments of the project team and institutional Analysis: It is widely believed that because management; vigorously working to make software has executed safely in other tradeoff decisions that mitigate the risks in order applications, it will be safe in the new one. This to maximize the likelihood of mission success; misconception arises from confusion between and regularly communicating the progress of the software reliability and safety. As stated, most risk mitigation plans and tradeoffs to project, accidents involve software that is doing exactly program, and institutional management. The what it was designed to do, but the designers other spacecraft accident reports, in contrast, misunderstood what behavior was required and recommended applying classic reliability would be safe. The blackbox (externally visible) behavior of a safety analysis and therefore should be useful not component can only be determined to be safe by only in the original system development but analyzing its effects on the system in which it when the software is to be reused. will be operating, that is, by considering the specific operational context. The fact that software has been used safely in another Unnecessary Complexity and Software environment provides no information about its Functions: One of the most basic concepts in safety in the current one. In fact, reused critical systems is to "keep it is probably less safe because the original simple.'' The seemingly unlimited ability of decisions about the required software behavior software to implement desirable features often, were made for a different system design and as in the case of most of the accidents examined were based on different environmental in this paper, pushes this basic principle into the assumptions. Changing the environment in background: Creeping featurism is a common which the software operates makes all previous problem in software and engineering. As stated usage experience with the software irrelevant for earlier, the Ariane and Titan accidents involved determining safety. software functions that were not needed, but surprisingly the decision to put in or to keep (in A reasonable conclusion to be drawn is not that the case of reuse) these unneeded features was software cannot be reused, but that a safety not questioned in the accident reports. The Mars analysis of its operation in the new system Polar Lander accident involved software that was context is mandatory: Testing alone is not executing when it was not necessary to execute. adequate to accomplish this goal. For complex In the case of the Titan IVB-32, the report designs, the safety analysis required stretches the explains that the software roll rate filter involved limits of current technology. For such analysis to in the loss of the Milstar satellite was not needed be technically and financially feasible, reused but was kept in for "consistency.'' The exact software must contain only the features same words are used for software functions necessary to perform critical functions: Both the leading to the loss of the Ariane 5. Neither Ariane 5 and the Titan software contained report explains why consistency was assigned unnecessary functions that led to the losses and such high priority. In all these projects, tradeoffs the MPL loss involved a necessary function were obviously not considered adequately, operating when it was not necessary. perhaps partially due to complacency about software risk. COTS software is often constructed with as many features as possible to make it The more features included in software and the commercially useful in a variety of systems. greater the resulting complexity (both software Thus there is tension between using COTS complexity and system complexity), the harder versus being able to perform a safety analysis and more expensive it is to test, to provide and have confidence in the safety of the system. assurance through reviews and analysis, to This tension must be resolved in management maintain, and to reuse in the future. Engineers decisions about risk---ignoring it only leads to need to start making these hard decisions about accidents and potential losses that are greater functionality with a realistic appreciation of their than the additional cost of designing and effect on development cost and eventual system building new components instead of buying safety and system reliability. them. Operational Personnel Not Understanding the If software reuse is to result in acceptable risk, Automation: Neither the MPL nor the Titan then system and software modeling and analysis mission operations personnel understood the techniques must be used to perform the system or software well enough to interpret the necessary safety analyses. This process is not data they saw as indicating there was a problem easy or cheap. Introducing computers does not in time to prevent the loss. Complexity in the preclude the need for good engineering practices, automation combined with poor documentation and it almost always involves higher costs and training procedures are contributing to this despite the common myth that introducing problem, which is becoming a common factor in automation will save money. Our blackbox aircraft accidents. Accidents, surveys, and formal requirements specification language simulator studies have emphasized the problems contains the information necessary for such a pilots are having in understanding digital automation and have shown that pilots are different. Another example was not testing the surprisingly uninformed about how the Titan/Centaur software with the actual load tape automation works [14], [11]. prior to launch.

Accidents have further demonstrated that Deficiencies in Safety-Related Information proficiency in the use of sophisticated Collection and Use: In all the spacecraft automation, such as a FMS (Flight Management accidents, the existing formal anomaly reporting System), without adequate knowledge about the system was bypassed (in Ariane 5, there is no logic underlying critical features, such as the information about whether one existed) and design and programmed priorities of its informal email and voice mail was substituted. navigation database or autopilot override The problem is clear but not the cause, which functions, can lead to its misuse and to accidents. was not included in the reports and perhaps not Problems are especially found with controls and investigated. When a structured process exists operations the crews rarely experience in daily and is not used, there is usually a reason. Some flight, such as unusual mode changes and manual possible explanations may be that the system is overrides. difficult or unwieldy to use or it involves too much overhead. Such systems may not be Either the design of the automation we are changing as new technology changes the way building needs to be simplified so it is engineers work. understandable or new training methods are needed for those who must deal with the clumsy, There is no reason why reporting something unpredictable, and inconsistent automation we within the problem-reporting system should be are designing, or both. much more cumbersome than adding an additional recipient to the email. The Raytheon Test and Simulation Environments that do not CAATS (Canadian Automated Air Traffic Match the Operational Environment: It is System) project implemented an informal email always dangerous to conclude that poor testing process for reporting anomalies and safety was the "cause'' of an accident. After the fact, it concerns or issues that reportedly was highly is always easy to find a test case that would have successful [3]. New hazards and concerns will uncovered a known error, but it is usually be identified throughout the development difficult to prove that the particular test case process and into operations, and there must be a would have been selected beforehand, even if simple and non-onerous way for software testing procedures were changed. By definition, engineers and operational personnel to raise the cause of an accident can always be stated as a concerns and safety issues and get questions failure to test for the condition that was answered at any time. determined, after the accident, to have led to the loss. However, in the accidents studied, there do Conclusion: The incidence of system accidents seem to be omissions that reflect poor decisions is increasing as engineering designs rely more related to testing, particular with respect to the and more on software. But system accidents are accuracy of the simulated operational exactly the type of accident that system safety environment. was invented to handle 50 years ago, and it should be very effective against the system A general principle in testing aerospace systems accidents stemming from misunderstood is to fly what you test and test what you fly. This software requirements and the dysfunctional principle was violated in all the spacecraft system interactions typical of software-related accidents, especially with respect to software. accidents. This approach does not require that The software test and simulation processes must system components exhibit ultra-high reliability, reflect the environment accurately. Although only that a set of specific behaviors do not occur. implementing this principle is often difficult or For software, this distinction is critical: It is even impossible for spacecraft, no reasonable much easier to design software to prevent explanation was presented in the reports for particular behaviors than to guarantee that it will some of the omissions in the testing for these always do the "right'' thing. In fact, the latter may systems. An example was the use of Ariane 4 be impossible. Classic trajectory data in the specifications and techniques, such as failure analysis and simulations of the Ariane 5 software even though redundancy, will be less important for software the Ariane 5 trajectory was known to be components than for electromechanical components. 7. Leveson, Nancy G. "Intent Specifications: An Approach to Building Human-Centered I have noticed, however, that over the years Specifications," IEEE Transactions on Software system safety engineering has increasingly Engineering, SE-26, no. 1 (January 2000) 15-35. drifted toward using reliability engineering techniques and away from classic system safety 8. Leveson, Nancy G. "Evaluating Accidents approaches and has, in particular, adopted this Models using Recent Aerospace Accidents: Part approach for software. This trend may simply be I. Event-Based Models," MIT Technical Report a result of lack of knowledge about software or it 2001. http://sunnyday.mit.edu/accidents. may reflect a lack of appropriate tools to assist in applying system safety approaches to software. 9. Leveson, Alfaro, Alvarado, Brown, Hunt, A further influence may be that computer science Jaffe, Joslyn, Pinnel, Reese, Samarziya, Sandys, has always been concerned with computer Shaw, Zabinsky, "Demonstration of a Safety reliability and has focused almost exclusively on Analysis on a Complex System," paper presented this quality. Only recently has safety become an at Software Engineering Laboratory Workshop, issue. Therefore, almost all existing software NASA Goddard, Maryland, USA, December engineering techniques focus on software 1997. (Full report can be found at reliability, i.e., assuring that the software fttp://sunnyday.mit.edu/papers.html). correctly or reliably satisfies the specified requirements (which may be incomplete, 10. Leveson, Nancy G., Mats Heimdahl, Holly incorrect, or unsafe). Hildreth, and Jon Damon Reese. "Requirements Specification for Process-Control Systems," We have created demonstration projects to show IEEE Transactions on Software Engineering, how classic system safety approaches can be SE-20, no. 9 (September 1994) 684-707. applied to software-intensive systems (see, for example, [9] and [11]). In addition, the MIT 11. Leveson, Villepin, Daouk, Bellingham, Software Engineering Research Laboratory Srinivasan, Neogi, Bachelder, Flynn, and Pilon, (SERL) is working to create new techniques and "A Safety and Human-Centered Approach to tools to support software system safety analysis Developing New Air Traffic Management and design. Tools." To appear in Proceedings of ATM 2001 Conference, New Mexico, December 2001. References 12. Mackall, Dale A. National Aeronautics and 1. Australia. 1996. Bureau of Air Safety Space Administration. November 1988. Investigation, Department of Transport and Development and Flight Test Experiences with a Regional Development. Advanced Technology Flight-Critical Digital Control System. NASA Aircraft Safety Survey Report. June. Technical Paper 2857. Dryden Flight Research Facility, California, USA. 2. Johnson, W. G. 1980. MORT Safety Assurance Systems. New York: Marcel Dekker, Inc. 13. Miller, C.O. "A Comparison of Military and Civil Approaches to Aviation System Safety," 3. Joyce, Jeffrey. Conversation w/author, 2001. Hazard Prevention, (May/June 1985) 29--34.

4. Knight, J.C., and Nancy G. Leveson. "An 14. Sarter, N.D., D.D. Woods, and C.E. Billings, Experimental Evaluation of the Assumption of "Automation Surprises," in Handbook of Human Independence in Multi-Version Programming," Factors/Ergonomics, 2nd Edition, ed. G. IEEE Transactions on Software Engineering SE- Salvendy (New York: John Wiley & Sons, 1997) 12, no. 1 (January 1986): 96-109. Biography 5. Learmount, D. "Flight Safety Foundation's European Aviation Safety Seminar," Flight Nancy G. Leveson, Ph.D., Professor, MIT, International (March 20-26, 2001) 17. Aeronautics & Astronautics Dept., 33-315, 77 Massachusetts Ave., Cambridge MA 02139, 6. Leveson, Nancy G. Safeware: System Safety USA, telephone - (617) 258-0505, facs. (617) and Computers. Boston: Addison Wesley, 1985. 253-7397, e-mail - [email protected]. Paper Release Form 19th International System Safety Conference

Title of Paper: ______

______

I hereby authorize the System Safety Society to publish the paper listed above in the Proceedings of the 18th International System Safety Conference. Further, I agree to the following policy and notice regarding copyrights.

It is the policy of the System Safety Society, the sponsor of the International System Safety Conference, not to copyright the proceedings in order to provide the widest access for academic and educational use. Authors are free to copyright their papers as long as they agree with this policy. The policy to be contained in the proceedings is as follows:

Permission to print or copy: The copyright of all materials and commentaries published in these proceedings rests with the authors. Reprinting or copying for academic or educational use is encouraged and no fees are required; however, such permission is contingent upon giving full and appropriate credit to the author and the source of publication.

Author: Author: ______

Address: Address: ______

Work Phone: ______Work Phone: ______Home Phone: ______Home Phone: ______FAX: ______FAX: ______E-Mail: ______E-Mail: ______

______Signature Date Signature Date

>>>>>>>>>> >>>>>>>>>>

Author: Author: ______

Address: Address: ______

Work Phone: ______Work Phone: ______Home Phone: ______Home Phone: ______FAX: ______FAX: ______E-Mail: ______E-Mail: ______

______Signature Date Signature Date

Mail to: John Livingston Boeing Reusable Space Systems 555 Discovery Drive Mail Code ZA-12 Huntsville, AL 35806-2809 (256) 971-3005, fax (256) 971-2699 [email protected]