System Safety Engineering: Back to the Future
Total Page:16
File Type:pdf, Size:1020Kb
System Safety Engineering: Back To The Future Nancy G. Leveson Aeronautics and Astronautics Massachusetts Institute of Technology c Copyright by the author June 2002. All rights reserved. Copying without fee is permitted provided that the copies are not made or distributed for direct commercial advantage and provided that credit to the source is given. Abstracting with credit is permitted. i We pretend that technology, our technology, is something of a life force, a will, and a thrust of its own, on which we can blame all, with which we can explain all, and in the end by means of which we can excuse ourselves. — T. Cuyler Young ManinNature DEDICATION: To all the great engineers who taught me system safety engineering, particularly Grady Lee who believed in me, and to C.O. Miller who started us all down this path. Also to Jens Rasmussen, whose pioneering work in Europe on applying systems thinking to engineering for safety, in parallel with the system safety movement in the United States, started a revolution. ACKNOWLEDGEMENT: The research that resulted in this book was partially supported by research grants from the NSF ITR program, the NASA Ames Design For Safety (Engineering for Complex Systems) program, the NASA Human-Centered Computing, and the NASA Langley System Archi- tecture Program (Dave Eckhart). program. Preface I began my adventure in system safety after completing graduate studies in computer science and joining the faculty of a computer science department. In the first week at my new job, I received a call from Marion Moon, a system safety engineer at what was then Ground Systems Division of Hughes Aircraft Company. Apparently he had been passed between several faculty members, and I was his last hope. He told me about a new problem they were struggling with on a torpedo project, something he called software safety. I told him I didn’t know anything about it, that I worked in a completely unrelated field, but I was willing to look into it. That began what has been a twenty-two year search for a solution to his problem. It became clear rather quickly that the problem lay in system engineering. After attempting to solve it from within the computer science community, in 1998 I decided I could make more progress by moving to an aerospace engineering department, where the struggle with safety and complexity had been ongoing for a long time. I also joined what is called at MIT the Engineering Systems Division (ESD). The interactions with my colleagues in ESD encouraged me to consider engineering systems in the large, beyond simply the technical aspects of systems, and to examine the underlying foundations of the approaches we were taking. I wanted to determine if the difficulties we were having in system safety stemmed from fundamental inconsistencies between the techniques engineers were using and the new types of systems on which they were being used. I began by exploring ideas in systems theory and accident models. Accident models form the underlying foundation for both the engineering techniques used to prevent accidents and the techniques used to assess the risk associated with using the systems we build. They explain why accidents occur, that is, the mechanisms that drive the processes leading to unacceptable losses, and they determine the approaches we take to prevent them. Most of the accident models underlying engineering today stem from the days before computers, when engineers were building much simpler systems. Engineering techniques built on these models, such as Failure Modes and Effects Analysis (FMEA) and Fault Tree Analysis (FTA), have been around for over 40 years with few changes while, at the same time, engineering has been undergoing a technological revolution. New technology is making fundamental changes in the etiology of accidents, requiring changes in the explanatory mechanisms used to understand them and in the engineering techniques applied to prevent them. For twenty years I watched engineers in industry struggling to apply the old techniques to new software-intensive systems—expending much energy and having little success—and I decided to search for something new. This book describes the results of that search and the new model of accidents and approaches to system safety engineering that resulted. i ii My first step was to evaluate classic chain-of-events models using recent aerospace accidents. Because too many important systemic factors did not fit into a traditional framework, I experi- mented with adding hierarchical levels above the basic events. Although the hierarchies helped alleviate some of the limitations, they were still unsatisfactory [67]. I concluded that we needed ways to achieve a more complete and less subjective understanding of why particular accidents occurred and how to prevent future ones. The new models also had to account for the changes in the accident mechanisms we are starting to experience in software-intensive and other high-tech systems. I decided that these goals could be achieved by building on the systems approach that was being applied by Jens Rasmussen and his followers in the field of human–computer interaction. The ideas behind my new accident model are not new, only the way they are applied. They stem from basic concepts of systems theory, the theoretical underpinning of systems engineering as it developed after World War II. The approach to safety contained in this book is firmly rooted in systems engineering ideas and approaches. It also underscores and justifies the unique approach to engineering for safety, called System Safety, pioneered in the 1950s by aerospace engineers like C.O. Miller, Jerome Lederer, Willie Hammer, and many others to cope with the increased level of complexity in aerospace systems, particularly in military aircraft and intercontinental ballistic missile systems. Because my last book took seven years to write and I wanted to make changes and updates starting soon after publication, I have decided to take a different approach. Instead of waiting five years for this book to appear, I am going to take advantage of new technology to create a “living book.” The first chapters will be available for download from the web as soon as they are completed, and I will continue to update them as we learn more. Chapters on new techniques and approaches to engineering for safety based on the new accident model will be added as we formulate and evaluate them on real systems. Those who request it will be notified when updates are made. In order to make this approach to publication feasible, I will retain the copyright instead of assigning it to a publisher. For those who prefer or need a bound version, they will be available. My first book, Safeware, forms the basis for understanding much of what is contained in this present book and provides a broader overview of the general topics in system safety engineering. The reader who is new to system safety or has limited experience in practicing it on complex systems is encouraged to read Safeware before they try to understand the approaches in this book. To avoid redundancy, basic information in Safeware will in general not be repeated, and thus Safeware acts as a reference for the material here. To make this book coherent in itself, however, there is some repetition, particularly on topics for which my understanding has advanced since writing Safeware. Currently this book contains: • Background material on traditional accident models, their limitations, and the goals for a new model; • The fundamental ideas in system theory upon which the new model (as well as system engi- neering and system safety in particular) are based; • A description of the model; • An evaluation and demonstration of the model through its application to the analysis of some recent complex system accidents. Future chapters are planned to describe novel approaches to hazard analysis, accident prevention, risk assessment, and performance monitoring, but these ideas are not yet developed adequately to justify wide dissemination. Large and realistic examples are included so the reader can see how iii this approach to system safety can be applied to real systems. Example specifications and analyses that are too big to include will be available for viewing and download from the web. Because the majority of users of Safeware have been engineers working on real projects, class exercises and other teaching aids are not included. I will develop such teaching aids for my own classes, however, and they, as well as a future self-instruction guide for learning this approach to system safety, will be available as they are completed. Course materials will be available un- der the new MIT Open Courseware program to make such materials freely available on the web (http://ocw.mit.edu). iv Contents I Foundations 1 1 Why Do We Need a New Model? 3 2 Limitations ofTraditional Accident Models 7 2.1UnderstandingAccidentsusingEventChains...................... 7 2.1.1 SelectingEvents.................................. 9 2.1.2 SelectingConditions................................ 12 2.1.3 SelectingCountermeasures............................ 13 2.1.4 Assessing Risk ................................... 14 2.2InterpretingEventsandConditionsasCauses...................... 17 3 Extensions Needed to Traditional Models 25 3.1SocialandOrganizationalFactors............................. 25 3.2SystemAccidents...................................... 30 3.3HumanErrorandDecisionMaking............................ 32 3.4SoftwareError....................................... 36 3.5Adaptation........................................