Safety III: a Systems Approach to Safety and Resilience

MIT ENGINEERING SYSTEMS LAB Safety III: A Systems Approach to Safety and Resilience Prof. Nancy Leveson Aeronautics and Astronautics Dept., MIT 7/1/2020 Abstract: Recently, there has been a lot of interest in some ideas proposed by Prof. Erik Hollnagel and labeled as “Safety-II” and argued to be the basis for achieving system resilience. He contrasts Safety-II to what he describes as Safety-I, which he claims to be what engineers do now to prevent accidents. What he describes as Safety-I, however, has very little or no resemblance to what is done today or to what has been done in safety engineering for at least 70 years. This paper describes the history of safety engineering, provides a description of safety engineering as actually practiced in different industries, shows the flaws and inaccuracies in Prof. Hollnagel’s arguments and the flaws in the Safety-II concept, and suggests that a systems approach (Safety-III) is a way forward for the future. Safety III: A Systems Approach to Safety and Resilience Contents Preface 3 Does Safety-I Exist? 4 Differences between Workplace Safety and Product/System Safety 7 Workplace and Product/System Safety History 8 A Brief Legal View of the History of Safety 8 A Technical View of the History of Safety 10 An Engineer’s View of Workplace Safety 12 An Engineer’s View of Product/System Safety 14 Activities Common among Different Industries 15 Commercial Aviation 17 Nuclear Power 19 Chemical Industry 20 Defense and “System Safety” 21 SUBSAFE: The U.S. Nuclear Submarine Program 25 Astronautics and Space 25 Healthcare/Hospital Safety 25 Summary 26 A Comparison of Safety-I, Safety-II and Safety-III 27 Definition of Safety 29 “Goes Wrong” vs. “Goes Right” 32 Safety is a Different Property than Reliability 38 What is a System? 41 Sociotechnical Systems 43 Decomposition and Emergence 45 “Bimodality” 49 Predictability 52 “Intractability” 52 Safety Management “Principle” 56 Investigation/Reporting Databases 57 Learning from Failure in Engineering 60 Accident Causality and Causality Models 63 Causality in General 65 Models of Accident Causality 69 The Linear Chain-of-Failure Events Model 69 Domino Model 72 Swiss Cheese Model 73 Hollnagel’s Resonance Model and FRAM 75 Limitations of the Linear Chain-of-Events Model in General 83 Epidemiological Models 86 System Theory and STAMP 86 A Brief Introduction to Systems Theory 87 The STAMP Model of Accident Causality 95 1 Attitude Toward Human Factors 99 Role of Performance Variability 101 Summary 104 The Future 105 References 106 Appendix: System Theory vs. Complexity Theory 108 Figures Fig. 1: Safety depends on context 31 Fig. 2: The terminology used in engineering 32 Fig. 3: (Hollnagel Figure 3.2 on Page 50): “Hypothesis of different causes” 36 Fig. 4: Causality in System Engineering 36 Fig. 5: Operators learn from crossing the boundaries of safe behavior 61 Fig. 6: (Hollnagel Figure 73 on Page 137): “The Safety-II view of failures and successes” 63 Fig. 7: Chain of events model for a tank explosion 69 Fig. 8: Tank explosion example shown with added protections 70 Fig. 9: Heinrich’s Domino Model of accident causation 71 Fig. 10: Reason’s Swiss Cheese Model 73 Fig. 11: Two examples of a FRAM specification of the steps in a process 75 Fig. 12: The FRAM process and “model” 76 Fig. 13: General process for creating safety-related analyses 76 Fig. 14: Analytic decomposition 86 Fig. 15: Emergent properties arise from complex interactions 88 Fig. 16: Control of emergent properties 88 Fig. 17: An example of a safety control structure 90 Fig. 18: Four types of causality included in Systems Theory 91 Fig. 19: Three types of causal loop structures 92 Fig. 20: Some of the factors in the Space Shuttle Columbia accident 93 Fig. 21: The basic building block for a safety control structure 96 Fig. 22: A representation of the STAMP model of accident causality 97 2 Preface Recently, there has been a lot of interest in some ideas proposed by Prof. Erik Hollnagel and labeled as “Safety-II” and argued to be the basis for achieving system resilience. He contrasts Safety-II to what he describes as Safety-I, which he claims to be what engineers do now to prevent accidents. What he describes, however, has very little or no resemblance to what is done today or to what has been done in safety engineering for at least 70 years. First, should you take my word for this? I have worked in safety engineering for 40 years. Here’s a little of my relevant background. I have degrees in mathematics, management, and computer science and did graduate work in cognitive and behavioral psychology. I have written two books on system safety (Safeware [Leveson, 1995] and Engineering a Safer World [Leveson, 2012]) and hundreds of papers on the topic. My efforts have been rewarded with many awards, most recently an IEEE Medal for Environmental and Safety Technologies. I am an elected member of the National Academy of Engineering. I also am fascinated by engineering history and have read much about how engineers handled safety for the past hundred or so years. In practice, I have worked in almost all aspects of aerospace and defense and, to a lesser extent, nuclear power, petrochemicals, patient safety and medical devices, most forms of transportation (particularly aircraft and automobiles), etc. I have also participated in writing some major accident reports (Deep Water Horizon, the Columbia Space Shuttle, and Texas City) and many less well-known ones. Finally, in the past few years, I have been encouraged to look into workplace safety because people felt that the engineering approaches that I have created might be useful there. I provide this background because I don’t recognize Prof. Hollnagel’s definition of Safety-I in my 40 years of experience in safety engineering. It is just not what is done in practice except, perhaps, in a very few organizations with the least sophisticated safety approaches. His analysis also confuses the almost totally different fields of product/system safety and workplace safety. Prof. Hollnagel tears apart his strawman Safety-I and recommends an alternative, which he calls Safety-II. In my experience, again, Safety-II is a giant step backward, particularly if it takes resources and attention away from more successful approaches. It contains the types of practices used in the past, mostly very long ago but also more recently in industries that have many accidents and that usually blame them all on the human operators. These practices have led to many unnecessary deaths and injuries. The Safety-II approach was rejected long ago in sophisticated engineering projects because it is not effective. Goals such as resilience, flexibility, and adaptability are important, but they are much more likely to be achieved using approaches other than Safety-II. These properties must be built into the system as a whole—they are not a function simply of the behavior of human operators, which seems to be the almost total emphasis in Safety-II. There certainly are a few aspects of Safety-II that might be useful in limited ways, but following the overall approach, I believe, is likely to lead to unnecessary accidents. In this paper, I explain these very strong statements and note that Prof. Hollnagel and his followers seem unaware of the successful use of a systems approach to safety, which is called “System Safety”1 by its practitioners. They may not know about it; it was developed and used primarily in the United States. System Safety was created for and has been used over the past 70 years in aerospace and defense to cope with the most dangerous systems being created. In my work and writings, I have extended this very successful practice to handle the evolution (and sometimes revolutionary change) of engineering 1 The term “system safety” has been adopted recently as a general term for safety engineering by people not familiar with the special field of System Safety developed long ago. I will differentiate them here by using capital letters to denote the specialized field of System Safety. 3 practices over time. These changes include greatly increasing complexity, the extensive and growing use of computers and other forms of new technology, and a changing role of humans in complex systems. In this paper, I call this general approach Safety-III to put it into the Hollnagel context. It is not new, however—the general practices have been around for a very long time, but primarily used in the most sophisticated and sometimes secretive engineering contexts. It can provide a template for advances in all industries, including product/system safety and workplace safety, going forward. Changes and advances will be needed to keep it relevant for engineering in the future, of course, as our technology and society change. One of the dangers of critiquing someone’s approach is determining exactly what that approach is. Our views evolve over time as more is learned, and we all change them in small or even major ways with more experience. In addition, many people write papers about someone else’s concept and interpret it differently than the original author, introducing their own slant and representing their own experiences. I’ve seen this in papers by the proponents of Safety-II other than Prof. Hollnagel, particularly in healthcare safety. To try to stay as close as possible to the original conceptions of Safety-I and Safety-II, as defined by Prof. Hollnagel, in this paper I use only the writings of Prof. Hollnagel himself, basically his two books Safety I and Safety-II: The Past and Future of Safety Management [Hollnagel, 2014] and Safety-II in Practice [Hollnagel, 2018].

Safety III: a Systems Approach to Safety and Resilience

System Safety Engineering: Back to the Future

Manuscript Instructions/Template

Infrastructure (Resilience-Oriented) Modelling Language: I®ML

IS2018 Book of Abstract

Model for Safety Case Modeling and Documentation

Systems Theoretic Process Analysis (STPA): a Bibliometric and Patents Analysis

Manuscript Instructions/Template

Download from an Unknown Website Cannot Be Said to Be ‘Safe’ Just Because It Happens Not to Harbor a Virus

Machine Learning Testing: Survey, Landscapes and Horizons

Decision Procedures for Algebraic Data Types With

Why Software Is So Bad

CAST HANDBOOK: How to Learn More from Incidents and Accidents