Digital Common-Cause Failure Susceptibility 2014 Project Status

3002002990

Digital Common-Cause Failure Susceptibility

2014 Project Status

3002002990 Technical Update, November 2014

EPRI Project Manager R. Torok

All or a portion of the requirements of the EPRI Nuclear Quality Assurance Program apply to this product.

ELECTRIC POWER RESEARCH INSTITUTE 3420 Hillview Avenue, Palo Alto, California 94304-1338 . PO Box 10412, Palo Alto, California 94303-0813 . USA 800.313.3774 . 650.855.2121 . [email protected] . www.epri.com

DISCLAIMER OF WARRANTIES AND LIMITATION OF LIABILITIES THIS DOCUMENT WAS PREPARED BY THE ORGANIZATION(S) NAMED BELOW AS AN ACCOUNT OF WORK SPONSORED OR COSPONSORED BY THE ELECTRIC POWER RESEARCH INSTITUTE, INC. (EPRI). NEITHER EPRI, ANY MEMBER OF EPRI, ANY COSPONSOR, THE ORGANIZATION(S) BELOW, NOR ANY PERSON ACTING ON BEHALF OF ANY OF THEM: (A) MAKES ANY WARRANTY OR REPRESENTATION WHATSOEVER, EXPRESS OR IMPLIED, (I) WITH RESPECT TO THE USE OF ANY INFORMATION, APPARATUS, METHOD, PROCESS, OR SIMILAR ITEM DISCLOSED IN THIS DOCUMENT, INCLUDING MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, OR (II) THAT SUCH USE DOES NOT INFRINGE ON OR INTERFERE WITH PRIVATELY OWNED RIGHTS, INCLUDING ANY PARTY'S INTELLECTUAL PROPERTY, OR (III) THAT THIS DOCUMENT IS SUITABLE TO ANY PARTICULAR USER'S CIRCUMSTANCE; OR (B) ASSUMES RESPONSIBILITY FOR ANY DAMAGES OR OTHER LIABILITY WHATSOEVER (INCLUDING ANY CONSEQUENTIAL DAMAGES, EVEN IF EPRI OR ANY EPRI REPRESENTATIVE HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES) RESULTING FROM YOUR SELECTION OR USE OF THIS DOCUMENT OR ANY INFORMATION, APPARATUS, METHOD, PROCESS, OR SIMILAR ITEM DISCLOSED IN THIS DOCUMENT. REFERENCE HEREIN TO ANY SPECIFIC COMMERCIAL PRODUCT, PROCESS, OR SERVICE BY ITS TRADE NAME, TRADEMARK, MANUFACTURER, OR OTHERWISE, DOES NOT NECESSARILY CONSTITUTE OR IMPLY ITS ENDORSEMENT, RECOMMENDATION, OR FAVORING BY EPRI. THE FOLLOWING ORGANIZATIONS PREPARED THIS REPORT: Electric Power Research Institute (EPRI) Southern Services, Inc. Applied , Inc.

THE TECHNICAL CONTENTS OF THIS PRODUCT WERE NOT PREPARED IN ACCORDANCE WITH THE EPRI QUALITY PROGRAM MANUAL THAT FULFILLS THE REQUIREMENTS OF 10 CFR 50, APPENDIX B. THIS PRODUCT IS NOT SUBJECT TO THE REQUIREMENTS OF 10 CFR PART 21.

This is an EPRI Technical Update report. A Technical Update report is intended as an informal report of continuing research, a meeting, or a topical study. It is not a final EPRI technical report.

NOTE For further information about EPRI, call the EPRI Customer Assistance Center at 800.313.3774 or e-mail [email protected]. Electric Power Research Institute, EPRI, and TOGETHER…SHAPING THE FUTURE OF ELECTRICITY are registered service marks of the Electric Power Research Institute, Inc. Copyright © 2014 Electric Power Research Institute, Inc. All rights reserved.

ACKNOWLEDGMENTS The following organizations prepared this report: Electric Power Research Institute (EPRI) Principal Investigator R. Torok Southern Engineering Services, Inc. 331 Allendale Drive Canton, Georgia 30115 Principal Investigator B. Geddes Applied Reliability Engineering, Inc. 1478 27th Avenue San Francisco, CA 94122 Principal Investigator D. Blanchard

This report describes research sponsored by EPRI.

This publication is a corporate document that should be cited in the literature in the following manner: Digital Common-Cause Failure Susceptibility: 2014 Project Status. EPRI, Palo Alto, CA: 2014. 3002002990.

iii

ABSTRACT Owner/operators of nuclear plants need to be able to systematically identify and assess susceptibilities to potential digital system failures and unintended behaviors that can lead to failures in plant systems, including common-cause failures (CCF). These vulnerabilities are managed using a combination of preventive and mitigative measures, but there is no consensus approach for crediting such measures to develop assurance that there is adequate protection against the consequences of potential failures and unintended behaviors involving digital systems. This concern applies to both safety and non-safety systems. While methods for ensuring high dependability in safety systems are generally well developed, non-safety systems typically use different and quality assurance approaches; and corresponding consensus guidance and acceptance criteria for them are not well established. This technical update provides a status report on an ongoing EPRI project that is developing practical guidance to help utility engineers, equipment suppliers, and system integrators assess and manage a full range of digital instrumentation and control (I&C) failure and CCF susceptibilities. It describes the important technical issues and provides a preliminary description of the overall approach being developed by the project team, including key underlying concepts and principles that form the technical basis of the approach. The final product will be published in 2015. Keywords Common-cause failure (CCF) Coping analysis Diversity Digital Instrumentation and control systems

v

CONTENTS 1 INTRODUCTION ...... 1-1 2 TECHNICAL ISSUES ...... 2-1 Common-Cause Failure ...... 2-1 Software CCF and Design Defects ...... 2-3 Protecting Against the Effects of CCF ...... 2-4 3 TECHNICAL APPROACH ...... 3-1 Key Terms ...... 3-1 Protection = Preventive and/or Mitigative Measures ...... 3-2 Defects and Triggers ...... 3-3 of Interest ...... 3-4 Diversity ...... 3-5 4 ASSESSING AND MANAGING DIGITAL FAILURE SUSCEPTIBILITIES ...... 4-1 Holistic Approach ...... 4-1 Safety versus Non-Safety ...... 4-2 Safety Systems ...... 4-2 Non-Safety Systems ...... 4-2 Considerations for Assessment ...... 4-3 5 PROJECT STATUS AND DELIVERABLES ...... 5-1 Guideline Description ...... 5-1 Schedule ...... 5-1 6 REFERENCES ...... 6-1

vii

LIST OF FIGURES Figure 2-1 Analog Plant ...... 2-2 Figure 2-2 Analog Plant with a Digital Feedwater Control (DFW) Upgrade ...... 2-2 Figure 2-3 Digital Feedwater Control and Digital Electrohydraulic (DEH) Turbine Control ...... 2-3 Figure 3-1 Protection = Preventive and/or Mitigative Measures...... 3-2 Figure 3-2 Conditions Required for Digital Failures and Digital CCFs ...... 3-4 Figure 3-3 Plant Hierarchy Illustrating “Level of Interest” Issue ...... 3-5

ix

1 INTRODUCTION Operating nuclear plants are gradually transitioning much of their aging and obsolete instrumentation and control (I&C) equipment from analog to digital technology. However, the introduction of software-based instrumentation and control (I&C) equipment in critical safety and control system applications in nuclear power plants raises the concern that latent software and other ‘digital’ faults could create new failure modes and unintended behaviors that might defeat traditional protective features. New nuclear plants will be ‘all digital;’ they will use more extensive and highly integrated I&C systems than existing plants, with correspondingly greater potential for new failure modes and unanticipated behaviors. Owner/operators of nuclear plants need to be able to systematically identify and assess susceptibilities to potential digital system failures and unintended behaviors that can lead to failures in plant systems, including common-cause failures (CCF). These vulnerabilities are managed using a combination of preventive and mitigative measures, but there is no consensus approach for crediting such measures to develop assurance that there is adequate protection against the consequences of potential failures and unintended behaviors involving digital systems. This concern applies to both safety and non-safety systems. While methods for ensuring high dependability in safety systems are generally well developed, non-safety systems typically use different design and quality assurance approaches, and corresponding consensus guidance and acceptance criteria for them are not well established. This technical update provides a status report on an ongoing EPRI project that is developing practical guidance to help utility engineers, equipment suppliers and system integrators assess and manage a full range of digital I&C failure and CCF susceptibilities. It describes the technical issues and the overall approach being developed by the project team, including key underlying concepts and principles that form the technical basis of the approach.

1-1

2 TECHNICAL ISSUES

Common-Cause Failure The primary technical concern driving this effort is the potential for common-cause failures (CCF) introduced by the use of digital systems. Safety systems achieve very high reliability in part by using independent and redundant hardware trains to ensure that they can perform their critical functions even in the presence of hardware failures. However, digital I&C introduces the potential for common-cause failures (CCFs) due to errors made during development that can render the redundant hardware architecture ineffective, and disable multiple equipment trains or systems that use identical software elements or shared resources. I&C for generation- critical non-safety systems are also migrating toward redundant, fault-tolerant digital to improve reliability, and here again, errors made during development could lead to common-cause failure that could defeat such architectures. Digital systems can also be subject to CCF causes that affect analog systems, such as environmental disturbances and single failures that affect multiple components and cascade into CCFs. In the analog plant depicted in Figure 2-1, the safety systems use redundant, independent, hardware-based controllers so the systems can tolerate single hardware failures. The non-safety systems have a single controller for each controlled element, so the effects of a controller failure are limited. It is important to understand that CCFs can occur in several architectural contexts, including: • Redundant divisions of identical equipment/software – For example, redundant trains of a safety system that use identical hardware and software and have identical functionality. • Combining functions in a single controller – In replacing obsolete analog control systems, it is easy and potentially advantageous to combine multiple control functions in a single controller. However, a failure of the controller for any reason (software defect, hardware failure, environmental effects, etc.) can then affect multiple functions. Figure 2-2 shows an example in which a single controller controls both feedwater regulating valves (FRV) and both feedwater pumps (FWP), so a single controller misbehavior could cause simultaneous overfeed of both steam generators (S/G), leading to a reactor trip. • Concurrent spurious actuations of multiple control functions that create unanalyzed events – consider the case where multiple non-safety systems misbehave simultaneously because they share the same flawed software. • Multiple redundant divisions, plant systems or controllers that share resources (e.g., data networks, power supplies, timing signals, etc.) – A single failure or unintended behavior could affect multiple components or systems, generating multiple failures or multiple spurious actuations, both forms of CCF. An example is shown in Figure 2-3, where digital feedwater and turbine control systems are connected to the same communications network, introducing the possibility of failures or spurious actions that affect multiple components in both systems.

2-1

Figure 2-1 Analog Plant

Figure 2-2 Analog Plant with a Digital Feedwater Control (DFW) Upgrade

2-2

Figure 2-3 Digital Feedwater Control and Digital Electrohydraulic (DEH) Turbine Control

Software CCF and Design Defects The likelihood of hardware common-cause failure in nuclear safety systems is generally considered acceptably low due to factors such as the high quality standards applied in its development and manufacture, physical separation of redundant equipment, and the recognition that degradation mechanisms that could result in common failures (e.g., corrosion or premature wear-out) are slow to develop and would be detected in maintenance and surveillance activities before they could disable a critical function [10]. For the purposes of this discussion, analog I&C is treated the same as other hardware components. Digital systems are also developed using high quality standards and extensive testing, but software does not wear out or fail in the sense that hardware does. Software ‘failures’ result from design defects (i.e., faults or bugs) in the code that cause unintended or undesired behaviors under specific conditions, usually conditions unanticipated by the [1, 4]. Effectively, the software functions exactly as it was designed to, but it may not function as the designer had intended. Such ‘failures’ are caused by designed-in errors that can originate in several ways. Examples are incorrect requirements, misunderstandings of requirements, and coding mistakes that have been designed into the code and overlooked during verification and validation (V&V). There is no significant wear-out mechanism in software. Hence, the notion that failures are likely to be detected during maintenance and testing before multiple redundancies have been affected does not generally apply. Even though with the right development methods, languages and tools it is possible to rigorously justify the absence of certain classes of defects, in most cases software cannot be proven to be completely defect free. Therefore, defects are possible, and some software design defects can become sources of CCF.

2-3

Protecting Against the Effects of CCF From a practical engineering perspective, the issue of latent design defects and the possibility that they could lead to CCFs leads to consideration of ways to protect against CCFs, either by preventing them or by mitigating their effects. Preventing CCFs involves either eliminating design defects, or limiting the defects’ ability to cause failures and CCFs. Such preventive measures effectively reduce the likelihood of CCF. Mitigation methods provide another form of protection by reducing the undesired effects, ideally enabling the plant to cope with a CCF, should it occur. Mitigation and coping are discussed further in Section 3. Various design practices, development methods and coping strategies can be effective in preventing CCFs and mitigating against their effects [1], which leads to the question of how much protection is sufficient. At what point is the likelihood of CCF low enough that additional protection is not needed? There are no accepted consensus methods for quantitatively evaluating the probability of digital CCF. Still, as described in later sections, there is much that can be done to reduce the likelihood of digital CCF through preventive measures, and to reduce the effects of CCF through mitigative measures, which can be demonstrated through a coping analysis. These protective measures can provide the basis for a qualitative assessment of CCF likelihood, adequacy of mitigative measures (coping), and adequacy of overall protection against CCF.

2-4

3 TECHNICAL APPROACH In developing the technical approach, the project team is drawing from industry standards and practices, lessons learned, and related EPRI products published over the last several years. The focus is on the effects of CCF at the plant level and demonstrating adequate protection against them in the form of an appropriate combination of preventive and mitigative measures and in light of other considerations, like risk significance and complexity. This approach takes advantage of evolutionary gains in the industry’s ability to ensure highly dependable performance in digital systems through the use of defensive design measures and effective development processes. The approach also allows for the use of coping analysis to investigate the effects of potential failures and misbehaviors in cases where a demonstration of coping capability might be an efficient way to provide assurance of adequate mitigation. Ultimately, the approach relies on the use of engineering judgment to consider all the available evidence and decide if the failure risks have been managed adequately. This is in contrast to earlier approaches that start by assuming a CCF occurs and then use a coping analysis to show adequate mitigation capability, without considering the preventive measures that may be in place [10]. The following subsections address key concepts used in developing the technical approach.

Key Terms A few key terms used in regard to CCF need clarification as they are used in this report, because they have proven somewhat confusing and ambiguous in practice: • Failure: Inability of a structure, system or component to function within acceptance criteria [11]. Note: A failure is the result of hardware, software or system defect, or a human error, and the associated signal trajectory which triggers the failure. • Software Failure: A failure (as defined above) that results when a software fault or defect is triggered. • Single Failure: A failure which results in the loss of capability of a system or component to perform its intended function(s), and any consequential failure(s) which results from it. • Common Cause Failure: Failure of two or more structures, systems or components due to a single specific event or cause [11] – Note: This is a broad definition that includes multiple failures of identical components as well as failures of multiple like or unlike components that start with a failure of a single component. That is, a single failure, random or systematic, of hardware or software, can lead to a CCF. • Diversity: Existence of two or more different ways or means of achieving a specified objective [1].

3-1

• Defense-in-depth and diversity analysis – Sometimes this term is used to refer to assessing susceptibility to CCF, other times it means the analyst should assume a CCF occurs and then demonstrate adequate mitigation capability. For the purposes of this project, defense-in- depth analysis is considered to have two parts:

o Susceptibility analysis – refers to identifying potential vulnerabilities and the measures in place to prevent them. Susceptibility analysis may also include qualitatively assessing the likelihood of failure/CCF.

o Coping analysis – an analysis that shows whether the mitigative measures are adequate to avoid the undesirable effects of a CCF, ignoring the preventive measures that are in place to reduce the likelihood of the CCF.

Protection = Preventive and/or Mitigative Measures Regardless of context, protection against unacceptable consequences consists of both prevention and mitigation of undesired events. Figure 3-1 compares protective measures for loss-of-coolant accidents to those for digital failure or CCF. For the pipes, prevention consists of measures that make pipe breaks highly unlikely – use of high quality materials and welding processes, inspections and so on. Mitigation consists of isolation valves, emergency core cooling systems and other systems that might come into play after a pipe breaks. Digital systems are analogous. Preventive measures include good programming practices and design features that detect and/or avoid problems. Mitigative measures include backup systems, operator action and automatic actions that can put the system in a safe state should a failure or CCF occur.

Figure 3-1 Protection = Preventive and/or Mitigative Measures

3-2

Defects and Triggers In a digital system a design defect (fault or ‘bug’) can only generate a failure or misbehavior (e.g., spurious actuation) if it sees a triggering condition that activates the defect (Figure 3-2). In principle, design defects can occur in both hardware and software. Triggering conditions for software defects are almost always unanticipated, untested conditions that the either did not consider or thought to be highly unlikely. Because software does not wear out and its behavior is deterministic, in anticipated, tested conditions digital systems are highly reliable. Triggering conditions can be of various types, for example: • A single random hardware failure – A random hardware failure in one component triggers a latent defect in one or more additional components. This type of trigger usually involves wearing or aging mechanisms. For example, suppose there is a sensor failure that the software in a controller is supposed to detect, but fails to detect due to a software defect. This type of trigger can become a CCF if the first failure is not detected and corrected before the next one occurs. It can also initiate a CCF if the single failure affects multiple components or systems. • An unanticipated, but real plant system condition – A digital system design typically reflects the understanding of the people who specified the requirements and designed the system. It is possible for an inaccurate or incomplete understanding of the system behavior to give rise to situations in which there are real behaviors that the digital system is not equipped to handle. An example is an autopilot for an airplane that assumes altitude will always be a positive number. (There are places where it is possible to fly below sea level.) • An unanticipated, but real system behavior that was known to be possible, but was deemed so unlikely that it was not addressed in the requirements. • An anticipated, real system behavior that triggers faults introduced by an operational or maintenance error, such as mis-calibrated sensors or an incorrect valve alignment. • Environmental disturbance – This will affect all equipment subject to the environment. Examples are flooding, high or low temperature, humidity, seismic events, etc. Note that environmental disturbances can affect like or diverse equipment. For example, it might be possible for a seismic event to simultaneously dislodge tin whiskers generated by lead-free solder in multiple components and multiple systems, each with different results. Note that not all digital faults/failures are likely to generate CCFs. Consider the case where there are multiple redundancies with identical equipment and functionality, and the trigger is a single random hardware failure. It would take multiple concurrent hardware failures (which is not likely if they are random) to generate a CCF. Similarly, not all digital failures/CCFs are risk- significant. In assessing failure/CCF susceptibility it is helpful to focus on failures that are critical for either safety or generation. For these reasons, defect free software is not really necessary. Rather the target should be a software design that is highly unlikely to cause critical failures or CCFs. The notion of faults and triggers is important, because a lot is known about ways to eliminate and avoid faults and triggers through good development processes and good design, and thereby

3-3

significantly reduce the likelihood of failure/CCF. Although we cannot generate a precise failure probability for digital systems, we can generate a qualitative assessment of the failure likelihood by assessing the measures in place to reduce the likelihood of defects and triggers. Digital equipment designers have been working to perfect these techniques for many years; it is not by luck that high quality digital systems are robust and highly dependable. As part of this project the EPRI team is consulting with experienced designers of digital I&C systems to capture their expert insights on effective methods for managing potential defects and triggers.

Defect Trigger

Multiple Functions Digital Affected Failure Concurrently

Digital CCF Within System Multiple Functions in Multiple Systems Digital CCF Affected Concurrently Across Systems

Figure 3-2 Conditions Required for Digital Failures and Digital CCFs

Level of Interest It is not always necessary to prevent or mitigate potential failures at the level of the digital components or the software that resides within them. Failures and CCFs can be managed (prevented or mitigated) at various levels within the plant architecture. Figure 3-3 shows a hierarchical framework of plant systems and components that illustrates this point [2]. Typically, software faults are near the bottom of the overall architecture, while the effects of interest are near the top. In principle, a potential failure or CCF vulnerability can be prevented or mitigated anywhere between the initiator and the effect of interest. For example, consider a digital control system for a fuel handling crane. A concern about the possibility that an undiscovered software error could cause the crane to move a fuel bundle outside the safe range of operation could be addressed using software features that track the crane position and compare it to acceptable values, but it might be simpler and more cost-effective to use hard mechanical stops that limit the range of the crane.

3-4

Figure 3-3 Plant Architecture Hierarchy Illustrating “Level of Interest” Issue

Diversity Diversity of various types can provide effective protection against CCF. Diversity in redundant I&C divisions may help prevent CCFs, and a diverse backup system can be used to mitigate a failure or CCF after it occurs. Various types of diversity can be employed [3], and some are undoubtedly more useful than others. In an earlier EPRI study of nuclear plant operating experience, functional and signal diversity were shown to be effective in limiting CCFs in actual plant events [4]. Diversity, however, is not the only means of protection against digital CCF, and in some situations is not the most appropriate. It can add complexity in training, maintenance and in resolving conflicts when diverse redundancies disagree with each other. Simply adding diversity may not add significant protection against CCF. It is of limited value in protecting against requirements errors that can lead to CCF, especially in redundancies with identical functionality. Diverse backup systems can increase the risk of spurious actuation, sometimes with an overall negative impact on plant safety [5, 6].

3-5

4 ASSESSING AND MANAGING DIGITAL FAILURE SUSCEPTIBILITIES

Holistic Approach The anticipated approach for assessing protection against software failure/CCF will consider available evidence that might contribute to a case for reasonable assurance of adequate protection. A susceptibility assessment would include consideration of factors that affect the likelihood of both latent software defects and the triggering conditions that can activate them. The overall assessment will consider available mitigation mechanisms, as well as the importance of the system from safety and operational perspectives. Potential contributors to an assurance argument include: • Traditional hardware practices - quality assurance, qualification testing, etc. – The guideline will not address these in detail, as that is covered in other industry documents and standards. • Rigorous software development practices – e.g., standards, coding practices, etc. – It is important to keep in mind that while good development practices will likely ensure complete and comprehensive documentation, they do very little to ensure that the final design will be adequate and that the corresponding product will be robust and dependable. • High quality, well vetted requirements - Requirements specifications that are incorrect, incomplete, unclear, ambiguous, etc. can result in software and system defects that are difficult to detect through normal verification and validation processes, including testing. • Defensive design measures in software, hardware, architecture, procedures, operation, etc. Operating experience in U.S. and Korean plants [4, 7] suggests that equipment suppliers are doing quite well in this area, as other sources of problems dominated the operating experience. • Mitigation and coping capability – While a coping analysis does not add protection against failure/CCF, it may demonstrate that there is adequate mitigation in the event of the failure (including spurious actuations). Further, regardless of the likelihood of the failure/CCF, it may be more cost-effective to base the case for assurance of adequate protection on a coping analysis, rather than a detailed investigation of the measures in place to prevent the failure. • Extensive test coverage – Extensive testing is an important contributor to the assurance argument. However, 100% test coverage may not be possible for most digital systems, and testing alone may not constitute reasonable assurance of adequate protection. If testing is done against a requirements specification, it is unlikely to reveal designed-in defects that originated in flawed requirements. • Performance records – Operating history that is successful, sufficient, relevant to the intended application and adequately documented can be a significant contributor to an assurance argument, particularly for commercial grade equipment, which may benefit from

4-1

extensive operating experience in other industries and an evolved design that eliminates or avoids previously identified defects and triggers. • Risk and safety analysis insights – What constitutes reasonable assurance of adequate protection may be a function of the risk significance of the digital component, the plant system and the potential failure effects, as these factors differ significantly among applications. • Simplicity of digital platform and application – While there is no consensus on the best way to quantitatively measure complexity, there is broad agreement that simplicity helps ensure high dependability, and it is expected that a qualitative assessment of platform and functional simplicity may be practical. Ultimately, the approach relies on the use of engineering judgment to consider the available evidence and decide if the failure risks have been managed adequately.

Safety versus Non-Safety The primary contributors to assurance of adequate protection will likely be different for safety and non-safety equipment for several reasons. Safety Systems Safety equipment have nuclear quality assurance pedigree, including qualification testing, with corresponding documentation. Software development practices and documentation will be in accordance with relevant nuclear safety standards and guidance. The system functionality will tend to be relatively simple, permitting high test coverage and a lower likelihood of requirements and design errors, but the operating experience may be limited, especially for systems developed specifically for nuclear safety applications. Safety equipment will also benefit from design requirements for separation and independence, which help decrease the likelihood of CCF. For safety systems, coping analysis is typically done using best estimate assumptions, ostensibly because its nuclear safety pedigree is adequate assurance that the likelihood of a CCF is quite low. Non-Safety Systems In contrast, non-safety equipment is likely to be commercial grade, for which development practices, qualification testing and documentation vary widely. Operating experience may prove far more useful, provided it is sufficient, successful, relevant to the intended application and adequately documented. Mature commercial systems are likely to have defensive design measures that have been shown to be effective in avoiding and eliminating potential triggers, because of efforts to address problems and increase dependability as the product evolved. Non- safety systems are often of lesser risk significance, so for them, reasonable assurance and adequate protection may be achievable at a more modest level than for safety equipment. For non-safety systems, coping analysis using best-estimate methods may be justifiable on the basis of its strong defensive design measures and lesser risk significance.

4-2

Considerations for Assessment Several other considerations may come into play in qualitative assessment of the adequacy of preventive and mitigative measures for protecting against undesired consequences of failures and CCFs, including: • If a defensive design measure that avoids a particular type of failure has been demonstrated to work as intended, then that failure is considered highly unlikely. Effectively, it has been precluded by design. • Provisions to ensure credited defensive measures are maintained – Defensive measures that are important in assuring adequate protection against failure/CCF must be maintained throughout the life of the system. This becomes a non-issue if no changes are made, but if, for example, periodic software updates are made, there should be a mechanism in place to ensure that the important defensive measures are not eliminated or degraded. • Credit protective (preventive and mitigative) measures both inside and outside the digital system – Prevention and mitigation need not occur at the level of the potential defect that could generate a failure or CCF. Consider the example of the fuel handling crane, in which mechanical stops on the crane are used to limit the undesirable effects of a potential software defect. • Risk-benefit and cost-benefit of additional protection (“reasonably practicable”) – In considering additional preventive or mitigative measures, risk-benefit and cost-benefit assessments should be made. Adding features and backups necessarily increases cost and complexity, along with increasing the potential for additional defects and spurious actuation. • Process-based development standards – Process-based standards are often limited in that they emphasize activities and recommended documentation, but not design features or functionality that helps ensure high dependability. As a result, they may not ensure a well- designed end product. • Standards that require defensive design features – Conformance to standards that effectively require defensive design measures by requiring functionality that protects against common failure mechanisms may be helpful in building an adequate protection case. • CCF protection tailored based on application and platform complexity – It may be possible to justify use of more limited defensive measures for applications of lesser platform and functional complexity. • Tools that reduce likelihood of software defects, e.g., static analyzers, automated design tools – Tools used to reduce the likelihood of software defects may be useful contributors to the argument for adequate protection. Even if they do not have nuclear safety pedigree and cannot be shown to catch every conceivable error, they will reduce the likelihood of defects. Failure and hazard analysis techniques may be useful in identifying potential CCF vulnerabilities, including combinations of spurious actions of multiple components, and corresponding preventive design measures.

4-3

5 PROJECT STATUS AND DELIVERABLES

Guideline Description The project is developing practical guidance to help utility engineers, equipment suppliers and system integrators assess and manage a full range of digital I&C failure and CCF susceptibilities. The target audience includes: • Utility I&C design engineers • Safety analysis engineers • Licensing engineers • PRA analysts A utility Technical Advisory Group (TAG) is providing input, oversight, and review of draft materials to help ensure that the end product will be used and useful. It is anticipated that the new guidance will be most useful when applied during the design process for digital implementations and modifications to: • Identify and address vulnerabilities • Develop design criteria to avoid CCFs • Qualitatively assess the likelihood digital failure and CCF early in design • Assess mitigation strategies and coping capability, should the failure/CCF occur In addition to a step-by-step approach for assessing and managing CCF susceptibilities, the guidance will include lists of recommended defensive design measures for reducing the likelihood of failures/CCF for selected types of equipment, and detailed examples to illustrate the principles and application of the approach. After the guideline has been published, training and industry workshops will be planned to help communicate and demonstrate the guideline approach. It is anticipated that the guidance will be helpful to U.S. utilities performing 10 CFR 50.59 evaluations and to the Nuclear Energy Institute (NEI) focus team working with their counterparts at the U.S. Nuclear Regulatory Commission (NRC) to resolve current concerns on licensing digital implementations and use of the NEI 01-01 guideline on licensing digital upgrades [8].

Schedule The guideline, is currently planned for publication in mid-2015.

5-1

6 REFERENCES 1. Protecting Against Digital Common-Cause Failure: Combining Defensive Measures and Diversity Attributes. EPRI, Palo Alto, CA: 2010. 1019182. 2. Hazard Analysis Methods for Digital Instrumentation and Control Systems. EPRI, Palo Alto, CA: 2013. 3002000509. 3. US NRC. NUREG/CR-7007 ORNL/TM-2009/302 Diversity Strategies for Nuclear Power Plants and Instrumentation and Control Systems. 4. Operating Experience Insights on Common-Cause Failures in Digital Instrumentation and Control Systems EPRI, Palo Alto, CA: 2008. 1016731. 5. Effects of Digital Instrumentation and Control Defense-in-Depth and Diversity on Risk in Nuclear Power Plants. EPRI, Palo Alto, CA: 2009. 1019183. 6. Letter, Subject: Diversity, J.E. Wilkins, Jr., Chairman ACRS to I. Selin, Chairman U.S. Nuclear Regulatory Commission, February 16, 1994. 7. Digital Operating Experience in the Republic of Korea. EPRI, Palo Alto, CA: 2011. 1022986. 8. Guideline on Licensing Digital Upgrades – A Revision to EPRI TR-102348 to Reflect Changes to the 10 CFR 50.59 Rule, EPRI, Palo Alto, CA: 2002. 1002833. 9. Estimating Failure Rates in Highly Reliable Digital Systems. EPRI, Palo Alto, CA: 2010. 1021077. 10. Staff Requirements Memorandum, “SECY-93-087 – Policy, Technical, and Licensing Issues Pertaining to Evolutionary and Advanced Light-Water Reactor (ALWR) Designs,” July 21, 1993. 11. Protecting against Common Cause Failures in Digital I&C Systems of Nuclear Power Plants, — Vienna: International Atomic Energy Agency, 2009. (IAEA nuclear energy series, ISSN 1995–7807; no. NP-T-1.5) STI/PUB/1410 ISBN 978-92-0-106309-0.

6-1

Export Control Restrictions The Electric Power Research Institute, Inc. (EPRI, Access to and use of EPRI Intellectual Property is granted www.epri.com) conducts research and development with the specific understanding and requirement that relating to the generation, delivery and use of responsibility for ensuring full compliance with all applicable electricity for the benefit of the public. An U.S. and foreign export laws and regulations is being independent, nonprofit organization, EPRI brings undertaken by you and your company. This includes an together its scientists and engineers as well as obligation to ensure that any individual receiving access hereunder who is not a U.S. citizen or permanent U.S. experts from academia and industry to help address resident is permitted access under applicable U.S. and challenges in electricity, including reliability, foreign export laws and regulations. In the event you are efficiency, affordability, health, safety and the uncertain whether you or your company may lawfully obtain environment. EPRI also provides technology, policy access to this EPRI Intellectual Property, you acknowledge and economic analyses to drive long-range research that it is your obligation to consult with your company’s legal and development planning, and supports research in counsel to determine whether this access is lawful. Although EPRI may make available on a case-by-case basis an emerging technologies. EPRI’s members represent informal assessment of the applicable U.S. export approximately 90 percent of the electricity generated classification for specific EPRI Intellectual Property, you and and delivered in the United States, and international your company acknowledge that this assessment is solely participation extends to more than 30 countries. for informational purposes and not for reliance purposes. EPRI’s principal offices and laboratories are located You and your company acknowledge that it is still the in Palo Alto, Calif.; Charlotte, N.C.; Knoxville, Tenn.; obligation of you and your company to make your own assessment of the applicable U.S. export classification and and Lenox, Mass. ensure compliance accordingly. You and your company Together…Shaping the Future of Electricity understand and acknowledge your obligations to make a prompt report to EPRI and the appropriate authorities regarding any access to or use of EPRI Intellectual Property hereunder that may be in violation of applicable U.S. or foreign export laws or regulations.

© 2014 Electric Power Research Institute (EPRI), Inc. All rights reserved. Electric Power Research Institute, EPRI, and TOGETHER…SHAPING THE FUTURE OF ELECTRICITY are registered service marks of the Electric Power Research Institute, Inc. 3002002990

Electric Power Research Institute 3420 Hillview Avenue, Palo Alto, California 94304-1338 • PO Box 10412, Palo Alto, California 94303-0813 • USA 800.313.3774 • 650.855.2121 • [email protected] • www.epri.com