Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee, 37212 a Case Study on the Application of S
Total Page:16
File Type:pdf, Size:1020Kb
Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee, 37212 A Case Study On The Application of Software Health Management Techniques Nagabhushan Mahadevan Abhishek Dubey Gabor Karsai TECHNICAL REPORT ISIS-11-101 Jan, 2011 A Case Study On The Application of Software Health Management Techniques Nagabhushan Mahadevan Abhishek Dubey Gabor Karsai Institute for Software Integrated Systems, Vanderbilt University, Nashville, TN 37212, USA Abstract—Ever increasing complexity of software used in large- [1] compliant operating systems., for software health man- scale, safety critical cyber-physical systems makes it increasingly agement (SHM) [14], [15]. The core principle behind our difficult to expose and thence correct all potential bugs. There approach is the hypothesis that it is possible to deduce the is a need to augment the existing fault tolerance methodologies with new approaches that address latent software bugs exposed behavioral dependencies and failure propagation across a com- at runtime. This paper describes an approach that borrows and ponent assembly, if the interactions between those components adapts traditional ‘Systems Health Management’ techniques to are restricted and well-defined. Here, components imply soft- improve software dependability through simple formal specifica- ware units that encapsulate parts of a software system and tion of runtime monitoring, diagnosis and mitigation strategies. implement a specific service or a set of services. Similar The two-level approach of Health Management at Component and System level is demonstrated on a simulated case study of an approaches exist in [12], [36]. The key differences between Air Data Inertial Reference Unit (ADIRU). That subsystem was those and this work are that we apply an online diagnosis categorized as the primary failure source for the in-flight upset engine coupled with a two-level mitigation scheme. caused in the Malaysian Air flight 124 over Perth, Australia in In this paper, we provide a discussion of our work with re- August 2005. spect to a case study that approximately emulates the working of the Boeing 777 Air Data Inertial Reference Unit (ADIRU). I. INTRODUCTION Our goal is to show how SHM architecture can be used to Due to the increasing software complexity in modern cyber- detect, diagnose, and mitigate the effects of component-level physical systems there is a likelihood of latent software defects failures such that the system-wide functionality is preserved. that can escape the existing rigorous testing and verification This work extends our previous works [13], [14], [15] to allow techniques but manifest only under exceptional circumstances. multi-module systems working on different physical comput- These circumstances may include faults in the hardware sys- ers. We also extended the detection functionality developed tem, including both the computing and non-computing hard- earlier for monitoring the correctness of data on all ports ware. Often, systems are not prepared for such faults. Such to enable observers that can also monitor the sequence of problems have led to number of failure incidents in the past, activities inside a component. Finally, we built the necessary including but not limited to those referred to in these reports: infrastructure to close the loop from detecting an anomaly [5], [26], [6], [7], [17]. and diagnosing a component failure to issuing the necessary State of the art for safety critical systems is to employ mitigation actions in real-time. software fault tolerance techniques that rely on redundancy Paper Outline: Section III describes the incident and the and voting [23], [35], [9]. However, it is clear that existing ADIRU architecture. Section IV describes the main concepts techniques do not provide adequate coverage for problems of our component framework used to build the emulated sys- such as common-mode faults and latent design bugs triggered tem. Sections V-VI describe the implemented case study and by other faults. Additional techniques are required to make explains our approach. Finally section VII presents a discus- the systems self-managing i.e. they have to provide resilience sion on the experiment. to faults by adaptively mitigating faults and failures. Self-adaptive systems, while in operation, must be able to II. RELATED RESEARCH adapt to latent faults in their implementation, in the computing The work described here fits in the general area of self- and non-computing hardware; even if they appear simulta- adaptive software systems, for which a research roadmap has neously. Software Health Management (SHM) is a system- been presented in [10]. Our approach is focusing on latent atic extension of classical software fault tolerance techniques faults in software systems, follows a component-based archi- that aims at implementing the vision of self-adaptive software tecture, with a model-based development process, and imple- using techniques borrowed from system health management ments all steps in the Collect/Analyze/Decide/Act loop. for complex engineering systems. System health management Rohr et al. advocate the use of architectural models for self- typically includes anomaly detection, fault source identifica- management [30]. They suggest the use of a runtime model tion (diagnosis), fault effect mitigation (in operation), mainte- to reflect the system state and provide reconfiguration func- nance (offline), and fault prognostics (online or offline) [27], tionality. From a development model they generate a causal [20]. graph over various possible states of its architectural entities. Our research group has been involved in developing tools At the core of their approach, they use specs based on UML to and techniques, including a hard real-time component frame- specify constraints, monitoring and reconfiguration operations work built over the platform services provided by ARINC-653 at the time of development. Garlan et al. [16] and Dashofy et al. [11] have proposed Accelerometer an approach which bases system adaptation on architectural Gyro Fault Containment Power Supply Containment Containment Area with 6 Area with 6 Gyro FCM Area with 3 supply FCM models representing the system as a composition of several Accelerometer FCM components, their interconnection and properties of interest. Their work follows the theme of Rohr et al., where architec- Processor Processor Processor Processor tural models are used at runtime to track system state and make reconfiguration decisions using rule-based strategies. ARINC 629 FCA, Left ARINC 629 FCA, ARINC 629 FCA, right While these works have tended to the structural part of the 2 629 FCM center 2 629 FCM 2 629 FCM self-managing computing components, some have emphasized on the need for behavioral modeling of the components. For Mid value example, Zhang et al. described an approach to specify the selection Flight Flight Flight behavior of adaptable programs in [41]. Their approach is Computer Computer Computer based on separating the adaptation behavior specification and non-adaptive behavior specification in autonomic computing software. They model the source and target models for the Secondary Attitude Air Data Reference Unit (SAARU) program using state charts and then specify an adaptation model, i.e., the model for the adaptation set connecting the source model to the target model using a variant of Linear Fig. 1. Outline of 777 ADIRU Architecture. Based on the diagram shown in [6], page 5. FCA= Fault Containment Area. FCM= Fault Containment Temporal Logic [40]. Modules Williams’ research [29] concentrates on model-based au- tonomy. The paper suggests that emphasis should be on de- veloping techniques to enable the software to recognize that it in the fault masking software in the aircraft’s primary Air has failed and to recover from the failure. Their technique lies Data Inertial Reference Unit (ADIRU). An ADIRU supplies in the use of a Reactive Model-based programming language air data (airspeed, angle of attack and altitude) and inertial (RMPL)[38] for specifying both correct and faulty behavior reference (position and attitude) information to the pilots’ of the software components. They also use high-level control Electronic Flight Instrument System displays as well as other programs [39] for guiding the system to the desirable behav- systems on the aircraft such as the engines, autopilot, flight iors. control and landing gear systems. An ADIRU acts as a sin- Lately, the focus has started to shift to formalize the soft- gle, fault tolerant source of navigational data for both pilots ware engineering concepts for self-management. In [22], Light- of an aircraft (Source: http://en.wikipedia.org/wiki/Air Data stone suggested that systems should be made “just sufficiently” Inertial Reference Unit). To understand the scenario we need self-managing and should not have any unnecessary com- to briefly summarize the ADIRU architecture. plicated function. Shaw proposes a practical process control Boeing 777 ADIRU Architecture: The primary design approach for autonomic systems in [31]. The author main- principle in Boeing 777’s ADIRU Architecture [25], [32] is tains that several dependability models commonly used in multiple levels of redundancy, see Fig. 1 . There are two autonomic computing are impractical as they require precise ADIRU units: primary and secondary. The primary ADIRU specifications that are hard to obtain. It is suggested that prac- is divided into 4 Fault Containment Areas (FCA), with each tical systems should use development models