focuspersistent software attributes

A Unified Model of Dependability: Capturing Dependability in Context

Victor Basili, University of Maryland and Fraunhofer Center for Experimental Paolo Donzelli and Sima Asgari, University of Maryland

n contemporary societies, individuals and organizations increasingly depend on services delivered by sophisticated software-intensive sys- tems to achieve personal and business goals. So, a system must have I engineered and guaranteed dependability, regardless of continuous, rapid, and unpredictable technological and context changes. The International Federation for Information Processing Working Group 10.4 (www.dependability.org) defines dependability as “the trustworthiness

of a system which allows reliance different things to different people; you often to be justifiably placed on the services it deliv- see different definitions given for the same at- ers.” “Reliance” is contextually subjective and tributes.1–3 Put another way, dependability as- depends on the particular stakeholders’ needs. sumes a precise meaning only when applied to Depending on the circumstances, different a specific context: the system and the stake- stakeholders will focus on different systems at- holders’ goals it must support. So, achieving tributes, such as availability, performance, and maintaining dependability in a quickly real-time response, and ability to avoid cata- changing context requires you to firmly un- strophic failures and resist adverse conditions, derstand its meaning. as well as different levels of adherence to such From this perspective, we introduce our attributes. Additionally, an attribute can mean Unified Model of Dependability, a for discussing and reasoning about dependability. By providing a structured Dependability is a key systems property that must be guaranteed framework for eliciting and organizing de- regardless of continuous, rapid, and unpredictable technological and pendability needs, UMD helps stakeholders build dependability models that clearly iden- context changes. Our Unified Model of Dependability lets you reason tify the measurable, implementable properties about dependability and turn it into clearly defined, implementable individual systems need to be dependable for system properties. their users.

0740-7459/04/$20.00 © 2004 IEEE Published by the IEEE Computer Society IEEE SOFTWARE 19 WHY READ THIS ARTICLE?

Imagine for a moment that you have a complex system ticle, Victor Basili, Paolo Donzelli, and Sima Asgari de- with numerous properties that must all be maintained within scribe their Unified Model of Dependability approach to certain limits during the system’s operation. When a spe- capturing and analyzing the impacts of diverse events cific negative event such as a denial-of-service attack oc- on critical system properties. Their UMD method demon- curs, what will its impact be on various properties such as strates the recurring theme that you can make -ilities more system safety, availability, , and other “-ilities”? enduring and survivable over time by capturing and sup- For most systems, such questions can be surprisingly porting a wider range of context information about those difficult to analyze due to the diverse ways in which a sin- properties. —Terry Bollinger, Jeffrey Voas, and gle event can affect different system properties. In this ar- Maarten Boasson, guest editors

■ If the online application is an e-commerce EventIssue Scope system, a delayed response could discour- Denial-of-service Response time age potential customers from buying (the attack > 10 seconds Query service Cause Concern service is unavailable), so this misbehavior represents a lack of availability. ■ The response delay could occur because of Figure 1. Unified Model an internal fault but also because of an ex- of Dependability UMD ternal event—for example, an unintentional concepts and their Traditional dependability modeling usually external event, such as a hardware fault, or relationships with an develops in a top-down fashion. For example, an intentional one, such as a denial-of-serv- online application let’s focus on a specific dependability attribute, ice attack. In this case, the performance fail- system example. performance, and assume we want to define ure becomes a lack of survivability (unin- what performance means for a specific service tentional) or of security (intentional). of a given system (for example, an online ap- plication’s query service). First, we need to de- These simple examples show how you could fine performance—for example, “performance interpret the same failure differently. Given this is a static or dynamic system’s capability (re- one-to-many relationship between failures and sponse time, throughput) defined in terms of attributes, you can more efficiently and intu- an acceptable range.”4 Then, on the basis of itively model dependability by assuming failure this definition, we can specify the desired be- as a starting point (that is, think bottom-up havior. So, our online application’s query serv- and not only top-down). As Brian Randall ice could have this performance requirement: points out, there’s a clear need to go beyond “The response time must be no greater than 10 terminology (that is, attributes definitions) and seconds.” In this way, we’ve obtained a model focus on relevant concepts (such as failure).5 of the system’s performance starting from a UMD builds on this need. We designed it to generic attribute definition. This simple model focus on a dependability issue (that is, an un- defines what we expect from the system: When dependable behavior, such as a failure) to help the query service responds in more than 10 sec- stakeholders build dependability models of in- onds, we can say there’s a performance failure, dividual systems. UMD (see Figure 1) lets or a lack of performance. However, the same stakeholders model dependability by defining failure could also represent a lack of other— an actual issue that shouldn’t affect the system perhaps even more relevant for the stakehold- or a service (scope) along with a possible re- ers—system properties. For example, sponsible external event that might cause it.

■ If the online application supports an emer- Supporting elicitation gency response operator, a delayed re- We provide stakeholder guidance by incor- sponse could result in a dangerous situa- porating into UMD the models, definitions, tion, so failure would be also a hazard, and classifications adopted in the literature to representing a lack of safety. discuss these concepts. For example (see Figure

20 IEEE SOFTWARE www.computer.org/software Figure 2. Stakeholder Issue guidance for specifying FAILURE issues. Characterization: 2), we could categorize issues into the (not ex- Type clusive) concepts of failures and hazards to • Accuracy • Response time help stakeholders identify potential system • And so on ... misbehaviors. Availability impact We can further refine these concepts by in- • Stopping troducing subclassifications for both failures • Nonstopping and hazards—for example, by classifying them HAZARD by type or other characteristics (such as failure Characterization: availability impact in Figure 2). Depending on Type the specific needs, you could adopt different • User hazard standards and classifications, for example, • Environment hazard • And so on ... MIL-STD-8826 for hazards and ISO/IEC 132367 for failures. Similarly, you could introduce ad hoc definitions and classifications for event and scope. In this way, UMD guides stakeholders dur- work’s lower to upper levels. ing the elicitation process. Specifically, stake- As Figure 3 shows, you can view UMD as an holders can use items already available or tai- experience base that supports stakeholders in lor, expand, and modify them according to building a specific system’s dependability model. their specific needs. For example, stakehold- The knowledge embedded in the UMD experi- ers can ence base can be customized and provides guid- ■ Use the definitions of failure types already ance while eliciting specific context needs. present The new knowledge acquired while build- ■ Use the same failure types but provide dif- ing the system dependability model can be an- ferent definitions alyzed and packaged for reuse in UMD. ■ Introduce more specific failure types—for example, response time rather than Measuring dependability performance UMD lets stakeholders specify the issues they don’t want to occur. However, this doesn’t UMD’s bottom-up nature lets stakeholders suffice; we need an operational definition of take advantage of existing dependability dependability. For example, Jean-Claude La- knowledge (models and classifications). So, prie says, “The extent to which a system pos- Figure 3. The updating the UMD framework consists of sesses the attributes of dependability should be process from version interpreted in a relative, probabilistic sense, due 11.23 to version 11.25, ■ Invariant concepts: These are stable for to the unavoidable occurrence of failures.”8 showing how to add a every UMD application (issue, scope, and UMD provides measurement models via the new column in the event). Bank table. ■ Semi-invariant concepts: These are the structure of the concepts’ characterization (for example, mapping issue to failure, Specific-system hazard, or both). dependability model ■ Customizable concepts: These are the characterizations (for example, classes of Customize Extract new failures, hazards, events, and so on). They framework to the knowledge depend on the specific context (project specific context to enrich UMD and stakeholders) and can be customized System context while applying UMD. Analyze and The somewhat arbitrary distinction be- package tween concepts represents the status of our de- for reuse UMD pendability knowledge. We can hypothesize Experience base of issues, failures, hazards, that as our knowledge increases and as classi- events, scope, and so on fications and models become recognized as standards, concepts will move from the frame-

November/December 2004 IEEE SOFTWARE 21 Reaction Measure Characterization: Characterization: Impact mitigation Probabilistic Deterministic • Warning services • Alternative services Nominal Trigger Manifest • Mitigation services Ordinal • Very rarely/Sometimes Recovery • Recovery time Interval • Recovery actions Ratio • MTBF (mean time between failure) • % of cases Occurrence reduction • of occurrence • Maximum # of cases Issue • Guard services

Cause Concern

EventFAILURE Scope Characterization: Characterization: Characterization: Type Type Type • Adverse condition • Accuracy • Whole system • Attack • Response time • Service • Upgrades • And so on ... Operational profile description • And so on ... Availability impact • Distribution of transaction • Stopping • Workload volumes • Nonstopping • And so on ...

HAZARD Characterization: Type • User hazard • Environment hazard • And so on ...

Figure 4. UMD’s structure. invariant concept measure (see Figure 4). In performing the same tasks, or allow a quicker this way, stakeholders can choose the meas- recovery (for example, automatic data backup). urement style most appropriate for describing We propose the following classification for reac- an underlying issue’s acceptable manifestation tion (see Figure 4): levels. For example, Figure 4 shows the fol- lowing measurement classes: ■ Warning services warn users about what happened or is happening (for example, ■ Ordinal and probabilistic—for example, “if response time exceeds 10 seconds, an ordinal scale such as “very rarely,” warn the user about the delay”). “rarely,” and “sometimes” ■ Alternative services help users perform ■ Ratio and probabilistic, such as MTBF their tasks regardless of the issue. (mean time between failure) and probabil- ■ Mitigation services reduce the issue’s im- ity of occurrence (in the next time unit or pact on users (for example, “if response transaction) time exceeds 10 seconds, suggest a better ■ Ratio and deterministic, such as number time to try again”). of occurrences (in a given time frame) ■ Recovery behavior is the time required to recover from the issue (for example, ex- Improving dependability pressed as MTTR (Mean Time to Re- UMD lets stakeholders provide ideas for im- cover) and the kind of required interven- proving dependability through the invariant tion (for example, user or technician). concept of reaction. In this way, stakeholders ■ Occurrence reduction guards against the can suggest reactive and proactive services the issue—that is, to reduce the probability of system should provide to become more depend- occurrence (for example, preventing satu- able. The stakeholder adds reactive services trig- ration or trashing by rejecting incoming gered by an issue to be warned of the situation requests). Stakeholders can extend this or to try to reduce an issue’s consequences. The idea to capture any suggestion they might stakeholder adds proactive services to the sys- have to prevent the issue from happening tem to further reduce the probability of an is- (such as modifying existing services, de- sue’s occurrence, provide alternative ways of sign changes, and so on).

22 IEEE SOFTWARE www.computer.org/software The UMD Tool We developed a Web-based tool that imple- ments UMD, organized around two tables:

■ The Scope frame (see Figure 5a) lets the stakeholder identify all system services for which dependability could be a concern. For the system and each identified service, a stakeholder must provide an identifier (name) and a brief description. (a) ■ The Issue frame (see Figure 5b) lets users specify their dependability needs by select- ing and defining potential issues, their toler- able manifestations and scope, possible trig- gering events, and desired system reactions.

Applying UMD Within the NASA High Dependability Com- puting Program, we performed a feasibility analysis of the UMD concept. As part of a larger technology evaluation program, we aimed to develop a dependability model of the Tactical Separation Assisted Flight Environ- ment testbed.9 TSAFE is a software system de- signed to aid air traffic controllers in detecting (b) and resolving short-term conflicts between air- craft. We derived the adopted testbed from the Figure 5. UMD tool TSAFE version that Gregory Dennis developed.10 acting stakeholders interacted with the UMD tables: (a) Scope and TSAFE provides air traffic controllers with a tool through an analyst. This helped to better (b) Issue (not related to graphical representation of the conditions (po- evaluate the tool’s capabilities and represent an external event). sition, planned route, and forecasted synthe- real-life situations in which stakeholders sized route) and status (conformance or non- might be unfamiliar with automatic tools. conformance to a planned route) of selected We applied UMD in two main steps: flights in a defined geographical area. It aims to detect conflicts between three and seven ■ Scope definition. By analyzing the already minutes in the future and issue avoidance ma- available functional requirements, all stake- neuvers accordingly. holders, working together and supported TSAFE already had available a set of func- by the analyst, selected the TSAFE main serv- tional requirements defining the system. ices that they believed to be relevant for de- However, it lacked precisely stated depend- pendability. The scope table shows the re- ability requirements, although dependability sulting four services (see Figure 5a). is a main concern given the application do- ■ Model building. Each stakeholder, sup- main (Dennis’s work focused on other as- ported by the analyst and guided by the pects). So, we used UMD to produce a de- structure tool provided, filled as many ta- pendability model. bles as necessary to define her or his de- pendability needs (see Figure 5b). Data gathering For the case study, a small group of com- While applying UMD, stakeholders used puter science researchers and students acted as the available characterizations and, when- stakeholders (air traffic controllers) after a ever necessary, extended them with their short introduction to TSAFE. This initial case own definitions. Figure 6 provides some de- study aimed to evaluate the suggested ap- tails about the failure characterization ob- proach’s feasibility rather than to identify tained for TSAFE (reconciled among the dif- TSAFE’s correct dependability requirements. All ferent stakeholders).

November/December 2004 IEEE SOFTWARE 23 FAILURE Characterization: Type • Accuracy. Data (flight position, trajectory, and so on) aren’t displayed with the required accuracy. sion-critical failure (transformed by the ana- • Throughput. System or service fails to handle the desired number of flights. • Response time. System or service fails to respond within the desired time. lysts into an MTBF of 10 E6) and asks for a • And so on... technician to perform the recovery within one hour. If this failure condition lasts for Availability impact • Stopping. System or service becomes unfit for use. more than an hour, the stakeholder feels he • Nonstopping. System or service is still usable. won’t be able to properly perform his duties because of the need to maintain a higher Severity • High severity. Major impact on the system’s utility for the operator. than usual level of attention. • Low severity. Minor impact on the system’s utility for the operator. Data analysis Once the tables were completed, the analyst Figure 6. The failure presented the stakeholder with the results to characterization To demonstrate the kind of data col- date. The analyst performed both graphical and obtained for TSAFE. lected, we describe a table filled in by a sin- computational analyses. The UMD tool incor- gle stakeholder—specifically, for an issue not porates the Visual Query Interface tool,11 which related to an external event (see Figure 5b). allows quick data , navigation, and The stakeholder signals a potential failure assessment. Figure 7 shows the distribution of for the service “display flight synthesized the identified issues around the main services, route,” when the response time is greater highlighting (with different colors and shapes) than 500 ms. This is a response time, non- the corresponding characteristics. Additionally, stopping, high-severity failure, given the the MS Access the UMD tool produces high impact on the service’s utility. For the supports different types of computations. In stakeholder, this failure is also a hazard, particular, the analyst could combine the meas- given that he thinks he could miss spotting a ures expressing the tolerable manifestation for plane on a dangerous path. To be more con- each of the identified issues to provide aggre- fident in the system, the stakeholder then gated values of dependability. For instance, he asks to introduce a warning service that will computed the aggregated probability of occur- advise if the computational time becomes rence of all the accuracy failures, all the accu- greater than 500 ms. This will alert the op- racy failures that were also stopping failures, erator to the need for additional attention. and accuracy failures for a specific service. Sim- Figure 7. Data graphical Finally, the stakeholder sees this as a mis- ilarly, with the tolerable manifestation of the analysis. stopping failures defined and knowing the cor- responding desired recovery time, he could com- pute the desired availability (again, for the whole system or a specific service). Accuracy Other Based on the analysis, the stakeholder was able to make changes and the analyst pointed System Accuracy Performance Other out possible risk areas. For example, the issue distribution (see Figure 7) showed that a serv- ice hadn’t been taken into account, so the an- Highlight route Conformance problem alyst asked the stakeholder to confirm such a Accuracy Other choice. Also, the computational analysis showed that some failures were dominant, Display Aircraft Position representing a bottleneck for the whole sys- Performance Accuracy Accuracy tem’s dependability; in some cases, the stake- holder wanted to revisit his choices. The

Display Aircraft stakeholder wasn’t satisfied with the system’s Planned Route Accuracy Other Performance computed availability, so he revised some of the choices made while filling in the tables. Display Aircraft This assessment led to further refinement of Projected Route the tables’ entries and completion of new ones. The iteration ended when both the ana- lyst and the stakeholder felt confident about the results.

24 IEEE SOFTWARE www.computer.org/software About the Authors

Victor R. Basili is a professor of at the University of Maryland, Col- lege Park. He served as Executive Director of the Fraunhofer Center for Experimental Software Engineering-Maryland and was and a founder and principal in the Software Engineering Labo- Data reconciliation ratory at NASA Goddard Space Flight Center. He works on measuring, evaluating, and improv- At this point, the analyst had to reconcile ing the process and product. He received his PhD in computer science the different stakeholders’ emerging needs. Ex- from the University of Texas. He is a fellow of the IEEE and ACM. Contact him at Dept. of Com- puter Science and Inst. for Advanced Computer Studies, Univ. of Maryland, College Park, MD ample reconciliations included 20742; [email protected]; www.cs.umd.edu/~basili.

■ When two or more stakeholders had filled Paolo Donzelli is a visiting senior research scientist with the Computer Science Depart- in tables concerning the same service but ment at the University of Maryland, College Park. He’s on sabbatical from the Office of the Prime Minister in Italy, where he’s a director with the Department of Innovation and Technol- identified different classes of failures ogy. His research interests include software process improvement, requirements engineering, ■ When stakeholders asked the system to and dependability modeling and validation. He received a PhD in computer science from the behave in incompatible ways (for exam- University of Rome, TorVergata, Italy. Contact him at the Computer Science Dept., Univ. of Maryland, College Park, MD 20742; [email protected]. ple, asking the system to simultaneously stop and provide an alternative service) ■ When combined stakeholder requests Sima Asgari is a research associate in the Computer Science Department at the Univer- sity of Maryland, College Park. Her research interests include the combination of theoretical couldn’t be addressed with available re- and empirical software engineering, human factors, and software dependability. She received sources her PhD in computer science from the Tokyo Institute of Technology. Contact her at the Com- puter Science Dept., Univ. of Maryland, College Park, MD 20742; [email protected].

he case study results increased our confidence in UMD’s ability to act as T a modeling language to discuss, References gather, represent, and make measurable de- 1. B. Boehm et al., The Nature of De- pendability: A Stakeholder/Value Approach, tech. re- pendability needs. UMD provided valuable port, Computer Science Dept., Univ. of Southern Cali- support in building a precise dependability fornia, 2003. model of TSAFE, making explicit and combin- 2. B. Bruegge and A.H. Dutoit, Object-Oriented Software Engineering, Prentice Hall, 2004. ing different stakeholders’ needs. Additionally, 3. I. Rus et al., “Empirical Evaluation Techniques and the approach’s flexibility also lets you model Methods Used for Achieving and Assessing High De- system properties that usually aren’t consid- pendability,” Proc. Workshop Dependability Bench- marking, 2002, www.ece.cmu.edu/~koopman/dsnwdb/ ered strictly related to dependability, such as wdb02_rus.pdf. usability. 4. B. Melhart and S. White, “Issues in Defining, Analyz- Our future work will develop in two main ing, Refining, and Specifying System Dependability Re- quirements,” Proc. IEEE Conf. Eng. of Computer directions. First, we want to perform further Based Systems, IEEE CS Press, 2000, pp. 334–340. empirical assessments, in particular to under- 5. B. Randell, “Dependability—A Unifying Concept,” stand how to better employ UMD. We might Proc. Workshop , Dependability and Assurance: From Needs to Solutions, IEEE CS Press, combine UMD with other tools and ap- 1998, pp. 16–25. proaches, such as the Win-Win model,12 to 6. MIL-STD-882, System Safety Program Requirements, support negotiation, and goal-based require- US Dept. of Defense, 1993. 7. ISO/IEC 13236, Information Technology: Quality of 13 ments engineering techniques, to support re- Service: Framework, ISO/IEC, 1998. quirements elicitation. Then, we want to ex- 8. J.-C. Laprie, Dependability: Basic Concepts and Termi- tend the support that UMD provides nology, Dependable Computing and Fault Tolerance, Springer-Verlag, 1992. stakeholders by encompassing more experi- 9. S. Asgari et al., “Empirical-Based Estimation of the Ef- ence-based capabilities. For example, once a fect on Software Dependability of a Technique for Ar- chitecture Conformance Verification,” Proc. Int’l Conf. stakeholder has identified a potential issue for Software Eng. 2004 Workshop Architecting Depend- a specific service, we’d like it to suggest the able Systems, LNCS 3,069, Springer-Verlag, 2004. most appropriate measurement models as well 10. G. Dennis, TSAFE: Building a Trusted Computing Base for Air Traffic Control Software, masters thesis, Computer as the most common system reactions and Science Dept., Massachusetts Inst. Technology, 2003. possible external events. 11. N. Jog and B. Shneiderman, “Starfield Information Vi- sualization with Interactive Smooth Zooming,” Proc. IFIP 2.6 Visual Systems, Chapman & Hall, 1995, pp. 3–14. 12. B. Boehm et al., “Using the Win-Win Spiral Model: A Case Study,” Computer, July 1998, pp. 33–44. Acknowledgments 13. L.K. Chung et al., Non-Functional Requirements in We acknowledge support from the NASA High Software Engineering, Kluwer, 2000. Dependability Computing Program under cooperative agreement NCC-2-1298. We thank the researchers on the HDCP project for their insights and suggestions For more information on this or any other computing topic, please visit our and Jennifer Dix for proofreading this article. Digital at www.computer.org/publications/dlib.

November/December 2004 IEEE SOFTWARE 25