How Does Fail and What Should be Done About It?

DCIT; Nov. 2015

Prof. Kishor Trivedi

Duke High Availability Assurance Lab (DHAAL) Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291 E-mail: [email protected] URL: www.ee.duke.edu/~ktrivedi

Copyright © 2015 by K.S. Trivedi 1 Duke University Research Triangle Park (RTP)

Duke

UNC-CH NC state

North USA Carolina

Copyright © 2015 by K.S. Trivedi 2 Duke High Availability Assurance Laboratory (DHAAL)

Kishor Trivedi Dept. of Electrical & Computer Engineering Duke University Email: [email protected] URL: www.ee.duke.edu/~ktrivedi

Internationally connected with groups in USA, Germany, Italy, Japan, China, Brazil, New Zealand and Spain

Copyright © 2015 by K.S. Trivedi 3 DHAAL

. Dhaal means “Shield” in Hindi/Gujarati . DHAAL research is about shielding systems from various threats Copyright © 2015 by K.S. Trivedi 4 Books Update n Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 1982; Second edition, John Wiley, 2001 (Blue book) – Chines translation has just appeared n Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer, 1996 (Red book)

n Qeuing Networks and Markov Chains, 1998 John Wiley, second edition, 2006 (White book) n Reliability and Availability Engineering, Cambridge University Press, 2016 (green book)

Copyright © 2015 by K.S. Trivedi 6 Outline

 Motivation

 Real System Examples

 Software Fault Classification

 Environmental Diversity

 Methods of Mitigation

and Rejuvenation

 Conclusions

Copyright © 2015 by K.S. Trivedi 7 Pervasive Dependence on Computer Systems Need for High Reliability/Availability

Communication

Health & Medicine Avionics

Banking Entertainment

Copyright © 2015 by K.S. Trivedi 8 Basic Definitions

n Steady-state availability (Ass) or just availability u Long-term probability that the system is available when requested:

MTTF A = ss MTTF + MTTR

u MTTF is the system mean time to failure, a complex combination of component MTTFs

u MTTR is the system mean time to recovery F may consist of many phases

Copyright © 2015 by K.S. Trivedi 9 Motivation Basic Definitions

n Downtime in minutes per year u In industry, (un)availability is usually presented in terms of annual downtime.

u Downtime = 876060 (1- Ass) minutes.

u In Industry it is common to define the availability in terms of number of nines

F 5 NINES (Ass = 0.99999)  5.26 minutes annual downtime

F 4 NINES (Ass = 0.9999)  52.56 minutes annual downtime

Copyright © 2015 by K.S. Trivedi 10 Motivation Number of Nines– Reality Check

n 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week

u Approx. 80 hours/year=4800 minutes/year

u Ass=(8760-80)/8760=0.9908

u That is, between 2 NINES and 3 NINES!

n This study assumes planned and unplanned downtime, together

Copyright © 2015 by K.S. Trivedi 11 Motivation Some real examples from High Tech companies

Jan. 2014 , Gmail was down for 25 – 50 min.

Oct. 2013, Unavailable services like post photos and “likes”

Feb. 2013, Windows Azure down for 12 hours

Jan. 2013, AWS down for an hour approx.

Sept. 2012 - GoDaddy (4 hours and 5 millions of websites affected)

Copyright © 2015 by K.S. Trivedi 12 Motivation More examples of failures

Oct. 2012 Amazon Webservices - 6 hours () Amazon EC2 - 2 hours

T h e

Sept. 2011 - Google Docs service outage (1 hour) - A memory leak s due to a software update a m e Sept. 2011 - Microsoft Cloud service outage (2.5 hours) w e e k n These examples indicate that even the most advanced tech companies are offering less than five NINES of availability • And only considering one failure!!!!

Copyright © 2015 by K.S. Trivedi 13 Motivation Software is the problem

Jim Gray’s paper titled “Why do computers stop and what can be done about it?” pointed out this trend in 1985, followed by his paper “A census of tandem system availability between 1985 and 1990”

2005

1985 Across different industries…. Copyright © 2015 by K.S. Trivedi 15 Motivation High Reliability/Availability: Software is the problem

n Hardware fault tolerance, fault management, reliability/availability modeling relatively well developed

n System outages more due to software faults

Key Challenge: Software reliability is one of the weakest links in system reliability/availability

Copyright © 2015 by K.S. Trivedi 16 Motivation Increasing SW Failure Rate? Planetary Missions Flight Software: A. Nikora of JPL

The interval between the first and last launch: 8.76 years.

The interval between successive launches ranges from: 23 to 790 days.

Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep Mars Global Climate Polar Odyssey Exploration Impact Reconnaissance Surveyor Orbiter Lander Rover Orbiter Mission Name (in launch order) Copyright © 2015 by K.S. Trivedi 17 Motivation TAKE AWAY MESSAGES

n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited.

Copyright © 2015 by K.S. Trivedi 20 Conclusions Software Reliability: Known Means n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance

Copyright © 2015 by K.S. Trivedi 21 Motivation Software Reliability n Fault prevention or Fault avoidance u Good software engineering practices F Requirement Elicitation (Abuse Case Analysis – TCS SSA) F Design Analysis / Review F Secure Programming Standard & Review F Secure Programming Compilation F Software Development lifecycle F Automated Code Generation Tools (IDE like Eclipse) u Use of formal methods F UML, SysML, BPM F Proof of correctness F Model Checking (SMART, SPIN, PRISM) n Bug free code not yet possible for large scale software systems u Impossible to fully test and verify if software is fault-free “Testing shows the presence, not the absence, of bugs” - E. W. Dijkstra

n Yet there is a strong need for failure-free system operation

Copyright © 2015 by K.S. Trivedi 22 Motivation Software Reliability

n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance

Copyright © 2015 by K.S. Trivedi 23 Motivation Software Reliability

n Fault removal u Can be carried out during F the specification and design phase F the development phase F the operational phase u Failure data may be collected and used to parameterize a software reliability growth model(SRGM) to predict when to stop testing

n Software is still delivered with many bugs either because of inadequate budget for testing , very difficult to reproduce/detect/localize/correct bugs or inadequacy of techniques employed/known

Copyright © 2015 by K.S. Trivedi 24 Software Reliability

n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance

Copyright © 2015 by K.S. Trivedi 25 Motivation High Reliability/Availability: Software is the problem

Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software

Copyright © 2015 by K.S. Trivedi 26 Software Fault Tolerance Classical Techniques Design diversity u N-version programming u Recovery block

Copyright © 2015 by K.S. Trivedi 27 Motivation Software Fault Tolerance Classical Techniques • N-version programming • Recovery blocks Design • … diversity

Expensive  not Yet there are used much in stringent requirements for practice! failure-free operation

Challenge: Affordable Software Fault Tolerance

Copyright © 2015 by K.S. Trivedi 28 TAKE AWAY MESSAGES

n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited.

Copyright © 2015 by K.S. Trivedi 29 Conclusions Outline

 Motivation

 Real System Examples

 Software Fault Classification

 Environmental Diversity

 Methods of Mitigation

 Software Aging and Rejuvenation

 Conclusions

Copyright © 2015 by K.S. Trivedi 30 Real Systems Example of a real systems and their High availability implementations

Copyright © 2015 by K.S. Trivedi 31 High availability SIP Application Server Configuration on IBM WebSphere

Blade Chassis 1

More details in PRDC 2008 AS 1 Replication Domain 1 and ISSRE 2010 papers AS 2 Replication Domain 2 Replication Blade 2

AS 3 Replication Domain 3

AS 4 SIP Proxy 1 Blade 3 DM Group3 AS 5 Blade 1

AS 6 Test Driver

- Blade 4 IP SIBpMra yLeo-rad Test drivers Balancer SIP

IBM PC Blade Chassis 2 Test Driver

AS 1 Replication Domain 4 AS 4 SIP Proxy 1 Blade 2

AS1 thru AS6 are Blade1 AS 2 Application Server Replication Domain 5

AS 5 Proxy1's are Stateless Proxy Server Blade 3

AS 3 Replication Domain 6

AS 6 Blade 4

Blade 4

A Real System Copyright © 2015 by K.S. Trivedi 32 High availability SIP Application Server configuration on WebSphere

 Hardware configuration:

 Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis sufficient for performance)

 Software configuration:

 2 copies of SIP/Proxy servers (1 sufficient for performance)

 12 copies of WAS (6 sufficient for performance)

 Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis

The system has hardware redundancy and software redundancy.

A Real System Copyright © 2015 by K.S. Trivedi 33 High availability SIP Application Server configuration on WebSphere

 Software Fault Tolerance

 Identical copies of SIP proxy used as backups (hot spares)

 Identical copies of WebSphere Applications Server (WAS) used as backups (hot spares)

 Type of software redundancy – (not design diversity) but replication of identical software copies

 Normal recovery after a software failure

 Restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked

A Real System Copyright © 2015 by K.S. Trivedi 34 Software Fault Tolerance: New Thinking

1 2 Have been Do they help in Known to help dealing with failures in dealing with caused by software hardware bugs? transients

3 If yes, why?

Copyright © 2015 by K.S. Trivedi 37 Software Fault Tolerance: New Thinking

Does it If yes, help? why?

Twenty years ago this would be considered crazy!

Failover to an identical software replica (that is not a diverse version)

Copyright © 2015 by K.S. Trivedi 38 Outline

 Motivation

 Real System Examples

 Software Fault Classification

 Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007  Environmental Diversity

 Methods of Mitigation

 Software Aging and Rejuvenation

 Conclusions

Copyright © 2015 by K.S. Trivedi 39 Software Faults main threats to high reliability, availability & safety

Copyright © 2015 by K.S. Trivedi 40 IFIP Working Group 10.4 (Laprie)

 Failure occurs when the delivered service no longer complies with the desired output.

 Error is that part of the system state which is liable to lead to subsequent failure.

 Fault (or bug) is adjudged or hypothesized cause of an error. Faults are the cause of errors that may lead to failures Fault Error Failure

Software fault Copyright © 2015 by K.S. TrivediClassification 41 Need to Classify bug types

 We submit that a software fault tolerance approach based on retry, restart, reboot or fail-over to an identical software replica (not a diverse version) works because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs.

Software fault Copyright © 2015 by K.S. TrivediClassification 42 Need to Classify bug types

 For over two decades, researchers have reported the phenomenon of “software aging” .

 i.e., degraded performance and/or increased failure rate of long-running software systems.

 Puzzle: How can performance and failure rate change if the software code is not modified?!

 Study software fault types and their relationships

Software fault Copyright © 2015 by K.S. TrivediClassification 43 A Classification of Software Faults

Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity.

Example: A bug causing a failure whenever the user enters a negative date of birth

 Since they are easily found, Bohrbugs may be detected and fixed during the software testing phase.

 The term alludes to the physicist Niels Bohr and his rather simple atomic model.

Software fault Copyright © 2015 by K.S. TrivediClassification 44 Mandelbug – Definition

 Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible.

Example: A bug whose activation is scheduling-dependent:  The residual faults in a thoroughly-tested piece of software are mainly Mandelbugs.  The term alludes to the mathematician Benoît Mandelbrot and his research in fractal geometry.  Sometimes called concurrency or non-deterministic or soft bugs; failures resulting from these bugs are called transient failures Software fault Copyright © 2015 by K.S. TrivediClassification 45 Mandelbugs Complexity Factors

 A fault is a Mandelbug if its manifestation is subject to the following complexity factors

 Long time lag between fault activation and failure appearance  Operating environment (OS, other applications running concurrently, hardware, network…)  Timing among submitted operations  Sequencing or ordering of operations

 A failure due to a Mandelbug may not show up upon the resubmission of a workload if the operating environment has changed enough

Software fault Copyright © 2015 by K.S. TrivediClassification 46 Examples of Types of Bugs in IT Systems

 Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011.

 The selected TCS projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector.

Software fault Copyright © 2015 by K.S. TrivediClassification 47 Mandelbug Reproducibility

 Mandelbugs are really hard to reproduce  Conducted a set of experiments to study the environmental factors (i.e., disk usage, memory occupation and concurrency) that affect the reproducibility of Mandelbugs

 High usage of environmental factors increases significantly the reproducibility of Mandelbugs

 ISSRE 2014 paper – D. Cavezza, Roberto P. , K. Trivedi, et al

Software fault Copyright © 2015 by K.S. TrivediClassification 48 Aging-related Bug – Definition

 Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance.

Example:  A bug causing memory leaks in the application  Note that the aging phenomenon requires a delay between (first) fault activation and failure occurrence.  Note also that the software appears to age due to such a bug; there is no physical deterioration Software fault Copyright © 2015 by K.S. TrivediClassification 52 Relationships of the Bug Types

 Bohrbug and Mandelbug are complementary antonyms. Aging-related bugs are a subtype of Mandelbugs

Mandelbugs

AgingRelated Bugs

Aging-Related Bugs

Bohrbugs

Software fault Copyright © 2015 by K.S. TrivediClassification 53 Important Questions about these Bugs

 What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs

 How do these fractions vary

 over time  over projects, languages, application types,…

 Need of Real Data

Software fault Copyright © 2015 by K.S. TrivediClassification 58 Software Faults in IT Systems

 Fault types in JPL/NASA flight software - M. Grottke, A. P. Nikora, K. S. Trivedi. “An empirical investigation of fault types in space mission system software”, DSN 2010:  61.4% Bohrbugs (BOH)  32.1% non-aging-related Mandelbugs (NAM)  4.4% aging-related bugs (ARB)  2.1% faults of unknown type (UNK)  Very similar results for Linux, MySQL, Apache AXIS, httpd - D. Cotroneo, M. Grottke, R. Natella, R. Pietrantuono, K. S. Trivedi. “Fault triggers in open- source software: An experience report”, ISSRE 2013:

Project LoC (Considered) % BOH % NAM % ARB % UNK Linux 1.31M 42.2 41.9 8.3 7.6

MySQL 453K 56.6 30.3 7.7 5.4

HTTPD 145K 81.1 10.5 7.0 1.4

AXIS 80K 92.5 3.5 4.0 0.0

Copyright © 2015 by K.S. Trivedi 59 Software Faults and Mitigation Types

 The fault classification is not only theoretical, it has also practical implications

 Each type of software fault may require different type of approach during development, testing, as well as during operations

Software fault Copyright © 2015 by K.S. TrivediClassification 60 Software Faults and Mitigation Types

Software faults

Bohrbugs Mandelbugs

Non-aging-related Aging-related bugs Mandelbugs

Failover to Failover to Fix/Patch Workaround Use as is Reconfigure Retry Restart Reboot Rejuvenate nonidentical identical Environmental Diversity Fault/Error mitigation techniques Failure mitigation techniques

Software fault Copyright © 2015 by K.S. TrivediClassification 61 Software Fault and Failure Mitigation Types

 Software fault mitigation types:  Fix/Patch

 Software failure mitigation types:  Reconfigure  Failover to non-identical  Retry  Restart Environmental diversity  Reboot  Failover to identical

Software fault Copyright © 2015 by K.S. TrivediClassification 62 Outline

 Motivation  Real System Examples

 Software Fault Classification  Environmental Diversity  Methods of Mitigation  Software Aging and Rejuvenation  Summary

Copyright © 2015 by K.S. Trivedi 63 Environmental diversity A new thinking to deal with software faults and failures

Copyright © 2015 by K.S. Trivedi 64 Software Fault Tolerance: New Thinking

 Environmental Diversity as opposed to Design Diversity

 Our claim is that this (retry, restart, reboot, failover to identical software copy) works since failures due to Mandelbugs are not negligible. We thus have an affordable software fault tolerance technique that we call Environmental Diversity

Environmental Copyright © 2015 by K.S. Trivedi Diversity 65 What is Environmental diversity?

 The underlying idea of Environmental diversity  Retry a previously faulty operation and it most likely works -- Why?  because of the environment where the operation was executed has changed enough to avoid the fault activation.  The environment is understood as  OS resources, other applications running concurrently and sharing the same resources, interleaving of operations, concurrency, or synchronization.  This is Fault Tolerance since we do not necessarily fix the fault; fault caused a failure but this failure is dealt with by using time redundancy hence the user does not experience the failure again on retry

Environmental Copyright © 2015 by K.S. Trivedi Diversity 66 Outline

 Motivation  Real System Examples  Software Fault Classification  Environmental Diversity  Methods of Mitigation  Software Aging and Rejuvenation  Summary

Copyright © 2015 by K.S. Trivedi 67 OS Availability Model (IBM BladeCenter)

Fix (Failed due to a Bohrbug)

mOS

l dOS (1-bOS)bOS asp UP OS DN DT DW RP

bOSbOS

Reboot (Failure due to a Mandelbug)

Methods of Copyright © 2015 by K.S. Trivedi Mitigation 74 Outline

 Motivation  Real System Examples  Software Fault Classification  Environmental Diversity  Methods of Mitigation  Software Aging and Rejuvenation  Summary

Copyright © 2015 by K.S. Trivedi 76 Software Aging and Rejuvenation Aging-related Bugs: Replicate, Restart, Reboot, Rejuvenate

Copyright © 2015 by K.S. Trivedi 77 Software Aging

 Aging phenomenon

 Error conditions accumulating over time

Performance degradation, system failure

 Main causes of Software Aging

 Memory leak, fragmentation, Unterminated threads, Data corruption, Round- off errors, Unreleased file-locks, etc…  Observed system  OS, Middle-ware, Netscape, Internet Explorer etc

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 78 Software Aging - Definition

 “Software Aging” phenomenon

 Long-running software tends to show an increasing failure rate.

 Not related to application program becoming obsolete due to changing requirements/maintenance.

 Software appears to age; no real deterioration

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 79 Software Aging Examples

Oct. 2012 - Amazon Web Services Outage Sept. 2011 - Google Caused By Memory Docs Outage Blamed Leak And Failure In on Memory Glitch Monitoring Alarm

Feb. 1991 - The International Space Patriot Missile Station (ISS) FC SSC Software Failure memory leaks problems

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 80 Software Aging – More Examples

 Cisco Catalyst Switch [Matias Jr.]  File system aging [Smith & Seltzer]  Gradual service degradation in the AT&T transaction processing system [Avritzer et al.]  Error accumulation in Patriot missile system’s software [Marshall]  Resources exhaustion in Apache [Li et al., Grottke et al.]  Physical memory degradation in a SOAP-based Server [Silva et al.]  Software aging in Linux [Cotroneo et al.]  Crash/hang failures in general purpose applications after a long runtime Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 81 Measurements Showing Resource Exhaustion or Depletion Real Memory Free File Table Size

A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi. Proc. of IEEE Intl. Symp. on Software Reliability Engineering, Nov. 1998.

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 82 Software Fault Types & Their Mitigation

Software faults

Bohrbugs Mandelbugs

Non-aging-related Aging-related bugs Mandelbugs

Failover to Failover to Fix/Patch Workaround Use as is Reconfigure Retry Restart Reboot Rejuvenate nonidentical identical Environmental Diversity Fault/Error mitigation techniques Failure mitigation techniques

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 83 Software Rejuvenation

 Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes.

 It allows proactive recovery to be carried either automatically or at the discretion of the user/administrator

Rejuvenation of the environment, not of software

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 84 Software Rejuvenation Examples

Patriot missile system software - switch off and on every 8 hours

Process and Text connections restart/recycling

Text Tens of US Patents ISS FS SSC (ISS File related with this system) - switch off technology and on every 2 months Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 85 Software Rejuvenation: More Examples

On-board preventive IBM Director AT&T billing maintenance for long-life deep Software Rejuvenation (x- applications space missions (NASA’s X2000 Advanced Flight series) [Huang et al.] Systems Program) [Tai et al.] [IBM & Duke Researchers]

For more real examples: "Software rejuvenation - Do IT & Telco industries use it?". J. Alonso, A. Bovenzi, J. Li, Y. Wang, S. Russo, and K. Trivedi. WOSAR 2012.

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 86 Software Rejuvenation –Trade-off

 Advantages

 Reduces costs of sudden aging-related failures  Can be applied at the discretion of the user/administrator

 Disadvantages

 Direct costs of carrying out rejuvenation  Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions etc)

Important research issue: Find optimal times to perform rejuvenation!

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 87 Software Rejuvenation - Approaches

 Two approaches based on WHEN:

 Time-Based rejuvenation approaches

 Rejuvenation applied regularly and at predetermined time intervals.

 Widely used in real environments • Web servers (Apache) • ISS two-months reboot • Telecommunication systems

Copyright © 2015 by K.S. Trivedi 88 Software Rejuvenation - Approaches

 Two approaches based on WHEN:

 Measurement (or inspection)-based rejuvenation

 System metrics continually monitored

 Threshold based: Rejuvenation triggered when the crash is imminent based on the observation

 Predictive: measurements are used to predict the time to resource exhaustion or time to failure

Copyright © 2015 by K.S. Trivedi 89 Software Rejuvenation Scheduling

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 90 Software Rejuvenation – Approaches

 Two approaches based on HOW:

 Use analytical model to optimize rejuvenation schedule

 Lucent Bell Labs [Huang et al., ‘95]

 Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]

 Others [IPDS’98, PNPM’99]

Copyright © 2015 by K.S. Trivedi 91 Time-Based Rejuvenation

Available

Gen(gf) Det(d) A Rejuvenation Failure B

F(t) Gen(gr)

Distribution of TTF Needed to determine optimal value of rejuvenation trigger

state A: the system is up and available state C: the system is under software rejuvenation state B: the system is down and under reactive repair

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 92 Purpose of Measurements

 Times to Failure: to parameterize the analytic model discussed earlier

 Fit the measured data to a known distribution and then use the model to find optimal rejuvenation schedule  Can Speed up measurements by using ALT (accelerated life tests) and ADT (accelerated degradation tests) [Matias et al; IEEE- TR 2010; ISSRE 2010], if measurements are done in a controlled environment, or by using simulation with IS (importance sampling) [Zhao et al; PE2013, ISSRE 2011, JETC 2014]  Taking the measured sequence directly to determine optimal rejuvenation schedule without fitting measured times to failure to a distribution [Dohi et al. PRDC 2000]

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 93 Purpose of Measurements

 Measuring performance variables

 to predict time to resource exhaustion or time to failure

 to trigger rejuvenation in a threshold based scheme [discussed here in detail]

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 94 Software Rejuvenation Granularities

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 95 IBM xSeries Software Rejuvenation Agent (SRA)

 IBM Director system management tool

 Provides GUI to configure SRA  Acts upon alerts

 Two versions

 Periodic rejuvenation  Prediction-based rejuvenation

Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 96 Outline

 Motivation  Real System Examples  Software Fault Classification  Environmental Diversity  Methods of Mitigation  Software Aging and Rejuvenation  Summary

Copyright © 2015 by K.S. Trivedi 103 Summarizing the talk

Copyright © 2015 by K.S. Trivedi 104 Summary

 It is possible to enhance software availability during operation exploiting environmental diversity

 Multiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch)

Copyright © 2015 by K.S. TrivediConclusions 105 Summary

 Software aging not anecdotal – real life scientific phenomenon

 Rejuvenation implemented in several special purpose applications and many general purpose cluster systems

Copyright © 2015 by K.S. TrivediConclusions 106 TAKE AWAY MESSAGES

n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited.

Copyright © 2015 by K.S. Trivedi 107 Conclusions Thank you for your attention

Kishor Trivedi Dept. of Electrical & Computer Engineering Duke High Availability Assurance Lab (DHAAL) n Duke University [email protected] www.ee.duke.edu/~ktrivedi

Copyright © 2015 by K.S. Trivedi 108 Key References

Motivation  On Operational Availability of a Large Software-Based Telecommunications System, R. Cramp, M. A. Vouk, and W. Jones, Proc. ISSRE 1992.

 Software dependability in the Tandem GUARDIAN system, I. Lee, R. Lyer,, IEEE Transactions on Software Engineering, 1995.

 Reliability of a Commercial Telecommunications System, M. Kaaniche and K. Kanoun, Proc. ISSRE 1996.

 Network Troubleshooting, O. Kyas, Agilent Technologies , 2001

 Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters, C. D. Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbaczyk, and R. Lyer, Proc. DSN 2014.

Copyright © 2015 by K.S. Trivedi 109 Key References

Real System  Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, and B. Vashaw, Proc. PRDC 2008.  Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. Proc. FTCS 1999. Fault Classification

 Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, 2007.  An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010.  Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011.  Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011

Copyright © 2015 by K.S. Trivedi 110 Key References

Environmental Diversity and Methods of Mitigation

 Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. Proc. FTCS 1999.  Whither generic recovery from Application Faults? A fault study using Open- Source Software, Chandra S., Chen P. M., Proc. CDSN 2000.  An Empirical Investigation of Fault Repairs and Mitigations in Space Mission System Software J. Alonso, M. Grottke, A. Nikora, and K. Trivedi. Proc. DSN 2013.  Software fault mitigation and availability assurance techniques, K. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011.  Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011  Fault triggers in open-source software: An experience report, Cotroneo, Grottke, Natella, Pietrantuono, Trivedi, ISSRE 2013.  Reproducibility of environment-dependent software failures: an experience report, Cavezza, Pietrantuono, Alonso, Russo, Trivedi , ISSRE 2014.

Copyright © 2015 by K.S. Trivedi 111 Key References

Software Aging and Software Rejuvenation (1)

 Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis and N. Fulton, Proc. FTCS-25, 1995.  A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K.Trivedi. Proc. ISSRE 1998.  Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000.  Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, 2001.  A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-TDSC, 2005.

Copyright © 2015 by K.S. Trivedi 112 Key References

Software Aging and Software Rejuvenation (2)

 Analysis of Software Aging in a Web Server, M. Grottke, L. Li, K. Vaidyanathan, and K. Trivedi. IEEE Transactions on Reliability, 2006.  A Best Practice Guide to Resource Forecasting for the Apache Webserver, G. Hoffman, K. Trivedi and M. Malek, IEEE Transactions on Reliability, Dec. 2007.  Using Accelerated Life Tests to Estimate Time to Software Aging Failure, R. Matias, K. Trivedi, P. Maciel , ISSRE, 2010.  Accelerated Degradation Tests Applied to Software Aging Experiments, R. Matias, K. Trivedi and P. Filho and P. Barbetta, IEEE Transactions on Reliability, 2010.  A Comparative experimental study of software rejuvenation overhead, J. Alonso, R. Matias, E. Vicente, A. Maria and K. Trivedi. Performance Evaluation, 2013.

Copyright © 2015 by K.S. Trivedi 113 Key References

Software Aging and Software Rejuvenation (3)

 A comprehensive approach to optimal software rejuvenation, J. Zhao, Y. Wang, G. Ning, K. Trivedi, R. Matias Jr., and K. Cai, Performance Evaluation, 2013.  Software rejuvenation scheduling using accelerated life testing, J. Zhao, Y. Jin, K. Trivedi, R. Matias Jr., and Y. Wang, ACM Journal on Emerging Technologies in Computing Systems, 2014.  Ensuring the Performance of Apache HTTP Server Affected by Aging, J. Zhao, K. Trivedi, M. Grottke, J. Alonso, and Y. Wang, IEEE Trans. Dependable Sec. Comput., 2014.

Copyright © 2015 by K.S. Trivedi 114