How Does Software Fail and What Should be Done About It?
DCIT; Nov. 2015
Prof. Kishor Trivedi
Duke High Availability Assurance Lab (DHAAL) Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291 E-mail: [email protected] URL: www.ee.duke.edu/~ktrivedi
Copyright © 2015 by K.S. Trivedi 1 Duke University Research Triangle Park (RTP)
Duke
UNC-CH NC state
North USA Carolina
Copyright © 2015 by K.S. Trivedi 2 Duke High Availability Assurance Laboratory (DHAAL)
Kishor Trivedi Dept. of Electrical & Computer Engineering Duke University Email: [email protected] URL: www.ee.duke.edu/~ktrivedi
Internationally connected with groups in USA, Germany, Italy, Japan, China, Brazil, New Zealand and Spain
Copyright © 2015 by K.S. Trivedi 3 DHAAL
. Dhaal means “Shield” in Hindi/Gujarati . DHAAL research is about shielding systems from various threats Copyright © 2015 by K.S. Trivedi 4 Books Update n Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 1982; Second edition, John Wiley, 2001 (Blue book) – Chines translation has just appeared n Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer, 1996 (Red book)
n Qeuing Networks and Markov Chains, 1998 John Wiley, second edition, 2006 (White book) n Reliability and Availability Engineering, Cambridge University Press, 2016 (green book)
Copyright © 2015 by K.S. Trivedi 6 Outline
Motivation
Real System Examples
Software Fault Classification
Environmental Diversity
Methods of Mitigation
Software Aging and Rejuvenation
Conclusions
Copyright © 2015 by K.S. Trivedi 7 Pervasive Dependence on Computer Systems Need for High Reliability/Availability
Communication
Health & Medicine Avionics
Banking Entertainment
Copyright © 2015 by K.S. Trivedi 8 Basic Definitions
n Steady-state availability (Ass) or just availability u Long-term probability that the system is available when requested:
MTTF A = ss MTTF + MTTR
u MTTF is the system mean time to failure, a complex combination of component MTTFs
u MTTR is the system mean time to recovery F may consist of many phases
Copyright © 2015 by K.S. Trivedi 9 Motivation Basic Definitions
n Downtime in minutes per year u In industry, (un)availability is usually presented in terms of annual downtime.
u Downtime = 876060 (1- Ass) minutes.
u In Industry it is common to define the availability in terms of number of nines
F 5 NINES (Ass = 0.99999) 5.26 minutes annual downtime
F 4 NINES (Ass = 0.9999) 52.56 minutes annual downtime
Copyright © 2015 by K.S. Trivedi 10 Motivation Number of Nines– Reality Check
n 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week
u Approx. 80 hours/year=4800 minutes/year
u Ass=(8760-80)/8760=0.9908
u That is, between 2 NINES and 3 NINES!
n This study assumes planned and unplanned downtime, together
Copyright © 2015 by K.S. Trivedi 11 Motivation Some real examples from High Tech companies
Jan. 2014 , Gmail was down for 25 – 50 min.
Oct. 2013, Unavailable services like post photos and “likes”
Feb. 2013, Windows Azure down for 12 hours
Jan. 2013, AWS down for an hour approx.
Sept. 2012 - GoDaddy (4 hours and 5 millions of websites affected)
Copyright © 2015 by K.S. Trivedi 12 Motivation More examples of failures
Oct. 2012 Amazon Webservices - 6 hours (Memory leak) Amazon EC2 - 2 hours
T h e
Sept. 2011 - Google Docs service outage (1 hour) - A memory leak s due to a software update a m e Sept. 2011 - Microsoft Cloud service outage (2.5 hours) w e e k n These examples indicate that even the most advanced tech companies are offering less than five NINES of availability • And only considering one failure!!!!
Copyright © 2015 by K.S. Trivedi 13 Motivation Software is the problem
Jim Gray’s paper titled “Why do computers stop and what can be done about it?” pointed out this trend in 1985, followed by his paper “A census of tandem system availability between 1985 and 1990”
2005
1985 Across different industries…. Copyright © 2015 by K.S. Trivedi 15 Motivation High Reliability/Availability: Software is the problem
n Hardware fault tolerance, fault management, reliability/availability modeling relatively well developed
n System outages more due to software faults
Key Challenge: Software reliability is one of the weakest links in system reliability/availability
Copyright © 2015 by K.S. Trivedi 16 Motivation Increasing SW Failure Rate? Planetary Missions Flight Software: A. Nikora of JPL
The interval between the first and last launch: 8.76 years.
The interval between successive launches ranges from: 23 to 790 days.
Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep Mars Global Climate Polar Odyssey Exploration Impact Reconnaissance Surveyor Orbiter Lander Rover Orbiter Mission Name (in launch order) Copyright © 2015 by K.S. Trivedi 17 Motivation TAKE AWAY MESSAGES
n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited.
Copyright © 2015 by K.S. Trivedi 20 Conclusions Software Reliability: Known Means n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance
Copyright © 2015 by K.S. Trivedi 21 Motivation Software Reliability n Fault prevention or Fault avoidance u Good software engineering practices F Requirement Elicitation (Abuse Case Analysis – TCS SSA) F Design Analysis / Review F Secure Programming Standard & Review F Secure Programming Compilation F Software Development lifecycle F Automated Code Generation Tools (IDE like Eclipse) u Use of formal methods F UML, SysML, BPM F Proof of correctness F Model Checking (SMART, SPIN, PRISM) n Bug free code not yet possible for large scale software systems u Impossible to fully test and verify if software is fault-free “Testing shows the presence, not the absence, of bugs” - E. W. Dijkstra
n Yet there is a strong need for failure-free system operation
Copyright © 2015 by K.S. Trivedi 22 Motivation Software Reliability
n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance
Copyright © 2015 by K.S. Trivedi 23 Motivation Software Reliability
n Fault removal u Can be carried out during F the specification and design phase F the development phase F the operational phase u Failure data may be collected and used to parameterize a software reliability growth model(SRGM) to predict when to stop testing
n Software is still delivered with many bugs either because of inadequate budget for testing , very difficult to reproduce/detect/localize/correct bugs or inadequacy of techniques employed/known
Copyright © 2015 by K.S. Trivedi 24 Software Reliability
n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance
Copyright © 2015 by K.S. Trivedi 25 Motivation High Reliability/Availability: Software is the problem
Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software
Copyright © 2015 by K.S. Trivedi 26 Software Fault Tolerance Classical Techniques Design diversity u N-version programming u Recovery block
Copyright © 2015 by K.S. Trivedi 27 Motivation Software Fault Tolerance Classical Techniques • N-version programming • Recovery blocks Design • … diversity
Expensive not Yet there are used much in stringent requirements for practice! failure-free operation
Challenge: Affordable Software Fault Tolerance
Copyright © 2015 by K.S. Trivedi 28 TAKE AWAY MESSAGES
n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited.
Copyright © 2015 by K.S. Trivedi 29 Conclusions Outline
Motivation
Real System Examples
Software Fault Classification
Environmental Diversity
Methods of Mitigation
Software Aging and Rejuvenation
Conclusions
Copyright © 2015 by K.S. Trivedi 30 Real Systems Example of a real systems and their High availability implementations
Copyright © 2015 by K.S. Trivedi 31 High availability SIP Application Server Configuration on IBM WebSphere
Blade Chassis 1
More details in PRDC 2008 AS 1 Replication Domain 1 and ISSRE 2010 papers AS 2 Replication Domain 2 Replication Blade 2
AS 3 Replication Domain 3
AS 4 SIP Proxy 1 Blade 3 DM Group3 AS 5 Blade 1
AS 6 Test Driver
- Blade 4 IP SIBpMra yLeo-rad Test drivers Balancer SIP
IBM PC Blade Chassis 2 Test Driver
AS 1 Replication Domain 4 AS 4 SIP Proxy 1 Blade 2
AS1 thru AS6 are Blade1 AS 2 Application Server Replication Domain 5
AS 5 Proxy1's are Stateless Proxy Server Blade 3
AS 3 Replication Domain 6
AS 6 Blade 4
Blade 4
A Real System Copyright © 2015 by K.S. Trivedi 32 High availability SIP Application Server configuration on WebSphere
Hardware configuration:
Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis sufficient for performance)
Software configuration:
2 copies of SIP/Proxy servers (1 sufficient for performance)
12 copies of WAS (6 sufficient for performance)
Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis
The system has hardware redundancy and software redundancy.
A Real System Copyright © 2015 by K.S. Trivedi 33 High availability SIP Application Server configuration on WebSphere
Software Fault Tolerance
Identical copies of SIP proxy used as backups (hot spares)
Identical copies of WebSphere Applications Server (WAS) used as backups (hot spares)
Type of software redundancy – (not design diversity) but replication of identical software copies
Normal recovery after a software failure
Restart software, reboot node or fail-over to a software replica; only when all else fails, a “software repair” is invoked
A Real System Copyright © 2015 by K.S. Trivedi 34 Software Fault Tolerance: New Thinking
1 2 Have been Do they help in Known to help dealing with failures in dealing with caused by software hardware bugs? transients
3 If yes, why?
Copyright © 2015 by K.S. Trivedi 37 Software Fault Tolerance: New Thinking
Does it If yes, help? why?
Twenty years ago this would be considered crazy!
Failover to an identical software replica (that is not a diverse version)
Copyright © 2015 by K.S. Trivedi 38 Outline
Motivation
Real System Examples
Software Fault Classification
Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer Magazine, Feb. 2007 Environmental Diversity
Methods of Mitigation
Software Aging and Rejuvenation
Conclusions
Copyright © 2015 by K.S. Trivedi 39 Software Faults main threats to high reliability, availability & safety
Copyright © 2015 by K.S. Trivedi 40 IFIP Working Group 10.4 (Laprie)
Failure occurs when the delivered service no longer complies with the desired output.
Error is that part of the system state which is liable to lead to subsequent failure.
Fault (or bug) is adjudged or hypothesized cause of an error. Faults are the cause of errors that may lead to failures Fault Error Failure
Software fault Copyright © 2015 by K.S. TrivediClassification 41 Need to Classify bug types
We submit that a software fault tolerance approach based on retry, restart, reboot or fail-over to an identical software replica (not a diverse version) works because of a significant number of software failures are caused by Mandelbugs as opposed to the traditional software bugs now called Bohrbugs.
Software fault Copyright © 2015 by K.S. TrivediClassification 42 Need to Classify bug types
For over two decades, researchers have reported the phenomenon of “software aging” .
i.e., degraded performance and/or increased failure rate of long-running software systems.
Puzzle: How can performance and failure rate change if the software code is not modified?!
Study software fault types and their relationships
Software fault Copyright © 2015 by K.S. TrivediClassification 43 A Classification of Software Faults
Bohrbug := A fault that is easily isolated and that manifests consistently under a well-defined set of conditions, because its activation and error propagation lack complexity.
Example: A bug causing a failure whenever the user enters a negative date of birth
Since they are easily found, Bohrbugs may be detected and fixed during the software testing phase.
The term alludes to the physicist Niels Bohr and his rather simple atomic model.
Software fault Copyright © 2015 by K.S. TrivediClassification 44 Mandelbug – Definition
Mandelbug := A fault whose activation and/or error propagation are complex. Typically, a Mandelbug is difficult to isolate, and/or the failures caused by a it are not systematically reproducible.
Example: A bug whose activation is scheduling-dependent: The residual faults in a thoroughly-tested piece of software are mainly Mandelbugs. The term alludes to the mathematician Benoît Mandelbrot and his research in fractal geometry. Sometimes called concurrency or non-deterministic or soft bugs; failures resulting from these bugs are called transient failures Software fault Copyright © 2015 by K.S. TrivediClassification 45 Mandelbugs Complexity Factors
A fault is a Mandelbug if its manifestation is subject to the following complexity factors
Long time lag between fault activation and failure appearance Operating environment (OS, other applications running concurrently, hardware, network…) Timing among submitted operations Sequencing or ordering of operations
A failure due to a Mandelbug may not show up upon the resubmission of a workload if the operating environment has changed enough
Software fault Copyright © 2015 by K.S. TrivediClassification 46 Examples of Types of Bugs in IT Systems
Mandelbugs in IT Systems: Trivedi, Mansharamani, Kim, Grottke, and Nambiar. “Recovery from failures due to Mandelbugs in IT systems”. PRDC 2011.
The selected TCS projects ranged across a number of business systems in the banking, financial, government, IT, pharmacy, and telecom sector.
Software fault Copyright © 2015 by K.S. TrivediClassification 47 Mandelbug Reproducibility
Mandelbugs are really hard to reproduce Conducted a set of experiments to study the environmental factors (i.e., disk usage, memory occupation and concurrency) that affect the reproducibility of Mandelbugs
High usage of environmental factors increases significantly the reproducibility of Mandelbugs
ISSRE 2014 paper – D. Cavezza, Roberto P. , K. Trivedi, et al
Software fault Copyright © 2015 by K.S. TrivediClassification 48 Aging-related Bug – Definition
Aging-related bug := A fault that leads to the accumulation of errors either inside the running application or in its system-context environment, resulting in an increased failure rate and/or degraded performance.
Example: A bug causing memory leaks in the application Note that the aging phenomenon requires a delay between (first) fault activation and failure occurrence. Note also that the software appears to age due to such a bug; there is no physical deterioration Software fault Copyright © 2015 by K.S. TrivediClassification 52 Relationships of the Bug Types
Bohrbug and Mandelbug are complementary antonyms. Aging-related bugs are a subtype of Mandelbugs
Mandelbugs
AgingRelated Bugs
Aging-Related Bugs
Bohrbugs
Software fault Copyright © 2015 by K.S. TrivediClassification 53 Important Questions about these Bugs
What fraction of bugs are Bohrbugs, Mandelbugs and aging-related bugs
How do these fractions vary
over time over projects, languages, application types,…
Need of Real Data
Software fault Copyright © 2015 by K.S. TrivediClassification 58 Software Faults in IT Systems
Fault types in JPL/NASA flight software - M. Grottke, A. P. Nikora, K. S. Trivedi. “An empirical investigation of fault types in space mission system software”, DSN 2010: 61.4% Bohrbugs (BOH) 32.1% non-aging-related Mandelbugs (NAM) 4.4% aging-related bugs (ARB) 2.1% faults of unknown type (UNK) Very similar results for Linux, MySQL, Apache AXIS, httpd - D. Cotroneo, M. Grottke, R. Natella, R. Pietrantuono, K. S. Trivedi. “Fault triggers in open- source software: An experience report”, ISSRE 2013:
Project LoC (Considered) % BOH % NAM % ARB % UNK Linux 1.31M 42.2 41.9 8.3 7.6
MySQL 453K 56.6 30.3 7.7 5.4
HTTPD 145K 81.1 10.5 7.0 1.4
AXIS 80K 92.5 3.5 4.0 0.0
Copyright © 2015 by K.S. Trivedi 59 Software Faults and Mitigation Types
The fault classification is not only theoretical, it has also practical implications
Each type of software fault may require different type of approach during development, testing, as well as during operations
Software fault Copyright © 2015 by K.S. TrivediClassification 60 Software Faults and Mitigation Types
Software faults
Bohrbugs Mandelbugs
Non-aging-related Aging-related bugs Mandelbugs
Failover to Failover to Fix/Patch Workaround Use as is Reconfigure Retry Restart Reboot Rejuvenate nonidentical identical Environmental Diversity Fault/Error mitigation techniques Failure mitigation techniques
Software fault Copyright © 2015 by K.S. TrivediClassification 61 Software Fault and Failure Mitigation Types
Software fault mitigation types: Fix/Patch
Software failure mitigation types: Reconfigure Failover to non-identical Retry Restart Environmental diversity Reboot Failover to identical
Software fault Copyright © 2015 by K.S. TrivediClassification 62 Outline
Motivation Real System Examples
Software Fault Classification Environmental Diversity Methods of Mitigation Software Aging and Rejuvenation Summary
Copyright © 2015 by K.S. Trivedi 63 Environmental diversity A new thinking to deal with software faults and failures
Copyright © 2015 by K.S. Trivedi 64 Software Fault Tolerance: New Thinking
Environmental Diversity as opposed to Design Diversity
Our claim is that this (retry, restart, reboot, failover to identical software copy) works since failures due to Mandelbugs are not negligible. We thus have an affordable software fault tolerance technique that we call Environmental Diversity
Environmental Copyright © 2015 by K.S. Trivedi Diversity 65 What is Environmental diversity?
The underlying idea of Environmental diversity Retry a previously faulty operation and it most likely works -- Why? because of the environment where the operation was executed has changed enough to avoid the fault activation. The environment is understood as OS resources, other applications running concurrently and sharing the same resources, interleaving of operations, concurrency, or synchronization. This is Fault Tolerance since we do not necessarily fix the fault; fault caused a failure but this failure is dealt with by using time redundancy hence the user does not experience the failure again on retry
Environmental Copyright © 2015 by K.S. Trivedi Diversity 66 Outline
Motivation Real System Examples Software Fault Classification Environmental Diversity Methods of Mitigation Software Aging and Rejuvenation Summary
Copyright © 2015 by K.S. Trivedi 67 OS Availability Model (IBM BladeCenter)
Fix (Failed due to a Bohrbug)
mOS
l dOS (1-bOS)bOS asp UP OS DN DT DW RP
bOSbOS
Reboot (Failure due to a Mandelbug)
Methods of Copyright © 2015 by K.S. Trivedi Mitigation 74 Outline
Motivation Real System Examples Software Fault Classification Environmental Diversity Methods of Mitigation Software Aging and Rejuvenation Summary
Copyright © 2015 by K.S. Trivedi 76 Software Aging and Rejuvenation Aging-related Bugs: Replicate, Restart, Reboot, Rejuvenate
Copyright © 2015 by K.S. Trivedi 77 Software Aging
Aging phenomenon
Error conditions accumulating over time
Performance degradation, system failure
Main causes of Software Aging
Memory leak, fragmentation, Unterminated threads, Data corruption, Round- off errors, Unreleased file-locks, etc… Observed system OS, Middle-ware, Netscape, Internet Explorer etc
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 78 Software Aging - Definition
“Software Aging” phenomenon
Long-running software tends to show an increasing failure rate.
Not related to application program becoming obsolete due to changing requirements/maintenance.
Software appears to age; no real deterioration
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 79 Software Aging Examples
Oct. 2012 - Amazon Web Services Outage Sept. 2011 - Google Caused By Memory Docs Outage Blamed Leak And Failure In on Memory Glitch Monitoring Alarm
Feb. 1991 - The International Space Patriot Missile Station (ISS) FC SSC Software Failure memory leaks problems
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 80 Software Aging – More Examples
Cisco Catalyst Switch [Matias Jr.] File system aging [Smith & Seltzer] Gradual service degradation in the AT&T transaction processing system [Avritzer et al.] Error accumulation in Patriot missile system’s software [Marshall] Resources exhaustion in Apache [Li et al., Grottke et al.] Physical memory degradation in a SOAP-based Server [Silva et al.] Software aging in Linux [Cotroneo et al.] Crash/hang failures in general purpose applications after a long runtime Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 81 Measurements Showing Resource Exhaustion or Depletion Real Memory Free File Table Size
A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K. Trivedi. Proc. of IEEE Intl. Symp. on Software Reliability Engineering, Nov. 1998.
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 82 Software Fault Types & Their Mitigation
Software faults
Bohrbugs Mandelbugs
Non-aging-related Aging-related bugs Mandelbugs
Failover to Failover to Fix/Patch Workaround Use as is Reconfigure Retry Restart Reboot Rejuvenate nonidentical identical Environmental Diversity Fault/Error mitigation techniques Failure mitigation techniques
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 83 Software Rejuvenation
Software rejuvenation is a cost effective solution for improving software reliability by avoiding/postponing unanticipated software failures/crashes.
It allows proactive recovery to be carried either automatically or at the discretion of the user/administrator
Rejuvenation of the environment, not of software
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 84 Software Rejuvenation Examples
Patriot missile system software - switch off and on every 8 hours
Process and Text connections restart/recycling
Text Tens of US Patents ISS FS SSC (ISS File related with this system) - switch off technology and on every 2 months Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 85 Software Rejuvenation: More Examples
On-board preventive IBM Director AT&T billing maintenance for long-life deep Software Rejuvenation (x- applications space missions (NASA’s X2000 Advanced Flight series) [Huang et al.] Systems Program) [Tai et al.] [IBM & Duke Researchers]
For more real examples: "Software rejuvenation - Do IT & Telco industries use it?". J. Alonso, A. Bovenzi, J. Li, Y. Wang, S. Russo, and K. Trivedi. WOSAR 2012.
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 86 Software Rejuvenation –Trade-off
Advantages
Reduces costs of sudden aging-related failures Can be applied at the discretion of the user/administrator
Disadvantages
Direct costs of carrying out rejuvenation Opportunity costs of rejuvenation (downtime, decreased performance, lost transactions etc)
Important research issue: Find optimal times to perform rejuvenation!
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 87 Software Rejuvenation - Approaches
Two approaches based on WHEN:
Time-Based rejuvenation approaches
Rejuvenation applied regularly and at predetermined time intervals.
Widely used in real environments • Web servers (Apache) • ISS two-months reboot • Telecommunication systems
Copyright © 2015 by K.S. Trivedi 88 Software Rejuvenation - Approaches
Two approaches based on WHEN:
Measurement (or inspection)-based rejuvenation
System metrics continually monitored
Threshold based: Rejuvenation triggered when the crash is imminent based on the observation
Predictive: measurements are used to predict the time to resource exhaustion or time to failure
Copyright © 2015 by K.S. Trivedi 89 Software Rejuvenation Scheduling
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 90 Software Rejuvenation – Approaches
Two approaches based on HOW:
Use analytical model to optimize rejuvenation schedule
Lucent Bell Labs [Huang et al., ‘95]
Duke [IEEE-TC’98, SIGMETRICS’96, ISSRE’95, PRDC’00, SIGMETRICS’01, Comp J.’01, SRDS’02, DSN’02, ISSRE’02, DSN’03, IEEE-TR’05]
Others [IPDS’98, PNPM’99]
Copyright © 2015 by K.S. Trivedi 91 Time-Based Rejuvenation
Available
Gen(gf) Det(d) A C Rejuvenation Failure B
F(t) Gen(gr)
Distribution of TTF Needed to determine optimal value of rejuvenation trigger
state A: the system is up and available state C: the system is under software rejuvenation state B: the system is down and under reactive repair
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 92 Purpose of Measurements
Times to Failure: to parameterize the analytic model discussed earlier
Fit the measured data to a known distribution and then use the model to find optimal rejuvenation schedule Can Speed up measurements by using ALT (accelerated life tests) and ADT (accelerated degradation tests) [Matias et al; IEEE- TR 2010; ISSRE 2010], if measurements are done in a controlled environment, or by using simulation with IS (importance sampling) [Zhao et al; PE2013, ISSRE 2011, JETC 2014] Taking the measured sequence directly to determine optimal rejuvenation schedule without fitting measured times to failure to a distribution [Dohi et al. PRDC 2000]
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 93 Purpose of Measurements
Measuring performance variables
to predict time to resource exhaustion or time to failure
to trigger rejuvenation in a threshold based scheme [discussed here in detail]
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 94 Software Rejuvenation Granularities
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 95 IBM xSeries Software Rejuvenation Agent (SRA)
IBM Director system management tool
Provides GUI to configure SRA Acts upon alerts
Two versions
Periodic rejuvenation Prediction-based rejuvenation
Software Aging Copyright © 2015 by K.S. Trivedi& Rejuvenation 96 Outline
Motivation Real System Examples Software Fault Classification Environmental Diversity Methods of Mitigation Software Aging and Rejuvenation Summary
Copyright © 2015 by K.S. Trivedi 103 Summarizing the talk
Copyright © 2015 by K.S. Trivedi 104 Summary
It is possible to enhance software availability during operation exploiting environmental diversity
Multiple types of recovery after a software failure can be judiciously employed: restart, failover to a replica, reboot and if all else fails repair (patch)
Copyright © 2015 by K.S. TrivediConclusions 105 Summary
Software aging not anecdotal – real life scientific phenomenon
Rejuvenation implemented in several special purpose applications and many general purpose cluster systems
Copyright © 2015 by K.S. TrivediConclusions 106 TAKE AWAY MESSAGES
n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited.
Copyright © 2015 by K.S. Trivedi 107 Conclusions Thank you for your attention
Kishor Trivedi Dept. of Electrical & Computer Engineering Duke High Availability Assurance Lab (DHAAL) n Duke University [email protected] www.ee.duke.edu/~ktrivedi
Copyright © 2015 by K.S. Trivedi 108 Key References
Motivation On Operational Availability of a Large Software-Based Telecommunications System, R. Cramp, M. A. Vouk, and W. Jones, Proc. ISSRE 1992.
Software dependability in the Tandem GUARDIAN system, I. Lee, R. Lyer,, IEEE Transactions on Software Engineering, 1995.
Reliability of a Commercial Telecommunications System, M. Kaaniche and K. Kanoun, Proc. ISSRE 1996.
Network Troubleshooting, O. Kyas, Agilent Technologies , 2001
Lessons Learned From the Analysis of System Failures at Petascale: The Case of Blue Waters, C. D. Martino, F. Baccanico, J. Fullop, W. Kramer, Z. Kalbaczyk, and R. Lyer, Proc. DSN 2014.
Copyright © 2015 by K.S. Trivedi 109 Key References
Real System Availability Modeling of SIP Protocol on IBM WebSphere, K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, and B. Vashaw, Proc. PRDC 2008. Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. Proc. FTCS 1999. Fault Classification
Fighting Bugs: Remove, Retry, Replicate and Rejuvenate, M. Grottke and K. Trivedi, IEEE Computer, 2007. An Empirical Investigation of Fault Types in Space Mission System Software, M.Grottke, A. P. Nikora and K. S. Trivedi, Proc. DSN, 2010. Software fault mitigation and availability assurance techniques, K. S. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011. Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011
Copyright © 2015 by K.S. Trivedi 110 Key References
Environmental Diversity and Methods of Mitigation
Performance and Reliability Evaluation of Passive Replication Schemes in Application Level Fault Tolerance, S. Garg, Y. Huang, C. M. R. Kintala, K. S. Trivedi, and S. Yajnik. Proc. FTCS 1999. Whither generic recovery from Application Faults? A fault study using Open- Source Software, Chandra S., Chen P. M., Proc. CDSN 2000. An Empirical Investigation of Fault Repairs and Mitigations in Space Mission System Software J. Alonso, M. Grottke, A. Nikora, and K. Trivedi. Proc. DSN 2013. Software fault mitigation and availability assurance techniques, K. Trivedi, M. Grottke, and E. Andrade. International Journal of System Assurance Engineering and Management, 2011. Recovery from Failures due to Mandelbugs in IT Systems, K. Trivedi, R. Mansharamani, D.S. Kim, M. Grottke, M. Nambiar , Proc. PRDC 2011 Fault triggers in open-source software: An experience report, Cotroneo, Grottke, Natella, Pietrantuono, Trivedi, ISSRE 2013. Reproducibility of environment-dependent software failures: an experience report, Cavezza, Pietrantuono, Alonso, Russo, Trivedi , ISSRE 2014.
Copyright © 2015 by K.S. Trivedi 111 Key References
Software Aging and Software Rejuvenation (1)
Software Rejuvenation: Analysis, Module and Applications, Y. Huang, C. Kintala, N. Kolettis and N. Fulton, Proc. FTCS-25, 1995. A Methodology for Detection and Estimation of Software Aging, S. Garg, A. van Moorsel, K. Vaidyanathan and K.Trivedi. Proc. ISSRE 1998. Statistical Non-Parametric Algorithms to Estimate the Optimal Software Rejuvenation Schedule, T. Dohi, K. Goseva-Popstojanova and K. S. Trivedi, Proc. PRDC 2000. Proactive Management of Software Aging, V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert, IBM Journal of Research & Development, 2001. A Comprehensive Model for Software Rejuvenation, K. Vaidyanathan and K. S. Trivedi. IEEE-TDSC, 2005.
Copyright © 2015 by K.S. Trivedi 112 Key References
Software Aging and Software Rejuvenation (2)
Analysis of Software Aging in a Web Server, M. Grottke, L. Li, K. Vaidyanathan, and K. Trivedi. IEEE Transactions on Reliability, 2006. A Best Practice Guide to Resource Forecasting for the Apache Webserver, G. Hoffman, K. Trivedi and M. Malek, IEEE Transactions on Reliability, Dec. 2007. Using Accelerated Life Tests to Estimate Time to Software Aging Failure, R. Matias, K. Trivedi, P. Maciel , ISSRE, 2010. Accelerated Degradation Tests Applied to Software Aging Experiments, R. Matias, K. Trivedi and P. Filho and P. Barbetta, IEEE Transactions on Reliability, 2010. A Comparative experimental study of software rejuvenation overhead, J. Alonso, R. Matias, E. Vicente, A. Maria and K. Trivedi. Performance Evaluation, 2013.
Copyright © 2015 by K.S. Trivedi 113 Key References
Software Aging and Software Rejuvenation (3)
A comprehensive approach to optimal software rejuvenation, J. Zhao, Y. Wang, G. Ning, K. Trivedi, R. Matias Jr., and K. Cai, Performance Evaluation, 2013. Software rejuvenation scheduling using accelerated life testing, J. Zhao, Y. Jin, K. Trivedi, R. Matias Jr., and Y. Wang, ACM Journal on Emerging Technologies in Computing Systems, 2014. Ensuring the Performance of Apache HTTP Server Affected by Aging, J. Zhao, K. Trivedi, M. Grottke, J. Alonso, and Y. Wang, IEEE Trans. Dependable Sec. Comput., 2014.
Copyright © 2015 by K.S. Trivedi 114