Software Fault Tolerance Based on Design Diversity Is Expensive and Hence Does Not Get Used Extensively

How Does Software Fail and What Should be Done About It? DCIT; Nov. 2015 Prof. Kishor Trivedi Duke High Availability Assurance Lab (DHAAL) Department of Electrical and Computer Engineering Duke University, Durham, NC 27708-0291 E-mail: [email protected] URL: www.ee.duke.edu/~ktrivedi Copyright © 2015 by K.S. Trivedi 1 Duke University Research Triangle Park (RTP) Duke UNC-CH NC state North USA Carolina Copyright © 2015 by K.S. Trivedi 2 Duke High Availability Assurance Laboratory (DHAAL) Kishor Trivedi Dept. of Electrical & Computer Engineering Duke University Email: [email protected] URL: www.ee.duke.edu/~ktrivedi Internationally connected with groups in USA, Germany, Italy, Japan, China, Brazil, New Zealand and Spain Copyright © 2015 by K.S. Trivedi 3 DHAAL . Dhaal means “Shield” in Hindi/Gujarati . DHAAL research is about shielding systems from various threats Copyright © 2015 by K.S. Trivedi 4 Books Update n Probability and Statistics with Reliability, Queuing, and Computer Science Applications, 1982; Second edition, John Wiley, 2001 (Blue book) – Chines translation has just appeared n Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer, 1996 (Red book) n Qeuing Networks and Markov Chains, 1998 John Wiley, second edition, 2006 (White book) n Reliability and Availability Engineering, Cambridge University Press, 2016 (green book) Copyright © 2015 by K.S. Trivedi 6 Outline Motivation Real System Examples Software Fault Classification Environmental Diversity Methods of Mitigation Software Aging and Rejuvenation Conclusions Copyright © 2015 by K.S. Trivedi 7 Pervasive Dependence on Computer Systems Need for High Reliability/Availability Communication Health & Medicine Avionics Banking Entertainment Copyright © 2015 by K.S. Trivedi 8 Basic Definitions n Steady-state availability (Ass) or just availability u Long-term probability that the system is available when requested: MTTF A = ss MTTF + MTTR u MTTF is the system mean time to failure, a complex combination of component MTTFs u MTTR is the system mean time to recovery F may consist of many phases Copyright © 2015 by K.S. Trivedi 9 Motivation Basic Definitions n Downtime in minutes per year u In industry, (un)availability is usually presented in terms of annual downtime. u Downtime = 876060 (1- Ass) minutes. u In Industry it is common to define the availability in terms of number of nines F 5 NINES (Ass = 0.99999) 5.26 minutes annual downtime F 4 NINES (Ass = 0.9999) 52.56 minutes annual downtime Copyright © 2015 by K.S. Trivedi 10 Motivation Number of Nines– Reality Check n 49% of Fortune 500 companies experience at least 1.6 hours of downtime per week u Approx. 80 hours/year=4800 minutes/year u Ass=(8760-80)/8760=0.9908 u That is, between 2 NINES and 3 NINES! n This study assumes planned and unplanned downtime, together Copyright © 2015 by K.S. Trivedi 11 Motivation Some real examples from High Tech companies Jan. 2014 , Gmail was down for 25 – 50 min. Oct. 2013, Unavailable services like post photos and “likes” Feb. 2013, Windows Azure down for 12 hours Jan. 2013, AWS down for an hour approx. Sept. 2012 - GoDaddy (4 hours and 5 millions of websites affected) Copyright © 2015 by K.S. Trivedi 12 Motivation More examples of failures Oct. 2012 Amazon Webservices - 6 hours (Memory leak) Amazon EC2 - 2 hours T h e Sept. 2011 - Google Docs service outage (1 hour) - A memory leak s due to a software update a m e Sept. 2011 - Microsoft Cloud service outage (2.5 hours) w e e k n These examples indicate that even the most advanced tech companies are offering less than five NINES of availability • And only considering one failure!!!! Copyright © 2015 by K.S. Trivedi 13 Motivation Software is the problem Jim Gray’s paper titled “Why do computers stop and what can be done about it?” pointed out this trend in 1985, followed by his paper “A census of tandem system availability between 1985 and 1990” 2005 1985 Across different industries…. Copyright © 2015 by K.S. Trivedi 15 Motivation High Reliability/Availability: Software is the problem n Hardware fault tolerance, fault management, reliability/availability modeling relatively well developed n System outages more due to software faults Key Challenge: Software reliability is one of the weakest links in system reliability/availability Copyright © 2015 by K.S. Trivedi 16 Motivation Increasing SW Failure Rate? Planetary Missions Flight Software: A. Nikora of JPL The interval between the first and last launch: 8.76 years. The interval between successive launches ranges from: 23 to 790 days. Mars Pathfinder CASSINI Mars Mars Stardust Mars Genesis Mars Deep Mars Global Climate Polar Odyssey Exploration Impact Reconnaissance Surveyor Orbiter Lander Rover Orbiter Mission Name (in launch order) Copyright © 2015 by K.S. Trivedi 17 Motivation TAKE AWAY MESSAGES n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited. Copyright © 2015 by K.S. Trivedi 20 Conclusions Software Reliability: Known Means n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance Copyright © 2015 by K.S. Trivedi 21 Motivation Software Reliability n Fault prevention or Fault avoidance u Good software engineering practices F Requirement Elicitation (Abuse Case Analysis – TCS SSA) F Design Analysis / Review F Secure Programming Standard & Review F Secure Programming Compilation F Software Development lifecycle F Automated Code Generation Tools (IDE like Eclipse) u Use of formal methods F UML, SysML, BPM F Proof of correctness F Model Checking (SMART, SPIN, PRISM) n Bug free code not yet possible for large scale software systems u Impossible to fully test and verify if software is fault-free “Testing shows the presence, not the absence, of bugs” - E. W. Dijkstra n Yet there is a strong need for failure-free system operation Copyright © 2015 by K.S. Trivedi 22 Motivation Software Reliability n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance Copyright © 2015 by K.S. Trivedi 23 Motivation Software Reliability n Fault removal u Can be carried out during F the specification and design phase F the development phase F the operational phase u Failure data may be collected and used to parameterize a software reliability growth model(SRGM) to predict when to stop testing n Software is still delivered with many bugs either because of inadequate budget for testing , very difficult to reproduce/detect/localize/correct bugs or inadequacy of techniques employed/known Copyright © 2015 by K.S. Trivedi 24 Software Reliability n Fault prevention or Fault avoidance n Fault Removal n Fault Tolerance Copyright © 2015 by K.S. Trivedi 25 Motivation High Reliability/Availability: Software is the problem Software fault tolerance is a potential solution to improve software reliability in lieu of virtually impossible fault-free software Copyright © 2015 by K.S. Trivedi 26 Software Fault Tolerance Classical Techniques Design diversity u N-version programming u Recovery block Copyright © 2015 by K.S. Trivedi 27 Motivation Software Fault Tolerance Classical Techniques • N-version programming • Recovery blocks Design • … diversity Expensive not Yet there are used much in stringent requirements for practice! failure-free operation Challenge: Affordable Software Fault Tolerance Copyright © 2015 by K.S. Trivedi 28 TAKE AWAY MESSAGES n Today's complex systems (including CPS and IoT) are mostly software with some hardware thrown in. Software failures are a major cause of system undependability. n The focus so far has been on software faults; we need to pay attention to failures caused by software and the recovery from these failures. Or focus so far has been on software reliability; we need to pay attention to software availability as well. n Software failures during operation are a fact that we need to learn to deal with. Traditional method of software fault tolerance based on design diversity is expensive and hence does not get used extensively. Software fault tolerance based on inexpensive environmental diversity should be exploited. Copyright © 2015 by K.S. Trivedi 29 Conclusions Outline Motivation Real System Examples Software Fault Classification Environmental Diversity Methods of Mitigation Software Aging and Rejuvenation Conclusions Copyright © 2015 by K.S. Trivedi 30 Real Systems Example of a real systems and their High availability implementations Copyright © 2015 by K.S. Trivedi 31 High availability SIP Application Server Configuration on IBM WebSphere Blade Chassis 1 More details in PRDC 2008 AS 1 Replication Domain 1 and ISSRE 2010 papers AS 2 Replication Domain 2 Replication Blade 2 AS 3 Replication Domain 3 AS 4 SIP Proxy 1 Blade 3 DM Group3 AS 5 Blade 1 AS 6 Test Driver - Blade 4 IP SIBpMra yLeo-rad Test drivers Balancer SIP IBM PC Blade Chassis 2 Test Driver AS 1 Replication Domain 4 AS 4 SIP Proxy 1 Blade 2 AS1 thru AS6 are Blade1 AS 2 Application Server Replication Domain 5 AS 5 Proxy1's are Stateless Proxy Server Blade 3 AS 3 Replication Domain 6 AS 6 Blade 4 Blade 4 A Real System Copyright © 2015 by K.S. Trivedi 32 High availability SIP Application Server configuration on WebSphere Hardware configuration: Two BladeCenter chassis; 4 blades (nodes) on each chassis (1 chassis sufficient for performance) Software configuration: 2 copies of SIP/Proxy servers (1 sufficient for performance) 12 copies of WAS (6 sufficient for performance) Each WAS instance forms a redundancy pair (replication domain) with WAS installed on another node on a different chassis The system has hardware redundancy and software redundancy.

Load more