TDDD07 Real-Time Systems Lecture 7: Dependability & Fault Tolerance
Total Page:16
File Type:pdf, Size:1020Kb
Dependability and real-time TDDD07 Real-time Systems Lecture 7: Dependability & • If a system is to produce results within time constraints, it needs to produce Fault tolerance results at all! Sim in Nadj m-ThTehran i •How to justify and measure how well computer systems do their jobs? Real-time Systems Laboratory Department of Computer and Information Science Linköping university Undergraduate course on Real-time Systems 56 pages Undergraduate course on Real-time Systems 2of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Dependable systems Early computer systems • How do things go wrong and why? • 1944: Real-time computer system in the • What can we do about it? Whirlwind project at MIT, used in a – This lecture: Basic notions of military air traffic control system 1951 dependability and replication in • Short life of vaccum tubes gave mean fault-tolerant systems time to fa ilure of 20 minut es – Next lecture: Designing Dependable Real-time systems Undergraduate course on Real-time Systems 3of 56 Undergraduate course on Real-time Systems 4of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Engineers: Fool me once, Software developers: Fool me N shame on you – fool me times, who cares, this is twice, shame on me complex and anyway no one expects software to work... Undergraduate course on Real-time Systems 5of 56 Undergraduate course on Real-time Systems 6of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 1 FT - June 16, 2004 October 2005 • "If you have a problem with your • “Automaker Toyota announced a recall Volkswagen the likelihood that it was a of 160,000 of its Prius hybrid vehicles software problem is very high. Software following reports of vehicle warning technology is not something that we as lights illuminating for no reason, and cars' gasoline engines stalling car manufacturers feel comfortable unexpectedly.” with.” Wired 05-11-08 Bernd Pischetsrieder, chief executive of Volkswagen • The problem was found to be an embedded software bug Undergraduate course on Real-time Systems 7of 56 Undergraduate course on Real-time Systems 8of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 February 2, 2004 Driver support: Volvo cars • Angel Eck, driving a 1997 Pontiac Sunfire found her car racing at high speed and accelerating on Interstate 70 1984 ABS Anti-lock 2004 Blind Spot for 45 minutes, heading toward Denver Braking System Information system (()BLIS) 1998 Dynamic Stability 2006 Active Bi-Xenon • ... with no effect from trying the brakes, and Traction Control lights shifting to neutral, and shutting off (DSTC) 2002 Roll Stability Control 2006 Adaptive Cruise the ignition. (RSC) Control (ACC) 2003 Intelligent Driver 2006 Collision warning Information System system with brake (IDIS) support Undergraduate course on Real-time Systems 9of 56 Undergraduate course on Real-time Systems 10 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Early space and avionics Airbus 380 • During 1955, 18 air carrier accidents in • Integrated modular avionics (IMA), with the USA (when only 20% of the public safety-critical digital was willing to fly!) components, e.g. – Power-by-wire: complementing the • Today’s complexity many times higher hydraulic powered flight control surfaces – Cabin pressure control (implemented with a TTP operated bus) Undergraduate course on Real-time Systems 11 of 56 Undergraduate course on Real-time Systems 12 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 2 Dependability What is dependability? Property of a computing system which • How do things go wrong? allows reliance to be justifiably placed •Why? on the service it delivers. • What can we do about it? [Avizienis et al.] • Basic notions in dependable systems The ability to avoid service failures that and replication for fault tolerance are more frequent or more severe than is acceptable. (sv. Pålitliga datorsystem) (sv. Pålitliga datorsystem) Undergraduate course on Real-time Systems 13 of 56 Undergraduate course on Real-time Systems 14 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Attributes of dependability Reliability IFIP WG 10.4 definitions: [Sv. Tillförlitlighet] • Safety: absence of harm to people and Means that the system (functionally) environment behaves as specified, and does it • Availability: the readiness for correct continually over measured intervals of service time. • Integrity: absence of improper system Typical measure in aerospace: 10-9 alterations • Reliability: continuity of correct service • Maintainability: ability to undergo Another way of putting it: MTTF - One modifications and repairs failure in 109 flight hours. Undergraduate course on Real-time Systems 15 of 56 Undergraduate course on Real-time Systems 16 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Faults, Errors & Failures Examples • Year 2000 bug • Fault: a defect within the system or a • Bit flips in hardware due to cosmic situation that can lead to failure radiation in space • Error: manifestation (symptom) of the • Loose wire fault - an unexpected behaviour • Air craft retracting its landing gear while • Failure: system not performing its on ground intended function Effects in time: Permanent/ transient/ intermittent Undergraduate course on Real-time Systems 17 of 56 Undergraduate course on Real-time Systems 18 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 3 Fault ⇒ Error ⇒ Failure More on dependability • Goal of system verification and Four approaches [IFIP 10.4]: validation is to eliminate faults Some will remain… 1. Fault avoidance 2. Fault removal Next lecture • Goal of hazard/risk analysis is to focus on important faults 3. Fault tolerance 4. Fault forecasting • Goal of fault tolerance is to reduce effects of errors if they appear - eliminate or delay failures Undergraduate course on Real-time Systems 19 of 56 Undergraduate course on Real-time Systems 20 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Google’s 100 min outage More on dependability September 2, 2009: Four approaches [IFIP 10.4]: A small fraction of Gmail’s servers were taken offline to perform routine 1. Fault avoidance upgrades. 2. Fault removal “We had slightly underestimated the 3. Fault tolerance load which some recent changes (ironically, some designed to improve 4. Fault forecasting service availability) placed on the request routers .” Fault forecasting? Undergraduate course on Real-time Systems 21 of 56 Undergraduate course on Real-time Systems 22 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Fault tolerance External factors • Means that a system provides a degraded (but acceptable) function – Even in presence of faults – During a period defined by certain The film... model assumptions • Foreseen or unforeseen? – Fault model describes the foreseen faults Undergraduate course on Real-time Systems 23 of 56 Undergraduate course on Real-time Systems 24 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 4 Fault models On-line fault-management •Leading to Node failures • Fault detection –Crash – By program or its environment – Omission • Fault tolerance using redundancy – Timing –software – BtiByzantine – hardware •Leading to Channel failures – Crash (and potential partitions) –Data – Message loss • Fault containment by architectural – Message delay choices – Erroneous/arbitrary messages Undergraduate course on Real-time Systems 25 of 56 Undergraduate course on Real-time Systems 26 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Redundancy Static Redundancy From D. Lardner: Edinburgh Review, year 1824: Used all the time (whether an error has ”The most certain and effectual check upon errors appeared or not), just in case… which arise in the process of computation is to cause the same computations to be made by – SW: N-version programming separate and independent computers*; and this – HW: Voting systems check is rendered still more decisive if their computations are carried out by different – Data: parity bits, checksums methods.” * people who compute Undergraduate course on Real-time Systems 27 of 56 Undergraduate course on Real-time Systems 28 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 Dynamic Redundancy Server replication models Used when error appears and specifically aids the treatment • Passive replication • Active replication – SW: Recovery methods, Exceptions – HW: Switching to back-up module X – Data: Self-correcting codes – Time: Re-computing a result : Denotes a replica group Undergraduate course on Real-time Systems 29 of 56 Undergraduate course on Real-time Systems 30 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009 5 Increasing availability Relating states Needs a notion of time (precedence) in distributed systems • Active replication – Group membership: who is up, who is down? Recall from lecture 4: • Passive replication • If clocks are synchronised – Primary – backup: bring the secondary server up to date – The rate of processing at each node can be related • What do we need to implement to support – Timeouts can be used to detect faults as transition to operational mode? opposed to message delays – Message ordering • Asynchronous systems: – Agreement among replicas – Abstract time via order of events Undergraduate course on Real-time Systems 31 of 56 Undergraduate course on Real-time Systems 32 of 56 Linköping University Autumn 2009