<<

and real-time TDDD07 Real-time Lecture 7: Dependability & • If a is to produce results within time constraints, it needs to produce Fault tolerance results at all!

Sim in Nadj m-ThTehrani •How to justify and measure how well systems do their jobs? Real-time Systems Laboratory Department of Computer and Science Linköping university

Undergraduate course on Real-time Systems 56 pages Undergraduate course on Real-time Systems 2of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Dependable systems Early computer systems

• How do things go wrong and why? • 1944: Real-time computer system in the • What can we do about it? Whirlwind project at MIT, used in a – This lecture: Basic notions of military air traffic control system 1951 dependability and replication in • Short life of vaccum tubes gave mean fault-tolerant systems time to fa ilure o f 20 mi nut es – Next lecture: Designing Dependable Real-time systems

Undergraduate course on Real-time Systems 3of 56 Undergraduate course on Real-time Systems 4of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Engineers: Fool me once, Software developers: Fool me N shame on you – fool me times, who cares, this is twice, shame on me complex and anyway no one expects software to work...

Undergraduate course on Real-time Systems 5of 56 Undergraduate course on Real-time Systems 6of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

1 FT - June 16, 2004 October 2005

• "If you have a problem with your • “Automaker Toyota announced a recall Volkswagen the likelihood that it was a of 160,000 of its Prius hybrid vehicles software problem is very high. Software following reports of vehicle warning technology is not something that we as lights illuminating for no , and cars' gasoline engines stalling car manufacturers feel comfortable unexpectedly.” with.” Wired 05-11-08

Bernd Pischetsrieder, chief executive of Volkswagen • The problem was found to be an embedded software bug

Undergraduate course on Real-time Systems 7of 56 Undergraduate course on Real-time Systems 8of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

February 2, 2004 Driver support: Volvo cars

• Angel Eck, driving a 1997 Pontiac Sunfire found her car racing at high speed and accelerating on Interstate 70 1984 ABS Anti-lock 2004 Blind Spot for 45 minutes, heading toward Denver Braking System Information system (()BLIS) 1998 Dynamic Stability 2006 Active Bi-Xenon • ... with no effect from trying the brakes, and Traction Control lights shifting to neutral, and shutting off (DSTC) 2002 Roll Stability Control 2006 Adaptive Cruise the ignition. (RSC) Control (ACC) 2003 Intelligent Driver 2006 Collision warning Information System system with brake (IDIS) support

Undergraduate course on Real-time Systems 9of 56 Undergraduate course on Real-time Systems 10 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Early space and avionics Airbus 380

• During 1955, 18 air carrier accidents in • Integrated modular avionics (IMA), with the USA (when only 20% of the public safety-critical digital was willing to fly!) components, e.g. – Power-by-wire: complementing the • Today’s complexity many times higher hydraulic powered flight control surfaces

– Cabin pressure control (implemented with a TTP operated bus)

Undergraduate course on Real-time Systems 11 of 56 Undergraduate course on Real-time Systems 12 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

2 Dependability What is dependability?

Property of a system which • How do things go wrong? allows reliance to be justifiably placed •Why? on the service it delivers. • What can we do about it? [Avizienis et al.]

• Basic notions in dependable systems The ability to avoid service failures that and replication for fault tolerance are more frequent or more severe than is acceptable. (sv. Pålitliga datorsystem) (sv. Pålitliga datorsystem)

Undergraduate course on Real-time Systems 13 of 56 Undergraduate course on Real-time Systems 14 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Attributes of dependability Reliability

IFIP WG 10.4 definitions: [Sv. Tillförlitlighet] • Safety: absence of harm to people and Means that the system (functionally) environment behaves as specified, and does it • Availability: the readiness for correct continually over measured intervals of service time. • Integrity: absence of improper system Typical measure in aerospace: 10-9 alterations • Reliability: continuity of correct service • Maintainability: ability to undergo Another way of putting it: MTTF - One modifications and repairs failure in 109 flight hours.

Undergraduate course on Real-time Systems 15 of 56 Undergraduate course on Real-time Systems 16 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Faults, Errors & Failures Examples

• Year 2000 bug • Fault: a defect within the system or a • Bit flips in hardware due to cosmic situation that can lead to failure radiation in space • Error: manifestation (symptom) of the • Loose wire fault - an unexpected behaviour • Air craft retracting its landing gear while • Failure: system not performing its on ground intended function

Effects in time: Permanent/ transient/ intermittent

Undergraduate course on Real-time Systems 17 of 56 Undergraduate course on Real-time Systems 18 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

3 Fault ⇒ Error ⇒ Failure More on dependability

• Goal of system verification and Four approaches [IFIP 10.4]: validation is to eliminate faults Some will remain… 1. Fault avoidance 2. Fault removal Next lecture • Goal of hazard/risk analysis is to focus on important faults 3. Fault tolerance 4. Fault forecasting • Goal of fault tolerance is to reduce effects of errors if they appear - eliminate or delay failures

Undergraduate course on Real-time Systems 19 of 56 Undergraduate course on Real-time Systems 20 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Google’s 100 min outage More on dependability

September 2, 2009: Four approaches [IFIP 10.4]: A small fraction of Gmail’s servers were taken offline to perform routine 1. Fault avoidance upgrades. 2. Fault removal “We had slightly underestimated the 3. Fault tolerance load which some recent changes (ironically, some designed to improve 4. Fault forecasting service availability) placed on the request routers .” Fault forecasting?

Undergraduate course on Real-time Systems 21 of 56 Undergraduate course on Real-time Systems 22 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Fault tolerance External factors

• Means that a system provides a degraded (but acceptable) function – Even in presence of faults – During a period defined by certain The film... model assumptions

• Foreseen or unforeseen? – Fault model describes the foreseen faults

Undergraduate course on Real-time Systems 23 of 56 Undergraduate course on Real-time Systems 24 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

4 Fault models On-line fault-management

•Leading to Node failures • Fault detection –Crash – By program or its environment – Omission • Fault tolerance using redundancy – Timing –software – BtiByzantine – hardware •Leading to Channel failures – Crash (and potential partitions) – – Message loss • Fault containment by architectural – Message delay choices – Erroneous/arbitrary messages

Undergraduate course on Real-time Systems 25 of 56 Undergraduate course on Real-time Systems 26 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Redundancy Static Redundancy

From D. Lardner: Edinburgh Review, year 1824: Used all the time (whether an error has ”The most certain and effectual check upon errors appeared or not), just in case… which arise in the of is to cause the same to be made by – SW: N-version programming separate and independent *; and this – HW: Voting systems check is rendered still more decisive if their computations are carried out by different – Data: parity bits, checksums methods.”

* people who compute

Undergraduate course on Real-time Systems 27 of 56 Undergraduate course on Real-time Systems 28 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Dynamic Redundancy Server replication models

Used when error appears and specifically aids the treatment • Passive replication • Active replication

– SW: Recovery methods, Exceptions – HW: Switching to back-up module X – Data: Self-correcting codes – Time: Re-computing a result

: Denotes a replica group

Undergraduate course on Real-time Systems 29 of 56 Undergraduate course on Real-time Systems 30 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

5 Increasing availability Relating states

Needs a notion of time (precedence) in distributed systems • Active replication – Group membership: who is up, who is down? Recall from lecture 4: • Passive replication • If clocks are synchronised – Primary – backup: bring the secondary server up to date – The rate of processing at each node can be related • What do we need to implement to support – Timeouts can be used to detect faults as transition to operational mode? opposed to message delays – Message ordering • Asynchronous systems: – Agreement among replicas – Abstract time via order of events

Undergraduate course on Real-time Systems 31 of 56 Undergraduate course on Real-time Systems 32 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Distributed snapshot ”Chicken and egg” problem

•Vector clocks help to synchronise at • Replication is useful in presence of event level failures if there is a consistent common – Consistent snapshots among replicas • To get consistency, processes need to communica te thitheir st ttate vi a broad cast • But broadcast are distributed algorithms that run on every node • But reasoning about response times and • and can be affected by failures... fault tolerance needs quantitative bounds

Undergraduate course on Real-time Systems 33 of 56 Undergraduate course on Real-time Systems 34 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Desirable broadcast properties How to implement?

• The first step is to separate the • Reliable broadcast underlying network (transport) and the – all non-crashed processes agree on broadcast mechanism messages delivered (agreement) • Distinguish between receipt and –no spurious messages (i(itit)ntegrity) dlidelivery of a message – all messages broadcast by non-crashed processes are delivered (validity)

All or none!

Undergraduate course on Real-time Systems 35 of 56 Undergraduate course on Real-time Systems 36 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

6 Desirable broadcast properties

Application layer • Reliable broadcast Broadcast Deliver Send – all non-crashed processes agree on Broadcast protocol messages delivered (agreement) –no spurious messages (i(itit)ntegrity) Receive Send – all messages broadcast by Transport non-crashed processes delivered (validity)

All or none!

Undergraduate course on Real-time Systems 37 of 56 Undergraduate course on Real-time Systems 38 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Example implementation What if n fails?

Directly after a At every node n receipt? • Execute broadcast(m) by: While relaying? – adding sender(m) and a unique ID as a header to the message m (building m) After sending to some but –send(m) to all neighbours including itself not all neighbours? • When receive(m): – if previously not executed deliver(m) then • if sender(m) /= n then send(m) to all neighbours This is where failure models come in... • deliver(m)

Undergraduate course on Real-time Systems 39 of 56 Undergraduate course on Real-time Systems 40 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Algorithms for broadcast The consensus problem

• Correctness: prove validity, Assume that we have a integrity, agreement, order reliable broadcast

In presence of a given failure model! • Processes p1,,,p…,pn take part in a decision Typical assumptions: •Each pi proposes a value vi – no link failures leading to partition – e.g. application state info – send does not duplicate or change messages • All correct processes decide on a common value v that is equal to one of – receive does not ”invent” messages the proposed values

Undergraduate course on Real-time Systems 41 of 56 Undergraduate course on Real-time Systems 42 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

7 Desired properties Basic impossibility result

• Every non-faulty process eventually decides [Fischer, Lynch and Paterson 1985] (Termination) • No two non-faulty processes decide differently (Agreement) There is no deterministic •If a process decides v then the value v was solving the consensus problem in an proposed by some process (Validity) asynchronous distributed system with a single crash failure In presence of given failure model

Algorithms for consensus have to be proven to have these properties

Undergraduate course on Real-time Systems 43 of 56 Undergraduate course on Real-time Systems 44 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

A way around it Can I prove any protocol correct?

Assume Synchrony: •NO! • Distributed computations proceed in • It depends on the rounds initiated by pulses fault model: • Pulses implemented using local physical – Assumptions on clocks, synchronised assuming bounded how can the message delays nodes/channels fail

Undergraduate course on Real-time Systems 45 of 56 Undergraduate course on Real-time Systems 46 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Byzantine agreement Byzantine agreement protocol

Problem: • Nodes need to agree on a decision but • Proposed in 1980 by Pease, Shostak and some nodes are faulty, act in an Lamport arbitrary way (can be malicious)

How many faulty nodes can the algorithm tolerate?

Undergraduate course on Real-time Systems 47 of 56 Undergraduate course on Real-time Systems 48 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

8 Result from 1980 Scenario 1

• Theorem: There is an upper bound t for • G and L1 are correct, L2 is faulty the number of byzantine node failures compared to the size of the network N, G G N ≥ 3t+1 • Gives a t+1 round algorithm for solving consensus in a synchronous network 1 1

1 0

• Here: 1 G said 1 – We just demonstrate that t nodes L1 would not be enough! L10 L2 G said 0 L2

Undergraduate course on Real-time Systems 49 of 56 Undergraduate course on Real-time Systems 50 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

Scenario 2 Scenario 3

• G and L2 are correct, L1 is faulty • The general is faulty!

G G G G

1 0 1 0

1 0 0 0 1 1 G said 1 G said 1

L1 L2 L1 L1 0 L2 L1 G said 0 L2 0 G said 0 L2

Undergraduate course on Real-time Systems 51 of 56 Undergraduate course on Real-time Systems 52 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

2-round algorithm Summary

… does not work with t=1, N=3! • Dependable systems need to justify why services can be relied upon • Seen from L1, scenario 1 and 3 are • To tolerate faults we normally deploy identical, so if L1 decides 1 in scenario 1 replication it will decide 1 in scenario 3 • Replication needs (some kind of) agreement on state • Similarly for L2, if it decides 0 in scenario 2 it decides 0 in scenario 3 • Agreement problems are hard to prove correct in distributed systems • L1 and L2 do not agree in scenario 3 ! • Require assumptions on faults, and (some kind of) synchrony

Undergraduate course on Real-time Systems 53 of 56 Undergraduate course on Real-time Systems 54 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

9 Real-time Timing and fault tolerance

• Remember in a TTP bus: • Implementing fault tolerance often relies – Nodes have synchronised clocks on support for timing guarantees – The communication infrastructure (collection of CCs and the TTP bus communication) • Exercise: How do faults affect support guarantees a bounded time for a new data appearing on comm. interface of a node (its for timing? CNI) – How do faults affect response times in – Byzantine faults in nodes can be detected by nodes? other nodes! – How do fault tolerance mechanisms • Exercise: Which faults are detectable in affect timing? a system connected with a CAN bus?

Undergraduate course on Real-time Systems 55 of 56 Undergraduate course on Real-time Systems 56 of 56 Linköping University Autumn 2009 Linköping University Autumn 2009

10