Reliability: Software Software Vs
Reliability Theory SENG 521 Re liabilit y th eory d evel oped apart f rom th e mainstream of probability and statistics, and Software Reliability & was usedid primaril y as a tool to h hlelp Software Quality nineteenth century maritime and life iifiblinsurance companies compute profitable rates Chapter 5: Overview of Software to charge their customers. Even today, the Reliability Engineering terms “failure rate” and “hazard rate” are often used interchangeably. Department of Electrical & Computer Engineering, University of Calgary Probability of survival of merchandize after B.H. Far ([email protected]) 1 http://www. enel.ucalgary . ca/People/far/Lectures/SENG521/ ooene MTTF is R e 0.37 From Engineering Statistics Handbook
[email protected] 1 [email protected] 2
Reliability: Natural System Reliability: Hardware
Natural system Hardware lif e life cycle. cycle.
Aging effect: Useful life span Life span of a of a hardware natural system is system is limited limited by the by the age (wear maximum out) of the system. reproduction rate of the cells.
Figure from Pressman’s book Figure from Pressman’s book
[email protected] 3 [email protected] 4
Reliability: Software Software vs. Hardware
So ftware life cycl e. Software reliability doesn’t decrease with Software systems time, i.e., software doesn’t wear out. are changed (updated) many Hardware faults are mostly physical faults, times during their e.g ., fatigue . life cycle. Each update adds to Software faults are mostly design faults the structural which are harder to measure, model, detect deterioration of the and correct. software system.
Figure from Pressman’s book
[email protected] 5 [email protected] 6 Software vs. Hardware Reliability: Science
HdHardware filfailure can b b“fid”be “fixed” by repl liacing a Exploring ways of implementing “reliability” faulty component with an identical one, therefore no reliability growth . in software products. Software problems can be “fixed” by changing the Reliability Science’s goals: code in order to have the failure not happen again , Dli“dl”(Developing “models” (regressi on and therefore reliability growth is present. aggregation models) and “techniques” to build Software does not go through production phase the relia ble soft ware. same way as hardware does. Testing such models and techniques for adequacy, Conclusion: hardware reliability models may not sounddldness and completeness. be used identically for software.
[email protected] 7 [email protected] 8
What is Engineering? Reliability: Engineering /1
Engineering = What is the problem to be solved? Engineeri ng of “reli abili ty” i n sof tware What characters of the entity are products. Analysis + used to solve the problem? Design + How will the entity be realized? Reliability Engineering’s goal: How is it constructed? Construction + developing software to reach the market What approach is used to uncover Verification + errors in design and construction? With “minimum” development time How w ill th e entity tit b e support ed i n With “ mi ni mum” d evel opment cost Management the long term? With “maximum” reliability
With “minimum” expertise needed
With “minimum” available technology
[email protected] 9 [email protected] 10
Reliability: Engineering /2 What is SRE? /1
Software quality means getting the right So ftware Re lia bility Engi neeri ng (S RE) i s a mul ti - faceted discipline covering the software product balance among development cost , development lifecycl e. time, people, technology and reliability. It involves both technical and management Mini mum & M ax imum activities in three basic areas: Cost, Time, People, SRE Software Development and Maintenance Technology, Reliability Optimum Measurement and Analysis of reliability data Pick quantitative representations for the 5 factors (cost , Feedback of reliability information into the software time, people, technology and reliability) and measure lifecycle activities. them!
[email protected] 11 [email protected] 12 What is SRE ? /2
SRE i s a practi ce f or quantit ati vel y pl anni ng and guiding software development and test, with emphasis on reliability and availability. SRE simultaneously does three things:
It ensures that product reliability and availability meet user needs. Software Reliability It delivers the product to market faster. It increases productivity, lowering product life-cycle cost. Engineering (SRE) Process In applying SRE, one can vary relative emphasis placed on these three factors.
[email protected] 14 [email protected] 13
Reference SRE: Process /1
Dr. Musa’s Software There are 5 st eps i n Reliability SRE process (for Engiiineering, 2 Ed each system to test):
Chapter 1 Define necessary reliability
Develop operational profiles
Prepare for test
Execute test
Apply failure data to guide decisions
[email protected] 15 [email protected] 16
SRE: Process /2 SRE: Process /2
Modified version of the SRE Process The DlODevelop Opera tilPfiltional Profiles, and PfPrepare for Test activities all start during the Requirements (and perhaps architectural analysis) phase of the software development process. They all extend to varying degrees into the Design and Implementation phase, as they can be affected by it. The Execute Test and Guide Test activities coincide with the Test phase.
Ref: Musa’s book 2nd Ed
[email protected] 17 [email protected] 18 SRE: Necessary Reliability Failure Intensity Objective (FIO)
DfiDefine w hat “fil“failure” means f or th e sof tware prod uct. Failure i ntensi ty ( λ)i) is d efi ned as f ail ure per natural Choose a common measure for all failure intensities, either units (or time), e.g. failures per some natural unit or failures per hour . 3 alarms per 100 hours of operation. Set the total system failure intensity objective (FIO) for the 5 failures per 1000 transactions, etc. software/hardware system. Failure intensity of a cascade (serial) system is the Compute a developed software FIO by subtracting the total of the FIOs of all hardware and acquired software sum of failure intensities for all of the components components from the system FIOs. of the system. Use the developpyed software FIOs to track the reliability For exponential model: growth during system test (later on). n ztsystem12 n i i1
[email protected] 19 [email protected] 20
How to Set FIO? Reliability vs. Failure Intensity
Se tting FIO in terms o f sys tem re lia bility ( R) or ava ilabilit y (A): Reliability for 1 hour Failure intensity mission time ln R 1 R or for R 0.95 0.36800 1 failure / hour tt 0.90000 105 failure / 1000 hours 1 A 0. 95900 1f1 fail ure / /d day 0.99000 10 failure / 1000 hours tAm 0. 99400 1 failure / week λ λ R is failure intensity 0.99860 1 failure / month R is reliability A 0.99900 1 failure / 1000 hours t is natural unit (time, etc.) 0.99989 1 failure / year tm is downtime per failure
[email protected] 21 [email protected] 22
SRE: Operation SRE: Operational Mode
An operation ijtliltkhihis a major system logical task, which OtildOperational mode idititttfis a distinct pattern of syst em returns control to the system when complete. use and/or set of environmental conditions that may An operation iittfftthis an input event affects the course of need separate testing due to likelihood of behavior of software. stimulating different failures. ElExample: operations f or a W eb proxy server Example: Connect internal users to external Web Time (time of year, day of week, time of day) EilitlEmail internal users t o ext ernal users Different user types (customer or user) Email external users to internal users Users experiences (novice or expert)
DNS request by internal users The same operation may appear in different
Etc. operational mode with different probabilities.
[email protected] 23 [email protected] 24 SRE: Operational Profile SRE: System Operational Profile
An operational profile is a complete set of operations with their StSystem opera tiltional profil e mustbt be deve lope dfd for all of fit its probabilities of occurrence (during the operational use of the software). important operational modes. An operational profile is a description of the distribution of input events There are four principal steps in developing an operational that is expected to occur in actual software operation. profile: The operational profile of the software reflects how it will be used in Identify the operation initiators (i.e., user types, external systems, and practice. Probability the sys tem itse lf) of occurrence List the operations invoked by each initiator
Operational mode Determine the occurrence rates
Determine the occurrence probabilities by dividing the occurrence rates by the total occurrence rate
Operation
[email protected] 25 [email protected] 26
SRE: Prepare for Test SRE: Execute Test
The PfTtPrepare for Test activ ity uses th e operati onal Allocate test ti me among th e associ ated systems and profiles to prepare test cases and test procedures. types of test (feature, load, regression, etc.). Test cases are all ocat ed i n accord ance with th e Invoke the test cases at random times, choosing operational profile. operations randomly in accordance with the TtTest cases are assi gned dtth to the operati tibons by operational profile. selecting from all the possible intra-operation choices with equal probability . Identifyy,g failures, along with when they occur. The test procedure is the controller that invokes test This information will be used in Apply Failure Data cases during execution. and Guide Test .
[email protected] 27 [email protected] 28
Types of Test SRE: Apply Failure Data
Certification Test: Accept or reject (binary Plot each new failure as it occurs on a decision) an acquired component for a given target failure intensity. reliability demonstration chart. Feature (Unit) TestTest:: A single execution of an Accept or reject software (operations) using operation with interaction between operations miiinimi zed . reliability demonstration chart . Load Test: Testing with field use data and Track reliability growth as faults are removed. accounting for interactions Regression Test: Feature tests after every build involving significant change , i .e ., check whether a bug fix worked.
[email protected] 29 [email protected] 30 Release Criteria Collect Field Data Consider releasing the product when: SRE f or th e soft ware pro duc t lif ecycl e. Collect field data to use in succeeding releases either using 1. All acquired components pass certification automatic reporting routines or manual collection, using a test random sample of field sites. Collect data on failure intensity and on customer satisfaction 2. Test t ermi nat ed sati sf act oril y f or all th e and use this information in setting the failure intensity product variations and components with the objective for the next release. Measure operational profiles in the field and use this failure intensity reaching the target λF information to correct the operational profiles we estimated. For better confidence , we usually allow Collect information to refine the ppgrocess of choosing reliability strategies in future projects. λ/λF ratio be below 0.5 (Confidence ft)factor)
[email protected] 31 [email protected] 32
However …
Practical implementation of an effective SRE program is a non-trivial task. Mechanisms for collection and analysis of data on software product and process quality must be in place. Fault identification and elimination techniques must be in place. Ot her organ izat ional abili ti es such as th e use of reviews and inspections, reliability based testing and software process improvement are also necessary for effective SRE.