<<

University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange

The aH rlan D. Mills Collection Science Alliance

11-1-1990 Engineering Software Under Statistical Quality- Control R. H. Cobb

Harlan D. Mills

Follow this and additional works at: http://trace.tennessee.edu/utk_harlan Part of the Commons

Recommended Citation Cobb, R. H. and Mills, Harlan D., "Engineering Software Under Statistical Quality-Control" (1990). The Harlan D. Mills Collection. http://trace.tennessee.edu/utk_harlan/14

This Article is brought to you for free and open access by the Science Alliance at Trace: Tennessee Research and Creative Exchange. It has been accepted for inclusion in The aH rlan D. Mills Collection by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more information, please contact [email protected]. Engineering Software under Statistical Quality Control

Richard H. Cobb and Harlan D. Mills, Software Engineering Technology

Thecosbof ociety has been developing soft- culty producing reliable software there is continuing to develop ware for less than one human gen- a demand for even more complex, larger Seration. We have accomplished a software systems. failure-lden 8offws1ye great deal in this first generation when These problems are symptoms of a pre with its associated compared to the accomplishments of cess that is not yet under intellectual con- other disciplines:During the first genera- trol. An activity is under intellectual con- low prductivity are tion of civil engineering, the right triangle trol when the people performing it use a unaamptable. hadn't been invented; accountants did theoretically sound process that gives C1-r- not discover doubleentry concepts in the each of them a high probability of obtain- early generations of their field. ing a commonlyaccepted correct answer. englneeri~promises Yet despite such significant progress, When most endeavors begin, they are lower costs and softwaredevelopment practices need im- out of intellectual control. Intellectual improved qualit~c provement. We must solve such problems control is achieved when theories are de- as veloped, implementation practices are re- execution failures, which exist to the fined, and people are taught the process. extent that software failures are accepted A good example is long division. For as normal by most people, many generations, performing division projects that are late and/or over bud- with Roman numerals was error-prone. get, and Today, children who learn how to do long the labor-intensive nature of software division with Arabic numerals obtain the development - productivity increases correct answer most of the time. The long have been modest since the introduction division algorithm is: of Cobol. 1. If the division is not complete, invent And at the same time we are having diffi- (estimate) the next digit.

44 074@7459/90/11oO/O044/$01.oO 0 1990 IEEE IEEE Software 2. Verify the invention (estimate) made ects using some or all of these new soft- is because software developers rely on an in step 1. waredevelopment practices. incomplete theory, so their engineering 3. If the verification is correct and the practices don’twork. division is not complete, repeat step 1 for Software myths Software engineers should be required the next digit; if the verification is not cor- Myth: Softwarefailures are unavoidable. to use engineering practices that produce rect, repeat step 1 for the same digit by This myth holds that software always software that does not contain faults that adjusting the invention. contains latent execution failures that will cause latent execution failures. Users This is a powerful algorithm for estab be found by users. Therefore, we must want the same high assurance that soft- lishing intellectual control. A difficult learn to live with and manage around soft- ware will work according to its specifica- problem, which on the surface seems to ware failures. tion that they have for products designed require a large invention, has been di- Fact: Like other engineering activities, by other engineers. vided into a series of smaller problems, engineering software is a human activity The software profession is young, so we each requiring a smaller invention. Most subject to human fallibilities.Yet other en- might want to start with modest goals, important, each inventive step is followed gineering disciplines have learned how to such as: Design and implement a 100,000 immediately by a verification step to ap design large and complex products with a line system so, more often than not, no praise the invention’s correctness, so sub low probability that the designs contain execution failure will be detected during sequent inventions don’t build on incor- the software’s entire field life. rect results. Even this modest goal is beyond our ex- This algorithm also applies to software Ekperience to date pectations using the development prac- design and development. As software supports our belief that tices we now rely on. We believe such a goal is well within our capabilitiesifwe use technologists strive to find better ways to as software developers develop software,we believe that they are the ideas summarized in this article. For hindered by some widely accepted beliefs move from today’s example, the software for the IBM Wheel- about how to develop software. We believe heuristic proglamming writer typewriter, developed using some that if we adopt new perspectives about to rigwous somvare of these ideas, has been in use for more these development myths, we will open englneering, quality than six years with millions of users and the way to development practices that will will increase and costs no software failure has ever been re- permit the construction of software that ported.’ contains few, if any, latent failures. will decrease. We have used new perspectivesto derive Myth: Qualily costs mmq Cleanroom engineering practices. Clean- Many people believe that software de- room engineering develops software faults that will cause latent execution fail- signed to execute with no or few failures under statistical quality control by ures. When structural engineers design a costs more per line of code to produce. specifymg statistical usage, bridge there is a high expectation that the Fact: Failures and cost are positively cor- defining an incremental pipeline for bridge, when built, will not fall down. related. It is more expensive to remove la- software construction that permits statisti- In other engineering disciplines, design tent execution failures designed into the cal testing, and failures are neither anticipated nor ac- software than to rigorously design the separating development and testing cepted as normal. When a failure hap software to prevent execution failures. (onlytesters compile and execute the soft- pens, major investigationsare undertaken For example, touch-typing is both more ware being developed). to determine why it occurred. Other engi- reliable and productive than hunt-and- These practices have been demon- neering professions have minimized peck typing. strated to provide higher quality software error by developing a sound, theoretical We believe - and experience to date - software with fewer latent execution base on which to build design practices. supports our belief- that as software de- failures. These same engineering prac- But because software developersexpect velopers move from today’s heuristic pre tices also have been observed to improve and accept design failures, software users gramming to rigorous software engineer- productivity. Table 1 summarizes some cannot have the same high expectations ing, quality will increase and design and quality and productivity metrics for proj- as users of other products. We believe this development costs will decrease.

November 1990 45 Table 1. Selected sample of Cleanroom projects. (All other projects known to authors report substantial improvements in quality and productivity.)

Applied Year technologies Implementation Results

1980 Stepwise refinement Census,25 KLOC (Pascal) No failure ever found Functional verification Programmer received gold medal from Baldridge 1983 Functional verification Wheelwriter, 63 KLOC, Millions of users Inspections three processors No failure ever found 1980s Functional verification Space shuttle, 500 KLOC Low defect over entire function Inspections No defect in any flight Work received NASA’s Quality Award 1987 Cleanroom engineering Flight control, 33 KLOC (Jovial) ,, Completed ahead of schedule three increments 2.5 errors/KLOC before any execution Error-fix effort reduced by a factor of five 1988 Cleanroom engineering Commercial product, 80 KLOC (PL/I) Certification testing failure rate of 3.4 failures/KLOC Deployment failures of O.l/KLOC Productivityof 740 lines/man-month 1989 Partial Cleanroom Satellite control, 30 KLOC (Fortran) Certification testing error rate engineering of 3.3 failures/KLOC 50-percent improvement in quality Productivityof 780 lines/man-month 80-percentimprovement in productivity 1990 Cleanroom engineering Research project, 12 KLOC Certified to 0.9978 with 989 test cases; 36 with reuse and new Ada (Adaand ADL) failures found during certification (20 design language logic errors, or 1.7 errors/KLOC

Myth: Unit vmjicatim by debugging works These failures are then either found dur- ects that used unit verification by logical on system ofany siz. ing integration testing or left in the prod- argument. All our experience with this Unitverification- debugging -is best uct as latent failures. Debugging seems to method indicates that the scaleup prob done by a single programmer who exer- produce local correctness and global in- lem associated with debugging is very tract- cises the program with specially con- correctness. able. Unit verification by logical argument structed test cases. During debugging, the Ed Adams examined every failure re- seems to work because when a defect is programmer constructs test cases, devel- port for nine of IBM’s most widely used found in a proof argument the focus can’t ops programs to run isolated units of the software products for several years and shift to local concerns to make something system, runs the tests, and fixes discrepan- traced each to its origin. He found that in work - the argument focuses entirely on cies as they are observed. This process most cases the cause of the failure was in- global issues. continues until the programmer is satis troduced during an attempt to fix another Fact: Unit verification via logical argu- tied the program performs its intended failure.’ ment is more cost-effective than unit veri- mission. Fact: Unit verification by logical argu- fication via debugging, for four reasons: Fact: Although it is satisfactorywhen the ment does scale up. This method of unit Design errors are caught sooner and software product is small, unit verification verification is based on the time-tested as a result are less costly to fix. by debugging does not scale up. When the method of proving the correctness of an It eliminates the expense of finding product is large and unit verificationexer- assertion by developing a proof. A pre the subtle, hard-to-fixfailures introduced cises only a small portion of the total sys- gram specification is a function or rela- by debugging. tem, the results are not satisfactory. tion; a program of any size or complexity It eliminates the expense of building Debugging doesn’t scale up because it is a rule for a function. So all you have to programs to permit unit testing and pre- often compromises the design’s integrity. do to show the correctness of a program is paring unit test cases. Typically, software units are built accord- to show that it is a complete rule for a sub Surprisingly,it takes less time. ing to a sound design and fit together ac- set of the specification. Do we really mean that unit tests should cording to the design when unit debug- Experience indicates that using proof not be conducted? Yes. Unit testing is ging begins. But the fixes introduced arguments to show program correctness is done to demonstrate that the unit satisfies during debugging, while they may seem to not an academic curiosity that works on its specification.We believe you can better make individual modules perform their small problems -it is a robust technique demonstrate this with logical arguments. intended mission fully, cause design faults that works well on large, complex systems. So ifwe don’t test units, then what do we when the fixed modules are combined. Table 1 summarizes data for a few proj- test and when? The answer to that ques-

46 IEEE Software tion involves another myth. tional verification, leaving only simple er- in accordance with the distribution of ex- rors like syntax oversightsto be found dur- pected usage. The test developer must un- Myth: The only way to perfm unit umjica- ing execution testing. derstand what the software is intended to tion via logical argument is to use a computer Many engineers find the mental chal- do and how it is expected to be used. The program. lenge of functionalverificationmore stim- test developer then constructs tests that Researchers have invested significantef- ulating and Satisfying than debugging. are representative of expected usage. No fort into building programs that use axi- 9 Many engineers find the team style B knowledge of how the software is de- omatic arguments to verify programs. sociatedwith functional verification more signed and constructed is required. These programs, as of now and for the satisfjmg than the solo style associated Fact: Users observe failures in execu- foreseeable future, can verlfy only small with debugging. tion. While developers talk of finding and

programs using a limited number of lan- 9 Engineers can learn how to perform fixing errors or faults, users don’t observe guage constructs. Developers have not unit verification via functional verifica- errors or faults. They observe execution been able to scale up axiomatic verifica- tion. failures, which occur when the software tion programs even with today’s very fast Engineers performing functional veri- doesn’t do something it’s required to do. computers. fication leave significantly fewer failures to When a tester observes an execution Fact: Engineers can verify large pro- be found during later lifecycle phases failure, the software is searched for a way grams made up of many language con- than debuggers. Data indicates that func- to prevent it. As a result of the search, structs with functional verification. Func- tional verification leaves only two to five changes are made to the code that may or tional verification, introduced by Richard futes per thousands lines of code to be may not fix the failure and mayor may not Linger, Harlan Mills, and Bernard Witt,3is made in later phases: compared to 10 to introduce new latent failures. The modifi- quite different from axiomatic verifica- cations are counted to obtain a count of tion. software errors or faults. With functional verification, you struc- Coverage testingis as For example, if you change five areas of ture a proof that a program implements likely to find a rare the program because they were appar- its specification correctly. Again, if a pro- execution failure as it is ently doing something they shouldn’t be gram specification is a function then a a frequentone. Usage doing, we say that five errors have been program is a rule for a function. The testingthatmatchesthe found and fixed. Software failures are pre- proof must show that the rule (the pre cise while software errors are imprecise. It gram) correctly implements the function actual usage profile has is execution failures that must be found (the specification) for the full range of the abetterchamoffinding and eliminated from software. function and no more. the execution failures Some execution failures will occur fre- Linger, Mills, and Witt have developed a that occur fiequentlF quently, others infrequently. Coverage correctness theorem that defines what testing is as likely to find a rare execution must be shown for each of the structured failure as it is afrequent execution failure. programming language constructs. The 30 fixes left by unit testing by debugging! Usage testing that matches the actual proof strategy is divided into small parts, Engineers practicing functional verifi- usage profile has a better chance of find- which are easily accumulated into a proof cation complete the total development ing the execution failures that occur fre- for a large program. Our experience indi- process with significantly less effort than quently. cates that people can master these ideas those practicing unit verification via de- Therefore, since the goal of a testing and construct proof arguments for very bugging. Measurements indicate that the program should be to maximize expected large programs. improvement in productivity may be mean time to failure, a strategy that con- The first reaction of many people is that three to five times! centrates on the failures that occur fre- it must be hard to construct a proof that a quently is more effectivethan one that has program is correct. Our experience indi- Myth: Software is best tested by designing tests an equal probability of finding high- and cates that, with a modest amount of train- that coverevery path through theprogram. low-frequency failures. ing and the opportunity to use the ideas This testing method, called coverage on the job, people can learn to develop testing, requires that the test developer be Myth: It doesn’t matter how mm or failures proof arguments and talk to other engi- completely familiar with the software’s in- arefound, as longas thq mejxed. neers in terms of proofs. ternal design. Fact: The failure rates of different er- Linger: Mills,’ Richard Selby,’ and oth- Fact: Statistical usage testing is 20 times rors can vary by four orders of magnitude ers have analyzed the performance of soft- more cost-effective in finding execution in complex systems. To measure the in- ware engineers using functional verifica- failures than coverage testing (a claim we creased effectivenessof usage testing over tion to perform unit verification via will prove later). coverage testing, you need to know the logical argument. Among their observa- In statisticalusage testing, the test devel- frequency of rare failures versus frequent tions: oper draws tests at random from the pop failures in a population of programs

9 Engineers find logic errors with func- ulation of all possible uses of the software, under test. The Adams study contains one

November 1990 47 Table 2. Software failures for nine major IBM products, classified from rare to frequent. Rare . Frequent Group 1 2 3 4 5 6 7 8 MmF (years) 5,000 1,580 500 158 50 15.8 5 1.58 Percent failures in class for product 1 34.2 28.8 17.8 10.3 5.0 2.1 1.2 0.7 2 34.3 28.0 18.2 9.7 4.5 3.2 1.5 0.7 3 33.7 28.5 18.0 8.7 6.5 2.8 1.4 0.4 4 34.2 28.5 18.7 11.9 4.4 2.0 0.3 0.1 5 34.2 28.5 18.4 9.4 4.4 2.9 1.4 0.7 6 32.0 28.2 20.1 11.5 5.0 2.1 0.8 0.3 7 34.0 28.5 18.5 9.9 4.5 2.7 1.4 0.6 8 31.9 27.1 18.4 11.1 6.5 2.7 1.4 1.1 9 31.2 27.6 20.4 12.8 5.6 1.9 0.5 0.0 Average percentage 33.4 28.2 18.7 10.6 5.2 2.5 1.o 0.4 failures Probability of a failure for this frequency 0.008 0.021 .044 0.079 0.123 0.187 0.237 0.300

large database we can use to estimate in- pected use, account for 61.6 percent of ((0.008/60) + (0.021/19)+ (0.044/6) + creased effectiveness. fixes made but only for 2.9 percent of the (0.079/1.9)+ (0.123/0.6) + (0.187/0.19) t (0.237/0.06) + (0.30/0.019))P= 20.98 P Table 2 summarizes Adams’s data, failures that will be observed by typical which has been classified across columns users. On the other hand, groups 7 and 8 by the frequency that a some user found a represent only 1.4 percent of the fixes This surprising result suggests the pre- failure? Each row represents a major IBM made to the software but eliminate 53.7 vailing strategy for testing and correcting system like MVS, Cobol,and IMS. The col- percent of the failures that would be ob software is very inefficient. umns represent a subdivision of the fre- served by a typical user. Myth: Software behavior is deterministic. quency in which users observed afailure. If you use coverage testing, you would Therqie, statistics cannot be used to make in- For example, the first column repre- spend 61.6 percent of the testing and cor- fmences about sofiare quality. sents failures observed by users on the av- rection budget on finding and futing er- Fact: Software use is stochastic. A soft- erage of once every 5,000 years of usage; rors that will eliminate only 2.9 percent of ware system has many different uses to the last column represents failures ob the failures, and only 1.4 percent on mak- perform different missions starting from served by users on the average of once ing fixes that would eliminate 53.7 per- different initial conditions and given dif- every 1.58 years of usage. The data in each cent of failures. Coverage testing doesn’t ferent input data. Each different use is a cell defines the percentage of all failures appear to very effective at allocating the different event. Given a system that con- observed for the software system repre- testing and correction budget to increase tains some latent failures, some usages will sented by that row with the expected fre- Mm. result in a failure; others in a correct exe- quency represented by that column. The cution. If you sample the entire popula- On the other hand, a usage testing strat- values in each row sum to 100. tion of all possible usages in accordance egy allocates the budget in accordance The remarkable fact is that, over this with an expected usage profile and main- with the probability that afailure is observ- very divergent range of products, the dis tain a record of failures and success, you able by the average user: It allocates 53.7 tribution of failures occurring with differ- can use statistics to estimate reliability. percent to fixes that will occur 53.7 per- ent frequencies is uniform. This letsus use Fact: You can estimate the expected cent of the time in the experience of an the data for analysis. M’ITF for a system from a series of tests average user. The bottom two rows of Table 2 contain drawn at random in accordance with an two numbers for each failure frequency, Using the data in Table 2, we can show expected usage profile from the popula- the average percent failuresfor the group that usage testing is 21 times more effec- tion of all possible uses. The major as and the probability of a failure of the fre- tive at increasing MTTF than coverage sumption you must make to make the sta- quency represented by that group. An ex- testing. Let Pbe the increase in MTTF ob tistical estimation valid is that the amination of these last two rows provides tained by the next fix determined by cov- development process is in a state of con- some critical insights. Groups 1 and 2, erage testing. Then the increase in MTTF trol. This is not an unreasonable assump which represent failures that will be ob obtained by the next fix determined by tion - it is the same one made when sta- served less than once in 1,580 years of ex- usage testing will be: tistical quality-control practices are

48 IEEE Software applied to a production process. data-maintenanceabnormalities. and development is to select the ideas that While our experience in applying statis- These abnormalities, which cost busi- youwant to use to help guide the inventive tical quality-control techniques to soft- ness a great deal in terms of wrong deci- process. Once that is done then the ideas ware development is limited, initial expe- sions and software fixes, were caused by a can be organized into an engineering rience indicates that five fixes per basic failure in the hierarchical and net- process that helps people exploit the che thousand lines of code can be tolerated work database models. These models sen ideas. Then it is possible to select or without invalidatingthe applicationof sta- could not maintain the referential trans- build tools that enhance peoples’ produc- tistics to estimate MTTF. This failure rate parency between the actual data and the tivity in performing these ideas. is low compared to normal development state data used to represent it in computa- practices, where 20 to 60 fixes per thou- tions: In certain situations,the value of the Cleanroom engineering sand lines of code is not atypical. state data did not accurately represent the These ideas are the foundation for the Fact: Experience indicates that it is pos actual data as stored in the database. set of softwareengineering practices we sible to design and develop software that Codd’s relational algorithm does main- call Cleanroom engineering? requires less than five fixes per thousand tain referential transparency, and if it is Cleanroom engineering can help soft- lines of code from its first compilation used to maintain keys in a relational ware engineers implement reliable soft- throughout its useful life. The engineer- database, it eliminates these failures. ware - software that won’t fail during ing practices that let such quality be This should have been an important lee use. Cleanroom engineering achieved before any execution testing are son learned, but apparently the lesson was achieves intellectual control by apply- grouped under the heading “Cleanroom lost, because loss of referential transpar- ing rigorous, mathematics-based engi- engineering.” neering practices, establishes an “errors-are-unaccept- Myth: The solutim to the development Fob able” attitude and a team responsibility lem is to create tools that will doforpeople what Tools are only as good as for quality, thq can’t dofol themselves. the ideas that serve delegates development and testing re- The general idea behind this myth is as sponsibilitiesto separate teams, and that people can’t be trusted to make the their found#ion. The certifies the software’s MTTF through difficult inventions that software develop important mor in the application of statistical quality-con- ment requires. selectingdesighand trol methods. Fact: Automation is very effective in development tools is to helping us do the things we already know select the ideas you want Process. Cleanroom engineering in- how to do. We know how to write. Aword to use to help gUide volves a specification team, a develop processor helps us write faster, but it the ment team, and a certification team. The doesn’t help us write better (except that it inventive process. specification team prepares and main- gives us more time to think). tains the specification and specializes it A translator - a compiler - can trans for each development increment. The de- late a high-level language definition into velopment team designs and implements machinelevel instructions. For example, ency is still a common design flaw. The the software. The certification team com- compilers translate a Fortran or Cobol current generation of computer-aided piles, tests, and certifies the software’scor- program into machine language. While software-engineeringtools does not help rectness. this translation algorithm can be per- maintain referential transparency and in In the Cleanroom engineering, the formed by people or computers, comput- some cases even allows designs that do not team members ers have an advantage because, once they exhibit referential transparency. complete a rigorous, formal specifica- have been programmed to do it, they are For example, some CASE tools help you tion, even if it is preliminary, before they fast and reliable and can free people to do invent program structures by converting begin design and development, something else. dataflow diagrams into program struc- *develop a construction plan by de- Fact: Automation is not effective in tures. Due to the one-to-many relation- composing the specification into small helping us do things we don’t know how ship between a dataflow diagram and a (seldom more than 10,ooO lines of third- to do algorithmically.When we computer- program-structure chart, it is easy to lose generation code) user-executable incre- ize incomplete algorithms, the results are referential transparency between the his ments, incomplete and unsatisfactory. When tory of stimuli to the software and the state design, implement, and verify each database management systems were first data used to represent the stimuli histo- user-executableincrement, and introduced, hierarchical and network ries. assess the software’squality. databases were common. Database man- Fact: Ideas must precede tools. Tools are agement programs encountered failures only as good as the ideas that serve as their Typical project. Figure 1 shows a profile that were eventually traced to a common foundation. The important factor in se- of a typical Cleanroom engineering proj- set of problems which E.F. Codd named lecting tools to assist in software design ect, divided into phases.

November 1990 49 Problem analysis and requirements phases

specification

Construction Planning t t

Certify I increment 1 1 Design and build Test preparation for increment 2 1 increments 1 and 2

Solution deployment I I I Fire1. Profile of a three-incrementCleanroom-engineering project.

S,beciJicution.The first task is to assemble initialization and shutdown, conceptions. what is known into a specification docu- system use (commands, menus, The internal specification augments in- ment, complete the remaining details, events, and modes), which must define all formation in the external specification. then prepare and publish aformal specifi- stimuli the system can receive from pee For example, while the external specifica- cation. The first version may be prelimi- ple, computers, and other devices and all tion defines the stimuli the software will nary due to lack of information, but it responses it will produce, act upon and responses it will produce, should still be formal. The specification performance guidelines (timing and the internal specification defines the re- must be as complete as possible and ap precision), and sponses in terms of stimuli histories. Spec- proved before development begins. responses to undesired events. ifylng the functional relationship between The effort required to prepare the spec- The external specification iswritten in a responses and stimuli completelyin terms ification depends on how much is known language understood by users, but it is not of stimuli histories avoids commitment to when the decision to develop the software a tutorial. It is not designed to instruct implementation details. is made. It should be in three parts, which how to use the software; it is intended to Speclfylng responses this way is usually should agree: external specification, in- define precisely how the software will hard to learn at first because it is natural to ternal specification, and expected-usage work. Using only the external specifica- use invented abstractions - state data - profile. tion, someone with appropriate applica- to represent some portion of the prior The external specificationis a user’s ref- tion expertise should be able to use the stimuli. But as soon as you use state data to erence manual. It defines how the soft- software with no surprises. define software responses,you begin mak- ware will look and feel from the user’s per- The internal specificationis more math- ing implementation commitments. spective and all the interfaces with the ematical. It completely states the mathe- At the specification stage, you must de- software. The specification should in- matical function or, more generally,math- fine what is to be done, not how to do it. clude details on ematical relation for which the program Experience indicates that as soon as you the system environment (hardware, implements a rule. This definition is re- learn to define what is to be done free of peripherals, operating system, related quired to implement the program and implementation details, you can design software, and people), verlfy its correctness. It must be imple- and implement much better software. the application environment (data mentation-independent so the program (David Parnas recommends traces8) We and use structures), architecture can be designed free of pre- find using stimuli histones more conve-

50 IEEE Software nient and natural and therefore easier to tion plan, including the first, be execut- terms of stimuli histories. teach. able by user commands. This means both The data-driven state-boxview begins The expected usage profile defines the that the system must be constructed top- to define implementation details by mod- software’s anticipated use. This document down and that you need write no special ifying the black box to represent re- primarily guides the preparation of usage testing routines. sponses in terms of the current stimuli tests. To make a valid inference about the It also means that incremental integra- and state data that represents the stimuli software’s expected M‘ITF, you must de- tion testing is done as each new increment histories. velop and run tests with stimuli taken is written. And it lets you use all test runs, The processdriven clear-box view from the population of all possible stimuli including the tests on the very first incre- completes the implementation details by and in the same proportion as they will be ment, to help estimate the final MTTF. modifylng the state box view to represent generated when the system is in use. Figure 2 shows a sample construction responses in terms of the current stimuli, Statistical testing is a stochastic process. plan. state data, and invocations of lower level The simplest and best understood ste When you have decomposed the specifi- black boxes. chastic process is the Markov process, cation into increments, design, imple- Some advocate the dataview,others the which can model the usage of most if not mentation, and testing can begin. These processviewfor designingsoftware. These all software systems. In developing a Mar- different points of view cannot be re- kov model for expected use, you must de- solved because, in reality, both views are fine all usage states and estimate the tran- Your specification must required. Box structures let you define sition probabilities between usage states. define what is to be done, this dualism. This sounds harder than it seems to be in not how. Btpedeme Figure 3 summarizesa design algorithm practice. For example, see Jesse Poore’s indicates that as soon as defined by Mills that uses box structures.’0 work? you leam to define what The first black box restates the specifca- There is no magic in preparing the writ- tion that defines all the responses pre ten specification. The magic is inventing is to be done free of duced by the increment in terms of stim- what the software should do to accom- implementation details, uli histories. plish its mission - a much deeper and you can create much Steps 3 through 5 invent the state data harder problem than developing the soft- better software. that represents stimuli histories, to pre- ware. That is why it is so important to use serve referential transparency. The alge good engineering practices in developing rithm then determines which of the state software so the time and attention now two phases can proceed in parallel. data to maintain at this level of the usage being consumed on the easy part of the hierarchy and which to migrate to a lower problem can be redirected to the harder Design and build. The development level. It is important to migrate state data problem of determining what the soft- team, not an individual engineer, is re- to the lowest practical level in the usage ware should be doing. sponsible for the qualityofthe increments hierarchy to keep the software’s structure developed. The team uses technologies to under control. Construction plan. This phase deter- construct increments, box structures and The state-box description is complete mines the development and certification stepwise refinement, and functional veri- when functional relationships exist that sequence. To do this, you decompose the fication. Development proceeds in three define the responses in terms of the cur- specification into executable increments. steps: rent stimuli, the state data being main- An executable increment can be tested by 1. Design each increment topdown, to tained at this level, and stimuli histories invoking user commands or supplying create a usage hierarchy in three views: for the state data being maintained at other external stimuli. black-box, state-box, and clear-box.Venfy lower levels. The criteria to determine the construc- the correctness of each view. Before the team proceeds to define the tion sequence include 2. Implement each increment by rigor- clear box, it verifies the state-box descrip the availabilityof reusable software, ous stepwise refinement of clear boxes tion by eliminating references to state how much is known about the reliabil- into executable code. data in the state-box functions. The result ity of the reused software for the expected 3. Verify that the code performs ac- is a derived black-box function that usage profile, cording to its specification using func- should be the same as the original black- increment size (increments should sel- tional verification arguments. box function. dom be larger than 10,OOO lines), and If the two functions are the same, the the number of development teams Box structures. The team uses box struc- team defines the clear box that follows the available, which determines the possibili- tures to create the software’s internal de- state box, otherwise it redefines the state- ties for parallel development. sign. Box structures view the software box function to correct the design errors. Incremental development is not new. from three perspectives: So the design process is suspended as long The important new idea is the require- The implementation-independent as the design contains logic errors. Just as ment that each increment in the construc- black-box view defines the responses in in long division,it is best to fix any error as

November 1990 51 Increment development

barn A Increment Increment Increment 1 2 3

Team B Increment 4

scenarios Increment Increments Increments Increments 1 1 and 2 1,2,and3 1,2,3, and 4

Certify statistical usage Increment Increments Increments Increments testing I’ I 1 and 2 1,2, and 3 1,2,3, and 4 I

Fire2. Atypical Cleanroom project construction plan.

I I a lower level. After step 10 you have a Define 1. Define stimuli. chance to evaluate which lower level black Black Box 2. Define responses in terms of stimuli histories. boxes to invoke. I The algorithm doesn’t make the inven- Define 3. Define state data to represent stimuli histories. tion in these two crucial areas for the de- State Box 4. Select state data to be maintained at this level. velopment team, but it does ensure that 5. Mod* black box to represent responsesin terms of stimuli all the details following these two inven- and the state data being maintained at this level. tions are performed correctly with the 6. Venfy state box. verification steps. The algorithm also forces the designers and evaluators to Define 7. Record type of reference to state data by stimuli. focus on the critical software inventions Clear Box 8. Define data abstraction for each state data. 9. Modify state box to represent responses in terms of stimuli, that affect software performance and this level’s state data, and invocations of lower level black boxes. quality. 10. Verify the clear box. The box-structures algorithm is a pro- cess that engineers and managers can rely on to invent a high quality, accuratedesign. Figure 3. The box-structuresalgorithm. Functional ueri$cation. Once the design is complete, the team expands the clear box at each level into code that fully imple- soon as possible. It takes less time in the ues. Otherwise, clear-box design contin- ments the defined rule for the black-box long run. uesuntil the verification indicatesno logic function at that level. Following each ex- To design the clear box, the team in- errors. pansion, the team uses functional verifca- vents (selects) the data abstractions (like The design process continues as the tion to help structure a proof that the ex- set, stack, and queue) it will use to repre- team expands each black box until there pansion correctly implements the speci- sent the state data maintained at this level. are no more to be expanded. At that point fication. Then it modifies the state-boxfunction to the design is complete. The proof must show that the rule (the define responses in terms of current stim- program) correctly implements the func- uli, state data being maintained at this Stepwise refinement. Box structures pro- tion (the specification) for the full range level, and invocations of lower level black vide a rigorous stepwise refinement algo- of the function and no more. The Linger, boxes. rithm that guides system design in an Mills, and Witt correctness theorem3 de- When the clear box description is com- orderly, logical manner, with natural fines what you must show to prove that a plete, it is verified by eliminating refer- checkpoints along the way. For example, program is equivalent to its specification ences to lower level black boxes to obtain after step 6 in Figure 3 it is time to evaluate for each of the structured-programming- a derived state-box function that is com- which state data you want to be main- language constructs. pared to the original state-boxfunction. tained at this level. This gives you a chance The proof strategy is divided into small If the original and the derived functions to evaluate the trade-offs in maintaining parts that easily accumulate into a proof are the same, the design process contin- the state at this level versus migrating it to for a large program. Experience indicates

52 IEEE Software that people are able to master these ideas Table 3. and construct proof arguments for very Results of MTTF estimation. large software systems. Version Observed Predicted Predicted Factor The Cleanroom development team number MTTF reliability MTTF B

does not test or even compile the code. It ___ ~~ uses a mathematical proof - functional 1 .oo - - - verification- to demonstrate the correct- - - - ness of the units. Testing and measuring 6.00 failures by program execution is the re- 1 .OO .23 .81 0.59 sponsibilityof the certification team. 16.00 .77 4.38 1.36 560.00 .9957 232.62 3.60 Certification. In parallel with the devel- opment team, the certification team uses the expected-usage profile and the appli- cable portion of the external specification to prepare test cases and solutionsthat ex- by fitting the data points To, to an the software's quality and it can be de- ercise the increment just developed and ... , Tk exponential relationship. The reliability ployed. the increments developed previously. The While we don't address the operations team can perform this step in parallel with can be calculatedfrom the MlTF. The re- sults of the MTI'F estimation can be sum- and maintenance phases here, we want to development because it uses the specifica- marized in a table, as in Table 3, which point out that tion, not the code, to develop tests. summarizes data from an actual project. operations provide actual testing for When the development team completes During certification, the team should continued estimates of the MTTF to an increment, the certification compiles observe the dynamics of change to deter- check against what has been certified dur- it, adds it to previous increments, and cer- mine how many more tests are required to ing development and tifies the software in three steps: term the software to the required MTI'F. the maintenance phase will be much 1. It measures Tk,the MTI'F of the cur- B is the factor by which each change in- simpler for Cleanroomdeveloped soft- rent version of the software by executing creases the MTTF. If B goes below 1, the ware than for heuristicallydeveloped soft- random test cases. Tkis a sample of MTTF new version is worse than the previousver- ware because of higher quality and the ex- for a version of the accumulated incre- sion. It is desirable that the value of B in- istence of a design and development trail. ments and ..., Tkl (measured pre- To, crease monotonically. viously). The value of Bturns down when failures Goals. We believe the following are real- The team compares each test result to a are encountered late. Failures found early istic goals for Cleanroom engineering. standard; either the result is correct or are not expensive in terms of eventually Our belief comes from observingdemon- there was afailure. The cumulativetime to obtaining a high value for MTTF with a strations of component practices, includ- failure is an estimate of the MTTF. The reasonable testing budget, but if B drops ing a few demonstrations of the full set of team may decide to continue testing by late in the certification process it will take practices. Table 1 summarizes the results constructing more tests. The new time to a large number of tests to achieve the de- of some of these projects. failure is another estimate of the MTI'F. It sired MTTF. M'ITF, reliability, and testing Long-term goals (after a team has uses all estimates of h4TTF to predict the time (number of test cases) are mathe- completed three or four increments): two MTTF of the next version. matically related to each other. The team orders of magnitude (factor of 100) im- The certification team reports failures can also calculate confidence bounds on provement in reliability and one order of to the development team, which makes MTI'F estimates. magnitude (factor of 10) improvement in the fixes. When the development team re- 3. Once it has estimated the mFfor productivity. turns new modules, the certification team the next version, the team must decide if it Short-term goals (first two or three in- compiles a new version, and the measur- wants to crements developed by a new Cleanroom ing process is repeated for the new version correct the observed failures and con- team) : statistical quality control of devel- of the software. tinue to certify the software, opment in a pipeline of user-executable 2. Estimate the reliability for the next stop certification because the software increments that accumulate into a system; version of the software using a certifica- has reached the desired reliability for this elimination of debugging by software en- tion model and the measured MTI'F for stage of testing, or gineers before independent statistical each version of the software. The team stop certification and redesign the testing of usage requirements; certifiica- predicts the MTTF for the next version of software because the failure rate is too tion of reliability at delivery; one order of the software using the model high or the failures are too serious. magnitude improvement in reliability; When all the increments are complete and factor of three improvement in pre MTTF, + , =A@' ' and tested, you have a reliable estimate of ductivity.

November 1990 53

~- esponsible softwaredevelopment organizations should References begin to adopt Cleanroom engineering or some equiva- 1. H.D. Mills, “: Retrospect and Prospect,” Rlent discipline. An organization always faces riskswhen it lEEE%@are, Nov. 1986, pp. 58-66. decides to change the way it does business. The best way to man- 2. E.N. Adams, “Optimizing Preventive Service of Software Products,” IBMJ. Research andDevelopnentJan. 1984. age risk is to identify the risk and determine what actions to take 3. R.C. Linger, H.D. Mills, and B.I. Witt, Structured Programming: Theory to avoid it or at least minimize its effect. andPractice, Addison-Wesley, Reading, Mass., 1979. The potential gains &om Cleanroom engineering are enor- 4. R.C. Linger and H.D. Mills, “A Case Study in Cleanroom Software mous compared to the identifed risks. The largest risk an orga- Engineering: The IBM Cobol Structuring Facility,” Proc. Computer nization can take is to decide not to adopt Cleanroom engineer- SofhuareandApplicatimConj, CS Press, Los Alamitos, Calif., 1988. ing or an equivalent discipline. At the very least, organizations 5. R.W. Selby, V.R Basili, and F.T. Baker, “Cleanroom Software Devel- should conduct a trial on at least one or two significant projects. opment: An Empirical Evaluation,” lEEE Tram. %@are Eng., Sept. 1987,pp.1.027-1.037. The cost of continuing to develop failure-laden software with its associated low productivity can at best increase cost and at 6. M. Dyer and A. Jbuchakdjian, “CorrectnessVerification:Alternative to Structural Software Testing,” Infmatwn and %&are Techrwlogy, worst so affect an organization’scompetitive position that is diffi- Jan./Feb. 1990, pp. 53-59. cult to remain in business. 7. H.D. Mills, M. Dyer, and R.C. Linger, “Cleanroom Software Engi- Organizations that purchase software should also understand neering,’’ EEESoJzmre, Nov. 1986, pp. 19-24. the ramifications of Cleanroom engineering so they can work 8. D.L. Parnas and Y Wang, ‘The Trace Assertion Method of Module- with their vendors and integrators to ensure that they build high- Interface Specification,” Tech. &. 89-261,Telecommunications Re- quality software at an attractive price. Intelligent buyers can have search Inst. of Ontario, Queens Univ., Kingston, Ontario, Canada, 1989. a significant effect on the speed with which developers adopt these superior softwaredevelopment practices. 9. J.H. Poore et al., “A Case Study Using Cleanroom with Box Struc- .:. tures ADL,” Tech. Report CDRL 1880, Software Engineering Tech- nology, Vero Beach, Fla., 1990. 10. H.D. Mills, “Stepwise Refinement andverification in Box-Structured Systems,” cOmputer,June 1988, pp. 23-36. Wgs Your Last Software Project

Late? Richard H. Cobb is vice president of Software Engineering Tech- If your last software project was late, you need Costar, a software cost nology. His research interests are software design and development estimation tool that will help you plan and manage your nat project. practices and methodologies for improving software quality and devel- Costar is based on the COCOMO model described by Bany Bwhm in oper productivity. SoftwarnEngineeting Economics. Cobb received a BS in industrial engineering from the University of COCOMO is used by hundreds of software managen to estimate the cost, Cincinnati and an MS in operations research from Rensselaer Poly- staffing levels, and schedule required to complete a project-it’s reliable. technic Institute. repeatable. and accurate. Costar estimates are based on 15 factors that strongly influence the effort required to complete a project, including: The Capability and Experience of your Programmers & Analysts The Complexity of your project The Required Reliability of your project Costar is a complete implementation of the COCOMO “detailed“model, so it calculates estimates for all phases of your project, from Requirements through Coding, Integration and Maintenance Costar puts you in control of the estimation and planning process, and provides full traceability for Harlan D. Mills is president of Software Engineering Technology and a each estimate. User definable cost drivers and a wide variety of reports professor of computer science at Florida Institute of Technology. His makes Costar flexible and powerful. research interests are systems engineering and the mathematical foun- dations of software engineerring. Supports Function Points 8t Ada COCOMO. Mills received a PhD in mathematics from . He is Costar miihon the L’AX rml IllM Ks Call for free demo disk. the recipient of the DPMA Distinguished Information Science Award Softstar Svstems and the Warnier Prize and is an IEEE fellow. (603)6f2-0987 28 bnemah Road. SOFTST*R Address questions about this article to Cobb at SET, 1918 Hidden Amherst, NH 03031 Point Rd., Annapolis, MD 21401 or Mills at SET, 2770 Indian River Blvd., Vero Beach, FL 32960 CSnet [email protected]. Reader Service Number 8 IEEE Software