<<

.

Editor: , EiffelSoft, 270 Storke Rd., Ste. 7, Goleta, CA 93117; voice (805) 685-6869; [email protected]

several hours (at least in earlier versions of ), it was better to let the computa- tion proceed than to stop it and then have by to restart it if liftoff was delayed. So the SRI computation continues for 50 seconds after the start of flight mode—well into the flight period. After takeoff, of course, this com- Contract: putation is useless. In the flight,

Object Technology however, it caused an exception, which was not caught and—boom. The exception was due to a floating- point error during a conversion from a 64- The Lessons bit floating-point value, representing the flight’s “horizontal bias,” to a 16-bit signed integer: In other words, the value that was converted was greater than what of Ariane can be represented as a 16-bit signed inte- ger. There was no explicit exception han- dler to catch the exception, so it followed the usual fate of uncaught exceptions and crashed the entire , hence the onboard computers, hence the mission. This is the kind of trivial error that we Jean-Marc Jézéquel, IRISA/CNRS are all familiar with (raise your hand if you Bertrand Meyer, EiffelSoft have never done anything of this sort), although fortunately the consequences are usually less expensive. How in the world

everal contributions to this made up of respected experts from major department have emphasized the European countries, which produced a How in the world could importance of report in hardly more than a month. These such a trivial error in the construction of reliable agencies are to be commended for the have remained S software. Design by contract, as speed and openness with which they han- undetected and cause you will recall, is the principle that inter- dled the disaster. The report is available a $500 million rocket faces between modules of a software sys- on the Web, in both French and English to blow up? tem— especially a mission-critical one— (http://www.cnes.fr/actualites/news/rap- should be governed by precise specifi- port_501.html). cations, similar to contracts between It is a remarkable document: short, clear, humans or companies. The contracts will and forceful. The explosion, the report can it have remained undetected and pro- cover mutual obligations (precondi- says, is the result of a software error, pos- duced such a horrendous outcome? tions), benefits (), and con- sibly the costliest in history (at least in dol- sistency constraints (invariants). Together lar terms, since earlier cases have cost lives). YOU CAN’ BLAME MANAGEMENT these properties are known as assertions, Particularly vexing is the realization that Although something clearly went wrong and are directly supported in some design the error came from a piece of the software in the validation and verification process and programming languages. that was not needed. The software (or we wouldn’t have a story to tell), and A recent $500 million software error involved is part of the Inertial Reference although the Inquiry Board does sev- provides a sobering reminder that this System, for which we will keep the eral recommendations to improve the principle is not just a pleasant academic acronym SRI used in the report, if only to process, it is also clear that systematic doc- ideal. On June 4, 1996, the maiden flight of avoid the unpleasant connotation that the umentation, validation, and management the European Ariane 5 launcher crashed, reverse acronym has for US readers. Before procedures were in place. about 40 seconds after takeoff. Media liftoff, certain computations are performed The software literature has reports indicated that a half-billion dollars to align the SRI. Normally, these compu- often contended that most software prob- was lost—the rocket was uninsured. tations should cease at −9 seconds, but lems are primarily management problems. The French agency, CNES (Centre because there is a chance that a countdown This is not the case here: the problem was National ’Etudes Spatiales), and the could be put on hold, the engineers gave a technical one. (Of course you can always immediately themselves some leeway. They reasoned argue that good management will appointed an international inquiry board, that, because resetting the SRI could take technical problems early enough.)

January 1997 129 .

YOU CAN’T BLAME THE LANGUAGE indeed reused from 10-year-old software, • Most important, assertions are a Ada’s exception mechanism has been the software from Ariane 4. But this is not prime component of the software criticized in the literature, but in this case the real story. and its automatically produced doc- it could have been used to catch the excep- umentation (“short form” in Eiffel tion. In fact, the report says: BUT YOU REALLY HAVE TO BLAME environments). In a project such as REUSE SPECIFICATION Ariane, in which there is so much Not all the conversions were protected What was truly unacceptable in this case emphasis on quality control and because a maximum workload target of was the absence of any kind of precise thorough validation of everything, 80% had been set for the SRI computer. specification associated with this reusable assertions would have been the qual- To determine the vulnerability of unpro- module. The requirement that the hori- ity assurance team’s primary focus of tected code, an analysis was performed on zontal bias should fit on 16 bits was in fact attention. Any test team worth its every operation which could give rise to stated in an obscure part of a mission doc- salt would have checked systemati- an ... operand error. This led to protection ument. But it was nowhere to be found in cally that every call satisfies every being added to four of [seven] variables ... the code itself! . That would have im- in the Ada code. However, three of the One of the principles of design by con- mediately revealed that the Ariane 5 variables were left unprotected. tract, as earlier columns have said, is that software did not meet the expecta- any software element that has such a fun- tion of the Ariane 4 routines that it YOU CAN’T BLAME THE DESIGN damental constraint should state it explic- called. Why was the exception not monitored? itly, as part of a mechanism present in the The analysis revealed that overflow (a hor- language. In an Eiffel version, for exam- he Inquiry Board makes several rec- izontal bias not fitting in a 16-bit integer) ple, it would be stated as ommendations with respect to soft- could not occur. Was the analysis wrong? T ware process improvement. Many No! It was right for the Ariane 4 trajec- convert (horizontal_bias: are justified; some may be overkill; some tory. For Ariane 5, with other trajectory DOUBLE): INTEGER is would be very expensive to put in place. parameters, it did not hold. require There is a more simple lesson to be horizontal_bias learned from this unfortunate : YOU CAN’T BLAME THE <= Maximum_bias Reuse without a precise, rigorous speci- IMPLEMENTATION do fication mechanism is a risk of potentially Some may criticize removing the conver- ... disastrous proportions. sion protection to achieve more perfor- ensure mance (the 80 percent workload target), ... but this decision was justified by the theo- end There is a simple lesson retical analysis. To engineer is to make here: Reuse without compromises. If you have proved that a where the precondition (require...) a precise specification condition cannot happen, you are entitled states clearly and precisely what the input mechanism is a not to check for it. If every program checked must satisfy to be acceptable. disastrous risk. for all possible and impossible events, no Does this mean that the crash would useful instruction would ever get executed! automatically have been avoided had the mission used a language and method sup- YOU CAN’T BLAME TESTING porting built-in assertions and design by It is regrettable that this lesson has not The Inquiry Board recommends better contract? Although it is always risky to been heeded by such recent as IDL testing procedures, and it also recommends draw such after-the-fact conclusions, the (the Interface Definition Language of testing the entire system rather than parts of answer is probably yes: CORBA)—which is intended to foster it (in the Ariane 5 case the SRI and the flight large-scale reuse across networks but fails software were tested separately). But even if • Assertions ( and post- to provide any semantic specification you can test more, you can never test all. conditions in particular) can be auto- mechanism—Ada 95, or Java. None of Testing, as we all know, can show the pres- matically turned on during testing, these languages has built-in support for ence of errors, not their absence. The only through a simple option. The design by contract. fully realistic test is a launch. And in fact, the error might have been caught then. Effective reuse requires design by con- launch was a test launch, in that it carried no • Assertions can remain turned on dur- tract. Without a precise specification commercial payload, although it was prob- ing execution, triggering an exception attached to each reusable component— ably not intended to be a $500 million test. if violated. Given the performance precondition, , — constraints on such a mission, how- no one can trust a supposedly reusable YOU CAN TRY TO BLAME REUSE ever, this would probably not have component. Without a specification, it is The SRI horizontal bias module was been the case. probably safer to redo than to reuse. ❖

130 Computer