Today's Lecture Mars Climate Orbiter
Total Page:16
File Type:pdf, Size:1020Kb
Today’s Lecture Discuss several different software disasters to Lecture 22: Software Disasters provide insights into the types of errors that can occur the costs associated with them Kenneth M. Anderson Examples Software Methods and Tools Mars Climate Orbiter CSCI 3308 - Fall Semester, 2004 Mars Polar Lander Patriot Missile Defense System Ariane 5 November 5, 2004 © University of Colorado, 2004 2 Mars Climate Orbiter Mars Climate Orbiter, continued Supposed to enter Martian atmosphere “at a high trajectory” and Science Objectives “lightly aerobrake” to achieve orbit, which uses less fuel Monitor climate changes Due to a conversion error in which commands to the spacecraft were sent in English units rather than metric units, the Serve as relay for Mars Polar Lander spacecraft entered the atmosphere at “a trajectory 170km lower Costs for Climate Orbiter and Polar Lander than planned” The spacecraft hit the atmosphere earlier than was planned and combined was thus traveling too fast; this led to the destruction of the spacecraft Spacecraft Development - 193.1 million Since the Polar Lander was also lost, the combined cost of the Launch - 91.7 million project stands at 327.6 million dollars Compare to Mars Observer (lost in 1989): 4 billion dollars! Mission and Operations - 42.8 million Unofficially, the problem had been detected but due to politics Total - 327.6 million between the development team and JPL, a fix was never deployed November 5, 2004 © University of Colorado, 2004 3 November 5, 2004 © University of Colorado, 2004 4 Mars Climate Orbiter, continued Mars Climate Orbiter, continued The official report cited the following “contributing Contributing Factors, continued factors” to the loss of the spacecraft the systems engineering function within the project that is undetected errors in ground-based models of the spacecraft supposed to track and double-check all interconnected aspects of the mission was not robust enough the operational navigational team was not fully informed on the details of the way that Mars Climate Orbiter was pointed this was exacerbated by the first-time handover of a Mars- in space bound spacecraft by the team that constructed and launched the vehicle to a new multi-mission operations team (this is the a final, optional engine firing to raise the spacecraft’s path “politics” part!) relative to Mars before its arrival was considered but not some communications channels among project engineering performed groups were too informal (e.g. not documented!) November 5, 2004 © University of Colorado, 2004 5 November 5, 2004 © University of Colorado, 2004 6 Mars Climate Orbiter, continued Mars Climate Orbiter, summary Contributing Factors, continued One Technical Problem the small mission navigation team was oversubscribed and failed conversion of units its work did not receive peer review by independent experts personnel were not trained sufficiently in areas such as the Many Process and Social Problems relationship between the operation of the mission and its No review (e.g. verification), insufficient training, detailed navigational characteristics, or the process of filing informal processes in place, formal processes formal anomaly reports ignored the process to verify / validate certain engineering requirements and the technical interfaces between some Led to a destroyed spacecraft project groups, and between the project and its prime mission contractor was inadequate November 5, 2004 © University of Colorado, 2004 7 November 5, 2004 © University of Colorado, 2004 8 Mars Polar Lander Details Part of the same project as the Mars Climate Module Test 1 Observer Give command to deploy spacecraft’s legs Last communication with spacecraft occurred just Legs deploy prior to its entry into the Martian atmosphere Test Passed! Loss of spacecraft can be attributed to the lack of Module Test 2 integration testing When spacecraft detects “jolt” of landing on the e.g. the failure was not detected by “module” testing surface of Mars, turn engine off an integrated test across subsystems was required to detect Simulated “jolt” detected, engine shuts off the problem Test Passed! November 5, 2004 © University of Colorado, 2004 9 November 5, 2004 © University of Colorado, 2004 10 Details, continued Mars Polar Lander, summary What actually happened Clear demonstration of the importance of Spacecraft enters atmosphere integration testing Legs deploy and “jolt” the craft If the team testing the deployment of the legs had “Jolt” detected and engine shuts off conducted the test while also testing the flight software, the “bug” may have been detected The problem Unfortunately, with the “faster, better, cheaper” Spacecraft was still miles above the surface of philosophy of NASA at the time, integration testing Mars! was deemed too expensive and was not Spacecraft crashes into Mars and is destroyed conducted in a comprehensive fashion November 5, 2004 © University of Colorado, 2004 11 November 5, 2004 © University of Colorado, 2004 12 Credit Where Credit is Due Patriot Missile Defense System Information for the rest of the lecture comes On February 25, 1991, during the Gulf War, a patriot from missile failed to intercept an incoming Iraqi Scud missile <http://dutita0.twi.tudelft.nl/users/vuik/wi211/disasters.html> The cost of this failure was the lives of 28 soldiers when the and Scud missile struck a military barracks <http://www.eiffel.com/doc/manuals/technology/contract/ariane/> The cause of the failure was a numerical error in the Patriot’s Additional Info on Ariane 5 operating system to correctly calculate “time since boot” This caused the system’s time to get out of synch with actual <http://www.math.ufl.edu/~cws/3114/ariane-siam.html> time and led to a failure in the system designed to track the Scud missile’s position in the air November 5, 2004 © University of Colorado, 2004 13 November 5, 2004 © University of Colorado, 2004 14 Details Details, continued The system’s internal clock measured The problem? 1/10 has a non-terminating binary expansion, so its actual time in tenths of seconds value was chopped to fit into a 24-bit register Actual time was reported in seconds by This introduces an error equal to 0.000000095 multiplying the internal clock’s value by After running for 100 hours, this error means that the system is 0.34 seconds out of sync with reality 1/10 What’s wrong with that? This calculation was performed using a 24- Scud Missiles travel at 1676 meters per second! bit fixed point register In 0.34 seconds, they travel half a kilometer (~0.3 of a mile)! Without an accurate location, the Patriot missile had no chance to intercept the missile November 5, 2004 © University of Colorado, 2004 15 November 5, 2004 © University of Colorado, 2004 16 Patriot Missile System, summary Ariane 5 Disaster Software Engineering Issues On June 4, 1996, after 7 billion dollars of development, an unmanned Ariane 5 rocket exploded just forty seconds after lift- Numerical algorithm at an extremely low level led off to total system failure and loss of human life The rocket and its cargo were valued at $500 million for a total cost of 7.5 billion dollars! But there was also a maintenance related problem The error was traced to a software component in the Inertial The error was known, and a fix had been applied in Reference System that had been reused from the Ariane 4 flight some parts of the software but not others! software Inadequate maintenance process! The reused component was more than 10 years old and had flown A complete set of regression tests, if applied after every successfully on numerous Ariane 4 flights modification, may have detected the problem The problem => certain assumptions changed between the Ariane 4 and the Ariane 5 and the software was not updated in response November 5, 2004 © University of Colorado, 2004 17 November 5, 2004 © University of Colorado, 2004 18 Ariane 5, background info Ariane 5, the details The flight software was written in Ada which The failure of the Ariane 5 can be traced to the has a first class exception construct conversion of a 64-bit integer to a 16-bit signed integer (it predates C++ and Java in this regard) 15 The 64-bit value was greater than 2 which caused an If an exception is thrown but not caught, the exception to be generated error will “percolate” up through the call stack This exception was not caught and it caused the termination and will eventually terminate the entire of the flight control software 37 seconds into the launch The rocket shortly thereafter (3 seconds) lost control and system was destroyed November 5, 2004 © University of Colorado, 2004 19 November 5, 2004 © University of Colorado, 2004 20 More information Details, continued Jean-Marc Jézéquel and Bertrand Meyer wrote a paper that Their paper reveals that sufficient software dev. processes were traces the problem to an inappropriate reuse of a 10-year old in place and the system that caused the error had even been reviewed software component extensively before launch They reveal that one “vexing” aspect of this disaster is that the error exception handlers had been placed around 4 of 7 variables; occurred in a software system that was not needed during launch! unfortunately, the data conversion error occurred in one of the The calculation was supposed to be stopped 9 seconds before 3 unprotected variables launch, but the inertial reference system had been reset during a hold in the countdown and