L02: Case Studies: the Mars Program
Total Page:16
File Type:pdf, Size:1020Kb
University of Toronto Department of Computer Science University of Toronto Department of Computer Science Lecture 2: NASA JPL’s Mars Program Examples of Poor Engineering Mission Launch Date Arrival Date Outcome Viking I 20 Aug 1975 Landed 20 Jul 1976 Operated until 1982 Viking II 9 Sept 1975 Landed 3 Sept 1976 Operated until 1980 ➜ “Software Forensics” Case Studies: Mars Pathfinder Mars Observer 25 Sept 1992 Last contact: Contact lost just Mars Climate Observer 22 Aug 1993 before orbit insertion Mars Polar Lander Pathfinder 4 Dec 1996 Landed Operated until 27 Sept Deep Space 2 4 July 1997 1997 Global Surveyor 7 Nov 1996 Orbit attained Still operational ➜ Some conclusions 12 Sept 1997 e.g. Reliable software has very little to do with writing good programs Climate Orbiter 11 Dec 1998 Last contact: Contact lost just 23 Sept 1999 before orbit insertion e.g. Humans make mistakes, but good engineering practice catches them! Polar Lander 3 Jan 1999 Last contact: Contact lost before 3 Dec 1999 descent Deep Space 2 3 Jan 1999 Last contact: No data was ever 3 Dec 1999 retrieved Mars Odyssey 7 Apr 2001 Expected: October 24, 2001 © 2001, Steve Easterbrook 1 © 2001, Steve Easterbrook 2 University of Toronto Department of Computer Science University of Toronto Department of Computer Science Mars Pathfinder Remember these pictures? ➜ Mission Demonstrate new landing techniques parachute and airbags Take pictures Analyze soil samples Demonstrate mobile robot technology Sojourner ➜ Major success on all fronts Returned 2.3 billion bits of information 16,500 images from the Lander 550 images from the Rover 15 chemical analyses of rocks & soil Lots of weather data Both Lander and Rover outlived their design life Broke all records for number of hits on a website!!! © 2001, Steve Easterbrook 3 © 2001, Steve Easterbrook 4 University of Toronto Department of Computer Science University of Toronto Department of Computer Science Pathfinder had Software Errors Mars Climate Orbiter ➜ Symptoms ➜ Launched Software did total system resets 11 Dec 1998 Symptoms noticed soon after Pathfinder started collecting meteorological data Some data lost each time ➜ Mission ➜ Cause interplanetary weather satellite 3 Process threads, with bus access via mutual exclusion locks (mutexs): communications relay for Mars Polar High priority: Information Bus Manager Lander Low priority: Meteorological Data Gathering Task Medium priority: Communications Task ➜ Fate: Priority Inversion: Arrived 23 Sept 1999 Low priority task gets mutex to transfer data to the bus No signal received after initial orbit High priority task blocked until mutex is released insertion Medium priority task pre-empts low priority task Eventually a watchdog timer notices Bus Manager hasn’t run for some time… ➜ Cause: ➜ Factors Faulty navigation data caused by Very hard to diagnose: failure to convert imperial to metric Hard to reproduce units Need full tracing switched on to analyze what happened Was experienced a couple of times in pre-flight testing Never reproduced or explained Hence testers assumed it was a hardware glitch © 2001, Steve Easterbrook 5 © 2001, Steve Easterbrook 6 University of Toronto Department of Computer Science University of Toronto Department of Computer Science Small Forces... MCO Navigation Error ➜ Locus of error Estimated trajectory Ground software file called “Small Forces” gives thruster performance data TCM-4 and AMD ∆V’s This data used to process telemetry from the spacecraft Spacecraft signals each Angular Momentum Desaturation (AMD) maneuver Small Forces data used to compute effect on trajectory Software underestimated effect by factor of 4.45 TCM-4 2 26 ➜ Cause of error km Small Forces Data given in Pounds-seconds (lbf-s) Actual trajectory 5 e 7k s Larger AMD ∆V’s and AMD ∆V’s m p The specification called for Newton-seconds (N-s) a ri Driving trajectory down e ➜ Result of error relative to ecliptic plane P As spacecraft approaches orbit insertion, trajectory is corrected Aimed for periapse of 226km on first orbit Mars Estimates were adjusted as the spacecraft approached orbit insertion: 1 week prior: first periapse estimated at 150-170km 1 hour prior: this was down to 110km Minimum periapse considered survivable is 80km MCO entered Mars occultation 49 seconds earlier than predicted Signal was never regained after the predicted 21 minute occultation To Earth Subsequent analysis estimates first periapse of 57km © 2001, Steve Easterbrook 7 © 2001, Steve Easterbrook 8 University of Toronto Department of Computer Science University of Toronto Department of Computer Science Contributing Factors Mars Polar Lander ➜ For 4 months, AMD data not ➜ Operations Navigation team ➜ Launched used due to file format errors unfamiliar with spacecraft 3 Jan 1999 Navigators calculated data by hand Different team from development & test File format fixed by April 1999 Did not fully understand the significance ➜ Mission Anomalies in trajectory became of the anomalies Land near South Pole apparent almost immediately Familiarity with previous mission (MGS) assumed sufficient: Dig for water ice with a robotic ➜ Limited ability to investigate: but AMD was performed 10-14 arm Thrust effects measured along line of times more often on MCO as it has ➜ sight using doppler shift asymmetric solar panels. Fate: AMD thrusts are mainly perpendicular ➜ Arrived 3 Dec 1999 to Earth-spacecraft line of sight Inadequate Testing Software Interface Spec not used No signal received after initial ➜ Poor communication between during unit testing of small forces s/w phase of descent teams: End-to-end test of ground software ➜ E.g. Issue tracking system not never completed Cause: properly used by navigation team Ground software was not considered Several candidate causes Anomalies not properly investigated “mission critical” so less rigorous V&V Most likely is premature engine ➜ ➜ shutdown due to noise on leg Inadequate staffing Inadequate Reviews sensors Operations team monitoring three Key personnel missing from critical missions simultaneously (MGS, MCO design reviews and MPL) © 2001, Steve Easterbrook 9 © 2001, Steve Easterbrook 10 University of Toronto Department of Computer Science University of Toronto Department of Computer Science What happened? Premature Shutdown Scenario ➜ ➜ Cause of error Investigation hampered by Magnetic sensor on each leg senses touchdown lack of data Legs unfold at 1500m above surface spacecraft not designed to send transient signals on touchdown sensors during unfolding telemetry during descent software accepts touchdown signals if they persist for 2 timeframes This decision severely criticized by transient signals likely to be long enough on at least one leg review boards ➜ ➜ Factors Possible causes: System requirement to ignore the transient signals Lander failed to separate from cruise But the software requirements did not describe the effect stage (plausible but unlikely) s/w designers didn’t understand the effect, so didn’t implement the requirement Landing site was too steep (plausible) Engineers present at code inspection didn’t understand the effect Heatshield failed (plausible) Not caught in testing because: Loss of control due to dynamic Unit testing didn’t include the transients effects (plausible) Sensors improperly wired during integration tests (no touchdown detected!) Loss of control due to center-of- Full test not repeated after re-wiring mass shift (plausible) Premature Shutdown of Descent ➜ Result of error Engines (most likely!) Engines shut down before spacecraft has landed Parachute drapes over lander When engine shutdown s/w enabled, flags indicated touchdown already occurred (plausible) estimated at 40m above surface, travelling at 13 m/s Backshell hits lander (plausible but estimated impact velocity 22m/s (spacecraft would not survive this) unlikely) nominal touchdown velocity 2.4m/s © 2001, Steve Easterbrook 11 © 2001, Steve Easterbrook 12 University of Toronto Department of Computer Science University of Toronto Department of Computer Science STS Ariane Path- Factor MCO MPL DS-2 Deep Space 2 51L 501 finder Didn’t test to spec ? ➜ Launched Insufficient test data 3 Jan 1999 Tested “wrong” system No regression test ➜ Mission Lack of integration testing 2 small probes piggybacked on Mars Lack of expertise at inspections Polar Lander System changed after testing ? Demonstration of new technology Reqt not implemented ? Separate from MPL 5 minutes before Lack of diagnostic data during ops atmosphere entry S/W used before ready ?? Bury themselves in Martian Soil Different team maintains S/W Return data on soil analysis and look Didn’t use problem reporting system ? for water Didn’t track problems properly ? ➜ Fate: Didn’t investigate anomalies Poor communication between teams ? No signals were received after launch Insufficient staffing ➜ Cause: Failure to adjust budget and schedule Inexperienced managers ? Unknown Commercial pressures took priority (System was not ready for launch) reused code w/o checking assumptions ‘Redundant’ design not redundant © 2001, Steve Easterbrook 13 © 2001, Steve Easterbrook 14 University of Toronto Department of Computer Science University of Toronto Department of Computer Science Summary Resource List ➜ ➜ Mars Observer Failures can usually be traced to a single root cause Project summary http://www.msss.com/mars/observer/project/mo_loss/moloss.html But good engineering practice should prevent these