A Ballista Retrospective
Total Page:16
File Type:pdf, Size:1020Kb
Software Robustness Testing A Ballista Retrospective Phil Koopman [email protected] http://ballista.org With contributions from: Dan Siewiorek, Kobey DeVale John DeVale, Kim Fernsler, Dave Guttendorf, Nathan Kropp, Jiantao Pan, Charles Shelton, Ying Shi Institute for Complex Engineered Systems Overview Introduction • APIs aren’t robust (and people act as if they don’t want them to be robust!) Top 4 Reasons people give for ignoring robustness improvement • “My API is already robust, especially for easy problems” (it’s probably not) • “Robustness is impractical” (it is practical) • “Robust code will be too slow” (it need not be) • “We already know how to do it, thank you very much” (perhaps they don’t) Conclusions • The big future problem for “near-stationary” robustness isn’t technology -- it is awareness & training 2 Ballista Software Testing Overview SPECIFIED INPUT RESPONSE BEHAVIOR SPACE SPACE ROBUST SHOULD VAL I D OPERATION WORK INPUTS MO DULE REPRODUCIBLE UNDEFINED UNDER FAILURE TEST SHOULD INVALID INPUTS UNREPRODUCIBLE RETURN FAILURE ERROR Abstracts testing to the API/Data type level • Most test cases are exceptional • Test cases based on best-practice SW testing methodology 3 Ballista: Test Generation (fine grain testing) Tests developed per data type/subtype; scalable via composition 4 Initial Results: Most APIs Weren’t Robust Unix & Windows systems had poor robustness scores: • 24% to 48% of intentionally exceptional Unix tests yielded non-robust results • Found simple “system killer” programs in Unix, Win 95/98/ME, and WinCE Even critical systems were far from perfectly robust • Safety critical operating systems • DoD HLA (where their stated goal was 0% robustness failures!) Developer reactions varied, but were often extreme • Organizations emphasizing field reliability often wanted 100% robustness • Organizations emphasizing development often said “core dumps are the Right Thing” • Some people didn’t care • Some people sent hate mail 5 Comparing Fifteen POSIX Operating Systems Ballista Robustness Tests for 233 Posix Function Calls AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 Abort Failures Restart Failure LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5 0% 5% 10% 15% 20% 25% Normalized Failure Rate 6 Robustness: C Library vs. System Calls Portions of Failure Rates Due To System/C-Library AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 System Calls C Library LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5 0% 5% 10% 15% 20% 25% Normalized Failure Rate 7 Estimated N-Version Comparison Results Normalized Failure Rate by Operating System AIX FreeBSD HPUX 9.05 Abort % HPUX 10.20 * Silent % Irix 5.3 Restart % Irix 6.2 * * Catastrophic Linux Lynx * NetBSD OSF-1 3.2 ating System Tested * r OSF-1 4.0 Ope QNX 4.22 ** QNX 4.24 SunOS 4.13 SunOS 5.5 0% 10% 20% 30% 40% 50% Normalized Failure Rate (after analysis) 8 Failure Rates by Function Group 9 Even Those Who Cared Didn’t Get It Right OS Vendors didn’t accomplish their stated objectives (e.g.,): • IBM/AIX wanted few Aborts, but had 21% Aborts on POSIX tests • FreeBSD said they would always Abort on exception (that’s the Right Thing) but had more Silent (unreported) exceptions than AIX! • Vendors who said their results would improve dramatically on the next release were usually wrong Safe Fast I/O (SFIO) library • Ballista found that it wasn’t as safe as the authors thought – Missed: valid file checks; modes vs. permissions; buffer size/accessibility Do people understand what is going on? • We found four widely held misconceptions that prevented improvement in code robustness 10 e t a R 100% e r u l i a 90% F s s e 80% n t s u b 70% o 60% R 50% 40% 30% 20% 10% 0% Clocks & Timers C Library Process Prims. 11 System Databs Process Env. Dev/Class-Specific Files & Dirs I/O Primitives Messaging Memory Mng. Scheduling Synchronization 5 . 3 . 4 5 1 2 2 . 2 B 2 S 4 . 0 4 . 0 .3 4 O . 3 8 S 4 n 1 X 2 4 1 X . 1 . u . O .3 N 1 0 6 2 F D 5 0 n 5 N S 2 F . Q 1 0 . S S u .5 X . S Q 2 S 0 X 2 B S I 4 9 O . t 1 O x O RI 0 . 2 I R u x . X I B I n Ne n A i D A y L S L X UX B - -U e P e P r H H F management functions… … which had a 100% failure rate under Ballista testing the upgrade– from Version In newly 9re-written to 10 memory “Robustness doesn’t matter” • HP-UX gained a system-killer in #1: “Ballista will never find anything (important)” 1. • So, robustness seems to matter! “The problems you’re looking for were very non-robust are too trivial -- we don’t make those kinds of mistakes” checks • HLA had a handful of functions that 2. • SFIO even missed some “easy” • See Unix data to the right… #2: “100% robustness is impractical” The use of a metric – in our case Ba detectable robustness failures from SFIO and other API subsets • (Our initial SFIO results weren’t entirely zero; but now they are) 90 80 70 60 50 % Abort40 Failures 30 20 10 0 14 Abort Failure Rate for Select Functions open write6.2 read9 close fileno seek sfputc sfgetc 57 0 5.99 0 52 9.2 79 0 llista – allowed us to remove all 2 Function 0 36 1 0 8 5.01 58 0 2 0 50 4.5 0 STDIO Original SFIO Robust SFIO 12 Can Even Be Done With “Ordinary” API Memory & semaphore robustness improved for Linux • Robustness hardening yielded 80.0 70.0 Failure rates for memory/process original Linux calls 60.0 50.0 40.0 (All failure rates are 0% after hardening) 30.0 Failure Rate (%) 20.0 10.0 75.2 75.2 0% failure rate 16.4 0.0 31.3 memchr 57.8 memcmp on standard POSIX calls below Mem. Manipulation Module memcpy memmove 35.3 59.5 memset 7.3 64.7 sem_init Function name sem_destroy 41.2 64.7 sem_getvalue Process Synch. Module sem_post sem_trywait sem_wait 13 #3: “It will be too slow” Solved via caching validity checks • Completely software-implemented cache for checking validity Reference Clear Invalidate Module Lookup Verification Module Store Cache Structure in Memory Result • Check validity once, remember result – Invalidate validity check when necessary 14 Caching Speeds Up Validity Tests Worst-case of tight loops doing nothing but “mem” calls is still fast • L2 Cache misses would dilute effects of checking overhead further Slowdown of robust memory functions with tagged malloc 1.25 1.2 1.15 memchr memcpy 1.1 memcmp memset Slowdown 1.05 memmove 1 0.95 16 32 64 128 256 512 1024 2048 4096 Buffer size (Bytes) 15 Future MicroArchitectures Will Help Exception & validity check branches are highly predictable • Compiler can structure code to assume validity/no exceptions • Compiler can give hints to branch predictor • Branch predictor will quickly figure out the “valid” path even with no hints • Predicated execution can predicate on “unexceptional” case Exception checks can execute in parallel with critical path • Superscalar units seem able to execute checks & functions concurrently • Out of order execution lets checks wait for idle cycles The future brings more speculation; more concurrency • Exception checking is an easy target for these techniques • Robustness is cheap and getting cheaper (if done with a view to architecture) 16 #4: “We Did That On Purpose” Variant: “Nobody could reasonably do better” • Despite the experiences with POSIX, HLA & SFIO, this one persisted • So, we tried an experiment in self-evaluating robustness Three experienced commercial development teams • Components written in Java • Each team self-rated the robustness of their component per Maxion’s “CHILDREN” mnemonic-based technique • We then Ballista tested their (pre-report) components for robustness • Metric: did the teams accurately predict where their robustness vulnerabilities would be? – They didn’t have to be perfectly robust – They all felt they would understand the robustness tradeoffs they’d made 17 Self Report Results: Teams 1 and 2 They were close in their prediction • Didn’t account for some language safety features (divide by zero) • Forgot about, or assumed language would protect them against NULL in A4 Component A and B Robustness 25 20 15 Measured Expected 10 Abort Failure Rate (%) Rate Failure Abort 5 0 A1 A2 A3 A4 B1 B2 B3 B4 Function 18 Self Report Data: Team 3 Did not predict several failure modes • Probably could benefit from 60 50 40 30 Component C Total Failure Rate 20 additional training/tools Total Failure Rate (%) 10 0 C1 C3 C5 C7 C9 C11 Method CDesignation13 C15 C17 C19F C21F Measured Expected C23F C25F C27F 19 Conclusions: Ballista Project In Perspective General testing & wrapping approach for Ballista • Simple tests are effective(!) – Scalable for both testing and hardening • Robustness tests & wrappers can be abstracted to the data type level – Single validation fragment per type – i.e. checkSem(), checkFP()… Wrappers are fast (under 5% penalty) and usually 100% effective • Successful check results can be cached to exploit locality – Typical case is an index lookup, test and jump for checking cache hit – Typical case can execute nearly “for free” in modern hardware • After this point, it is time to worry about resource leaks, device drivers, etc.