Software Robustness Testing A Ballista Retrospective
Phil Koopman [email protected] http://ballista.org
With contributions from: Dan Siewiorek, Kobey DeVale John DeVale, Kim Fernsler, Dave Guttendorf, Nathan Kropp, Jiantao Pan, Charles Shelton, Ying Shi
Institute for Complex Engineered Systems Overview
Introduction • APIs aren’t robust (and people act as if they don’t want them to be robust!)
Top 4 Reasons people give for ignoring robustness improvement • “My API is already robust, especially for easy problems” (it’s probably not) • “Robustness is impractical” (it is practical) • “Robust code will be too slow” (it need not be) • “We already know how to do it, thank you very much” (perhaps they don’t)
Conclusions • The big future problem for “near-stationary” robustness isn’t technology -- it is awareness & training
2 Ballista Software Testing Overview
SPECIFIED INPUT RESPONSE BEHAVIOR SPACE SPACE ROBUST SHOULD VAL I D OPERATION WORK INPUTS MO DULE REPRODUCIBLE UNDEFINED UNDER FAILURE TEST SHOULD INVALID INPUTS UNREPRODUCIBLE RETURN FAILURE ERROR
Abstracts testing to the API/Data type level • Most test cases are exceptional • Test cases based on best-practice SW testing methodology
3 Ballista: Test Generation (fine grain testing)
Tests developed per data type/subtype; scalable via composition
4 Initial Results: Most APIs Weren’t Robust
Unix & Windows systems had poor robustness scores: • 24% to 48% of intentionally exceptional Unix tests yielded non-robust results • Found simple “system killer” programs in Unix, Win 95/98/ME, and WinCE
Even critical systems were far from perfectly robust • Safety critical operating systems • DoD HLA (where their stated goal was 0% robustness failures!)
Developer reactions varied, but were often extreme • Organizations emphasizing field reliability often wanted 100% robustness • Organizations emphasizing development often said “core dumps are the Right Thing” • Some people didn’t care • Some people sent hate mail
5 Comparing Fifteen POSIX Operating Systems Ballista Robustness Tests for 233 Posix Function Calls
AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 Abort Failures Restart Failure LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5
0% 5% 10% 15% 20% 25% Normalized Failure Rate
6 Robustness: C Library vs. System Calls Portions of Failure Rates Due To System/C-Library AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 System Calls C Library LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5
0% 5% 10% 15% 20% 25% Normalized Failure Rate
7 Estimated N-Version Comparison Results
Normalized Failure Rate by Operating System
AIX FreeBSD HPUX 9.05 Abort % HPUX 10.20 * Silent % Irix 5.3 Restart % Irix 6.2 * * Catastrophic Linux Lynx * NetBSD OSF-1 3.2
ating System Tested * r OSF-1 4.0
Ope QNX 4.22 ** QNX 4.24 SunOS 4.13 SunOS 5.5
0% 10% 20% 30% 40% 50% Normalized Failure Rate (after analysis)
8 Failure Rates by Function Group
9 Even Those Who Cared Didn’t Get It Right
OS Vendors didn’t accomplish their stated objectives (e.g.,): • IBM/AIX wanted few Aborts, but had 21% Aborts on POSIX tests • FreeBSD said they would always Abort on exception (that’s the Right Thing) but had more Silent (unreported) exceptions than AIX! • Vendors who said their results would improve dramatically on the next release were usually wrong
Safe Fast I/O (SFIO) library • Ballista found that it wasn’t as safe as the authors thought – Missed: valid file checks; modes vs. permissions; buffer size/accessibility
Do people understand what is going on? • We found four widely held misconceptions that prevented improvement in code robustness
10 #1: “Ballista will never find anything (important)”
1. “Robustness doesn’t matter” • HP-UX gained a system-killer in the upgrade from Version 9 to 10 100% 90%
– In newly re-written memory 80%
management functions… 70% e t
a
R
60% e … which had a 100% failure rate r
u
l
i
a
50% F under Ballista testing
s
s 40% e n
t
s
• So, robustness seems to matter! 4.1 u AIX 30% b
.5 o 2.2 5 D 0 R eBS .09. Fre X A .20 20% HP-U B.10 .3 UX IX 5 HP- IR 6.2 IRIX .0.18 10% ux 2 4.0 Lin S 2. nxO D 1.3 2. “The problems you’re looking for Ly tBS 3.2 0% Ne F1 C OS C .0B P 1 4 S l SF 22 o O 4. P r L X D y c QN 4 F o are too trivial -- we don’t make 4.2 r s i NX I e o c b k Q 3 M / i t 1. O l e r s 4. M e v c e S S a nO .5 e s / e s & Su S 5 S C m O e P s r un c s s y S y those kinds of mistakes” he m & l T s s r a D P n i a m i c o D s E a r m d g r s i h i i n t m ul y ab e ing t r - r i S v v s r • HLA had a handful of functions that o i M . s s n n e p s . i g n s e z g c a i were very non-robust . f t i i o c • SFIO even missed some “easy” n checks • See Unix data to the right…
11 #2: “100% robustness is impractical”
The use of a metric – in our case Ballista – allowed us to remove all detectable robustness failures from SFIO and other API subsets • (Our initial SFIO results weren’t entirely zero; but now they are)
Abort Failure Rate for Select Functions 90
9 80 7 70
8 7 5 60 5 2 0 5 5 STDIO 50 Original SFIO 6 40 3 Robust SFIO 30 % Abort Failures Abort % 20 4 1 9 9 .2 1 9 0 .2 . 9 8 . .5 10 6 5 5 4 2 2 1 0 0 0 0 0 0 0 0 0 open write read close fileno seek sfputc sfgetc Function
12 Can Even Be Done With “Ordinary” API
Memory & semaphore robustness improved for Linux • Robustness hardening yielded 0% failure rate on standard POSIX calls below
Failure rates for memory/process original Linux calls (All failure rates are 0% after hardening)
80.0 75.2 75.2 70.0 64.7 64.7 57.8 59.5 60.0 50.0 41.2 35.3 40.0 31.3 30.0 20.0 16.4 Failure Rate (%) 10.0 7.3 0.0
t t t hr p y e et i y e s i it c m p v s in o lu o a a c c o _ tr a w w m m m m s v p y _ e m e e m e t _ tr m e m m e d e m _ m m m e s _ g e e m m _ s m s e m se s se
Mem. Manipulation Module Process Synch. Module Function name
13 #3: “It will be too slow”
Solved via caching validity checks • Completely software-implemented cache for checking validity
Reference Clear Invalidate Module Lookup Verification Module Store Cache Structure in Memory Result
• Check validity once, remember result – Invalidate validity check when necessary
14 Caching Speeds Up Validity Tests
Worst-case of tight loops doing nothing but “mem” calls is still fast • L2 Cache misses would dilute effects of checking overhead further
Slowdown of robust memory functions with tagged malloc 1.25
1.2
1.15 memchr memcpy 1.1 memcmp memset Slowdown 1.05 memmove
1
0.95 16 32 64 128 256 512 1024 2048 4096 Buffer size (Bytes) 15 Future MicroArchitectures Will Help
Exception & validity check branches are highly predictable • Compiler can structure code to assume validity/no exceptions • Compiler can give hints to branch predictor • Branch predictor will quickly figure out the “valid” path even with no hints • Predicated execution can predicate on “unexceptional” case
Exception checks can execute in parallel with critical path • Superscalar units seem able to execute checks & functions concurrently • Out of order execution lets checks wait for idle cycles
The future brings more speculation; more concurrency • Exception checking is an easy target for these techniques • Robustness is cheap and getting cheaper (if done with a view to architecture)
16 #4: “We Did That On Purpose”
Variant: “Nobody could reasonably do better” • Despite the experiences with POSIX, HLA & SFIO, this one persisted • So, we tried an experiment in self-evaluating robustness
Three experienced commercial development teams • Components written in Java • Each team self-rated the robustness of their component per Maxion’s “CHILDREN” mnemonic-based technique • We then Ballista tested their (pre-report) components for robustness
• Metric: did the teams accurately predict where their robustness vulnerabilities would be? – They didn’t have to be perfectly robust – They all felt they would understand the robustness tradeoffs they’d made
17 Self Report Results: Teams 1 and 2
They were close in their prediction • Didn’t account for some language safety features (divide by zero) • Forgot about, or assumed language would protect them against NULL in A4
Component A and B Robustness
25
20
15 Measured Expected 10
Abort Failure Rate (%) Rate Failure Abort 5
0 A1 A2 A3 A4 B1 B2 B3 B4 Function 18 Self Report Data: Team 3
Did not predict several failure modes • Probably could benefit from additional training/tools
Component C Total Failure Rate 60
50
40
Measured 30 Expected
20 Total Failure Rate (%) Rate Failure Total 10
0
1 3 5 7 9 1 3 5 7 F F F F F C C C C C 1 1 1 1 9 1 3 5 7 C C C C 1 2 2 2 2 C C C C C Method Designation 19 Conclusions: Ballista Project In Perspective
General testing & wrapping approach for Ballista • Simple tests are effective(!) – Scalable for both testing and hardening • Robustness tests & wrappers can be abstracted to the data type level – Single validation fragment per type – i.e. checkSem(), checkFP()…
Wrappers are fast (under 5% penalty) and usually 100% effective • Successful check results can be cached to exploit locality – Typical case is an index lookup, test and jump for checking cache hit – Typical case can execute nearly “for free” in modern hardware • After this point, it is time to worry about resource leaks, device drivers, etc.
But, technical solution alone is not sufficient • Case study of self-report data – Some developers unable to predict code response to exceptions • Training/tools needed to bridge gap – Even seasoned developers need a QA tool to keep them honest – Stand-alone Ballista tests for Unix under GPL; Windows XP soon
20 Future Research Challenges In The Large
Quantifying “software aging” effects • Simple, methodical tests for resource leaks – Single-threaded, multi-threaded, distributed all have different issues – One problem is multi-thread contention for non-reentrant resources » e.g., exception handling data structures without semaphore protection • Measurement & warning systems for need for SW rejuvenation – Much previous work in predictive models – Can we create an on-line monitor to advise it is time to reboot?
Understanding robustness tradeoffs from developer point of view • Tools to provide predictable tradeoff of effort vs. robustness – QA techniques to ensure that desired goal is reached – Ability to specify robustness level clearly, even if “perfection” is not desired • Continued research in enabling ordinary developers to write robust code • Need to address different needs for development vs. deployment – Developers want heavy-weight notification of unexpected exceptions – In the field, may want a more benign reaction to exceptions 21 22