Software Robustness Testing A Ballista Retrospective

Phil Koopman [email protected] http://ballista.org

With contributions from: Dan Siewiorek, Kobey DeVale John DeVale, Kim Fernsler, Dave Guttendorf, Nathan Kropp, Jiantao Pan, Charles Shelton, Ying Shi

Institute for Complex Engineered Systems Overview

‹ Introduction • APIs aren’t robust (and people act as if they don’t want them to be robust!)

‹ Top 4 Reasons people give for ignoring robustness improvement • “My API is already robust, especially for easy problems” (it’s probably not) • “Robustness is impractical” (it is practical) • “Robust code will be too slow” (it need not be) • “We already know how to do it, thank you very much” (perhaps they don’t)

‹ Conclusions • The big future problem for “near-stationary” robustness isn’t technology -- it is awareness & training

2 Ballista Software Testing Overview

SPECIFIED INPUT RESPONSE BEHAVIOR SPACE SPACE ROBUST SHOULD VAL I D OPERATION WORK INPUTS MO DULE REPRODUCIBLE UNDEFINED UNDER FAILURE TEST SHOULD INVALID INPUTS UNREPRODUCIBLE RETURN FAILURE ERROR

‹ Abstracts testing to the API/Data type level • Most test cases are exceptional • Test cases based on best-practice SW testing methodology

3 Ballista: Test Generation (fine grain testing)

‹ Tests developed per data type/subtype; scalable via composition

4 Initial Results: Most APIs Weren’t Robust

‹ & Windows systems had poor robustness scores: • 24% to 48% of intentionally exceptional Unix tests yielded non-robust results • Found simple “system killer” programs in Unix, Win 95/98/ME, and WinCE

‹ Even critical systems were far from perfectly robust • Safety critical operating systems • DoD HLA (where their stated goal was 0% robustness failures!)

‹ Developer reactions varied, but were often extreme • Organizations emphasizing field reliability often wanted 100% robustness • Organizations emphasizing development often said “core dumps are the Right Thing” • Some people didn’t care • Some people sent hate mail

5 Comparing Fifteen POSIX Operating Systems Ballista Robustness Tests for 233 Posix Function Calls

AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic 2.0.18 Abort Failures Restart Failure LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5

0% 5% 10% 15% 20% 25% Normalized Failure Rate

6 Robustness: C Library vs. System Calls Portions of Failure Rates Due To System/C-Library AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 System Calls C Library LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5

0% 5% 10% 15% 20% 25% Normalized Failure Rate

7 Estimated N-Version Comparison Results

Normalized Failure Rate by

AIX FreeBSD HPUX 9.05 Abort % HPUX 10.20 * Silent % Irix 5.3 Restart % Irix 6.2 * * Catastrophic Linux Lynx * NetBSD OSF-1 3.2

ating System Tested * r OSF-1 4.0

Ope QNX 4.22 ** QNX 4.24 SunOS 4.13 SunOS 5.5

0% 10% 20% 30% 40% 50% Normalized Failure Rate (after analysis)

8 Failure Rates by Function Group

9 Even Those Who Cared Didn’t Get It Right

‹ OS Vendors didn’t accomplish their stated objectives (e.g.,): • IBM/AIX wanted few Aborts, but had 21% Aborts on POSIX tests • FreeBSD said they would always Abort on exception (that’s the Right Thing) but had more Silent (unreported) exceptions than AIX! • Vendors who said their results would improve dramatically on the next release were usually wrong

‹ Safe Fast I/O (SFIO) library • Ballista found that it wasn’t as safe as the authors thought – Missed: valid file checks; modes vs. permissions; buffer size/accessibility

‹ Do people understand what is going on? • We found four widely held misconceptions that prevented improvement in code robustness

10 #1: “Ballista will never find anything (important)”

1. “Robustness doesn’t matter” • HP-UX gained a system-killer in the upgrade from Version 9 to 10 100% 90%

– In newly re-written memory 80%

management functions… 70% e t

a

R

60% e … which had a 100% failure rate r

u

l

i

a

50% F under Ballista testing

s

s 40% e n

t

s

• So, robustness seems to matter! 4.1 u AIX 30% b

.5 o 2.2 5 D 0 R eBS .09. Fre X A .20 20% HP-U B.10 .3 UX IX 5 HP- IR 6.2 IRIX .0.18 10% ux 2 4.0 Lin S 2. nxO D 1.3 2. “The problems you’re looking for Ly tBS 3.2 0% Ne F1 C OS C .0B P 1 4 S l SF 22 o O 4. P r L X D y c QN 4 F o are too trivial -- we don’t make 4.2 r s i NX I e o c b k Q 3 M / i t 1. O l e r s 4. M e v c e S S a nO .5 e s / e s & Su S 5 S C m O e P s r un c s s y S y those kinds of mistakes” he m & l T s s r a D P n i a m i c o D s E a r m d g r s i h i i n t m ul y ab e ing t r - r i S v v s r • HLA had a handful of functions that o i M . s s n n e p s . i g n s e z g c a i were very non-robust . f t i i o c • SFIO even missed some “easy” n checks • See Unix data to the right…

11 #2: “100% robustness is impractical”

‹ The use of a metric – in our case Ballista – allowed us to remove all detectable robustness failures from SFIO and other API subsets • (Our initial SFIO results weren’t entirely zero; but now they are)

Abort Failure Rate for Select Functions 90

9 80 7 70

8 7 5 60 5 2 0 5 5 STDIO 50 Original SFIO 6 40 3 Robust SFIO 30 % Abort Failures Abort % 20 4 1 9 9 .2 1 9 0 .2 . 9 8 . .5 10 6 5 5 4 2 2 1 0 0 0 0 0 0 0 0 0 open write read close fileno seek sfputc sfgetc Function

12 Can Even Be Done With “Ordinary” API

‹ Memory & semaphore robustness improved for Linux • Robustness hardening yielded 0% failure rate on standard POSIX calls below

Failure rates for memory/process original Linux calls (All failure rates are 0% after hardening)

80.0 75.2 75.2 70.0 64.7 64.7 57.8 59.5 60.0 50.0 41.2 35.3 40.0 31.3 30.0 20.0 16.4 Failure Rate (%) 10.0 7.3 0.0

t t t hr p y e et i y e s i it c m p v s in o lu o a a c c o _ tr a w w m m m m s v p y _ e m e e m e t _ tr m e m m e d e m _ m m m e s _ g e e m m _ s m s e m se s se

Mem. Manipulation Module Process Synch. Module Function name

13 #3: “It will be too slow”

‹ Solved via caching validity checks • Completely software-implemented cache for checking validity

Reference Clear Invalidate Module Lookup Verification Module Store Cache Structure in Memory Result

• Check validity once, remember result – Invalidate validity check when necessary

14 Caching Speeds Up Validity Tests

‹ Worst-case of tight loops doing nothing but “mem” calls is still fast • L2 Cache misses would dilute effects of checking overhead further

Slowdown of robust memory functions with tagged malloc 1.25

1.2

1.15 memchr memcpy 1.1 memcmp memset Slowdown 1.05 memmove

1

0.95 16 32 64 128 256 512 1024 2048 4096 Buffer size (Bytes) 15 Future MicroArchitectures Will Help

‹ Exception & validity check branches are highly predictable • Compiler can structure code to assume validity/no exceptions • Compiler can give hints to branch predictor • Branch predictor will quickly figure out the “valid” path even with no hints • Predicated execution can predicate on “unexceptional” case

‹ Exception checks can execute in parallel with critical path • Superscalar units seem able to execute checks & functions concurrently • Out of order execution lets checks wait for idle cycles

‹ The future brings more speculation; more concurrency • Exception checking is an easy target for these techniques • Robustness is cheap and getting cheaper (if done with a view to architecture)

16 #4: “We Did That On Purpose”

‹ Variant: “Nobody could reasonably do better” • Despite the experiences with POSIX, HLA & SFIO, this one persisted • So, we tried an experiment in self-evaluating robustness

‹ Three experienced commercial development teams • Components written in Java • Each team self-rated the robustness of their component per Maxion’s “CHILDREN” mnemonic-based technique • We then Ballista tested their (pre-report) components for robustness

• Metric: did the teams accurately predict where their robustness vulnerabilities would be? – They didn’t have to be perfectly robust – They all felt they would understand the robustness tradeoffs they’d made

17 Self Report Results: Teams 1 and 2

‹ They were close in their prediction • Didn’t account for some language safety features (divide by zero) • Forgot about, or assumed language would protect them against NULL in A4

Component A and B Robustness

25

20

15 Measured Expected 10

Abort Failure Rate (%) Rate Failure Abort 5

0 A1 A2 A3 A4 B1 B2 B3 B4 Function 18 Self Report Data: Team 3

‹ Did not predict several failure modes • Probably could benefit from additional training/tools

Component C Total Failure Rate 60

50

40

Measured 30 Expected

20 Total Failure Rate (%) Rate Failure Total 10

0

1 3 5 7 9 1 3 5 7 F F F F F C C C C C 1 1 1 1 9 1 3 5 7 C C C C 1 2 2 2 2 C C C C C Method Designation 19 Conclusions: Ballista Project In Perspective

‹ General testing & wrapping approach for Ballista • Simple tests are effective(!) – Scalable for both testing and hardening • Robustness tests & wrappers can be abstracted to the data type level – Single validation fragment per type – i.e. checkSem(), checkFP()…

‹ Wrappers are fast (under 5% penalty) and usually 100% effective • Successful check results can be cached to exploit locality – Typical case is an index lookup, test and jump for checking cache hit – Typical case can execute nearly “for free” in modern hardware • After this point, it is time to worry about resource leaks, device drivers, etc.

‹ But, technical solution alone is not sufficient • Case study of self-report data – Some developers unable to predict code response to exceptions • Training/tools needed to bridge gap – Even seasoned developers need a QA tool to keep them honest – Stand-alone Ballista tests for Unix under GPL; Windows XP soon

20 Future Research Challenges In The Large

‹ Quantifying “software aging” effects • Simple, methodical tests for resource leaks – Single-threaded, multi-threaded, distributed all have different issues – One problem is multi-thread contention for non-reentrant resources » e.g., exception handling data structures without semaphore protection • Measurement & warning systems for need for SW rejuvenation – Much previous work in predictive models – Can we create an on-line monitor to advise it is time to reboot?

‹ Understanding robustness tradeoffs from developer point of view • Tools to provide predictable tradeoff of effort vs. robustness – QA techniques to ensure that desired goal is reached – Ability to specify robustness level clearly, even if “perfection” is not desired • Continued research in enabling ordinary developers to write robust code • Need to address different needs for development vs. deployment – Developers want heavy-weight notification of unexpected exceptions – In the field, may want a more benign reaction to exceptions 21 22