A Ballista Retrospective

Software Robustness Testing A Ballista Retrospective Phil Koopman [email protected] http://ballista.org With contributions from: Dan Siewiorek, Kobey DeVale John DeVale, Kim Fernsler, Dave Guttendorf, Nathan Kropp, Jiantao Pan, Charles Shelton, Ying Shi Institute for Complex Engineered Systems Overview Introduction • APIs aren’t robust (and people act as if they don’t want them to be robust!) Top 4 Reasons people give for ignoring robustness improvement • “My API is already robust, especially for easy problems” (it’s probably not) • “Robustness is impractical” (it is practical) • “Robust code will be too slow” (it need not be) • “We already know how to do it, thank you very much” (perhaps they don’t) Conclusions • The big future problem for “near-stationary” robustness isn’t technology -- it is awareness & training 2 Ballista Software Testing Overview SPECIFIED INPUT RESPONSE BEHAVIOR SPACE SPACE ROBUST SHOULD VAL I D OPERATION WORK INPUTS MO DULE REPRODUCIBLE UNDEFINED UNDER FAILURE TEST SHOULD INVALID INPUTS UNREPRODUCIBLE RETURN FAILURE ERROR Abstracts testing to the API/Data type level • Most test cases are exceptional • Test cases based on best-practice SW testing methodology 3 Ballista: Test Generation (fine grain testing) Tests developed per data type/subtype; scalable via composition 4 Initial Results: Most APIs Weren’t Robust Unix & Windows systems had poor robustness scores: • 24% to 48% of intentionally exceptional Unix tests yielded non-robust results • Found simple “system killer” programs in Unix, Win 95/98/ME, and WinCE Even critical systems were far from perfectly robust • Safety critical operating systems • DoD HLA (where their stated goal was 0% robustness failures!) Developer reactions varied, but were often extreme • Organizations emphasizing field reliability often wanted 100% robustness • Organizations emphasizing development often said “core dumps are the Right Thing” • Some people didn’t care • Some people sent hate mail 5 Comparing Fifteen POSIX Operating Systems Ballista Robustness Tests for 233 Posix Function Calls AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 Abort Failures Restart Failure LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5 0% 5% 10% 15% 20% 25% Normalized Failure Rate 6 Robustness: C Library vs. System Calls Portions of Failure Rates Due To System/C-Library AIX 4.1 Free BSD 2.2.5 HP-UX 9.05 HP-UX 10.20 1 Catastrophic Irix 5.3 Irix 6.2 1 Catastrophic Linux 2.0.18 System Calls C Library LynxOS 2.4.0 1 Catastrophic NetBSD 1.3 OSF 1 3.2 1 Catastrophic OSF 1 4.0 QNX 4.22 2 Catastrophics QNX 4.24 SunOS 4.1.3 SunOS 5.5 0% 5% 10% 15% 20% 25% Normalized Failure Rate 7 Estimated N-Version Comparison Results Normalized Failure Rate by Operating System AIX FreeBSD HPUX 9.05 Abort % HPUX 10.20 * Silent % Irix 5.3 Restart % Irix 6.2 * * Catastrophic Linux Lynx * NetBSD OSF-1 3.2 ating System Tested * r OSF-1 4.0 Ope QNX 4.22 ** QNX 4.24 SunOS 4.13 SunOS 5.5 0% 10% 20% 30% 40% 50% Normalized Failure Rate (after analysis) 8 Failure Rates by Function Group 9 Even Those Who Cared Didn’t Get It Right OS Vendors didn’t accomplish their stated objectives (e.g.,): • IBM/AIX wanted few Aborts, but had 21% Aborts on POSIX tests • FreeBSD said they would always Abort on exception (that’s the Right Thing) but had more Silent (unreported) exceptions than AIX! • Vendors who said their results would improve dramatically on the next release were usually wrong Safe Fast I/O (SFIO) library • Ballista found that it wasn’t as safe as the authors thought – Missed: valid file checks; modes vs. permissions; buffer size/accessibility Do people understand what is going on? • We found four widely held misconceptions that prevented improvement in code robustness 10 e t a R 100% e r u l i a 90% F s s e 80% n t s u b 70% o 60% R 50% 40% 30% 20% 10% 0% Clocks & Timers C Library Process Prims. 11 System Databs Process Env. Dev/Class-Specific Files & Dirs I/O Primitives Messaging Memory Mng. Scheduling Synchronization 5 . 3 . 4 5 1 2 2 . 2 B 2 S 4 . 0 4 . 0 .3 4 O . 3 8 S 4 n 1 X 2 4 1 X . 1 . u . O .3 N 1 0 6 2 F D 5 0 n 5 N S 2 F . Q 1 0 . S S u .5 X . S Q 2 S 0 X 2 B S I 4 9 O . t 1 O x O RI 0 . 2 I R u x . X I B I n Ne n A i D A y L S L X UX B - -U e P e P r H H F management functions… … which had a 100% failure rate under Ballista testing the upgrade– from Version In newly 9re-written to 10 memory “Robustness doesn’t matter” • HP-UX gained a system-killer in #1: “Ballista will never find anything (important)” 1. • So, robustness seems to matter! “The problems you’re looking for were very non-robust are too trivial -- we don’t make those kinds of mistakes” checks • HLA had a handful of functions that 2. • SFIO even missed some “easy” • See Unix data to the right… #2: “100% robustness is impractical” The use of a metric – in our case Ba detectable robustness failures from SFIO and other API subsets • (Our initial SFIO results weren’t entirely zero; but now they are) 90 80 70 60 50 % Abort40 Failures 30 20 10 0 14 Abort Failure Rate for Select Functions open write6.2 read9 close fileno seek sfputc sfgetc 57 0 5.99 0 52 9.2 79 0 llista – allowed us to remove all 2 Function 0 36 1 0 8 5.01 58 0 2 0 50 4.5 0 STDIO Original SFIO Robust SFIO 12 Can Even Be Done With “Ordinary” API Memory & semaphore robustness improved for Linux • Robustness hardening yielded 80.0 70.0 Failure rates for memory/process original Linux calls 60.0 50.0 40.0 (All failure rates are 0% after hardening) 30.0 Failure Rate (%) 20.0 10.0 75.2 75.2 0% failure rate 16.4 0.0 31.3 memchr 57.8 memcmp on standard POSIX calls below Mem. Manipulation Module memcpy memmove 35.3 59.5 memset 7.3 64.7 sem_init Function name sem_destroy 41.2 64.7 sem_getvalue Process Synch. Module sem_post sem_trywait sem_wait 13 #3: “It will be too slow” Solved via caching validity checks • Completely software-implemented cache for checking validity Reference Clear Invalidate Module Lookup Verification Module Store Cache Structure in Memory Result • Check validity once, remember result – Invalidate validity check when necessary 14 Caching Speeds Up Validity Tests Worst-case of tight loops doing nothing but “mem” calls is still fast • L2 Cache misses would dilute effects of checking overhead further Slowdown of robust memory functions with tagged malloc 1.25 1.2 1.15 memchr memcpy 1.1 memcmp memset Slowdown 1.05 memmove 1 0.95 16 32 64 128 256 512 1024 2048 4096 Buffer size (Bytes) 15 Future MicroArchitectures Will Help Exception & validity check branches are highly predictable • Compiler can structure code to assume validity/no exceptions • Compiler can give hints to branch predictor • Branch predictor will quickly figure out the “valid” path even with no hints • Predicated execution can predicate on “unexceptional” case Exception checks can execute in parallel with critical path • Superscalar units seem able to execute checks & functions concurrently • Out of order execution lets checks wait for idle cycles The future brings more speculation; more concurrency • Exception checking is an easy target for these techniques • Robustness is cheap and getting cheaper (if done with a view to architecture) 16 #4: “We Did That On Purpose” Variant: “Nobody could reasonably do better” • Despite the experiences with POSIX, HLA & SFIO, this one persisted • So, we tried an experiment in self-evaluating robustness Three experienced commercial development teams • Components written in Java • Each team self-rated the robustness of their component per Maxion’s “CHILDREN” mnemonic-based technique • We then Ballista tested their (pre-report) components for robustness • Metric: did the teams accurately predict where their robustness vulnerabilities would be? – They didn’t have to be perfectly robust – They all felt they would understand the robustness tradeoffs they’d made 17 Self Report Results: Teams 1 and 2 They were close in their prediction • Didn’t account for some language safety features (divide by zero) • Forgot about, or assumed language would protect them against NULL in A4 Component A and B Robustness 25 20 15 Measured Expected 10 Abort Failure Rate (%) Rate Failure Abort 5 0 A1 A2 A3 A4 B1 B2 B3 B4 Function 18 Self Report Data: Team 3 Did not predict several failure modes • Probably could benefit from 60 50 40 30 Component C Total Failure Rate 20 additional training/tools Total Failure Rate (%) 10 0 C1 C3 C5 C7 C9 C11 Method CDesignation13 C15 C17 C19F C21F Measured Expected C23F C25F C27F 19 Conclusions: Ballista Project In Perspective General testing & wrapping approach for Ballista • Simple tests are effective(!) – Scalable for both testing and hardening • Robustness tests & wrappers can be abstracted to the data type level – Single validation fragment per type – i.e. checkSem(), checkFP()… Wrappers are fast (under 5% penalty) and usually 100% effective • Successful check results can be cached to exploit locality – Typical case is an index lookup, test and jump for checking cache hit – Typical case can execute nearly “for free” in modern hardware • After this point, it is time to worry about resource leaks, device drivers, etc.

A Ballista Retrospective

Introduction to UNIX What Is UNIX? Why UNIX? Brief History of UNIX Early UNIX History UNIX Variants

Absolute BSD—The Ultimate Guide to Freebsd Table of Contents Absolute BSD—The Ultimate Guide to Freebsd

Introduction

Google Announces Chrome OS, for Release Mid

SDE 3.0.1 System Requirements

CXFSTM 5 Administration Guide for SGI® Infinitestorage

Porting IRIX® Applications to SGI® Altix® Platforms: SGI Propack™ for Linux®

IRIX® Admin System Configuration and Operation

Low Overhead Real-Time Computing • General Purpose OS's Can Be

Packaging for 11 Platforms with One Tool

Introduction to Unix

Binary Compatibility on Netbsd