Differential Testing for Software
Total Page:16
File Type:pdf, Size:1020Kb
William M. McKeeman Differential Testing for Software Differential testing, a form of random testing, The Testing Problem is a component of a mature testing technology for large software systems. It complements Successful commercial computer systems contain tens of millions of lines of handwritten software, all of regression testing based on commercial test which is subject to change as competitive pressures suites and tests locally developed during prod- motivate the addition of new features in each release. uct development and deployment. Differential As a practical matter, quality is not a question of cor- testing requires that two or more comparable rectness, but rather of how many bugs are fixed and systems be available to the tester. These sys- how few are introduced in the ongoing development tems are presented with an exhaustive series process. If the bug count is increasing, the software is deteriorating. of mechanically generated test cases. If (we might say when) the results differ or one of Quality the systems loops indefinitely or crashes, the Testing is a major contributor to quality—it is the last tester has a candidate for a bug-exposing test. chance for the development organization to reduce Implementing differential testing is an interest- the number of bugs delivered to customers. Typically, ing technical problem. Getting it into use is an developers build a suite of tests that the software must pass to advance to a new release. Three major sources even more interesting social challenge. This of such tests are the development engineers, who paper is derived from experience in differential know where to probe the weak points; commercial test testing of compilers and run-time systems at suites, which are the arbiters of conformance; and cus- DIGITAL over the last few years and recently tomer complaints, which developers must address to at Compaq. A working prototype for testing win customer loyalty. All three types of test cases are C compilers is available on the web. relevant to customer satisfaction and therefore have value to the developers. The resultant test suite for the software under test becomes intellectual property, encapsulates the accumulated experience of problem fixes, and can contain more lines of code than the soft- ware itself. Testing is always incomplete. The simplest measure of completeness is statement coverage. Instrumentation can be added to the software before it is tested. When a test is run, the instrumentation generates a report detailing which statements are actually executed. Obviously, code that is not executed was not tested. Random testing is a way to make testing more com- plete. One value of random testing is introducing the unexpected test—1,000 monkeys on the keyboard can produce some surprising and even amusing input! The traditional approach to acquiring such input is to let university students use the software. Testing software is an active field of endeavor. Interesting starting points for gathering background 100 Digital Technical Journal Vol. 10 No. 1 1998 information and references are the web site main- rect. The very complexity of modern software that tained by Software Research, Inc.1 and the book drives us to construct tests makes it impractical to pro- Software Testing and Quality Assurance.2 vide a priori knowledge of the expected results. The problem is worse for randomly generated tests. There Developer Distaste is not likely to be a higher level of reasoning that can A development team with a substantial bug backlog be applied, which forces the tester to instead follow does not find it helpful to have an automatic bug the tedious steps that the computer will carry out dur- finder continually increasing the backlog. The team ing the test run. An oracle is needed. priority is to address customer complaints before deal- One class of results is easy to evaluate: program ing with bugs detected by a robot. Engineers argue crashes. A crash is never the right answer. In the triage that the randomly produced tests do not uncover that drives a maintenance effort, crashes are assigned to errors that are likely to bother customers. “Nobody the top priority category. Although this paper does not would do that,” “That error is not important,” and contain an in-depth discussion of crashes, all crashes “Don’t waste our time; we have plenty of real errors caused by differential testing are reported and consti- to fix” are typical developer retorts. tute a substantial portion of the discovered bugs. The complaints have a substantial basis. During a visit Differential testing, which is covered in the following to our development group, Professor C. A. R. Hoare of section, provides part of the solution to the problem of Oxford University succinctly summarized one class of needing an oracle. The remainder of the solution is dis- complaints: “You cannot fix an infinite number of bugs cussed in the section entitled Test Reduction. one at a time.” Some software needs a stronger remedy than a stream of bug reports. Moreover, a stream of bug Differential Testing reports may consume the energy that could be applied in more general and productive ways. Differential testing addresses a specific problem—the The developer pushback just described indicates that cost of evaluating test results. Every test yields some a differential testing effort must be based on a per- result. If a single test is fed to several comparable pro- ceived need for better testing from within the product grams (for example, several C compilers), and one pro- development team. Performing the testing is pointless gram gives a different result, a bug may have been if the developers cannot or will not use the results. exposed. For usable software, very few generated tests Differential testing is most easily applicable to soft- will result in differences. Because it is feasible to gener- ware whose quality is already under control, that is, ate millions of tests, even a few differences can result in software for which there are few known outstanding a substantial stream of detected bugs. The trade-off is errors. Running a very large number of tests and to use many computer cycles instead of human effort to expending team effort only when an error is found design and evaluate tests. Particle physicists use the becomes an attractive alternative. Team members’ same paradigm: they examine millions of mostly boring morale increases when the software passes millions of events to find a few high-interest particle interactions. hard tests and test coverage of their code expands. Several issues must be addressed to make differen- The technology should be important for applica- tial testing effective. The first issue concerns the qual- tions for which there is a high premium on correct- ity of the test. Any random string fed to a C compiler ness. In particular, product differentiation can be yields some result—most likely a diagnostic. Feeding achieved for software that has few failures in compari- random strings to the compiler soon becomes unpro- son to the competition. Differential testing is designed ductive, however, because these tests provide only to provide such comparisons. shallow coverage of the compiler logic. Developers The technology should also be important for appli- must devise tests that drive deep into the tested com- cations for which there is a high premium on indepen- piler. The second issue relates to false positives. The dently duplicating the behavior of some existing results of two tested programs may differ and yet application. Identical behavior is important when old still be correct, depending on the requirements. For software is being retired in favor of a new implementa- example, a C compiler may freely choose among alter- tion, or when the new software is challenging a domi- natives for unspecified, undefined, or implementation- nant competitor. defined constructs as detailed in the C Standard.3 Similarly, even for required diagnostics, the form of Seeking an Oracle the diagnostic is unspecified and therefore difficult to The ugliest problem in testing is evaluating the result compare across systems. The third issue deals with the of a test. A regression harness can automatically check amount of noise in the generated test case. Given a that a result has not changed, but this information successful random test, there is likely to be a much serves no purpose unless the result is known to be cor- shorter test that exposes the same bug. The developer Digital Technical Journal Vol. 10 No. 1 1998 101 who is seeking to fix the bug strongly prefers to use the level. One compiler could not handle 0x000001 if shorter test. The fourth issue concerns comparing pro- there were too many leading zeros in the hexadecimal grams that must run on different platforms. Differential number. Another compiler crashed when faced with testing is easily adapted to distributed testing. the floating-point constant 1E1000. Many compilers failed to properly process digraphs and trigraphs. Test Case Quality Stochastic Grammar Writing good tests requires a deep knowledge of the A vocabulary is a set of two kinds of symbols: terminal system under test. Writing a good test generator and nonterminal. The terminal symbols are what one requires embedding that same knowledge in the gen- can write down. The nonterminal symbols are names erator. This section presents the testing of C compilers for higher level language structures. For example, the as an example. symbol “+” is a terminal symbol, and the symbol “additive-expression” is a nonterminal symbol of the Testing C Compilers C programming language. A grammar is a set of rules For a C compiler, we constructed sample C source files for describing a language. A rule has a left side and a at several levels of increasing quality. right side.