Taint-Based Directed Whitebox Fuzzing
Total Page:16
File Type:pdf, Size:1020Kb
Taint-based Directed Whitebox Fuzzing The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Ganesh, V., T. Leek, and M. Rinard. “Taint-based directed whitebox fuzzing.” Software Engineering, 2009. ICSE 2009. IEEE 31st International Conference on. 2009. 474-484. ©2009 Institute of Electrical and Electronics Engineers. As Published http://dx.doi.org/10.1109/ICSE.2009.5070546 Publisher Institute of Electrical and Electronics Engineers Version Final published version Citable link http://hdl.handle.net/1721.1/59320 Terms of Use Article is made available in accordance with the publisher's policy and may be subject to US copyright law. Please refer to the publisher's site for terms of use. Taint-based Directed Whitebox Fuzzing ∗ Vijay Ganesh† and Tim Leek‡ and Martin Rinard† †MIT Computer Science and Artificial Intelligence Lab, ‡MIT Lincoln Laboratory [email protected], [email protected], [email protected] Abstract systems [17, 23, 16]. It is especially useful for testing input parsing components — the randomly generated We present a new automated white box fuzzing tech- inputs often exercise overlooked corner cases in the ini- nique and a tool, BuzzFuzz, that implements this tech- tial parsing and error checking code. nique. Unlike standard fuzzing techniques, which ran- But fuzz testing (or simply fuzzing) has been less domly change parts of the input file with little or no effective at generating syntactically legal inputs that information about the underlying syntactic structure of expose deeper semantic errors in programs [23, 8] — the file, BuzzFuzz uses dynamic taint tracing to au- for many programs virtually all of the randomly gener- tomatically locate regions of original seed input files ated inputs fail to satisfy the basic syntactic constraints that influence values used at key program attack points that characterize well-formed inputs, and hence fail to (points where the program may contain an error). make it past the initial parsing phase to exercise the BuzzFuzz then automatically generates new fuzzed test remaining code. input files by fuzzing these identified regions of the orig- This paper presents a new testing approach, directed inal seed input files. Because these new test files typi- whitebox fuzz testing, and a new tool, BuzzFuzz,that cally preserve the underlying syntactic structure of the implements this testing approach. Instead of generat- original seed input files, they tend to make it past the ing random inputs that primarily exercise the initial initial input parsing components to exercise code deep input parsing components, directed fuzz testing is de- within the semantic core of the computation. signed to produce well-formed test inputs that exercise We have used BuzzFuzz to automatically find er- code deep within the core semantic processing com- rors in two open-source applications: Swfdec (an Adobe ponents of the program under test. As such, it com- Flash player) and MuPDF (a PDF viewer). Our re- plements random fuzzing and significantly extends the sults indicate that our new directed fuzzing technique reach of automated testing techniques. Directed fuzz can effectively expose errors located deep within large testing is based on the following techniques: programs. Because the directed fuzzing technique uses • Taint Tracing: The program under test executes taint to automatically discover and exploit information on one or more valid sample inputs. The execu- about the input file format, it is especially appropriate tion is instrumented to record taint information. for testing programs that have complex, highly struc- Specifically, the instrumented execution records, tured input file formats. for each value that the program computes, the in- 1 Introduction put bytes that influence that value. • Fuzz testing [17] is a form of automatic, black-box Attack Point Selection: Specific points in the testing which uses a fuzzer to randomly generate or mu- program under test are identified as potential at- tate sample inputs. This technique has been shown to tack points, i.e., locations that may exhibit an er- be surprisingly effective in exposing errors in software ror if the program is presented with an appropri- ate error-revealing input. By default, our imple- ∗This research was supported in part by National Sci- mented BuzzFuzz system selects library and sys- ence Foundation grants CCR-0325283, CNS-0509415, and CCF- tem calls as attack points. BuzzFuzz can also be 0811397, and the Department of Defense under the Air Force Cooperative Agreement FA8750-06-2-0189. The Lincoln Labo- configured, under user control, to select any arbi- ratory portion of this work was sponsored by the Department of trary set of program points as attack points. Defense under the Air Force Contract FA8721-05-C-0002. Opin- • Directed Fuzzing: For each attack point and ions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United each sample input, BuzzFuzz computes the set of States Government. input bytes that affect the values at that attack ICSE’09, May 16-24, 2009, Vancouver, Canada 978-1-4244-3452-7/09/$25.00 © 2009 IEEE 474 point. For library and system calls, for example, the values at the attack point are the parameters Program Attack Points passed to the library or system call. BuzzFuzz Program then generates new inputs as follows: Each new BuzzFuzz Fuzzed Test Inputs input is identical to one of the sample inputs, ex- cept that the input bytes that affect the values Seed Inputs at one or more attack points have been altered. By default, our implemented BuzzFuzz system sets Figure 1. BuzzFuzz Inputs and Outputs. The these bytes to extremal values, e.g., large, small, inputs to BuzzFuzz are: source of a C pro- or zero integer values. BuzzFuzz can also be con- gram, seed inputs, and a list of program at- figured to use different policies such as generating tack points. The output of BuzzFuzz is a set of random values for the corresponding input bytes. fuzzed test inputs for the program under test. • Directed Testing: Finally, BuzzFuzz runs the program under test on the newly generated inputs to see if the inputs expose any errors. calls particularly susceptible to the kind of latent er- This approach has several benefits: rors that our directed fuzzing technique is designed to expose. • Preservation of Syntactic Structure: Di- Experimental Results: We have used BuzzFuzz to rected fuzzing tends to target input bytes that can test several programs that process highly structured be changed without violating the legal syntactic binary data (examples of such data include video, im- structure of the original seed inputs. The auto- ages, document, and sound files). We have found that matically generated fuzzed inputs therefore tend directed fuzzing is able to successfully generate inputs to make it past the initial parsing components to that satisfy complex input consistency constraints and exercise code within the semantic core of the com- make it past the initial input processing components putation. to expose errors at attack points within the semantic • Targeted Values: The altered input bytes are core. More specifically, our results show that BuzzFuzz designed to target values that are directly relevant is able to preserve the syntactic structure present in to specific potential vulnerabilities. The generated the original inputs, generate new inputs that satisfy test suite therefore tends to have a high concen- the key properties that characterize legal inputs, and tration of inputs that can expose errors that may successfully target vulnerabilities deep within the se- exist at these potential vulnerabilities. mantic core of the program under test. In the con- text of programs for processing highly structured in- • Coordinated Changes: Finally, directed puts like movies or documents, BuzzFuzz nicely com- fuzzing can identify and alter multiple disjoint re- plements random fuzzing, since random fuzzing usually gions of the input space that must change together uncovers errors in shallower parts of the program such in a coordinated way to expose the presence of an as input validation code. error. All of the experimental results presented in this pa- One of the keys to making this approach work is per are available at http://people.csail.mit. choosing a set of attack points that tend to have la- edu/vganesh/buzzfuzz.html.Thiswebsitein- tent errors. We have found that library calls can com- cludes the source code for our benchmark applications, prise a particularly productive set of attack points. The the automatically generated test files that expose er- program and library are typically developed by differ- rors in these applications, and the seed test input files ent people operating with different perspectives, infor- that BuzzFuzz started with to generate the test files. mation, and assumptions. This cognitive gap between Inputs and Outputs: Figure 1 illustrates the pro- the program and library developers can easily enable gram inputs and outputs of the BuzzFuzz system. The a range of errors — the program may use the library user provides BuzzFuzz with a program and set of seed in ways that the library developer did not anticipate, input files for that program. BuzzFuzz also accepts a corner cases can become lost or forgotten at the inter- specification of attack points. BuzzFuzz comes precon- face between the program and the library, or the pro- figured with a default attack point specification; The gram developer may simply be unaware of some library user can also provide additional attack point specifica- preconditions. Moreover, the large potential execution tions. Given these inputs, BuzzFuzz instruments the space at library calls can be difficult to adequately ex- source to trace taint information, then runs the instru- plore with standard testing practices, making library mented source on the provided inputs to find out which 475 input file locations influence attack point values.