Automatic Generation of Valid Syntax C Programs for Fuzz Testing

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) DeepFuzz: Automatic Generation of Syntax Valid C Programs for Fuzz Testing Xiao Liu, Xiaoting Li, Rupesh Prajapati, Dinghao Wu College of Information Sciences and Technology The Pennsylvania State University University Park, PA 16802, USA Abstract research (Hoare 2003). Mainstream research focuses on formal verification (Leroy and Grall 2009), translation valida- Compilers are among the most fundamental programming tion (Necula 2000), and random testing (Lidbury et al. 2015; tools for building software. However, production compilers remain buggy. Fuzz testing is often leveraged with newly- Le, Afshari, and Su 2014; Le, Sun, and Su 2015). The first generated, or mutated inputs in order to find new bugs or two categories try to provide certified compilers. For example, security vulnerabilities. In this paper, we propose a grammar- CompCert (Leroy et al. 2016) has made promising progress based fuzzing tool called DEEPFUZZ. Based on a generative in this area. But in practice, it is challenging to apply formal Sequence-to-Sequence model, DEEPFUZZ automatically and techniques to fully verify a production compiler such as GCC, continuously generates well-formed C programs. We use this especially when the proof is not constructed together with the set of new C programs to fuzz off-the-shelf C compilers, e.g., compiler. Therefore, testing remains the dominant approach GCC and Clang/LLVM. We present a detailed case study to in compiler validation. analyze the success rate and coverage improvement of the generated C programs for fuzz testing. We analyze the perfor- Our work focuses on compiler testing. By feeding in pro- mance of DEEPFUZZ with three types of sampling methods grams covering different features to different production com- as well as three types of generation strategies. Consequently, pilers turning on different levels of optimizations, internal DEEPFUZZ improved the testing efficacy in regards to the line, compiler errors (genuine bugs of the compiler) may be trig- function, and branch coverage. In our preliminary study, we gered during the compilation with a detailed error message found and reported 8 bugs of GCC, all of which are actively indicating where and what the error is. However, it is chal- being addressed by developers. lenging to generate “good” programs to make testing more efficient and to build a continuous testing framework byau- Introduction tomating this process. Each test, including man-crafted ones, in the existing methods, covers some features and it is com- Compilers are among the most important software of com- mon today to see larger and larger test suites for modern puting systems and they are typically part of the trust com- compilers. This improves the testing coverage but it takes puting base, but they remain buggy. For example, GCC, a a lot of human effort to construct these tests. Nevertheless, long-lasting software released in 1987, is a standard compiler a practical way to reduce human labor for testing is fuzz for many Unix-like operating systems. Over 3,410 internal testing, or fuzzing. bugs (Yang et al. 2011) have been caught since it is cre- Fuzzing (Bird and Munoz 1983) is a method to find bugs ated. Similarly, for Java, Python, and JavaScript, thousands or security vulnerabilities. A program is repeatedly executing of bugs have been found in those widely-used compilers and with automatically generated or modified inputs to detect interpreters. These compiler bugs can result in unintended abnormal behaviors such as program crashes. Main tech- program executions and lead to catastrophic consequences in niques for input fuzzing in use today are black box ran- security-sensitive applications. It may also hamper developer dom fuzzing (Zalewski 2015), white box constraint-based productivity in debugging a program when the root cause fuzzing (Godefroid, Kiezun, and Levin 2008), and grammar- cannot be decided in the applications or compilers. There- based fuzzing (Dewey, Roesch, and Hardekopf 2014). Black fore, it is critical to improve the compiler correctness. But it box and white box fuzzing are fully automatic, and have his- is not easy to validate compilers with the growing code base: torically been proven to be effective in finding security vulner- the code base of today’s GCC is around 15 million lines of abilities in binary-format file parsers. In contrast, grammar- code (Sun et al. 2016), close to the entire Linux kernel, which based fuzzing requires an input grammar specifying the input is around 19 million lines of code. format of the application under test, which is typically writ- It is critical to make compilers dependable. In the past ten by hand. This process is laborious, time-consuming, and decade, compiler verification has been an important and ac- error-prone. However, grammar-based fuzzing is the most tive area for the verification grant challenge in computing effective fuzzing technique known today for fuzzing applica- Copyright ⃝c 2019, Association for the Advancement of Artificial tions with complexly structured input formats, e.g., compilers. Intelligence (www.aaai.org). All rights reserved. In the scenario of compiler testing, one way to deploy the 1044 grammar-based fuzzing is to encode the C grammar as rules where f is a non-linear activation function. An RNN can learn for test case generation. But in practice, C11 (of the Inter- a probability distribution over a sequence of characters to pre- national Organization for Standardization (ISO) 2011), the dict the next symbol. Therefore, at each timestep t, the output current standard of the C programming language, has 696 from the RNN is a conditional distribution p(xtjxt−1; :::; x1). pages of detailed specifications, which brings the hurdle for For instance, in our case, upon a multinomial distribution of engineers to construct such a grammar-based engine. the next character, we use a softmax activation function for In this paper, we consider the problem of automatically the output generating syntactically valid inputs for grammar-based exp(wjh ) fuzzing with a generative recurrent neural network. To be p(x = 1jx ; :::; x ) = hti ; (2) t;j t−1 1 PK more specific, we aim to train a generative neural network j=1 exp(wjhhti) to learn the “grammar”, or to be more precise, the language patterns, of the input data. We propose to train a Sequence- for all possible symbols j = 1; :::; K, where wj are the rows to-Sequence model (Sutskever, Vinyals, and Le 2014) in of a weight matrix W . By combining these probabilities, we a supervised learning strategy, leveraging the original test compute the probability of the sequence x using suites provided with production compilers. Originally, the T Sequence-to-Sequence model is widely used for machine Y p(x) = p(xtjxt−1; :::; x1): (3) translation (Klein et al. 2017) and text generation (Sutskever, t=1 Martens, and Hinton 2011). Theoretically speaking, by training a model on the original paragraphs, we implicitly encode With the learned distribution, it is straightforward to generate the correct spelling of words, valid syntaxes of sentences, a new sequence by iteratively sampling new characters at detailed styles of writing behaviors into a generative model. each time step. The same idea can be applied to program synthesis, where we A Sequence-to-Sequence model consists of two RNNs, only need to train a model to generate different syntactically an encoder and a decoder. The encoder learns to encode a valid programs on top of a seed data set. For the training data variable-length sequence into a fixed-length vector represen- set, we adopted the original GCC test suite where there are tation and the decoder will decode this fixed-length vector over 10,000 short, or small, programs that cover most of the representation into a variable-length sequence. It was origi- features specified in the C11 standard. On the training stage, nally proposed by Cho et al. (2014) for statistical machine we tune parameters to encode the language patterns for C translation. The encoder RNN reads each character of an programs into the model, based on which, we continuously input sequence x while the hidden states of the RNN changes. generate new programs for compiler fuzzing. After reading the end of this sequence, the hidden state of the RNN is a summary c of the whole input sequence. Mean- Contributions. Our work is the first to use a generative re- while, the decoder RNN is trained to generate the output current neural network for grammar-based compiler fuzzing. sequence by predicting the next character yt given the hidden • First, the proposed framework is fully automatic. By train- state hhti. However, unlike a pure RNN, both yt and hhti ing a Sequence-to-Sequence model which can be viewed are also conditioned on yt−1 and the summary c of the input as an implicit representation of the language patterns for sequence. In this case, to compute the hidden states of the the training data, C syntax in our context, our framework decoder, we have DEEPFUZZ will continuously provide new syntactic cor- h = f(h ; y ; c); (4) rect C programs. hti ht−1i t−1 • Second, we build a practical tool for fuzzing off-the-shelf and similarly, the condition distribution of the next character C compilers. We conduct a detailed analysis of how key is factors will affect the accuracy of the generative model p(ytjyt−1; :::y1; c) = g(hhti; yt−1; c); (5) and fuzzing performance. where f and g are activation functions. Overall, the two • Third, we apply our DEEPFUZZ technique to test GCC and RNNs Encoder-Decoder are jointly trained to generate a Clang/LLVM. During our preliminary analysis, the testing target sequence given an input sequence. coverage (line, function, and branch) is increased and we All RNNs have feedback loops in the recurrent layer.

Load more