Coverage-Guided Adversarial Generative Fuzzing Testing of Deep Learning Systems

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. XX, NO. X, XXXX 1 CAGFuzz: Coverage-Guided Adversarial Generative Fuzzing Testing of Deep Learning Systems Pengcheng Zhang, Member, IEEE, Qiyin Dai, Patrizio Pelliccione Abstract—Deep Learning systems (DL) based on Deep Neural Networks (DNNs) are increasingly being used in various aspects of our life, including unmanned vehicles, speech processing, intelligent robotics and etc. Due to the limited dataset and the dependence on manually labeled data, DNNs always fail to detect erroneous behaviors. This may lead to serious problems. Several approaches have been proposed to enhance adversarial examples for testing DL systems. However, they have the following two limitations. First, most of them do not consider the influence of small perturbations on adversarial examples. Some approaches take into account the perturbations, however, they design and generate adversarial examples based on special DNN models. This might hamper the reusability on the examples in other DNN models, thus reducing their generalizability. Second, they only use shallow feature constraints (e.g. pixel-level constraints) to judge the difference between the generated adversarial example and the original example. The deep feature constraints, which contain high-level semantic information - such as image object category and scene semantics, are completely neglected. To address these two problems, we propose CAGFuzz, a Coverage-guided Adversarial Generative Fuzzing testing approach for Deep Learning Systems, which generates adversarial examples for DNN models to discover their potential defects. First, we train an Adversarial Case Generator (AEG) based on general data sets. AEG only considers the data characteristics, and avoids low generalization ability. Second, we extract the deep features of the original and adversarial examples, and constrain the adversarial examples by cosine similarity to ensure that the semantic information of adversarial examples remains unchanged. Finally, we use the adversarial examples to retrain the model. Based on three standard data sets, we design a set of dedicated experiments to evaluate CAGFuzz. The experimental results show that CAGFuzz can improve the neuron coverage rate, detect hidden errors, and also improve the accuracy of the target DNN. Index Terms—deep neural network; fuzz testing; adversarial example; coverage criteria. F 1 INTRODUCTION example, in automatic driving systems, if the deployed DNNs have not recognized the obstacles ahead timely and Owadays, we have already stepped into the era of correctly, it may lead to serious consequences such as vehicle artificial intelligence from the digital era. Apps with N damage and even human death [8]. AI systems can be seen everywhere in our daily life, such as Generally speaking, the development process of DL sys- Amazon Alexa [1], DeepMind’s Atari [2], and AI-phaGo [3]. tems is essentially different from the traditional software With the development of edge computing, 5G technology development process. As shown in Fig. 1, for traditional and etc., AI technologies become more and more mature. software development practices, developers directly specify In many applications, we can see the shape of deep neural the logic of the systems. On the contrary, DL systems auto- networks (DNNs), such as automatic driving [4], intelligent matically learn their models and corresponding parameters robotics [5], smart city applications [6] and AI-enabled En- arXiv:1911.07931v2 [cs.CV] 21 May 2020 from data. Consequently, the testing process of DL systems terprise Information Systems [7]. In this paper, we term this is also different from traditional software systems. For tra- kind of applications as DL (deep learning) systems. ditional software systems, code or control-flow coverage is In particular, many different kinds of DNNs are em- utilized to guide the testing process [9]. However, the logic bedded in security and safety-critical applications, such of the DL systems is not encoded by control flow, and it can- as automatic driving [4] and intelligent robotics [5]. This not be solved by the normal encoding way. Their decisions brings new challenges since predictability, correctness, and are always made by training data for many times, and the safety are especially crucial for this kind of DL systems. performance is more dependent on data rather than human These safety-critical applications deploying DNNs without intervention. For DL systems, neural coverage can be used comprehensive testing could have serious problems. For to guide the testing process [10]. When faults are found, it is also very difficult to locate the exact position in the original • P. Zhang and Q. Dai are with the College of Computer and Information, DL systems. Consequently, most traditional software testing Hohai University, Nanjing, P.R.China methodologies are not suitable for testing DL systems. As E-mail: [email protected] • P. Pelliccione is with the University of L’Aquila, Italy and Chalmers j highlighted in [10], [11], research on developing new testing University of Gothenburg, Sweden. techniques for DL systems is urgently needed. E-mail: [email protected] The standard way to test DL systems is to collect and Manuscript received XXXX XXXX; revised XXXX, XXXX. manually mark as much actual test data as possible [12], IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. XX, NO. X, XXXX 2 the pixel-level changes of the adversarial example. However, such shallow feature constraints can only represent the visual consistency between the adversarial example and the original example, and cannot guarantee the high-level semantic information con- Traditional Developer sistency between the adversarial example and the original example. Furthermore, this may lead to bad Input Decision Logic Output performance when testing the network with deep layers. Algorithm Tuning To address the problems aforementioned, we propose a new testing approach for DL systems, called CAGFuzz Feature (Coverage-guided Adversarial Generative Fuzzing)1. The Selection goal of the CAGFuzz is to maximize the neuron coverage { and generate adversarial test examples as much as possible DL system Developer Training Data with small perturbations for the target DNNs. The goal of the CAGFuzz is to maximize the neuron coverage and generate adversarial test examples with minimal perturbations Fig. 1. Comparison between traditional and DL system development for the target DNNs. Meanwhile, the generated examples have strong generalization ability and can be used to test different DNN models. CAGFuzz iteratively selects the test [13]. Obviously, it is unthinkable to exhaustively test every examples in the processing pool and generates the adversar- feasible input of the DL systems. Recently, an increasing ial examples through the pre-trained adversarial example number of researchers have contributed to test DL systems generator (see Section 3 for details) to guide DL systems to with a variety of approaches [10], [11], [14], [15], [16]. The expose incorrect behaviors. During the process of generating main idea of these approaches is to enhance input examples adversarial examples, CAGFuzz keeps valid adversarial ex- of test data set by different techniques. Some approaches, amples, which can provide a certain improvement in neuron e.g. DeepXplore [10], use multiple DNNs to discover and coverage for subsequent fuzzy processing, and limit the generate adversarial examples that lie between the decision small perturbations invisible to human eyes, ensuring the boundaries of these DNNs. Some approaches, e.g. Deep- same meaningfulness between the original example and the Hunter [11], use metamorphic mutation strategy to generate adversarial example. The contributions of this paper include new test examples. Other approaches, e.g. DeepGauge [16], the following three aspects: propose new coverage criteria for deep neural networks. These coverage criteria can be used as guidance for gener- • We design an adversarial example generator, AEG, which ating test examples. While state-of-the-art approaches make can generate adversarial examples with small perturba- some progresses on testing DL systems, they still suffer the tions based on general data sets. The goal of Cycle- following two main problems: GAN [21] is to transform image A to image B with different styles. Based on CycleGAN, our goal is to 1) DNN-dependent generation of adversarial examples. transform image B back to image A, and get image Most approaches [11], [17] do not consider the in- A’ similar to the original image A. Consequently, fluence of small perturbations on deep neural net- we combine two generators with opposite functions works when the test examples are generated. Some of CycleGAN as our adversarial example generator. approaches [10], [18] consider small perturbations The adversarial examples generated by AEG can add based on special DNN models. The test examples small perturbations invisible to human eyes to the that they have generated are only designed for one original examples. AEG is trained based on general special DNN, and it may be difficult to generalize data sets and does not rely on any specific DNN them to other DNNs. Recent research on adversarial model, which has higher generalization ability than DL systems [19], [20] shows that by adding the state-of-the-art approaches. Furthermore, because of small perturbations to existing images and elabo- the inherent constraint logic of CycleGAN, the trained rating synthetic images

Coverage-Guided Adversarial Generative Fuzzing Testing of Deep Learning Systems

ON SOFTWARE TESTING and SUBSUMING MUTANTS an Empirical Study

Thesis Template

Testing, Debugging & Verification

Different Types of Testing

Guidelines on Minimum Standards for Developer Verification of Software

Jia and Harman's an Analysis and Survey of the Development Of

What Are We Really Testing in Mutation Testing for Machine

Using Mutation Testing to Measure Behavioural Test Diversity

ISTQB Glossary of Testing Terms 2.3X

If You Can't Kill a Supermutant, You Have a Problem

White Box Testing Process

A Survey on Software Testing Techniques Using Genetic Algorithm