Data Sanity Check for Deep Learning Systems via Learnt Assertions

Haochuan Lu∗†, Huanlin Xu∗†, Nana Liu∗†, Yangfan Zhou∗†, Xin Wang∗† ∗School of , Fudan University, Shanghai, China †Shanghai Key Laboratory of Intelligent Information Processing, Shanghai, China

Abstract—Reliability is a critical consideration to DL- based deviations in a data flow perspective. Invalid input cases are systems. But the statistical nature of DL makes it quite vulnerable thus identified effectively. to invalid inputs, i.e., those cases that are not considered in We summarize the contributions of this paper as follows. the training phase of a DL model. This paper proposes to perform data sanity check to identify invalid inputs, so as to • We approach reliability enhancement of DL systems via enhance the reliability of DL-based systems. We design and data sanity check. We proposed a tool, namely SaneDL, implement a tool to detect behavior deviation of a DL model to perform data sanity check for DL-based systems. when processing an input case. This tool extracts the data flow SaneDL provides assertion mechanism to detect behavior footprints and conducts an assertion-based validation mechanism. The assertions are built automatically, which are specifically- deviation of DL model. To our knowledge, SaneDL is the tailored for DL model data flow analysis. Our experiments first assertion-based tool that can automatically detects conducted with real-world scenarios demonstrate that such an invalid input cases for DL systems. Our work can shed assertion-based data sanity check mechanism is effective in light to other practices in improving DL reliability. identifying invalid input cases. • SaneDL adopts a non-intrusive design that will not de- grade the performance of the target DL model. In other I.INTRODUCTION words, the SaneDL mechanism does not require the In recent years, deep learning (DL) techniques have shown adaptation of the target DL model. As a result, SaneDL great effectiveness in various aspects. A huge amount of DL- can be easily applied to existing DL-based systems. based applications and systems have been proposed in favor • We show the effectiveness and efficiency of invalid input of peoples daily life and industrial production [1]–[3], even detection via real-world invalid input cases, other than in safety-critical applications. Reliability is hence of great manipulated faulty samples typically used in the com- significance for practical DL- based systems. munity. This proves the SaneDL mechanism is promising. It is widely-accepted that every software system has its valid We will then discuss our approach in Section II and the input domain [4]–[7]. Inputs staying in such a domain, namely, evaluation in Section III. Some related work are listed in valid inputs, are expected to receive proper execution results. Section IV. We conclude this work in Section V. Unfortunately, in real circumstances, there is no guarantee the inputs are always valid. Anomalous, unexpected inputs may II.METHODOLOGIES arrive and result in unpredictable misbehavior, which in turn Figure 1 shows the framework of SaneDL. Generally, for degrades reliability. In particular, this is a severe threat to a pre-trained DNN model, the execution process of SaneDL the reliability of DL-based systems. The statistical-inference follows the following workflow. nature of DL models makes them quite vulnerable to invalid First, a series of AutoEncoder (AE) [9] based assertions are input cases. inserted into the structure of a pre-trained DNN. Specifically, SaneDL In this paper, we propose , a tool that provides given a pre-trained neural network N, we insert AE networks arXiv:1909.03835v3 [cs.LG] 28 Sep 2019 San e D L systematic data ity ch ck for eep earning-based Sys- between its layers. These AEs are trained using the interme- tems. SaneDL serves as a lightweight, flexible, and adaptive diate result generated by each layer of N. plugin for DL models, which can achieve effective detection of invalid inputs during system runtime. The key notion of SaneDL design is that a DL-model will behave differently given valid input cases and the invalid ones. The behavior deviation is the symptom of invalid input cases [8]. Since a DL model is not complicated in its control flow, we use data flow to model its behaviors. SaneDL provides an assertion- based mechanism. Such assertions, typically, are constructed via AutoEncoder [9], which exhibits a perfect compatibility with the DL model. Similarly to traditional assertions for general programs inserted between code blocks, the assertions are inserted in certain network layers, so as to detect behavior Fig. 1. Overview of SaneDL Then, for a given input, such assertions take the intermediate TABLE I results of neural network as inputs to evaluate the anomalous EVALUATION RESULTS ON INVALID INPUT DETECTION IN SCENARIO I degree via the losses. AEs can reconstruct all intermediate Dataset Fashion-Mnist & Mnist GTSRB & Flickr results from valid inputs. For valid inputs, the intermediate Model LeNet AlexNet AlexNet VGG results can be properly restored by AEs. But for invalid inputs, Invalid Inputs Hand-write Digits Commercial Signs TPR 0.9919 0.9975 0.9451 0.9268 the intermediate results of invalid puts show abnormal patterns. Metrics FPR 0.0156 0.0221 0.1002 0.0801 Hence the reconstruction leads to a huge deviation with a high AUC 0.9978 0.9987 0.9808 0.9716 loss value that dont fit the AEs. We check each assertions by using properly preset thresh- olds to constrain the losses. Formally, it can be expressed as some cases in the original test set with the invalid input cases. the follow equation: The results exhibits remarkable performance of SaneDL. Pm LossAE(IRj) IV. RELATED WORK T hres = δ · j=1 m A. Input Validation and Sanitization in Software The IRj stands for the input of the AE, which is the intermedi- In real production environment, input validation is usually ate result at certain layer of the jth case in the training set. The required to prevent unexpected inputs from violating system’s m means the total samples of the training set. The δ represents pre-defined functionalities. Existing approaches seek for ways the scale coefficient, which determines the assertion threshold that better perform sanity check for specific scenarios [15]– as a certain scale of the mean loss of valid inputs. Our practice [22]. All these work shows input validation plays an important shows that selecting this scale coefficient in a interval from 2 role in maintaining software reliability. However, few experi- to 4 (i.e. δ ∈ [2, 4]) results in good performance. ences on applying input validation to deep learning systems. Lastly, the assertion results then determine the validity of SaneDL is hence designed to achieve this functionality, so as given input. Only those inputs that pass all the assertions(i.e. to enhance the reliability of deep learning systems. the losses don’t exceed any constrains) are verified as valid inputs that the deep learning system can properly handle. B. Deep Learning Testing Recent studies in SE community realize the inadequacy of III.EXPERIMENT traditional deep learning testing via statistical results(e.g. ac- A. Experimental Setup curacy), which can hardly expose the specific defects of neural We build two experimental target scenarios to evluate the network. Therefore, several work tends to regard deep neural effectiveness of SaneDL. For Scenario I, we build a test network as a sophisticated software and apply systematic set with a combination of cases from Fashion-MNIST and testing mechanisms according to software engineering domain MNIST1, so as to test if SaneDL can successfully enhance a knowledge [23]–[29]. On the other hand, test case generation pre-trained model on Fashion-MNIST, i.e. to detect invalid turns out to be another effective approach to extend the inputs from MNIST dataset. Similarly, for Scenario II, we testing effect in deep learning systems to enhance performance train traffic sign recognition models based on GTSRB2 [10], and reliability [30]–[32]. However, excessive manipulations and then test the embedded SaneDL with a hybrid test set. can violate the internal feature of the original cases. This Such hybrid test set is built by combining extra cases from work more considers deep learning systems’ deployment in FlickrLogos-273 [11] with the original test. We test if SaneDL real practice, where true real-world cases can be encountered can figure out those invalid inputs of commercial signs from instead of manipulated ones. traffic signs. We build pre-trained models in different network structures. V. CONCLUSIONS For Fashion-MNIST, we use LeNet [12] and AlexNet [13]. Deep learning systems exhibit huge effectiveness in real life. And for GTSRB, the AlexNet and VGG16 [14] are utilized. But, it may suffer from invalid inputs encountered in complex We consider the True Positive Rate(TPR) and the False Pos- circumstances. This work presents our efforts on handling itive Rate(FPR) as two proper metrics to evaluate the per- invalid inputs that deep learning systems may come across formance of SaneDL in detecting invalid inputs. Besides, the in practice. We propose a white-box verification framework, ROC-AUC score is also considered as one of our evaluation namely, SaneDL to perform systematic data sanity check metrics to evaluate the overall performance under different δ. for deep learning systems via assertion-based mechanism. It Table I shows the detection results in two scenarios. For works in an efficient, non-intrusive manner to detect invalid each test setting, we inject invalid cases by randomly replacing inputs. Our experiment shows its good performance when being applied to two practical problem scenarios, where the 1Fashion-MNIST is a public dataset providing pictures of fashionable invalid input cases are real-world cases rather than manually- clothing. MNIST is another public dataset consists of handwritten digits images. generated toy cases. 2GTSRB stands for German Traffic Sign Recognition Benchmark, which is a dataset consists of traffic signs of different categories 3FlickrLogos-27 is dataset containing real images of various brand logos that appear in the wild

2 REFERENCES type analysis,” in 36th Annual IEEE Computer Software and Applica- tions Conference, COMPSAC, 2012, pp. 233–243. [1] J. Duan and V. Malichenko, “Real time road edges detection and road [22] Y. Park and J. Park, “Web application intrusion detection system for signs recognition,” in International Conference on Control, Automation input validation attack,” in Third International Conference on Conver- and Information Sciences, ICCAIS, 2015, pp. 107–112. gence and Hybrid Information Technology, vol. 2. IEEE, 2008, pp. [2] H. Greenspan, B. van Ginneken, and R. M. Summers, “Guest editorial 498–504. deep learning in medical imaging: Overview and future promise of an [23] Y. Sun, X. Huang, and D. Kroening, “Testing deep neural networks,” exciting new technique,” IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 2018. [Online]. Available: http://arxiv.org/abs/1803.04792 1153–1159, 2016. [24] A. Odena and I. J. Goodfellow, “Tensorfuzz: neural [3] R. B. Abdessalem, S. Nejati, L. C. Briand, and T. Stifter, “Testing networks with coverage-guided fuzzing,” 2018. [Online]. Available: advanced driver assistance systems using multi-objective search and http://arxiv.org/abs/1807.10875 neural networks,” in Proceedings of the 31st IEEE/ACM International [25] S. Ma, Y. Liu, W. Lee, X. Zhang, and A. Grama, “MODE: automated Conference on Automated Software Engineering, ASE, 2016, pp. 63–74. neural network model debugging via state differential analysis and input [4] L. J. White and E. I. Cohen, “A domain strategy for selection,” in Proceedings of the ACM Joint Meeting on European testing,” IEEE Transactions on Software Engineering, no. 3, pp. 247– Software Engineering Conference and Symposium on the Foundations 257, 1980. of Software Engineering, ESEC/SIGSOFT FSE, 2018, pp. 175–186. [26] D. Gopinath, G. Katz, C. S. Pasareanu, and C. Barrett, “Deepsafe: [5] K. Taneja, N. Li, M. R. Marri, T. Xie, and N. Tillmann, “Mitv: multiple- A data-driven approach for checking adversarial robustness in neural implementation testing of user-input validators for web applications,” networks,” 2017. [Online]. Available: http://arxiv.org/abs/1710.00486 in 25th IEEE/ACM International Conference on Automated Software [27] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Engineering, ASE, 2010, pp. 131–134. Y. Liu, J. Zhao, and Y. Wang, “Deepmutation: Mutation testing of deep [6] N. Li, T. Xie, M. Jin, and C. Liu, “Perturbation-based user-input- learning systems,” in 29th IEEE International Symposium on Software validation testing of web applications,” Journal of Systems and Software, Reliability Engineering, ISSRE, 2018, pp. 100–111. vol. 83, no. 11, pp. 2263–2274, 2010. [28] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, [7] P. Tsankov, M. T. Dashti, and D. A. Basin, “Semi-valid input coverage T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “Deepgauge: multi- International Symposium on and for fuzz testing,” in granularity testing criteria for deep learning systems,” in Proceedings of Analysis, ISSTA, 2013, pp. 56–66. the 33rd ACM/IEEE International Conference on Automated Software [8] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, “Bugs as Engineering, ASE, 2018, pp. 120–131. deviant behavior: A general approach to inferring errors in systems [29] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, code,” in ACM SIGOPS Operating Systems Review, vol. 35, no. 5, 2001, “Concolic testing for deep neural networks,” in Proceedings of the 33rd pp. 57–72. ACM/IEEE International Conference on Automated Software Engineer- [9] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of ing, ASE, 2018, pp. 109–119. data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, [30] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “Deeproad: 2006. Gan-based metamorphic testing and input validation framework for [10] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, autonomous driving systems,” in Proceedings of the 33rd ACM/IEEE “Detection of traffic signs in real-world images: The German Traffic International Conference on Automated Software Engineering, ASE, Sign Detection Benchmark,” in International Joint Conference on Neural 2018, pp. 132–142. Networks, no. 1288, 2013. [31] K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox [11] Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis, testing of deep learning systems,” in Proceedings of the 26th Symposium “Scalable triangulation-based logo recognition,” in Proceedings of the on Operating Systems Principles, 2017, pp. 1–18. 1st International Conference on Multimedia Retrieval, ICMR, 2011, [32] Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: automated testing of p. 20. deep-neural-network-driven autonomous cars,” in Proceedings of the [12] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al., “Gradient-based 40th International Conference on Software Engineering, ICSE, 2018, learning applied to document recognition,” Proceedings of the IEEE, pp. 303–314. vol. 86, no. 11, pp. 2278–2324, 1998. [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural infor- mation processing systems, 2012, pp. 1097–1105. [14] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014. [15] J. H. Hayes and A. J. Offutt, “Increased software reliability through input validation analysis and testing,” in Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No. PR00443). IEEE, 1999, pp. 199–209. [16] A. Avancini, “Security testing of web applications: A research plan,” in 34th International Conference on Software Engineering, ICSE, 2012, pp. 1491–1494. [17] M. Alkhalaf, A. Aydin, and T. Bultan, “Semantic differential repair for input validation and sanitization,” in International Symposium on Software Testing and Analysis, ISSTA, 2014, pp. 225–236. [18] G. Buehrer, B. W. Weide, and P. A. G. Sivilotti, “Using parse tree validation to prevent SQL injection attacks,” in Proceedings of the 5th International Workshop on Software Engineering and Middleware, SEM, 2005, pp. 106–113. [19] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna, “Saner: Composing static and dynamic anal- ysis to validate sanitization in web applications,” in IEEE Symposium on Security and Privacy (S&P), 2008, pp. 387–401. [20] L. K. Shar and H. B. K. Tan, “Predicting common web application vulnerabilities from input validation and sanitization code patterns,” in IEEE/ACM International Conference on Automated Software Engineer- ing, ASE, 2012, pp. 310–313. [21] T. Scholte, W. K. Robertson, D. Balzarotti, and E. Kirda, “Preventing input validation vulnerabilities in web applications through automated

3