arXiv:2103.01341v1 [cs.SE] 1 Mar 2021 omnt,bidn pnetbihdsfwr etn tec of testing body software established considerable engineerin upon software a the building in emerged behavior community, Therefore, has pipeline work bugs. ML’ for ML ‘testing or undesired faults that signals thus, and procedures, claim [4]. humans latter [3], that make mistakes this not unexpected ML make time, would well-performing to same seemingly found were as the models controversy, At with met [2]. ‘better has [1], perform to humans’ claimed Learning than been Deep have on techniques—even based situa (DL) those several models—especially In ML judgment. tions, human required predict previously data-driven for which making in important efficiency been and scale have reaching techniques an become ML has component. techniques essential (ML) learning machine by powered prtr,sfwr testing software operators, etn ehiusfrM ihprdgsadvcblre o vocabularies and mutat paradigms of testing. w with alignment mutation be observations, ML classical better our for will for on techniques mapping points Based testing action concept parallels. several draw mor and propose and truly scoping development, to system explicit needed ML and to hy and translate conscious these trivially illustrate, hypothesis will not programmer we do We As class blurry, competent hypothesis. challenged. underlying effect is the be coupling hypotheses can code testing: fundamental operators mutation test the mutation consider and of also approaches, realism production current the in between (test and thres metaphor, distinction points TDD certain this a data the technique under to testing the systems up mutation ML fits assertions), proposed (implicit that considering (‘programmer’ However, labels (program) algorithm to development training model data) a test-driven a which a generates in resembles process, argue obser w development currentl We (TDD) will line testing. extent model in we mutation what actually of ML are as to interpretation DL However, raised classical in be not. the techniques testing can or mutation questions killed used work, are mu this whether systems in determine DL to been heuristics in learnin has sound approaches, deep proposed statistically testing and have and researchers systems, mutation particular; (ML) in fau years, learning (DL) artificial machine recent injecting to In extended by quality code. suite’s production test a assessing n ol ru htM rcdrsefcieyaesoftware are effectively procedures ML that argue could One processing data automated systems, present-day many In ne Terms Index Abstract htAeW elyTsigi uainTesting Mutation in Testing Really We Are What o ahn erig rtclReflection Critical A Learning? Machine for Mtto etn sawl-salse ehiu for technique well-established a is testing —Mutation mtto etn,mcielann,mutation learning, machine testing, —mutation ef nvriyo Technology of University Delft .I I. ef,TeNetherlands The Delft, NTRODUCTION [email protected] niaePanichella Annibale potheses t into lts ethat ve tools, hold. tants for s ions ical ion ith h- g y g e e - ) f ftetandmdlwl easse gis etstthat set test a against assessed be will model trained the met. of be correct and/o will ‘the threshold convergence, passing, to performance until user-defined be be repeated a to will is until considered tests process be The more will program’. The model the created. closer be the will model the r o ufiinl xlctaotwa sbigtse exac tested being is what about development explicit useful sufficiently a that not is argue are this [5], we While work, Khom [6]. this & al. in et Braiek Zhang Ben as well by as surveyed extensively as niques, lutaigti,w ilcnld hspprwt several with paper After this context. improvement. conclude ML for show, points will the action will concrete we to we this, mapped As curre illustrating illogically DL. are of or testing mutation context unclearly classical the a of in fundaments testing, several mutation applied rece on which been focus technique, has we testing paper, white-box this well-established in particular, In aytsswl al hrfr,anw eie version revised new, a Therefore, fail. will tests st many training as suite Initially, test model initiated. a is development as system considered be can tigi omthti swl spsil,w ru that development argue model test-driven we of possible, a procedure as goal resembles (TDD) well procedure the as training and this ML input, match to what is for fitting predicted be phase. should training iterativel the be during will data model the the which to to construc fitted be according to rules Th model the the that interest. and of of model architecture hyperparameters the ML dictates any of choice with type trained, the be specifies should explicitly The scientist employed. data usually are frameworks wh partiall ML for will in routines, implementations model optimization trained automated the through yields generated that process the [6], be should output. model corresponding this the the from model, yielding input ML of When trained capable a up. dataset to pick offered to a th is learn patterns collect dataset should input-output will model evidences ML scientist point the data data each or which expert in domain a all, sso smdltann sfiihd h performance the finished, is training model as soon As osdrn httetann aaseie htoutput what specifies data training the that Considering [5], software traditional of development the to contrast In of first developed, being is model ML (supervised) a When M ef nvriyo Technology of University Delft ilntytfi h aawl;i te od,too words, other in well; data the fit yet not will ef,TeNetherlands The Delft, yti .S Liem S. C. Cynthia I LM ML II. [email protected] ne hsperspective, this Under . ODEL urn etn o Lapproaches ML for testing current D EVELOPMENT hc sdfie before defined is which , h riigdata training the M rs a arts, be y ntly ntly ′ ted, tly the ich of at is y r . , , was not used during training. As a consequence, performance therefore, mutation testing can be used to assess the quality evaluation of a trained model can be seen as another TDD of a test suite. However, mutation testing has some important procedure, now considering the test data as the test suite. If challenges: the high computation cost and equivalent mutants. too many test cases fail, the trained model is unsatisfactory, Equivalent mutants are mutants that always produce the same and will need to be replaced by another, stronger model. Here, output as the original program; therefore, they cannot be the ‘other, stronger model’ can be considered in a black-box killed [7]. While program equivalence is an undecidable fashion; it may be a variant of the previously fitted model, or problem, researches have proposed various strategies to avoid a completely different model altogether. the generation of equivalent mutants for classical programs [7]. Finally, as soon as the satisfactory performance is reached, a trained model will likely be integrated as a component into IV. MUTATION TESTING APPROACHES FOR ML a larger system. While this integration is outside the scope Mutation testing has also been applied to ML software. To of the current paper, we would like to note that practical demonstrate the effectiveness of metamorphic testing strategies concerns regarding ML faults often are being signaled in this for ML classification algorithm implementations, in Xie et integrated context (e.g. unsafe behavior of a self-driving car), al. [15], mutation analysis was performed using classical but consequently may not only be due to the ML model. syntactical operators. With the recent rise in popularity of deep learning (DL) techniques, accompanied by concerns about DL III. CLASSICAL MUTATION TESTING robustness, interest in mutation testing has re-emerged as a In ‘traditional’ , mutation testing is an way to assess the adequacy of test data for deep models. To this extensively surveyed [7]–[9], well-established field, for which end, in 2018, two mutation testing proposals were proposed in early work can be traced back to the 1970s. parallel: MuNN by Shen et al. [16], and DeepMutation by Ma et al. [17], both seeking to propose mutation operators tailored The process. Given a program P , mutation testing tools to deep models. Both methods focus on convolutional neural generate program variants (i.e., mutants) based on a given network (CNN) models for ‘classical’ image recognition tasks set of transformation rules (i.e., mutation operators). The mu- (MNIST digit recognition in MuNN and DeepMutation, plus tants correspond to potential (small) mistakes that competent CIFAR-10 object recognition in DeepMutation). programmers could make and that are artificially inserted MuNN considers an already trained convolutional neural in the production code. If the test suite cannot detect these network as production code, and proposes 5 mutation operators injected mutants, it may not be effective in discovering real on this code, that consider deletions or replacements of neu- defects. More precisely, an effective test case should pass rons, activation functions, bias and weight values. DeepMu- when executed against the original program P , but fail against ′ tation argues that the ‘source of faults’ in DL-based systems a mutant P . In other words, the test should differentiate can be caused by faults in data and the training program used between the original program and its mutated variant. In this ′ for model fitting. As such, the authors propose several source- scenario, the mutant P is said to be killed. Instead, the ′ level mutation operators, which generate mutants before the mutant P survives if the test cases pass on both P and ′ actual training procedure is executed. In their turn, the source- P . The test suite’s effectiveness is measured as the ratio level operators are divided into operators that affect training between the number of killed mutants and the total number of data (e.g. Label Error, Data Missing) and operators that affect non-equivalent mutants being generated (mutation score) [7]. model structure (e.g. Layer Removal). Besides this, the authors Therefore, changes (mutants) are applied to the production also propose eight model-level mutation operators, that operate code, while the test code is left unchanged. Mutants can only on weights, neurons, or layers of already-trained models. be killed if they relate to tests that passed for the original These model-level operators can be compared to the operators program. After mutation analysis, test code can be further in MuNN; indeed, one can e.g. argue that DeepMutation’s improved by adding/modifying tests and assertions. Gaussian operator is a way to implement MuNN’s The hypotheses. As reported by Jia and Harman [7], Change Weight Value operator, where in both cases, weights mutation testing targets faults in programs that are close to in the trained model will be changed. their correct version. The connection between (small) mutants Noticing that DL training procedures include stochastic and real faults is supported by two fundamental hypothe- components, such that re-training under the same conditions ses: the Competent Programmer Hypothesis (CPH) [10] and will not yield identical models, and that hyperparameters may the Coupling Effect Hypothesis (CEH) [11]. Based on these affect the effectiveness of mutation operators, Jahangirova hypotheses, mutation testing only applies simple syntactic & Tonella revisit the operators of MuNN and DeepMuta- changes/mutants, as they correspond to mistakes that a compe- tion from a stochastic perspective [18] and study operator tent programmer tends to make. Furthermore, each (first-order) (in)effectiveness, also adding a regression task (steering wheel P ′ mutant includes one single syntactic change, since a test angle prediction from self-driving car images) to the MNIST P ′ case that detects will also be able to detect more complex and CIFAR-10 classification tasks. Furthermore, the authors P ′ (higher-order) faults coupled to . study more traditional syntactic mutations on the training Challenges in mutation testing: Prior studies [12]–[14] program, employing the MutPy tool. Similarly, Chetouane showed that mutants are valid substitutes of real faults and, et al. [19] do this for neural network implementations in 1 @Test 2 public void testLearning() { Many source-level mutation operators actually target the 3 List trainingData = FileUtils.readDataset(’training_data’); model’s (so the production code’s) correctness, rather than the 4 Model m = Model.train(trainingData); quality of the test suite. For instance, let us consider the Label 5 double counter = 0; Error operator, which changes the label for one (or more) data 6 for (Image image : trainingData){ 7 String outcome = m.predict(image); point(s) in the training set. Considering the TDD metaphor for 8 String expected = image.getGroundTruth(); ML training, this operator alters the test suite (training data), 9 if (outcome.equals(expected)) 10 counter++; rather than the production code that generates or represents 11 } the trained model. Furthermore, the changes are applied to 12 double accuracy = counter / trainingData.size(); 13 assertTrue(accuracy >= 0.9); assess that the model trained on the original dataset provides 14 } statistically better predictions than a model1 trained with faulty 15 16 @Test data labels. This therefore appears to be test amplification 17 public void testPerformance(){ rather than a mutation testing, and this consideration holds 18 List testData = FileUtils.readDataset(’test_data’); for any proposed data-level operator. 19 Model m = Model.readModel(’model_path’); For source-level model structure operators, the situation is 20 double counter = 0; 21 for (Image image : testData){ ambiguous. The code written by the data scientist to initiate 22 String outcome = m.predict(image); model training may either be seen as production code, or as 23 String expected = image.getGroundTruth(); 24 if (outcome.equals(expected)) ‘the specification that will lead to the true production code’ 25 counter++; (the trained model). It also should be noted that this ‘true 26 } 27 double accuracy = counter / testData.size(); production code’ is yielded by yet another software program assertTrue 28 (accuracy >= 0.9); implementing the training procedure, that typically is hidden 29 } in the ML framework. Listing 1. Tests for ML training and performance assessment

Java, employing the PIT tool [20]. Finally, seeking to expand B. Revisiting the fundamental mutation testing hypotheses mutation testing beyond CNN architectures, the DeepMuta- We argue that the two fundamental hypotheses of mutation tion++ tool [21] both includes DeepMutation’s CNN-specific testing should be reconsidered for ML: many terms in these mutation operators and various operators for Recurrent Neural hypotheses are not easily mapped to ML contexts. Network architectures. CPH Competent programmers “tend to develop programs V. WAS MUTATION TESTING TRULY APPLIED TO ML? close to the correct version.” [10]

A. Is the production code being mutated? First of all, for ML, not all ‘programmers’ are human, Reconsidering our framing in Section II of ML model and the degree of their competence may not always show in development as two TDD procedures (one focusing on model production code. As discussed in Sections II and IV, when an fitting, and one focusing on performance evaluation of a ML model is to be trained, the general setup for the training trained model), this leads to two types of production code, with procedure will be coded by a human being (often, a data two corresponding (implicit) test suites: respectively, the data scientist). This program could be considered as human-written referred to as ‘training data’ and ‘test data’ in ML vocabulary. production code, but is not the true production code: it triggers We can frame the assessments of how well a model has been an automated optimization procedure (typically previously trained, and how well it performs after training, as separate authored by an ML framework programmer), which will yield test cases. Listing 1 illustrates this for a fictional image the target program (the trained model). recognition task implemented in Java. The testLearning Questions can be raised about what makes for ‘realistic’ method assesses the success of a training procedure: it reads mistakes. Currently proposed operators on data (which, as the training set (line 3), runs the training procedure for a given we argued in Section V-A, should be interpreted as operators model (line 4), and lets the test pass in case a user-provided for test amplification, rather than for mutation testing) largely success criterion (here: a minimum accuracy threshold) is met tackle more mechanical potential processing issues (e.g., in- (line 14). The testPerformance method runs predictions consistent data traversal and noisy measurement), rather than by an already-trained model on a given dataset, again letting faults that a human operator would make. Generally, the more the test pass in case the user-provided success criterion is met. interesting faults made during dataset curation can have con- Under this framing, we question to what extent the opera- siderable effects downstream, but are not necessarily encoded tors proposed for DL testing (Section IV) are true mutation in the form of small mistakes. For example, if biased data is operators. In fact, we argue that the category of post-training fed to the pipeline, this may lead to discriminatory decisions model-level mutation operators is the only one clearly fitting downstream. However, one cannot tell from a single data the mutation testing paradigm, in which production code is point whether it resulted from biased curation. Furthermore, altered to assess the quality of the test suite—while still raising the undesired downstream behavior will also not be trivially questions regarding the fit to the fundamental mutation testing hypotheses, as we will explain in Section V-B. 1Same model, with same training algorithm, and same parameter setting. traceable to source code; instead, it rather is a signal of im- Clearly settle on what the system under test (SUT) is plicit, non-functional requirements not having been met. Thus, when extending software testing techniques to ML systems. higher-level will be needed for problems Currently, mutation testing for DL is a common umbrella for like this, rather than lower-level mutation testing strategies. all techniques that target DL models, the training procedure, As mentioned in Section V-A, the post-training model- and the surrounding software artifacts. However, as we argued level operators proposed in literature fit the classical mutation in Section II, ML model development (both within and outside testing paradigm. Yet, some of them are unlikely to represent of DL) involves several development stages with associated realistic faults made by the ‘programmer’ (the ML training artifacts, each requiring dedicated, different testing strategies. framework). The way the framework iteratively optimizes the Furthermore, where in traditional ML, one test set suffices for model fit is bound to strict and consistent procedures; thus, a specific ML task (regardless of what ML model is chosen it e.g. will not happen that subsequent model fitting iterations for this task), the mutation testing paradigm implies that the will suddenly change activation functions or duplicate layers. adequacy of a given test suite may differ, depending on the Finally, the concept of ‘an ML system close to the cor- program it is associated with. Thus, depending on the program, rect version’ is ill-defined. In contrast to classical software a given test suite may need different amplifications. testing setups, where all tests should pass for a program to Work with experts to better determine what ‘realistic mu- be considered correct, this target is fuzzier in ML systems. tants’, ‘realistic faults’ and ‘realistic tests’ are. Ma et al. [17] During the training phase, one wants to achieve a good model rightfully argued that not all faults in DL systems are created fit, but explicitly wish to avoid a 100% fit on the training by humans, but little insight exists yet on how true, realistic data, since this implies overfitting. As soon as a model has faults can be mimicked. Humbatova et al. [24] recently pro- been trained, performance on unseen test data is commonly posed a taxonomy of real faults in DL systems, finding that considered as a metric that can be optimized for. Still, the many of these have not been targeted in existing mutation test data points may not constitute a perfect oracle. As a testing approaches yet. Furthermore, when the mutation score consequence, perfect performance on the unseen test set may will indicate that a given test set is not sufficiently adequate still not guarantee that an ML pipeline behaves as intended yet, it also still is an open question how test suites can be from a human, semantic perspective [22], [23]. Thus, when improved: test cases need to both meet system demands, while considering mutation testing for ML, we actually cannot tell having to be encoded in the form of realistic data points. Here, the extent to which the program to mutate is already close to insights from domain and ML experts will be beneficial. the behavior we ultimately intend. Investigate what ‘a small mutant’ is. This aspect is rather complex in ML: repeating the exact same training process CEH “Complex mutants are coupled to simple mutants on the exact same data is not mutation, but may yield major in such a way that a test dataset that detects all simple model differences, due to stochasticity in training. Similarly, mutants in a program will also detect a large percentage of a small back-propagation step may affect many weights in a the complex mutant.” [11] DL model. At the same time, given the high dimensionality In ML, we still lack a clear sense of what makes for simple of data and high amounts of parameters in DL models, other or complex mutants. As noted in [18], the stochastic nature of mutations (e.g. noise additions to pixels) may be unnoticeable training may lead to multiple re-trainings of the same model as first-order mutants, and may need to be applied in many on the same data leading to models that may differ in weights places at once to have any effect. and output. Also, when considering mutations after a model Finally, we wish to point out that the evolution of ML has been trained, not all neurons and layers have equal roles, programs differs from ‘traditional’ software. In traditional so the effect of mutating a single neuron may greatly differ. software, revisions are performed iteratively, applying small improvements to previous production code. This is uncommon VI. CONCLUSIONANDFUTUREWORK in ML, where revisions are typically considered at the task level (e.g. ‘find an improved object recognition model’), rather In this paper, we discussed how classical hypotheses and than at the level of production code (e.g. ‘see whether chang- techniques for mutation testing do not always clearly translate ing neuron connections in a previously-trained deep neural to how mutation testing has been applied to DL systems. We network can improve performance’). Translated to software propose several action points for SE researchers to consider, engineering processes, this is more similar to re-implementing when seeking to further develop mutation testing for ML sys- the project many times by different, independent developer tems in general, while staying aligned to the well-established teams, than to traditional software development workflows. In paradigms and vocabularies of classical mutation testing. such cases, if the production code of trained models will not When defining new mutation operators for ML, be explicit be expected to evolve, employing black-box testing strategies about what production and test code are. In classical mutation may overall be more sensible than employing white-box testing, there is a clear distinction between the production code testing strategies. Again, this question requires collaboration (to be altered by mutation operators) and the test code (to be and further alignment between software engineering, machine assessed by mutation testing). This clear distinction must be learning, and domain experts, which may open many more kept when applying mutation testing to ML systems. interesting research directions. REFERENCES the relationship between mutants and real faults,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, [1] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: 2018, pp. 537–548. Surpassing Human-Level Performance on ImageNet Classification,” in [14] J. H. Andrews, L. C. Briand, and Y. Labiche, “Is mutation an appropriate Proceedings of the IEEE International Conference on Computer Vision tool for testing experiments?” in Proceedings of the 27th ACM/IEEE (ICCV), 2015. International Conference on Software Engineering (ICSE), 2005, pp. [2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, 402–411. A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, [15] X. Xie, J. W. K. Ho, C. Murphy, G. Kaiser, B. Xu, and T. Y. Chen, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Testing and validating machine learning classifiers by metamorphic “Mastering the game of Go without human knowledge,” Nature, vol. testing,” Journal of Systems and Software, vol. 84, no. 4, pp. 544–558, 550, pp. 354–359, 2017. 2011. [3] B. L. Sturm, “A Simple Method to Determine if a Music Information [16] W. Shen, J. Wan, and Z. Chen, “MuNN: Mutation Analysis of Neural Retrieval System is a “Horse”,” IEEE Transactions on Multimedia Networks,” in Proceedings of the IEEE International Conference on (TMM), vol. 16, no. 6, pp. 1636–1644, Oct 2014. Software Quality, Reliability and Security Companion, 2018. [4] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are [17] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, easily fooled: High confidence predictions for unrecognizable images,” Y. Liu, J. Zhao, and Y. Wang, “DeepMutation: Mutation testing of deep in Proceedings of the Conference on Computer Vision and Pattern learning systems,” in Proceedings of the 29th International Symposium Recognition (CVPR), 2015. on Software Reliability Engineering (ISSRE), 2018. [5] H. Ben Braiek and F. Khomh, “On testing machine learning programs,” [18] G. Jahangirova and P. Tonella, “An Empirical Evaluation of Mutation Journal of Systems and Software, vol. 164, 2020. Operators for Deep Learning Systems,” in Proceedings of the IEEE [6] J. M. Zhang, M. Harman, L. Ma, and Y. Liu, “Machine Learning Testing: 13th International Conference on Software Testing, Validation and Survey, Landscapes and Horizons,” IEEE Transactions on Software Verification (ICST), 2020. Engineering (TSE), 2020. [19] N. Chetouane, L. Klampf, and F. Wotawa, “Investigating the Effective- [7] Y. Jia and M. Harman, “An analysis and survey of the development of ness of Mutation Testing Tools in the Context of Deep Neural Networks,” mutation testing,” IEEE Transactions on Software Engineering (TSE), in Proceedings of the 16th International Work-Conference on Artificial vol. 37, no. 5, pp. 649–678, 2011. Neural Networks, 2019. [8] M. Papadakis, M. Kintis, J. Zhang, Y. Jia, Y. Le Traon, and M. Harman, [20] H. Coles, T. Laurent, C. Henard, M. Papadakis, and A. Ventresque, “PIT: “Mutation testing advances: an analysis and survey,” in Advances in A Practical Mutation Testing Tool for Java (Demo),” in Proceedings Computers. Elsevier, 2019, vol. 112, pp. 275–378. of the 25th International Symposium on Software Testing and Analysis [9] Q. Zhu, A. Panichella, and A. Zaidman, “A systematic literature review (ISSTA), 2016, p. 449–452. of how mutation testing supports quality assurance processes,” Software [21] Q. Hu, L. Ma, X. Xie, B. Yu, Y. Liu, and J. Zhao, “DeepMutation++: Testing, Verification and Reliability (STVR), vol. 28, no. 6, p. e1675, a Mutation Testing Framework for Deep Learning Systems,” in Pro- 2018. ceedings of the 34th IEEE/ACM International Conference on Automated [10] A. J. Offutt, “Investigations of the software testing coupling effect,” Software Engineering (ASE), 2019. ACM Transactions on Software Engineering and Methodology (TOSEM), [22] L. Bertinetto, R. Mueller, K. Tertikas, S. Samangooei, and N. A. Lord, vol. 1, no. 1, pp. 5–20, 1992. “Making Better Mistakes: Leveraging Class Hierarchies with Deep [11] R. A. DeMillo, R. J. Lipton, and F. G. Sayward, “Hints on test data Networks,” in Proceedings of the Conference on Computer Vision and selection: Help for the practicing programmer,” Computer, vol. 11, no. 4, Pattern Recognition (CVPR), 2020. pp. 34–41, 1978. [23] C. C. S. Liem and A. Panichella, “Oracle Issues in Machine Learning [12] R. Just, D. Jalali, L. Inozemtseva, M. D. Ernst, R. Holmes, and G. Fraser, and Where To Find Them,” in Proceedings of the IEEE/ACM 42nd “Are mutants a valid substitute for real faults in software testing?” in International Conference on Software Engineering Workshops, 2020. Proceedings of the 22nd ACM SIGSOFT International Symposium on [24] N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, Foundations of Software Engineering, 2014, pp. 654–665. and P. Tonella, “Taxonomy of real faults in deep learning systems,” [13] M. Papadakis, D. Shin, S. Yoo, and D.-H. Bae, “Are mutation scores in Proceedings of the ACM/IEEE 42nd International Conference on correlated with real fault detection? A large scale empirical study on Software Engineering (ICSE), 2020.