Testing and Validating Machine Learning Classifiers by Metamorphic
Total Page:16
File Type:pdf, Size:1020Kb
The Journal of Systems and Software 84 (2011) 544–558 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss Testing and validating machine learning classifiers by metamorphic testingଝ Xiaoyuan Xie a,d,e,∗, Joshua W.K. Ho b, Christian Murphy c,f, Gail Kaiser c, Baowen Xu e, Tsong Yueh Chen a a Centre for Software Analysis and Testing, Swinburne University of Technology, Hawthorn, Vic. 3122, Australia b Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA c Department of Computer Science, Columbia University, New York, NY 10027, USA d School of Computer Science and Engineering, Southeast University, Nanjing 210096, China e State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210093, China f Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19103, USA article info abstract Article history: Machine learning algorithms have provided core functionality to many application domains – such as Received 10 January 2010 bioinformatics, computational linguistics, etc. However, it is difficult to detect faults in such applications Received in revised form 17 October 2010 because often there is no “test oracle” to verify the correctness of the computed outputs. To help address Accepted 25 November 2010 the software quality, in this paper we present a technique for testing the implementations of machine Available online 7 December 2010 learning classification algorithms which support such applications. Our approach is based on the tech- nique “metamorphic testing”, which has been shown to be effective to alleviate the oracle problem. Also Metamorphic testing presented include a case study on a real-world machine learning application framework, and a discussion Machine learning Test oracle of how programmers implementing machine learning algorithms can avoid the common pitfalls discov- Oracle problem ered in our study. We also conduct mutation analysis and cross-validation, which reveal that our method Validation has high effectiveness in killing mutants, and that observing expected cross-validation result alone is not Verification sufficiently effective to detect faults in a supervised classification program. The effectiveness of meta- morphic testing is further confirmed by the detection of real faults in a popular open-source classification program. © 2010 Elsevier Inc. All rights reserved. 1. Introduction The majority of the research effort in the domain of machine learning focuses on building more accurate models that can bet- Machine learning algorithms have provided core functionality ter achieve the goal of automated learning from the real world. to many application domains – such as bioinformatics, compu- However, to date very little work has been done on assuring the tational linguistics, etc. As these types of scientific applications correctness of the software applications that implement machine become more and more prevalent in society (Mitchell, 1983), learning algorithms. Formal proofs of an algorithm’s optimal qual- ensuring their quality becomes more and more crucial. ity do not guarantee that an application implements or uses the Quality assurance of such applications presents a challenge algorithm correctly, and thus software testing is necessary. because conventional software testing techniques are not always To help address the quality of machine learning programs, applicable. In particular, it may be difficult to detect subtle errors, this paper presents a technique for testing implementations faults, defects or anomalies in many applications in these domains of the supervised classification algorithms which are used by because it may be either impossible or too expensive to verify the many machine learning programs. Our technique is based on correctness of computed outputs, which is referred to as the oracle an approach called “metamorphic testing” (Chen et al., 1998), problem (Weyuker, 1982). which uses properties of functions such that it is possible to predict expected changes to the output for particular changes to the input. Although the correct output cannot be known in advance, if the change is not as expected, then a fault must ଝ exist. A preliminary version of this paper was presented at the 9th International Con- ference on Quality Software (QSIC 2009) (Xie et al., 2009). In our approach, we first enumerate the metamorphic relations ∗ Corresponding author at: Centre for Software Analysis and Testing, Swinburne that classifiers would be expected to demonstrate, then for a given University of Technology, Hawthorn, Vic. 3122, Australia. Tel.: +61 3 9214 8678; implementation determine whether each relation is a necessary fax: +61 3 9819 0823. property for the corresponding classifier algorithm. If it is, then fail- E-mail addresses: [email protected] (X. Xie), [email protected] (J.W.K. Ho), [email protected] (C. Murphy), [email protected] ure to exhibit the relation indicates a fault; if the relation is not a (G. Kaiser), [email protected] (B. Xu), [email protected] (T.Y. Chen). necessary property, then a deviation from the “expected” behav- 0164-1212/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2010.11.920 X. Xie et al. / The Journal of Systems and Software 84 (2011) 544–558 545 ior has been found. In other words, apart from verification, our the labels are unknown. In a classification algorithm, the system approach also supports validation. attempts to predict the label of each individual example. That is, In addition to presenting our technique, we describe a case study the testing data input is an unlabeled test case ts, and the aim is on a real-world machine learning application framework, Weka to determine its class label ct based on the data-label relationship (Witten and Frank, 2005), which is used as the foundation for many learned from the set of training samples S and the corresponding computational science tools such as BioWeka (Gewehr et al., 2007) class labels C, where ct ∈ L. in bioinformatics. Additionally a mutation analysis is conducted on Weka to investigate the effectiveness of our method. We also discuss how our findings can be of use to other areas. 2.2. Algorithms investigated The rest of this paper is organized as follows. Section 2 provides background information about machine learning and introduces In this paper, we only study the k-nearest neighbors classifier the specific algorithms that are evaluated. Section 3 discusses and the Naïve Bayes classifier, because of their popularity in the the metamorphic testing approach and the specific metamorphic machine learning community. However, it should be noted that relations used for testing machine learning classifiers. Section the oracle problem description and techniques described below are 4 presents the results of case studies demonstrating that the not specific to any particular algorithm, and as shown in our pre- approach can find faults in real-world machine learning appli- vious work (Chen et al., 2009; Murphy et al., 2008), our results are cations. Section 5 discusses empirical studies that use mutation applicable to the general case. analysis to systematically insert faults into the source code, and In k-nearest neighbors (k NN), for a training sample set S, sup- measures the effectiveness of metamorphic testing. Section 6 pose each sample has m attributes, att0, att1, ..., attm−1, and there presents related work, and Section 7 concludes. are n classes in S, {l0, l1, ..., ln−1}. The value of the test case ts is a0, a1, ..., am−1. k NN computes the distance between each training 2. Background sample and the test case. Generally k NN uses the Euclidean Dis- tance: for a sample si ∈ S, suppose the value of each attribute is sa0, ... In this section, we present some of the basics of machine sa1, , sam−1 , and the distance between si and ts is as follows: learning and the two algorithms we investigated (k-nearest neigh- bors and Naïve Bayes Classifier), as well as the terminology used m−1 2 (Alpaydin, 2004). Readers familiar with machine learning may skip dist(si,ts) = (saj − aj) . this section. j One complication in our work arose due to conflicting tech- nical nomenclature: “testing”, “regression”, “validation”, “model” After sorting all the distances, k NN selects the k nearest ones and other relevant terms have very different meanings to machine which are considered as the k nearest neighbors. Then k NN calcu- learning experts than they do to software engineers. Here we lates the proportion of each label in the k nearest neighbors, and employ the terms “testing”, “regression testing”, and “validation” the label with the highest proportion is assigned as the label of the as appropriate for a software engineering audience, but we adopt test case. the machine learning sense of “model”, as defined below. In the Naï ve Bayes Classifier (NBC), for a training sample set S, suppose each sample has m attributes, att , att , ..., att − , and 2.1. Supervised machine learning fundamentals 0 1 m 1 there are n classes in S, {l0, l1, ..., ln−1}. The value of the test case ts is a , a , ..., a − . The label of t is called l , and is to be predicted In general, input to a supervised machine learning classifier con- 0 1 m 1 s ts by NBC. sists of a set of training data that can be represented by two vectors ... NBC computes the probability of lts to be of class lk, when each of size k. One vector is for the k training samplesS = s0, s1, , sk−1 attribute value of t is a , a , ..., a − . NBC assumes that attributes and the other is for the corresponding class labelsC = c , c , ..., s 0 1 m 1 0 1 are conditionally independent with one another given the class c . Each sample s ∈ S is a vector of size m, which represents m k−1 label, therefore we have the equation: attributes from which to learn.