METTLE: a Metamorphic Testing Approach to Assessing and Validating Unsupervised Machine Learning Systems

1 METTLE: a METamorphic Testing approach to assessing and validating unsupervised machine LEarning systems Xiaoyuan Xie, Member, IEEE, Zhiyi Zhang, Tsong Yueh Chen, Senior Member, IEEE, Yang Liu, Member, IEEE, Pak-Lok Poon, Member, IEEE and Baowen Xu, Senior Member, IEEE F Abstract—Unsupervised machine learning is the training of an artificial NSUPERVISED machine learning requires no prior intelligence system using information that is neither classified nor la- U knowledge and can be widely used in a large variety beled, with a view to modeling the underlying structure or distribution in a of applications such as market segmentation for targeting dataset. Since unsupervised machine learning systems are widely used customers [1], anomaly or fraud detection in banking [2], in many real-world applications, assessing the appropriateness of these grouping genes or proteins in biological process [3], deriv- systems and validating their implementations with respect to individual users’ requirements and specific application scenarios / contexts are ing climate indices from earth science data [4], and docu- indisputably two important tasks. Such assessment and validation tasks, ment clustering based on content [5]. More recently, unsu- however, are fairly challenging due to the absence of a priori knowledge pervised machine learning has also been used by software of the data. In view of this challenge, we develop a METamorphic testers in predicting software faults [6]. Testing approach to assessing and validating unsupervised machine This paper specifically focuses on clustering systems LEarning systems, abbreviated as METTLE. Our approach provides a (which refer to software systems that implement cluster- new way to unveil the (possibly latent) characteristics of various machine ing algorithms and are intended to be used in different learning systems, by explicitly considering the specific expectations and domains) Such a clustering system helps users partition requirements of these systems from individual users’ perspectives. To a given unlabeled dataset into groups (or clusters) based support METTLE, we have further formulated 11 generic metamorphic relations (MRs), covering users’ generally expected characteristics that on some similarity measures, so that data in the same should be possessed by machine learning systems. To demonstrate the cluster are more “similar” to each other than to data from viability and effectiveness of METTLE, we have performed an experiment different clusters. In artificial intelligence (AI) and data involving six commonly used clustering systems. Our experiment has mining, numerous clustering systems [7, 8, 9] have been shown that, guided by user-defined MR-based adequacy criteria, end developed and are available for public use. Thus, selecting users are able to assess, validate, and select appropriate clustering the most appropriate clustering system for use is an impor- systems in accordance with their own specific needs. Our investiga- tant concern from end users. (In this paper, end users, or tion has also yielded insightful understanding and interpretation of the simply users, refer to those people who are “causal” users behavior of the machine learning systems from an end-user software engineering’s perspective, rather than a designer’s or implementor’s of clustering systems. Although they have some hands-on perspective, who normally adopts a theoretical approach. experience on using such systems, they often do not possess a solid theoretical foundation on machine learning. These arXiv:1807.10453v4 [cs.SE] 11 Jun 2019 Index Terms—Unsupervised machine learning, clustering assessment, users come from different fields such as bioinformatics [17] clustering validation, metamorphic testing, metamorphic relation, end- and nuclear engineering [18]. Also, their main concern is user software engineering. the applicability of a clustering system in the users’ specific contexts, rather than the detailed logic of this system.) From a user’s perspective, this selection is not trivial [10], not only 1 INTRODUCTION because end users generally do not have very solid theoretical background on machine learning, but also because the • X. Xie and Z. Zhang are with the School of Computer Science, Wuhan selection task involves two complex issues as follows: University, Wuhan 430072, China. (Issue 1) The correctness of the clustering results is X. Xie is the corresponding author. E-mail: [email protected]. • T. Y. Chen is with the Department of Computer Science and Software a major concern for users. However, when evaluating a Engineering, Swinburne University of Technology, Hawthorn, VIC 3122, clustering system, there is not necessarily a correct solution Australia. or ”ground truth” that users can refer to for verifying the • Y. Liu is with the School of Computer Science and Engineering, Nanyang clustering result [11]. Furthermore, not only is the correct Technological University, Singapore 639798. • P.-L. Poon is with the School of Engineering and Technology, Central result difficult or infeasible to find, the interpretation of Queensland University Australia, Melbourne, VIC 3000, Australia. correctness varies from one user to another. This is because, • B. Xu is with the State Key Laboratory for Novel Software Technology, although data points are partitioned into clusters based on Nanjing University, Nanjing 210023, China. some similarity measures, the comprehension of ”similar- 2 ity” may vary among individual users. Given a cluster, one importance, it is unfortunate that both external and internal user may consider that the data in it are similar, yet another techniques generally do not consider the dynamic aspect of user may consider not. dataset transformation when testing clustering systems. (Issue 2) Despite the importance of the correctness of the We now turn to issue 2. There has not yet been a gener- clustering result, in many cases, users would probably be ally accepted and systematic methodology that allows end more concerned if a clustering system produces an output users to effectively assess the quality and appropriateness of that is appropriate or meaningful to their particular scenar- a clustering system for their particular applications. In tradi- ios of applications. This view is supported by the following tional software testing, test adequacy is commonly measured argument in [12]: by code coverage criteria to unveil necessary conditions ”... the major obstacle is the difficulty in evaluating a of detecting faults in the code (e.g., incorrect logic). In clustering algorithm without taking into account the this regard, clustering systems are harder to assess because context: why does the user cluster his data in the first the logic of a machine learning model is primarily learnt place, and what does he want to do with the cluster- from massive data. In view of this problem, a practically ing afterwards? We argue that clustering should not applicable adequacy criterion is in need to help a user assess be treated as an application-independent mathematical and validate the characteristics that a clustering system problem, but should always be studied in the context of should possess in a specific application scenario, so that the its end-use.” most appropriate system can be selected for use from this user’s perspective. As a reminder, the characteristics that a Regarding issue 1, it is well known as the oracle problem clustering system is “expected” to possess may vary across in software testing. This problem occurs when a test oracle different users. Needless to say, there is also no systematic (or simply an oracle) does not exist. Here, an oracle refers to methodology for users to validate the appropriateness of a a mechanism that can verify the correctness of the system clustering result in their own contexts. output [13]. In view of the oracle problem, users of unsu- In view of the above two challenging issues, we propose pervised machine learning rely on two types of validation a METamorphic Testing approach to assessing and validat- techniques (external and internal) to evaluate clustering ing unsupervised machine LEarning systems (abbreviated systems. Basically, external validation techniques evaluate as METTLE). To alleviate Issue 1, METTLE applies the frame- the output clusters based on some existing benchmarks; work of metamorphic testing (MT) [13], so that users are still while internal validation techniques adopt features inherent able to validate a clustering system even when the oracle to the data alone to validate the clustering result. problem occurs. In addition, MT is naturally considered to Both external and internal validation techniques suffer be a candidate solution for addressing the ripple effect of from some problems which affect their effectiveness and data transformation, sinceMT involves multiple inputs (or applicability. For external techniques, it is usually difficult datasets) which follow a specific transformation relation. By to obtain sufficient relevant benchmark data for compar- defining a set of metamorphic relations (MRs) (which capture ison [14, 15]. In most situations, the benchmarks selected the relations between multiple inputs (or datasets) and their for use are essentially those special cases in software ver- corresponding outputs (or clustering results)) to be used in ification and validation, thereby providing insufficient test MT, the dynamic perspective of a clustering system can adequacy, coverage, and diversity. This issue does not exist be properly assessed. Furthermore,

METTLE: a Metamorphic Testing Approach to Assessing and Validating Unsupervised Machine Learning Systems

How Effectively Does Metamorphic Testing Alleviate the Oracle Problem?

How Effectively Does Metamorphic Testing Alleviate the Oracle Problem?

Metamorphic Testing of Navigation Software: a Pilot Study with Google Maps

Evolution, Testing and Configuration of Variability Systems Intensive José Ángel Galindo Duarte

4 Metamorphic Testing: a Review of Challenges and Opportunities

Metamorphic Testing Techniques to Detect Defects in Applications Without Test Oracles

Metamorphic Testing of Navigation Software: a Pilot Study with Google Maps

A Survey on Metamorphic Testing

Metamorphic Testing: a Review of Challenges and Opportunities

Metamorphic Testing: Testing the Untestable

Metamorphic Robustness Testing of Google Translate

Testing Web Enabled Simulation at Scale Using Metamorphic Testing