Mlcheck– Property-Driven Testing of Machine Learning Models
Total Page:16
File Type:pdf, Size:1020Kb
MLCheck– Property-Driven Testing of Machine Learning Models Arnab Sharma Caglar Demir Department of Computer Science Data Science Group University of Oldenburg Paderborn University Oldenburg, Germany Paderborn, Germany [email protected] [email protected] Axel-Cyrille Ngonga Ngomo Heike Wehrheim Data Science Group Department of Computer Science Paderborn University University of Oldenburg Paderborn, Germany Oldenburg, Germany [email protected] [email protected] ABSTRACT 1 INTRODUCTION In recent years, we observe an increasing amount of software with The importance of quality assurance for applications developed machine learning components being deployed. This poses the ques- using machine learning (ML) increases steadily as they are being tion of quality assurance for such components: how can we validate deployed in a growing number of domains and sites. Supervised ML whether specified requirements are fulfilled by a machine learned algorithms “learn” their behaviour as generalizations of training software? Current testing and verification approaches either focus data using sophisticated statistical or mathematical methods. Still, on a single requirement (e.g., fairness) or specialize on a single type developers need to make sure that their software—whether learned of machine learning model (e.g., neural networks). or programmed—satisfies certain specified requirements. Currently, In this paper, we propose property-driven testing of machine two orthogonal approaches can be followed to achieve this goal: (A) learning models. Our approach MLCheck encompasses (1) a lan- employing an ML algorithm guaranteeing some requirement per guage for property specification, and (2) a technique for system- design, or (B) validating the requirement on the model generated atic test case generation. The specification language is compara- by the ML algorithm. ble to property-based testing languages. Test case generation em- Both approaches have their individual shortcomings: Approach ploys advanced verification technology for a systematic, property- A is only available for a handful of requirements (e.g., fairness, dependent construction of test suites, without additional user- monotonicity, robustness) [23, 33, 41]. Moreover, such algorithms supplied generator functions. We evaluate MLCheck using require- cannot ensure the complete fulfillment of the requirement. For ments and data sets from three different application areas (software example, Galhotra et al. [15] have found fairness-aware ML al- discrimination, learning on knowledge graphs and security). Our gorithms to generate unfair predictions, and Sharma et al. [35] evaluation shows that despite its generality MLCheck can even out- detected non-monotonic predictions in supposedly monotone clas- perform specialised testing approaches while having a comparable sifiers. For robustness to adversarial attacks, the algorithms can runtime. only reduce the attack surface. Approach B, on the other hand, is only possible if a validation technique exists which is applicable to KEYWORDS (1) the specific type of machine learning classifier under considera- tion (i.e., neural network, SVM, etc.) and (2) the specific property to Machine Learning Testing, Decision Tree, Neural Network, Property- be checked. Current validation techniques are restricted to either a Based Testing. single model type or a single property (or even both). ACM Reference Format: In this paper, we propose property-driven testing as a validation arXiv:2105.00741v1 [cs.SE] 3 May 2021 Arnab Sharma, Caglar Demir, Axel-Cyrille Ngonga Ngomo, and Heike technique for machine learning models overcoming the shortcom- Wehrheim. 2021. MLCheck– Property-Driven Testing of Machine Learning ings of approach B. Our technique allows developers to specify the Models . In Proceedings of ACM Conference (Conference’DD). ACM, New property under interest and—based on the property—performs a York, NY, USA, 12 pages. https:XXXX targeted generation of test cases. The target is to find test cases violating the property. The approach is applicable to arbitrary types of non-stochastic properties and arbitrary types of supervised ma- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed chine learning models. We consider the model under test (MUT) as for profit or commercial advantage and that copies bear this notice and the full citation a black-box of which we just observe the input-output behaviour. on the first page. Copyrights for components of this work owned by others than ACM To achieve a systematic generation of test cases, specific to both must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a MUT and property, we train a second white-box model approximat- fee. Request permissions from [email protected]. ing the MUT by using its predictions as training data. Knowing Conference’DD, MM YYYY, XX the white-box’s internal structure, we can apply state-of-the-art © 2021 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 verification technology to verify the property on it. A verification https:XXXX result of “failure” (property not satisfied) is then accompanied by Conference’DD, MM YYYY, XX Arnab Sharma, Caglar Demir, Axel-Cyrille Ngonga Ngomo, and Heike Wehrheim (one or more) counterexamples, which we subsequently store as 2 FOUNDATIONS test inputs whenever they are failures for the MUT as well. We start by introducing some basic terminology in machine learning We currently employ two types of ML models as approximating and formally defining the properties to be checked for our three white-boxes: decision trees and neural networks. While no prior application areas. knowledge is required pertaining to the internal structure of the A supervised machine learning (ML) algorithm works in two model under test, the internals of the white-box model are accessible steps. In the first (learning) phase, it is presented with a set of data to verification. Test generation proceeds by (1) encoding both prop- instances (training data) and generates a function (the predictive erty and white-box model as logical formulae and (2) using an SMT model), generalising from the training data. The generated predic- (Satisfiability Modulo Theories) solver to check their satisfiability. tive model (short, model) is then used in the second (prediction) Counterexamples in this case directly come in the form of satisfying phase to predict classes for unknown data instances. assignments to logical variables which encode feature and class Formally, the generated model is a function values. Due to the usage of an approximating white-box model, test generation is an iterative procedure: whenever a counterexample " : -1 × ... × -= ! /1 × ... × /< , on the white-box model is found which is not valid for the MUT, where -8 is the value set of feature 8, 1 ≤ 8 ≤ =, and every / 9 , 1 ≤ the white-box model gets retrained. This way the approximation 9 ≤ <, contains the classes for the 9th label. Instead of numbering quality of the white-box model is successively improved. features and labels, we also use feature names 퐹1, . , 퐹= and label We have implemented our approach in a tool called MLCheck names !1,...,!<, and let 퐹 = f퐹1, . , 퐹= g, ! = f!1,...,!< g. We and evaluated it on requirements of three different application freely mix numbers and names in our formalizations. When < ¡ 1, areas: the learning problem is a multilabel classification problem; when j/8 j ¡ 2 for some 8, the learning problem is a multiclass classification • Software discrimination studies whether ML models give problem. In case j/8 j = 2 for all 8, it is a binary classification problem. ® ® predictions which are (un)biased with respect to some at- We write - for -1 × ... × -=, / for /1 × .../< and use an index tributes. Different definitions of such fairness requirements (like in G8 ) to access the 8-th component. The training data consists exist (see [40]); we exemplarily use individual discrimina- of elements from -® × /®, i.e., data instances with known associated tion [15]. class labels. During the prediction, the generated predictive model • Knowledge graphs are a family of knowledge representa- assigns classes I 2 /® to a data instance G 2 -® (which is poten- tion techniques. We consider learning classifiers for entities tially not in the training data). Based on this formalization, we based on knowledge graphs embeddings [12] and exemplar- define properties relevant to our three application areas software ily consider the properties of class disjointness and subsump- discrimination, knowledge representation and security. tion. Software discrimination. Fairness of predictive models refers • Security of machine learning applications investigates if to the absence of discrimination of individuals due to certain fea- ML models are vulnerable to attacks, i.e., can be manipu- ture values. More precisely, a model has no individual discrimina- lated as to give specific predictions. We exemplarily study tion [15] if flipping the value of a single, so called sensitive feature, vulnerability to trojan attacks [17]. while keeping the values of other features does not change the prediction. In all three areas, we compare our approach to either other tools specifically testing such properties (if they exist)