Arxiv:2103.10226V1 [Cs.LG] 18 Mar 2021 Would Suggest to Add Or Remove Sunglasses, Beard, Or Scarf, While a Non-Diverse Set Would All Suggest to Add Or Remove 1
Total Page:16
File Type:pdf, Size:1020Kb
Beyond Trivial Counterfactual Explanations with Diverse Valuable Explanations Pau Rodr´ıguez1 Massimo Caccia1;2 Alexandre Lacoste1 Lee Zamparo1 Issam Laradji1;2;3 Laurent Charlin2;4 David Vazquez1 1Element AI 2MILA 3McGill University 4HEC Montreal [email protected] Abstract model is confused by people wearing sunglasses then the system could generate alternative images of faces without Explainability for machine learning models has gained sunglasses that would be correctly recognized. In order considerable attention within our research community to discover a model’s limitations, counterfactual generation given the importance of deploying more reliable machine- systems could be used to generate images that would con- learning systems. In computer vision applications, gen- fuse the classifier, such as people wearing sunglasses or erative counterfactual methods indicate how to perturb a scarfs occluding the mouth. This is different from other model’s input to change its prediction, providing details types of explainability methods such as feature importance about the model’s decision-making. Current counterfactual methods [4, 51, 52] and boundary approximation meth- methods make ambiguous interpretations as they combine ods [38, 48], which highlight salient regions of the input multiple biases of the model and the data in a single coun- like the sunglasses but do not indicate how the ML model terfactual interpretation of the model’s decision. Moreover, could achieve a different prediction. these methods tend to generate trivial counterfactuals about According to [39, 49], counterfactual explanations the model’s decision, as they often suggest to exaggerate or should be valid, proximal, and sparse.A valid counterfac- remove the presence of the attribute being classified. For the tual explanation changes the prediction of the ML model, machine learning practitioner, these types of counterfactu- for instance, adding sunglasses to confuse a smile classi- als offer little value, since they provide no new information fier. The explanation is sparse if it only changes a mini- about undesired model or data biases. In this work, we mal set of attributes, for instance, it only adds sunglasses propose a counterfactual method that learns a perturbation and it does not add a hat, a beard, or the like. An expla- in a disentangled latent space that is constrained using a nation is proximal if it is perceptually similar to the origi- diversity-enforcing loss to uncover multiple valuable expla- nal image, for instance, a ninety degree rotation of an im- nations about the model’s prediction. Further, we introduce age would be a sparse but not proximal. In addition to the a mechanism to prevent the model from producing trivial three former properties, generating a set of diverse expla- explanations. Experiments on CelebA and Synbols demon- nations increases the likelihood of finding a useful expla- strate that our model improves the success rate of producing nation [39, 49]. A set of counterfactuals is diverse if each high-quality valuable explanations when compared to pre- one proposes to change a different set of attributes. Fol- vious state-of-the-art methods. We will publish the code. lowing the previous example, a diverse set of explanations arXiv:2103.10226v1 [cs.LG] 18 Mar 2021 would suggest to add or remove sunglasses, beard, or scarf, while a non-diverse set would all suggest to add or remove 1. Introduction different brands of sunglasses. Intuitively, each explanation Consider an image recognition model such as a smile should shed light on a different action that a user can take classifier. In case of erroneous prediction, an explainabil- to change the ML model’s outcome. ity system should provide information to machine learn- Current generative counterfactual methods like ing practitioners to understand why such error happened xGEM [26] generate a single explanation that is not and how to prevent it. Counterfactual explanation meth- constrained to be similar to the input. Thus, they fail to be ods [4, 11, 13] can help highlight the limitations of an ML proximal, sparse, and diverse. Progressive Exaggeration model by uncovering data and model biases. Counterfac- (PE) [54] provides higher-quality explanations that are tual explanations provide perturbed versions of the input more proximal than xGEM, but it still fails to provide a data that emphasize features that contributed the most to diverse set of explanations. In addition, the image generator the ML model’s output. For the smile classifier, if the of PE is trained on the same data as the image classifier in order to detect biases thereby limiting their applicability. art in terms of the validity, proximity, and sparsity of its ex- Both of these two methods tend to produce trivial explana- planations, detecting biases on the datasets, and producing tions, which only address the attribute that was intended to multiple explanations for an image. 3) We identify the im- be classified, without further exploring failure cases due to portance of finding non-trivial explanations and we propose biases in the data or spurious correlations. For instance, an a new benchmark to evaluate how valuable the explanations explanation that suggests to increase the ‘smile’ attribute are. 4) We propose to leverage the Fisher information ma- of a ‘smile’ classifier for an already-smiling face is trivial trix of the latent space for finding spurious features that pro- and it does not explain why a misclassification occurred. duce non-trivial explanations. On the other hand, a non-trivial explanation that suggests to change the facial skin color would uncover a racial bias 2. Related Work in the data that should be addressed by the ML practitioner. In this work, we focus on diverse valuable explanations, Explainable artificial intelligence (XAI) is a suite of that is, valid, proximal, sparse, and non-trivial. techniques developed to make either the construction or in- terpretation of model decisions more accessible and mean- We propose Diverse Valuable Explanations (DiVE), an ingful. Broadly speaking, there are two branches of work in explainability method that can interpret ML model predic- XAI, ad-hoc and post-hoc. Ad-hoc methods focus on mak- tions by identifying sets of valuable attributes that have the ing models interpretable, by imbuing model components or most effect on model output. DiVE produces multiple coun- parameters with interpretations that are rooted in the data valuable terfactual explanations which are enforced to be , themselves [25, 40, 46]. To date, most successful machine diverse and , resulting in more informative explanations for learning methods, including deep learning ones, are unin- machine learning practitioners than competing methods in terpretable [6, 18, 24, 32]. the literature. Our method first learns a generative model of Post-hoc methods aim to explain the decisions of un- β the data using a -TCVAE[5] to obtain a disentangled latent interpretable models. These methods can be categorized proximal sparse representation which leads to more and ex- as non-generative and generative. Non-generative methods planations. In addition, the VAEis not required to be trained use information from an ML model to identify the features on the same dataset as the ML model to be explained. DiVE most responsible for an outcome for a given input. Ap- then learns a latent perturbation using constraints to enforce proaches like [38, 42, 48] interpret ML model decisions by diversity sparsity proximity non- , , and . In order to generate fitting a locally interpretable model. Others use the gra- trivial explanations, DiVE leverages the Fisher information dient of the ML model parameters to perform feature at- matrix of its latent space to focus its search on the less in- tribution [1, 51–53, 55, 61, 62], sometimes by employing fluential factors of variation of the ML model. This mecha- a reference distribution for the features [11, 52]. This has nism enables the discovery of spurious correlations learned the advantage of identifying alternative feature values that, by the ML model. when substituted for the observed values, would result in a We provide experiments to assess whether our expla- different outcome. These methods are limited to small con- nations are more valuable and diverse than current state- tiguous regions of features with high influence on the target of-the-art. First, we assess their validity on the CelebA model outcome. In so doing, they can struggle to provide dataset [34] and provide quantitative and qualitative results plausible changes of the input that are useful for a user in on a bias detection benchmark [54]. Second, we show that order to correct a certain output or bias of the model. Gener- the generated explanations are more proximal in terms of ative methods such as [4,5,7, 15] propose proximal modi- Frechet´ Inception Distance (FID) [20], which is a measure fications of the input that change the model decision. How- of similarity between two datasets of images commonly ever the generated perturbations are usually performed in used to evaluate the quality of generated images. In addi- pixel space and bound to masking small regions of the im- tion, we evaluate the proximity in latent space and face ver- age without necessarily having a semantic meaning. Closest ification accuracy, as reported by Singla et al. [54]. Third, to our work are generative counterfactual explanation meth- we assess the sparsity of the generated counterfactuals by ods which synthesize perturbed versions of observed data computing the average change in facial attributes. Finally, that result in a change of the model prediction. These can we propose a new benchmark to quantify the success of ex- be further subdivided into two families. The first family of plainability systems at finding non-trivial explanations. We methods conditions the generative model on attributes, by find that DiVE is more successful at finding non-trivial ex- e.g. using a conditional GAN [26, 33, 50, 59, 60].