[Re] Replication Study of 'Generative Causal Explanations Of

Total Page:16

File Type:pdf, Size:1020Kb

[Re] Replication Study of 'Generative Causal Explanations Of Replication / ML Reproducibility Challenge 2020 RESCIENCEC [Re] Replication Study of ’Generative causal explanations of black-box classifiers’ Ivo Verhoeven1, ID , Xinyi Chen1, ID , Qingzhi Hu1, ID , and Mario Holubar1, ID 1University of Amsterdam, Amsterdam, the Netherlands Edited by Koustuv Sinha Reviewed by Reproducibility Summary Anonymous Reviewers Received Scope of Reproducibility 29 January 2021 We verify the outcome of the methodology proposed in the article, which attempts to Published provide post-hoc causal explanations for black-box classifiers through causal reference. 27 May 2021 This is achieved by replicating the code step by step, according to the descriptions in the paper. All the claims in the paper have been examined, and we provide additional DOI 10.5281/zenodo.4835600 metric to evaluate the portability, expressive power, algorithmic complexity and the data fidelity of their framework. We have further extended their analyses to consider all benchmark datasets used, confirming results. Methodology We use the same architecture and (hyper)parameters for replication. However, the code has a different structure and we provide a more efficient implementation for themea- sure of information flow. In addition, Algorithm 1 in the original paper is not imple- mented in their repository, so we have also implemented Algorithm 1 ourselves and further extend their framework to another domain (text data), although unsuccessfully. Furthermore, we make a detailed table in the paper to show the time used to produce the results for different experiments reproduced. All models were trained on Nvidia GeForce GTX 1080 GPUs provided by Surfsaraʼs LISA cluster computing service at uni- versity of Amsterdam1. Results We reproduced the framework in the original paper and verified the main claims made by the authors in the original paper. However, the GCE model in extension study did not manage to separate causal factors and non-causal factors for a text classifier due to the complexity of fine-tuning the model. 1This is our course project for the master course Fairness, Accountability, Confidentiality and Transparency in AI at the University of Amsterdam. Lisa cluster: https://userinfo.surfsara.nl/systems/lisa. Copyright © 2021 I. Verhoeven et al., released under a Creative Commons Attribution 4.0 International license. Correspondence should be addressed to Ivo Verhoeven ([email protected]) The authors have declared that no competing interests exist. Code is available at https://github.com/shin-ee-chen/UvA_FACT_2021. – SWH swh:1:dir:9d960937f2b11bb67b53bbe4eaa940c14cadf9d1. Open peer review is available at https://openreview.net/forum?id=Vqtf4kZg2j. ReScience C 7.2 (#23) – Verhoeven et al. 2021 1 [Re] Replication Study of ’Generative causal explanations of black-box classifiers’ What was easy The original paper comes with extensive appendices, many of which contain crucial details for implementation and understanding of the intended function. The authors provide code for most of the experiments presented in the paper. Although at the begin- ning their code repository was not functional, we use it as a reference to re-implement our code. The author also updated their code two weeks after we start our own imple- mentation, which made it easy for us to verify the correctness of our re-implementation. What was difficult The codebase the authors provided was initially unusable, with missing or renamed imports, hardcoded filepaths and an all-around convoluted structure. Additionally, the description of Algorithm 1 is quite vague and no implementation of it was given. Beyond this, computational expense was a serious issue, given the need for inefficient training steps, and re-iterating training several times for hyperparameter search. Communication with original authors This reproducibility study is part of a course on fairness, accountability, confidentiality and transparency in AI. Since it is a course project where we interacted with other group in the forum, and another group also working with this paper has reached out to the authors about problems with the initial repository, we did not find necessary to do it again. 1 Introduction Machine learning is increasingly used in different applications. The wide-scale spread of these methods places more emphasis on transparent algorithmic decision making, which has the potential to mitigate the potential for disruptive social effects. Yet, despite reliable results of complex black boxes, their internal reasoning and inner workings are not necessarily apparent to end-users or even designers. As a result, not even trained experts can grasp the reasoning behind forecasts. Moreover, modern legislation have necessitated the opportunity challenging these systems, especially in heavily regulated domains, increasing the need for machine learning systems that are (post-hoc) inter- pretable [1]. Black-box artificial intelligence approaches like Deep Neural Networks have often proven to be able to capture complex dependencies within data, and allow for making accurate predictions. However, the actual internal logic used by systems dependent on such ap- proaches is often nebulous or totally unclear. In this paper, we will reproduce the paper which focuses on the explainability aspect of AI. Explainable Artificial Intelligence (XAI) refers to systems that seek to clarify how ablack- box AI model achieves its performance. Post-hoc XAI achieves the desired explainability be generating reasons for decisions after having trained a black-box classifier. This is often achieved by extracting correlations between input features and the eventual fore- casts [2]. This paper aims to reproduce such a post-hoc XAI algorithm, capable of pro- viding clear and interpretable explanations for complex black-box classifiers3 [ ]. The central contribution made by [3] is placing explanations in a causal-generative frame- work. By using a variational auto-encoder (VAE) [4] , low dimensional representations of the data can be achieved. By further incorporating a mutual information loss between the classifier and the latent variables, the latent space is decorrelated into factors that ReScience C 7.2 (#23) – Verhoeven et al. 2021 2 [Re] Replication Study of ’Generative causal explanations of black-box classifiers’ actively influence the classifierʼs prediction, and those that capture superfluous varia- tion in the underlying dataset. The causal-VAE , dubbed a generative causal explainer (GCE) by [3] can be studied and intervened with to provide a powerful framework for inducing explainability. There exists literature suggesting that some studies cannot be replicated [5]. This is valid also for the most respected journals and conferences in AI. Steps must be taken to ensure high trustworthiness of AI algorithms [6]. In the experiments presented here, we re-implement the framework used by [3], extend their analyses to all benchmark datasets discussed and provide an extension to a novel domain. When beginning our process, the officially provided code repository corresponding to the authorsʼ publication was not capable of running in all discussed scenarios, nor pro- ducing results similar to those presented 2. However, our re-implementation has ascer- tained all results, and been extended to incorporate all analyses, and can report their framework to be reproducible. All code has been made publicly available via a Github repository3. 2 Scope of Reproducibility The original paper establishes a causal inference framework. While intuitive, this intro- duces the additional challenge of balancing causal intervention with expressive latent variables. To overcome this hurdle, [3] integrate causal inference into the VAE learning objective. This system consists of two basic components: a way to describe the data dis- tribution, and a method of generating a consistent classification model. In order to get a clear objective of the reproduction, we consider the the main claim(s) of the original paper as: • Claim 1: creating a generative framework that describes a black-box classifier, and a process that achieves this while maintaining latent descriptions that result in high data fidelity and scalability of the overall system • Claim 2: their objective function helps the generative model disentangle the latent variables into causal and non-causal factors for the classifier. Their approach is sufficient for any black-box classifier that can provide class gradients andproba- bilities with respect to the classifier input • Claim 3: the resulting explanations show what is important to the accompanying classifier • Claim 4: the learned representation may also be used to explain counterfactual explanations as it incorporates both generative and causal modelling The standard of reproducibility demands that machine learning methods be routinely evaluated on the verifiability of their results. The following additional metrics (proper- ties) will be used as measurement of reproducibility: • Portability: the modelling domains and types of machine learning models that can benefit from their framework • Expressive Power: the explanation structure that can be produced by their frame- work • Algorithmic Complexity: the computational complexity of their algorithm 2Since, the authors have responded to suggestions and rewritten their code base substantially. 3https://github.com/shin-ee-chen/UvA_FACT_2021 ReScience C 7.2 (#23) – Verhoeven et al. 2021 3 [Re] Replication Study of ’Generative causal explanations of black-box classifiers’ • Data Fidelity:
Recommended publications
  • The Communication Dimension of the Reproducibility Crisis in Experimental Psychology and Neuroscience
    European Journal for Philosophy of Science (2020) 10:44 https://doi.org/10.1007/s13194-020-00317-6 PAPER IN PHILOSOPHY OF SCIENCE IN PRACTICE Open Access Double trouble? The communication dimension of the reproducibility crisis in experimental psychology and neuroscience Witold M. Hensel1 Received: 31 December 2019 /Accepted: 18 September 2020/ # The Author(s) 2020 Abstract Most discussions of the reproducibility crisis focus on its epistemic aspect: the fact that the scientific community fails to follow some norms of scientific investigation, which leads to high rates of irreproducibility via a high rate of false positive findings. The purpose of this paper is to argue that there is a heretofore underappreciated and understudied dimension to the reproducibility crisis in experimental psychology and neuroscience that may prove to be at least as important as the epistemic dimension. This is the communication dimension. The link between communication and reproducibility is immediate: independent investi- gators would not be able to recreate an experiment whose design or implementation were inadequately described. I exploit evidence of a replicability and reproducibility crisis in computational science, as well as research into quality of reporting to support the claim that a widespread failure to adhere to reporting standards, especially the norm of descrip- tive completeness, is an important contributing factor in the current reproducibility crisis in experimental psychology and neuroscience. Keywords Reproducibility crisis in experimental psychology and neuroscience . Replicability crisis in computational modeling . Replication . Reproduction . Norms of scientific reporting . Descriptive completeness 1 Introduction The reproducibility crisis is usually depicted as resulting from a failure to follow some norms of scientific investigation (but see, e.g., Bird 2018).
    [Show full text]
  • CODECHECK: an Open Science Initiative For
    F1000Research 2021, 10:253 Last updated: 30 MAR 2021 METHOD ARTICLE CODECHECK: an Open Science initiative for the independent execution of computations underlying research articles during peer review to improve reproducibility [version 1; peer review: awaiting peer review] Daniel Nüst 1, Stephen J. Eglen 2 1Institute for Geoinformatics, University of Münster, Münster, Germany 2Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK v1 First published: 30 Mar 2021, 10:253 Open Peer Review https://doi.org/10.12688/f1000research.51738.1 Latest published: 30 Mar 2021, 10:253 https://doi.org/10.12688/f1000research.51738.1 Reviewer Status AWAITING PEER REVIEW Any reports and responses or comments on the Abstract article can be found at the end of the article. The traditional scientific paper falls short of effectively communicating computational research. To help improve this situation, we propose a system by which the computational workflows underlying research articles are checked. The CODECHECK system uses open infrastructure and tools and can be integrated into review and publication processes in multiple ways. We describe these integrations along multiple dimensions (importance, who, openness, when). In collaboration with academic publishers and conferences, we demonstrate CODECHECK with 25 reproductions of diverse scientific publications. These CODECHECKs show that asking for reproducible workflows during a collaborative review can effectively improve executability. While CODECHECK has clear limitations, it may represent a building block in Open Science and publishing ecosystems for improving the reproducibility, appreciation, and, potentially, the quality of non- textual research artefacts. The CODECHECK website can be accessed here: https://codecheck.org.uk/. Keywords reproducible research, Open Science, peer review, reproducibility, code sharing, data sharing, quality control, scholarly publishing This article is included in the Science Policy Research gateway.
    [Show full text]
  • Papers Published at SIGGRAPH 2014, 2016 and 2018
    Code Replicability in Computer Graphics NICOLAS BONNEEL, Univ Lyon, CNRS, France DAVID COEURJOLLY, Univ Lyon, CNRS, France JULIE DIGNE, Univ Lyon, CNRS, France NICOLAS MELLADO, Univ Toulouse, CNRS, France Fig. 1. We ran 151 codes provided by papers published at SIGGRAPH 2014, 2016 and 2018. We analyzed whether these codes could still be run as of 2020 to provide a replicability score, and performed statistical analysis on code sharing. Image credits: Umberto Salvagnin, _Bluenose Girl, Dimitry B., motiqua, Ernest McGray Jr., Yagiz Aksoy, Hillebrand Steve. 3D models by Martin Lubich and Wig42. Being able to duplicate published research results is an important process of 1 INTRODUCTION conducting research whether to build upon these findings or to compare with The ability to reproduce an experiment and validate its results is a them. This process is called “replicability” when using the original authors’ cornerstone of scientific research, a key to our understanding of the artifacts (e.g., code), or “reproducibility” otherwise (e.g., re-implementing world. Scientific advances often provide useful tools, and build upon algorithms). Reproducibility and replicability of research results have gained a lot of interest recently with assessment studies being led in various fields, a vast body of previous work published in the literature. As such, and they are often seen as a trigger for better result diffusion and trans- research that cannot be reproduced by peers despite best efforts parency. In this work, we assess replicability in Computer Graphics, by often has limited value, and thus impact, as it does not benefit to evaluating whether the code is available and whether it works properly.
    [Show full text]
  • CODECHECK: an Open Science Initiative For
    F1000Research 2021, 10:253 Last updated: 01 OCT 2021 METHOD ARTICLE CODECHECK: an Open Science initiative for the independent execution of computations underlying research articles during peer review to improve reproducibility [version 2; peer review: 2 approved] Daniel Nüst 1, Stephen J. Eglen 2 1Institute for Geoinformatics, University of Münster, Münster, Germany 2Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, UK v2 First published: 30 Mar 2021, 10:253 Open Peer Review https://doi.org/10.12688/f1000research.51738.1 Latest published: 20 Jul 2021, 10:253 https://doi.org/10.12688/f1000research.51738.2 Reviewer Status Invited Reviewers Abstract The traditional scientific paper falls short of effectively communicating 1 2 computational research. To help improve this situation, we propose a system by which the computational workflows underlying research version 2 articles are checked. The CODECHECK system uses open infrastructure (revision) report and tools and can be integrated into review and publication processes 20 Jul 2021 in multiple ways. We describe these integrations along multiple dimensions (importance, who, openness, when). In collaboration with version 1 academic publishers and conferences, we demonstrate CODECHECK 30 Mar 2021 report report with 25 reproductions of diverse scientific publications. These CODECHECKs show that asking for reproducible workflows during a collaborative review can effectively improve executability. While 1. Nicolas P. Rougier , Inria Bordeaux Sud- CODECHECK has clear limitations, it may represent a building block in Ouest, Talence, France Open Science and publishing ecosystems for improving the reproducibility, appreciation, and, potentially, the quality of non- 2. Sarah Gibson , The Alan Turing Institute, textual research artefacts.
    [Show full text]
  • The Communication Dimension of the Reproducibility Crisis in Experimental Psychology and Neuroscience
    European Journal for Philosophy of Science (2020) 10: 44 https://doi.org/10.1007/s13194-020-00317-6 PAPER IN PHILOSOPHY OF SCIENCE IN PRACTICE Open Access Double trouble? The communication dimension of the reproducibility crisis in experimental psychology and neuroscience Witold M. Hensel1 Received: 31 December 2019 /Accepted: 18 September 2020 /Published online: 1 October 2020 # The Author(s) 2020 Abstract Most discussions of the reproducibility crisis focus on its epistemic aspect: the fact that the scientific community fails to follow some norms of scientific investigation, which leads to high rates of irreproducibility via a high rate of false positive findings. The purpose of this paper is to argue that there is a heretofore underappreciated and understudied dimension to the reproducibility crisis in experimental psychology and neuroscience that may prove to be at least as important as the epistemic dimension. This is the communication dimension. The link between communication and reproducibility is immediate: independent investi- gators would not be able to recreate an experiment whose design or implementation were inadequately described. I exploit evidence of a replicability and reproducibility crisis in computational science, as well as research into quality of reporting to support the claim that a widespread failure to adhere to reporting standards, especially the norm of descrip- tive completeness, is an important contributing factor in the current reproducibility crisis in experimental psychology and neuroscience. Keywords Reproducibility crisis in experimental psychology and neuroscience . Replicability crisis in computational modeling . Replication . Reproduction . Norms of scientific reporting . Descriptive completeness 1 Introduction The reproducibility crisis is usually depicted as resulting from a failure to follow some norms of scientific investigation (but see, e.g., Bird 2018).
    [Show full text]
  • Code Replicability in Computer Graphics
    Code Replicability in Computer Graphics NICOLAS BONNEEL, Univ Lyon, CNRS, France DAVID COEURJOLLY, Univ Lyon, CNRS, France JULIE DIGNE, Univ Lyon, CNRS, France NICOLAS MELLADO, Univ Toulouse, CNRS, France Fig. 1. We ran 151 codes provided by papers published at SIGGRAPH 2014, 2016 and 2018. We analyzed whether these codes could still be run as of 2020 to provide a replicability score, and performed statistical analysis on code sharing. Being able to duplicate published research results is an important process of 1 INTRODUCTION conducting research whether to build upon these findings or to compare with The ability to reproduce an experiment and validate its results is a them. This process is called “replicability” when using the original authors’ cornerstone of scientific research, a key to our understanding of the artifacts (e.g., code), or “reproducibility” otherwise (e.g., re-implementing world. Scientific advances often provide useful tools, and build upon algorithms). Reproducibility and replicability of research results have gained a lot of interest recently with assessment studies being led in various fields, a vast body of previous work published in the literature. As such, and they are often seen as a trigger for better result diffusion and trans- research that cannot be reproduced by peers despite best efforts parency. In this work, we assess replicability in Computer Graphics, by often has limited value, and thus impact, as it does not benefit to evaluating whether the code is available and whether it works properly. others, cannot be used as a basis for further research, and casts As a proxy for this field we compiled, ran and analyzed 151 codes outof doubt on published results.
    [Show full text]
  • Rescience C: a Journal for Reproducible Replications in Computational Science Nicolas Rougier, Konrad Hinsen
    ReScience C: A Journal for Reproducible Replications in Computational Science Nicolas Rougier, Konrad Hinsen To cite this version: Nicolas Rougier, Konrad Hinsen. ReScience C: A Journal for Reproducible Replications in Computa- tional Science. Bertrand Kerautret; Miguel Colom; Daniel Lopresti; Pascal Monasse; Hugues Talbot. Reproducible Research in Pattern Recognition, 11455, Springer, pp.150-156, 2019, Lecture Notes in Computer Science, 978-3-030-23987-9. hal-02199854 HAL Id: hal-02199854 https://hal.inria.fr/hal-02199854 Submitted on 31 Jul 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. ReScience C: a Journal for Reproducible Replications in Computational Science Nicolas P. Rougier1;2;3[0000−0002−6972−589X] and Konrad Hinsen4;5[0000−0003−0330−9428] 1 INRIA Bordeaux Sud-Ouest, Talence, France [email protected] 2 Institut des Maladies Neurod´eg´en´eratives, Universit´ede Bordeaux, CNRS UMR 5293, Bordeaux, France 3 LaBRI, Universit´ede Bordeaux, Bordeaux INP, CNRS UMR 5800, Talence, France 4 Centre de Biophysique Mol´eculaire, CNRS UPR 4301, Orl´eans,France [email protected] 5 Synchrotron SOLEIL, Division Exp´eriences,Gif sur Yvette, France Abstract. Independent replication is one of the most powerful methods to verify published scientific studies.
    [Show full text]
  • PDF Download
    This PDF is a simplified version of the original article published in Internet Archaeology. Enlarged images which support this publication can be found in the original version online. All links also go to the online version. Please cite this as: Strupler, N. 2021 Re-discovering Archaeological Discoveries. Experiments with reproducing archaeological survey analysis, Internet Archaeology 56. https://doi.org/10.11141/ia.56.6 Re-discovering Archaeological Discoveries. Experiments with reproducing archaeological survey analysis Néhémie Strupler This article describes an attempt to reproduce the published analysis from three archaeological field-walking surveys by using datasets collected between 1990 and 2005 which are publicly available in digital format. The exact methodologies used to produce the analyses (diagrams, statistical analysis, maps, etc.) are often incomplete, leaving a gap between the dataset and the published report. By using the published descriptions to reconstruct how the outputs were manipulated, I expected to reproduce and corroborate the results. While these experiments highlight some successes, they also point to significant problems in reproducing an analysis at various stages, from reading the data to plotting the results. Consequently, this article proposes some guidance on how to increase the reproducibility of data in order to assist aspirations of refining results or methodology. Without a stronger emphasis on reproducibility, the published datasets may not be sufficient to confirm published results and the scientific process of self-correction is at risk. Disclaimer The publication of any dataset needs to be acknowledged and praised. In an academic world in which so many datasets are not accessible, publishing them is a courageous endeavour, even more so if it was done over 10 years ago, when the problem of reproducibly was not yet fully recognised.
    [Show full text]
  • Curriculum Vitae
    2021-09-06 CURRICULUM VITAE Karl W. Broman Department of Biostatistics and Medical Informatics School of Medicine and Public Health niversit" of Wisconsin$Madison 2126 %enetics-Biotechnolo&" 'enter (2) Henr" Mall Madison* #isconsin )+,06 Phone- 60.-262-(6++ /mail- [email protected] Web: https://kbroman.org 01'ID- 0000-0002-(91(-66,1 EDUCATION 199, $ 1999 Postdoctoral 2ello3* 'enter for Medical %enetics* Marsh4eld Medical 1esearch 2oundation* Marsh4eld* Wisconsin 56d!isor- 7ames 89 Weber: 199, PhD* Statistics* niversit" of 'alifornia* Ber;ele" 56dvisor- Terr" Speed= thesis- Identifying quantitative trait loci in experimental crosses: 1991 BS* Summa Cum Laude* Mathematics* niversit" of Wisconsin$Mil3au;ee PROFESSIONAL POSITIONS 2009 $ present Professor* Department of Biostatistics and Medical Informatics* School of Medicine and Public Health* niversit" of Wisconsin$Madison 200, $ 2009 6ssociate Professor* Department of Biostatistics and Medical Informatics* School of Medicine and Public Health* niversit" of Wisconsin$Madison 2002 $ 200, 6ssociate Professor* Department of Biostatistics* Bloomber& School of Public Health* The 7ohns Hopkins niversit"* Baltimore* Mar"land 1999 $ 2002 6ssistant Professor* Department of Biostatistics* Bloomber& School of Public Health* The 7ohns Hopkins niversit"* Baltimore* Mar"land 1999 6ssociate 1esearch Scientist* 'enter for Medical %enetics* Marsh4eld Medical 1esearch 2oundation* Marsh4eld* Wisconsin ADDITIONAL PROFESSIONAL APPOINTMENTS 6>liate facult" member* Department of Statistics* ni!ersit" of Wisconsin$Madison
    [Show full text]
  • Witold. M. Hensel Double Trouble?
    1 Witold. M. Hensel Double Trouble? The Communication Dimension of the Reproducibility Crisis in Experimental Psychology and Neuroscience*† Abstract Most discussions of the reproducibility crisis focus on its epistemic aspect: the fact that the scientific community fails to follow some norms of scientific investigation, which leads to high rates of irreproducibility via a high rate of false positive findings. The purpose of this paper is to argue that there is a heretofore underappreciated and understudied dimension to the reproducibility crisis in experimental psychology and neuroscience that may prove to be at least as important as the epistemic dimension. This is the communication dimension. The link between communication and reproducibility is immediate: independent investigators would not be able to recreate an experiment whose design or implementation were inadequately described. I exploit evidence of a replicability and reproducibility crisis in computational science, as well as research into quality of reporting to support the claim that a widespread failure to adhere to reporting standards, especially the norm of descriptive completeness, is an important contributing factor in the current reproducibility crisis in experimental psychology and neuroscience. Introduction The reproducibility crisis is usually depicted as resulting from a failure to follow some norms of scientific investigation (but see, e.g., Bird 2018). If sufficiently widespread, this failure can cause a proliferation of false positive findings in the literature, which can lead to low research reproducibility because false findings are mostly irreproducible. For example, many authors have argued that researchers in psychology, including journal editors and referees, tend to confuse nominal statistical significance of an experimental finding with its epistemic import (e.g., Rosenthal & Gaito 1963; Neuliep & Crandall 1990, 1993; Sterling 1959; Sterling, Rosenbaum, & Weinkam 1995; Gigerenzer 2004, 2018).
    [Show full text]