BIG Data, BIG Responsibility (Data Lineage Management with Template for Reproducible Scientific Papers)

BIG Data, BIG Responsibility (Data Lineage Management with Template for Reproducible Scientific Papers)

BIG Data, BIG responsibility (Data lineage management with template for reproducible scientific papers) Mohammad Akhlaghi Instituto de Astrof´ısicade Canarias (IAC), Tenerife, Spain 10th Iberian Grid Conference (Ibergrid2019), Santiago de Compostela (Spain), September 23rd, 2019 Slides available at http://akhlaghi.org/pdf/reproducible-paper.pdf Reproducibility is critically important in the sciences (example from astronomy) Example: Detecting outer regions of M51 in a single exposure SDSS image, using NoiseChisel, with default and optimized parameters. I When optimized, outer wing detected to S=N =1=4, or 28:3 mag/arcsec2. I Complete tutorial in manual fully describes how to derive/reproduce optimized result: I Run-time options/configuration. I Steps before/after NoiseChisel. Input image Default NoiseChisel I Deep/orange image from Watkins+2015 (arXiv:1501.04599). I Therefore: I Default settings not enough. I Final number not just from NoiseChisel (more software involved). Simply reporting in your paper that \we used NoiseChisel" is not enough to reproduce, un- Optimized NoiseChisel Much deeper image derstand, or verify your result. Perspectives on Reproducibility and Sustainability of Open-Source Scientific Software \It is our interest that NASA adopt an open-code policy because without it, reproducibility in computational science is needlessly hampered". From Oishi+2018, (arXiv:1801.08200). Schroedinger's code: source code availability and link persistence in astrophysics \We were unable to find source code online ... for 40:4% of the codes used in the research we looked at". From Allen+2018, (arXiv:1801.02094). Reproducibility crisis in the sciences/astronomy Snakes on a Spaceship { An Overview of Python in Heliophysics \...inadequate analysis descriptions and loss of scientific data have made scientific studies difficult or impossible to replicate". From Burrell+2018, (arXiv:1901.00143). Schroedinger's code: source code availability and link persistence in astrophysics \We were unable to find source code online ... for 40:4% of the codes used in the research we looked at". From Allen+2018, (arXiv:1801.02094). Reproducibility crisis in the sciences/astronomy Snakes on a Spaceship { An Overview of Python in Heliophysics \...inadequate analysis descriptions and loss of scientific data have made scientific studies difficult or impossible to replicate". From Burrell+2018, (arXiv:1901.00143). Perspectives on Reproducibility and Sustainability of Open-Source Scientific Software \It is our interest that NASA adopt an open-code policy because without it, reproducibility in computational science is needlessly hampered". From Oishi+2018, (arXiv:1801.08200). Reproducibility crisis in the sciences/astronomy Snakes on a Spaceship { An Overview of Python in Heliophysics \...inadequate analysis descriptions and loss of scientific data have made scientific studies difficult or impossible to replicate". From Burrell+2018, (arXiv:1901.00143). Perspectives on Reproducibility and Sustainability of Open-Source Scientific Software \It is our interest that NASA adopt an open-code policy because without it, reproducibility in computational science is needlessly hampered". From Oishi+2018, (arXiv:1801.08200). Schroedinger's code: source code availability and link persistence in astrophysics \We were unable to find source code online ... for 40:4% of the codes used in the research we looked at". From Allen+2018, (arXiv:1801.02094). Original image from https://www.redbubble.com General outline of a project Software Build Run software on data Paper Hardware/data General outline of a project Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com General outline of a project What version? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com Different package managers have different versions of software (repology.org, 2019/08/19) GNU Astronomy Utilities (Gnuastro) GGnuastronuastro packagingpackaging statusstatus DDebianebian OldstableOldstable 00.2.33.2.33 DDebianebian StableStable 00.8.8 DDebianebian TestingTesting 00.10.10 Astropy DDebianebian UnstableUnstable 00.10.10 DDeepineepin 00.5.5 PPackagingackaging statusstatus DDevuanevuan 2.02.0 (ASCII)(ASCII) 00.2.33.2.33 Debian Stable 3.1.2 Debian Stable 3.1.2 DDevuanevuan 3.03.0 (Beowulf)(Beowulf) 00.8.8 Debian Testing 3.1.2 Debian Testing 3.1.2 DDevuanevuan UnstableUnstable 00.10.10 DDebianebian UnstableUnstable 33.2.1.2.1 DDPortsPorts 00.9.9 DDeepineepin 33.0.2.0.2 FFreeBSDreeBSD PortsPorts 00.10.10 DDevuanevuan 3.03.0 (Beowulf)(Beowulf) 33.1.2.1.2 FFuntoountoo 1.31.3 00.3.3 DDevuanevuan UnstableUnstable 33.2.1.2.1 GGentooentoo 00.3.3 KKaliali LinuxLinux RollingRolling 33.1.2.1.2 KKaliali LinuxLinux RollingRolling 00.10.10 PParrotarrot 33.1.2.1.2 oopenSUSEpenSUSE LeapLeap 15.115.1 00.8.8 PPureOSureOS greengreen 33.1.2.1.2 oopenSUSEpenSUSE TumbleweedTumbleweed 00.8.8 PPureOSureOS landinglanding 33.1.2.1.2 oopenSUSEpenSUSE ScienceScience TumbleweedTumbleweed 00.8.8 RRaspbianaspbian StableStable 33.1.2.1.2 PPardusardus 00.2.33.2.33 RRaspbianaspbian TestingTesting 33.1.2.1.2 PParrotarrot 00.10.10 UUbuntubuntu 18.0418.04 33.0.0 PPLDLD LLinuxinux 00.8.8 UUbuntubuntu 18.1018.10 33.0.4.0.4 PPureOSureOS greengreen 00.8.8 UUbuntubuntu 19.0419.04 33.1.1.1.1 PPureOSureOS landinglanding 00.8.8 UUbuntubuntu 19.1019.10 33.1.2.1.2 RRaspbianaspbian OldstableOldstable 00.2.33.2.33 Raspbian Stable 0.8 UUbuntubuntu 19.1019.10 ProposedProposed 33.2.1.2.1 Raspbian Stable 0.8 RRaspbianaspbian TestingTesting 00.10.10 UUbuntubuntu 18.0418.04 00.5.5 UUbuntubuntu 18.1018.10 00.7.7 UUbuntubuntu 19.0419.04 00.8.8 UUbuntubuntu 19.1019.10 00.8.8 UUbuntubuntu 19.1019.10 ProposedProposed 00.10.10 General outline of a project Repository? What version? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com General outline of a project Repository? What version? Dependencies? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com General outline of a project Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com General outline of a project Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com General outline of a project Config environment? Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com Example: Matplotlib (a Python visualization library) build dependencies From \Attributing and Referencing (Research) Software: Best Practices and Outlook from Inria" (Alliez et al. 2019, hal-02135891) Impact of \Dependency hell" on native building in various hardware (CPU architectures) Astropy depends on Matplotlib GNU Astronomy Utilities doesn't. General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data Data base, or PID? https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data Data base, or PID? Calibration/version? https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data Data base, or PID? Calibration/version? Integrity? https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? What version? Dependencies? Software Build What order? Run software on data Paper Hardware/data Data base, or PID? Calibration/version? Integrity? https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? What version? Dependencies? Runtime options? Software Build What order? Run software on data Paper Hardware/data Data base, or PID? Calibration/version? Integrity? https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines Containers (e.g., Docker) OSs (e.g., Nix, GNU Guix) Config options? Repository? Dep. versions? Human error? What version? Dependencies? Runtime options? Software Build What order? Run software on data Paper Hardware/data Data base, or PID? Calibration/version? Integrity? https://heywhatwhatdidyousay.wordpress.com General outline of a project Existing solutions: Config environment? Virtual machines

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    73 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us