Customizable Framework for Managing Data Lineage

BIG Data, BIG responsibility Introducing Maneage: customizable framework for managing data lineage Mohammad Akhlaghi Instituto de Astrof´ısicade Canarias (IAC), Tenerife, Spain PLACE HOLDER MONTH DAY, YEAR Most recent slides available in link below (this PDF is built from Git commit 7c49cdd): https://maneage.org/pdf/slides-intro.pdf Let's start with this nice image of the Wirlpool galaxy (M51): https://i.redd.it/jfqgpqg0hfk11.jpg Now, let's assume you want to study M51's outer structure, but you'll have to detect it first. Example: Using a single exposure SDSS image with NoiseChisel (a program that is part of `GNU Astronomy Utilities'). I When optimized, outskirts detected down to S=N =1=4, or 28:3 mag/arcsec2. By default, it only reaches S=N > 1=2. I Akhlaghi 2019 (arXiv:1909.11230) describes optimized result: I Run-time options/configuration. I Steps before/after NoiseChisel. Input image Default NoiseChisel I Deep/orange image from Watkins+2015 (arXiv:1501.04599) shown for reference. I Therefore: I Default settings not enough. I Final number not just from NoiseChisel (more software involved). Simply reporting in your paper that \we used NoiseChisel" is not enough to reproduce, understand, Optimized NoiseChisel Much deeper image or verify your result. Perspectives on Reproducibility and Sustainability of Open-Source Scientific Software \It is our interest that NASA adopt an open-code policy because without it, reproducibility in computational science is needlessly hampered". From Oishi+2018, (arXiv:1801.08200). Schroedinger's code: source code availability and link persistence in astrophysics \We were unable to find source code online ... for 40:4% of the codes used in the research we looked at". From Allen+2018, (arXiv:1801.02094). Reproducibility crisis in the sciences/astronomy Snakes on a Spaceship { An Overview of Python in Heliophysics \...inadequate analysis descriptions and loss of scientific data have made scientific studies difficult or impossible to replicate". From Burrell+2018, (arXiv:1901.00143). Schroedinger's code: source code availability and link persistence in astrophysics \We were unable to find source code online ... for 40:4% of the codes used in the research we looked at". From Allen+2018, (arXiv:1801.02094). Reproducibility crisis in the sciences/astronomy Snakes on a Spaceship { An Overview of Python in Heliophysics \...inadequate analysis descriptions and loss of scientific data have made scientific studies difficult or impossible to replicate". From Burrell+2018, (arXiv:1901.00143). Perspectives on Reproducibility and Sustainability of Open-Source Scientific Software \It is our interest that NASA adopt an open-code policy because without it, reproducibility in computational science is needlessly hampered". From Oishi+2018, (arXiv:1801.08200). Reproducibility crisis in the sciences/astronomy Snakes on a Spaceship { An Overview of Python in Heliophysics \...inadequate analysis descriptions and loss of scientific data have made scientific studies difficult or impossible to replicate". From Burrell+2018, (arXiv:1901.00143). Perspectives on Reproducibility and Sustainability of Open-Source Scientific Software \It is our interest that NASA adopt an open-code policy because without it, reproducibility in computational science is needlessly hampered". From Oishi+2018, (arXiv:1801.08200). Schroedinger's code: source code availability and link persistence in astrophysics \We were unable to find source code online ... for 40:4% of the codes used in the research we looked at". From Allen+2018, (arXiv:1801.02094). Original image from https://www.redbubble.com \Reproducibility crisis" in the sciences? (Baker 2016, Nature 533, 452) Definitions & Clarification (from the National Academies report in 2019, DOI:10.17226/25303) Replicability (hardware/statistical) I Involves data collection. I Inherently includes measurements errors (can never be exactly reproduced). I Example: Raw telescope image/spectra. I NOT DISCUSSED HERE. http://slittlefair.staff.shef.ac.uk Definitions & Clarification (from the National Academies report in 2019, DOI:10.17226/25303) Replicability (hardware/statistical) I Involves data collection. I Inherently includes measurements errors (can never be exactly reproduced). I Example: Raw telescope image/spectra. I NOT DISCUSSED HERE. http://slittlefair.staff.shef.ac.uk Definitions & Clarification (from the National Academies report in 2019, DOI:10.17226/25303) Replicability (hardware/statistical) Reproducibility (Software/Deterministic) I Involves data collection. I Involves data analysis, or simulations. I Inherently includes measurements errors I Starts after data is collected/digitized. (can never be exactly reproduced). I Example: Raw telescope image/spectra. I Example: 2 + 2 = 4 (i.e., sum of datasets). I NOT DISCUSSED HERE. I DISCUSSED HERE. https://tsongas.com http://slittlefair.staff.shef.ac.uk General outline of a project (after data collection) Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. General outline of a project (after data collection) Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. https://heywhatwhatdidyousay.wordpress.com General outline of a project (after data collection) What version? Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. https://heywhatwhatdidyousay.wordpress.com Different package managers have different versions of software (repology.org, 2019/11/20) GNU Astronomy Utilities (Gnuastro) PPackagingackaging statusstatus DDebianebian OldstableOldstable 00.2.33.2.33 DDebianebian StableStable 00.8.8 DDebianebian TestingTesting 00.10.10 DDebianebian UnstableUnstable 00.10.10 DDebianebian ExperimentalExperimental 00.10.39.10.39 Astropy DDeepineepin 00.5.5 PPackagingackaging statusstatus DDevuanevuan 2.02.0 (ASCII)(ASCII) 00.2.33.2.33 DDevuanevuan 3.03.0 (Beowulf)(Beowulf) 00.8.8 DDebianebian StableStable 33.1.2.1.2 DDevuanevuan UnstableUnstable 00.10.10 DDebianebian TestingTesting 33.2.3.2.3 DDPortsPorts 00.9.9 DDebianebian UnstableUnstable 33.2.3.2.3 FFreeBSDreeBSD PortsPorts 00.10.10 DDeepineepin 33.0.2.0.2 FFuntoountoo 1.31.3 00.3.3 DDevuanevuan 3.03.0 (Beowulf)(Beowulf) 33.1.2.1.2 FFuntoountoo 1.41.4 00.3.3 DDevuanevuan UnstableUnstable 33.2.3.2.3 GGentooentoo 00.3.3 KKaliali LinuxLinux RollingRolling 33.2.3.2.3 GGNUNU GGuixuix 00.10.10 PParrotarrot 33.2.3.2.3 KKaliali LinuxLinux RollingRolling 00.10.10 PPureOSureOS AmberAmber 33.1.2.1.2 oopenSUSEpenSUSE LeapLeap 15.115.1 00.8.8 PPureOSureOS landinglanding 33.2.3.2.3 oopenSUSEpenSUSE LeapLeap 15.215.2 00.8.8 RRaspbianaspbian StableStable 33.1.2.1.2 oopenSUSEpenSUSE TumbleweedTumbleweed 00.8.8 RRaspbianaspbian TestingTesting 33.2.3.2.3 oopenSUSEpenSUSE ScienceScience TumbleweedTumbleweed 00.10.10 UUbuntubuntu 18.0418.04 33.0.0 PPardusardus 00.2.33.2.33 UUbuntubuntu 18.1018.10 33.0.4.0.4 PParrotarrot 00.10.10 UUbuntubuntu 19.0419.04 33.1.1.1.1 PPLDLD LLinuxinux 00.8.8 UUbuntubuntu 19.1019.10 33.2.1.2.1 PPureOSureOS AmberAmber 00.8.8 UUbuntubuntu 20.0420.04 33.2.2.2.2 PPureOSureOS landinglanding 00.10.10 UUbuntubuntu 20.0420.04 ProposedProposed 33.2.3.2.3 RRaspbianaspbian OldstableOldstable 00.2.33.2.33 RRaspbianaspbian StableStable 00.8.8 RRaspbianaspbian TestingTesting 00.10.10 UUbuntubuntu 18.0418.04 00.5.5 UUbuntubuntu 18.1018.10 00.7.7 UUbuntubuntu 19.0419.04 00.8.8 UUbuntubuntu 19.1019.10 00.10.10 UUbuntubuntu 20.0420.04 00.10.10 General outline of a project (after data collection) Repository? What version? Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. https://heywhatwhatdidyousay.wordpress.com General outline of a project (after data collection) Repository? What version? Dependencies? Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. https://heywhatwhatdidyousay.wordpress.com General outline of a project (after data collection) Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. https://heywhatwhatdidyousay.wordpress.com General outline of a project (after data collection) Config options? Repository? Dep. versions? What version? Dependencies? Software Build Run software on data Paper Hardware/data Green boxes with sharp corners: source/input components/files. Blue boxes with rounded corners: built components. Red boxes with dashed borders: questions that must be clarified for each phase. https://heywhatwhatdidyousay.wordpress.com General outline of a project (after data collection) Config environment? Config options?

Customizable Framework for Managing Data Lineage

Red Hat Enterprise Linux 8 Installing, Managing, and Removing User-Space Components

Using the Mysql Yum Repository Abstract

Debian Installation Manual

Using System Call Interposition to Automatically Create Portable Software Packages

GNU Guix Cookbook Tutorials and Examples for Using the GNU Guix Functional Package Manager

Managing Software with Yum

2004 USENIX Annual Technical Conference

Red Hat Developer Toolset 9 User Guide

Introduction to the Nix Package Manager

Functional Package and Configuration Management with GNU Guix

Chapter 7 Package Management

Bulletin Issue 25