Variability of the Neuroimaging Results Across OS, and How to Avoid It

Center for Open Neuroscience cogito ergo sum C Variability of the Neuroimaging results across OS, and how to avoid it Yaroslav O. Halchenko Dartmouth College, Hanover, NH, USA May 2018 http://datalad.org http://www.pymvpa.org Houston, we’ve got a problem... Houston, we’ve got a problem... Houston, we’ve got a problem... Gronenschild et al. (2012) Houston, we’ve got a problem... Glatard et al. (2015) How neuroimaging study should be implemented to make it verifiably re-executable? Simple re-executable neuroimaging publication Ghosh et al. (2017) Simple re-executable neuroimaging publication Goal Investigate and document procedures to generate a verifiable re-executable publication: obtain the same result using the same software on the same data make it easy to check if results hold when computational environment changes Approach make entire analysis fully scripted and full computing environment unambiguously specified complement classical publication with all necessary code, data and documentation for anyone to (re)run the analysis (on same or different data/environment) Simple re-executable: Data input dataset created by querying NITRC Image Repository (NITRC-IR; nitrc.org/ir) for MRI images subjects age = 10–15 years magnet field strength = 3 24 subjects from 1000 Functional Connectomes project (Biswal et al., 2010) age=13.5 +/- 1.4 years; 16 males, 8 females; 8 right handed, 1 left and 15 unknown data provided into analysis as a simple list of URLs Simple re-executable: Analysis software FSL fsl.fmrib.ox.ac.uk/fsl/fslwiki brain segmentation and parcellation nipype nipy.org/nipype Python framework to ”glue” all tools together into a workflow NeuroDebian neuro.debian.net turnkey software platform for neuroscience NITRC-CE nitrc.org/ce NeuroDebian-based computing environment “in the cloud” Simple re-executable: Nipype workflow github.com/ReproNim/simple workflow/blob/master/run demo workflow.py Simple re-executable: GitHub repository/publication Analysis (full or partial) could be executed on custom/personal computing environment (e.g., OS X desktop) some pre-deployed environment such as NITRC-CE pre-generated Docker environment to obtain “reference result” custom (Docker image) environment, which you can generate by specifying some past date CircleCI - when any changes pushed to GitHub (buzzword: “Continuous Integration”) Simple re-executable: GitHub repository/publication Analysis (full or partial) could be executed on custom/personal computing environment (e.g., OS X desktop) some pre-deployed environment such as NITRC-CE pre-generated Docker environment to obtain “reference result” custom (Docker image) environment, which you can generate by specifying some past date CircleCI - when any changes pushed to GitHub (buzzword: “Continuous Integration”) How results of running on OS X differ from a reference run (on NeuroDebian GNU/Linux)? Result: Reference vs OS X How using the same software and data could provide different results? The “same” code but different behavior different arithmetic precision across platforms different compiler flags could trigger different optimizations and floating point arithmetic behavior even the very initial step of data conversion (from DICOM to NIfTI) could be affected! different implementations of core (e.g., LAPACK and BLAS) libraries Environment (variables) matter the choice of which tool is actually executed (PATH) the choice of libraries being used (LD LIBRARY PATH) actual behavior of the tool: see e.g. AFNI’s README.environment for a list of > 300 environment variables which could alter behavior of AFNI Visit www.reproducibleimaging.org/module-reproducible-basics/01-shell-basics/ for more information Beware: No software is written by God Unknown artist/origin, borrowed from http://blogs.quovantis.com/god-programmer/ Beware: All software has bugs! Unknown artist/origin, borrowed from http://blogs.quovantis.com/god-programmer/ Beware: Even data can have bugs! Unknown artist/origin, borrowed from http://blogs.quovantis.com/god-programmer/ “most recent” version does not mean “more correct” Bugs: bad news for “exact” reproducibility e.g. some lead to random (often 0s but not always) data being used as starting point some bugs do get fixed e.g. some lead to random (often 0s but not always) data being used as starting point Bugs: good news for science, but ... some bugs do get fixed “most recent” version does not mean “more correct” Lessons learned or How to make any analysis re-executable and results reproducible? How to improve reproducibility Establish efficient code and data management to retain full history of changes Test (unit-, regression-) your analysis and assumptions (Re)use public datasets Collect information about your computation environment Create your own virtualized/conteinerized computational environments from exact/unambiguous specification Automate your analysis as much as possible How to improve reproducibility Establish efficient code and data management to retain full history of changes Test (unit-, regression-) your analysis and assumptions (Re)use public datasetsSounds Utopian? CollectNeuroimaging information about community your computation is blessed environment with great Createinitiatives your own and virtualized/conteinerized tools to make it actually computational possible environments from exact/unambiguous specification Automate your analysis as much as possible How to improve reproducibility Establish efficient code and data management to retain full history of changes BIDS, ReproIn, DataLad (git, git-annex), ... Test (unit-, regression-) your analysis and assumptions PyTest, MOxUnit, CTest, Travis-CI, CircleCI, ... (Re)use public datasets NITRC-IR, OpenfMRI/OpenNeuro, INDI, datasets.datalad.org, ... Collect information about your computation environment ReproZip, NICEMAN, ... Create your own virtualized/conteinerized computational environments from exact/unambiguous specification NeuroDebian, NITRC-CE, NeuroDocker, ... Automate your analysis as much as possible nipype, ... BIDS Brain Imaging Data Structure bids-apps.neuroimaging.io BIDS: A directory/files structure for neuroimaging BIDS Benefits: BIDS is both machine- and human- friendly could be automatically verified using bids-validator turnkey use of pre-crafted BIDS-apps such as fmriprep, mriqc, etc github.com/INCF/bids-validator bids-apps.neuroimaging.io ReproIn Reproducible Input reproin.repronim.org ReproIn: Pile of DICOMs ! BIDS ReproIn heuristic is part of github.com/nipy/heudiconv ReproIn ReproIn ReproIn: Benefits Minimal single time investment of adhering to sequence naming convention to convert existing studies look into Heudiconv to create a custom conversion heuristic All datasets within center organized into a hierarchy reflecting hierarchy at the scanner console Sidecar .json files in BIDS contain “useful” DICOM fields (no more of “lost slice order”) DICOM files are retained under sourcedata/ (easy to re-convert if needed) All data (optionally) are maintained under distributed version control system (DataLad) to facilitate incremental updates collaboration orchestration of data flow across computing infrastructure DataLad Distribution and management platform for all digital artifacts datalad.org DataLad: data discovery, access, version control, ... DataLad: 10,000 ft overview uses git and git-annex for managing all (e.g., code and data) digital artifacts of science allows to manage data spread across wide range of local or cloud resources provides access to over 11TB of neuroimaging data from various data providers authentication crawling of websites with data resources getting data from archives publishing new or derived data assists with recording provenance of the analyses makes meta-data useful to normal humans can be used to store images of computation environments NeuroDebian Turnkey platform for neuroscince neuro.debian.net NeuroDebian: Turnkey platform for neuroscience apt-get install python-nipype heudiconv datalad n fsl afni singularity-container neuro.debian.net What is ? Windows Build-time tests Compatibility Mac OS X Maintainer tests QA/Testing other Interoperability testing Virtual machine Reproducable research Portability Teaching Publications Mentoring Mailing lists Proxy bugreports IRC (#neurodebian on OFTC) Online Portability patches Maintenance identi.ca/twitter (neurodebian) Packaging Library/compiler transitions Communication NeuroDebian Insider Blog Legal checks [email protected] Modularisation Conference booth Personal Talks Stalking i386 Debian amd64 Debian Science Software Blends i386 Debian Med Ubuntu amd64 BTS Data QA Usage statistics i386 Repository Debian amd64 Snapshoting armel DE Archive mips(el) GR kfreebsd-i386/amd64 US-CA Mirrors ... US-NH US-TN Derivatives Neurodocker Creator of containerized environments for neuroimaging github.com/kaczmarj/neurodocker NeuroDocker NICEMAN Neuroimaging Computation Environments Manager niceman.repronim.org NICEMAN: Collect sufficient information about computing environment to document (include alongside the publication) compare or validate re-instantiate (exactly) Discover information about Debian-based distributions Conda virtualenv, pip Version control systems (git, svn) WiP: containers (Docker, Singularity) Unify interface to work with different resources local host remote hosts (ssh) Docker Singularity Cloud (Amazon) niceman.repronim.org Reproducibility is in reach! Free and Open Source Software, Data sharing, Good code and data management practices are helping to make it happen Acknowledgments David Kennedy NSF-CRCNS (1429999) Satrajit Ghosh

Variability of the Neuroimaging Results Across OS, and How to Avoid It

The Shogun Machine Learning Toolbox

Python Libraries, Development Frameworks and Algorithms for Machine Learning Applications

Tapkee: an Efficient Dimension Reduction Library

ML Cheatsheet Documentation

GNU/Linux AI & Alife HOWTO

Arxiv:1702.01460V5 [Cs.LG] 10 Dec 2018 1

Machine Learning for Genomic Sequence Analysis

The SHOGUN Machine Learning Toolbox

Learning Kernels -Tutorial Part IV: Software Solutions

The Shogun Machine Learning Toolbox

The SHOGUN Machine Learning Toolbox (And Its R Interface)

ML-Flex: a Flexible Toolbox for Performing Classification Analyses