Center for Open Neuroscience cogito ergo sum C

Variability of the Neuroimaging results across OS, and how to avoid it

Yaroslav O. Halchenko Dartmouth College, Hanover, NH, USA May 2018

http://datalad.org http://www.pymvpa.org Houston, we’ve got a problem... Houston, we’ve got a problem... Houston, we’ve got a problem...

Gronenschild et al. (2012) Houston, we’ve got a problem...

Glatard et al. (2015) How neuroimaging study should be implemented to make it verifiably re-executable? Simple re-executable neuroimaging publication

Ghosh et al. (2017) Simple re-executable neuroimaging publication

Goal Investigate and document procedures to generate a verifiable re-executable publication: obtain the same result using the same software on the same data make it easy to check if results hold when computational environment changes Approach make entire analysis fully scripted and full computing environment unambiguously specified complement classical publication with all necessary code, data and documentation for anyone to (re)run the analysis (on same or different data/environment) Simple re-executable: Data

input dataset created by querying NITRC Image Repository (NITRC-IR; nitrc.org/ir) for MRI images subjects age = 10–15 years magnet field strength = 3 24 subjects from 1000 Functional Connectomes project (Biswal et al., 2010) age=13.5 +/- 1.4 years; 16 males, 8 females; 8 right handed, 1 left and 15 unknown data provided into analysis as a simple list of URLs Simple re-executable: Analysis software

FSL fsl.fmrib.ox.ac.uk/fsl/fslwiki brain segmentation and parcellation nipype nipy.org/nipype Python framework to ”glue” all tools together into a workflow NeuroDebian neuro.debian.net turnkey software platform for neuroscience NITRC-CE nitrc.org/ce NeuroDebian-based computing environment “in the cloud” Simple re-executable: Nipype workflow

.com/ReproNim/simple workflow/blob/master/run demo workflow.py Simple re-executable: GitHub repository/publication

Analysis (full or partial) could be executed on custom/personal computing environment (e.g., OS X desktop) some pre-deployed environment such as NITRC-CE pre-generated Docker environment to obtain “reference result” custom (Docker image) environment, which you can generate by specifying some past date CircleCI - when any changes pushed to GitHub (buzzword: “Continuous Integration”) Simple re-executable: GitHub repository/publication

Analysis (full or partial) could be executed on custom/personal computing environment (e.g., OS X desktop) some pre-deployed environment such as NITRC-CE pre-generated Docker environment to obtain “reference result” custom (Docker image) environment, which you can generate by specifying some past date CircleCI - when any changes pushed to GitHub (buzzword: “Continuous Integration”) How results of running on OS X differ from a reference run (on NeuroDebian GNU/Linux)? Result: Reference vs OS X How using the same software and data could provide different results? The “same” code but different behavior

different arithmetic precision across platforms different compiler flags could trigger different optimizations and floating point arithmetic behavior even the very initial step of data conversion (from DICOM to NIfTI) could be affected! different implementations of core (e.g., LAPACK and BLAS) libraries Environment (variables) matter

the choice of which tool is actually executed (PATH) the choice of libraries being used (LD LIBRARY PATH) actual behavior of the tool: see e.g. AFNI’s README.environment for a list of > 300 environment variables which could alter behavior of AFNI

Visit www.reproducibleimaging.org/module-reproducible-basics/01-shell-basics/ for more information Beware: No software is written by God

Unknown artist/origin, borrowed from http://blogs.quovantis.com/god-programmer/ Beware: All software has bugs!

Unknown artist/origin, borrowed from http://blogs.quovantis.com/god-programmer/ Beware: Even data can have bugs!

Unknown artist/origin, borrowed from http://blogs.quovantis.com/god-programmer/ “most recent” version does not mean “more correct”

Bugs: bad news for “exact” reproducibility

e.g. some lead to random (often 0s but not always) data being used as starting point some bugs do get fixed e.g. some lead to random (often 0s but not always) data being used as starting point

Bugs: good news for science, but ...

some bugs do get fixed “most recent” version does not mean “more correct” Lessons learned or How to make any analysis re-executable and results reproducible? How to improve reproducibility

Establish efficient code and data management to retain full history of changes Test (unit-, regression-) your analysis and assumptions (Re)use public datasets Collect information about your computation environment Create your own virtualized/conteinerized computational environments from exact/unambiguous specification Automate your analysis as much as possible How to improve reproducibility

Establish efficient code and data management to retain full history of changes Test (unit-, regression-) your analysis and assumptions (Re)use public datasetsSounds Utopian? CollectNeuroimaging information about community your computation is blessed environment with great Createinitiatives your own and virtualized/conteinerized tools to make it actually computational possible environments from exact/unambiguous specification Automate your analysis as much as possible How to improve reproducibility

Establish efficient code and data management to retain full history of changes BIDS, ReproIn, DataLad (git, git-annex), ... Test (unit-, regression-) your analysis and assumptions PyTest, MOxUnit, CTest, Travis-CI, CircleCI, ... (Re)use public datasets NITRC-IR, OpenfMRI/OpenNeuro, INDI, datasets.datalad.org, ... Collect information about your computation environment ReproZip, NICEMAN, ... Create your own virtualized/conteinerized computational environments from exact/unambiguous specification NeuroDebian, NITRC-CE, NeuroDocker, ... Automate your analysis as much as possible nipype, ... BIDS Brain Imaging Data Structure bids-apps.neuroimaging.io BIDS: A directory/files structure for neuroimaging BIDS Benefits:

BIDS is both machine- and human- friendly could be automatically verified using bids-validator turnkey use of pre-crafted BIDS-apps such as fmriprep, mriqc, etc

github.com/INCF/bids-validator bids-apps.neuroimaging.io ReproIn Reproducible Input reproin.repronim.org ReproIn: Pile of DICOMs → BIDS

ReproIn heuristic is part of github.com/nipy/heudiconv ReproIn ReproIn ReproIn: Benefits

Minimal single time investment of adhering to sequence naming convention to convert existing studies look into Heudiconv to create a custom conversion heuristic All datasets within center organized into a hierarchy reflecting hierarchy at the scanner console Sidecar .json files in BIDS contain “useful” DICOM fields (no more of “lost slice order”) DICOM files are retained under sourcedata/ (easy to re-convert if needed) All data (optionally) are maintained under distributed version control system (DataLad) to facilitate incremental updates collaboration orchestration of data flow across computing infrastructure DataLad Distribution and management platform for all digital artifacts datalad.org DataLad: data discovery, access, version control, ... DataLad: 10,000 ft overview

uses git and git-annex for managing all (e.g., code and data) digital artifacts of science allows to manage data spread across wide range of local or cloud resources provides access to over 11TB of neuroimaging data from various data providers authentication crawling of websites with data resources getting data from archives publishing new or derived data assists with recording provenance of the analyses makes meta-data useful to normal humans can be used to store images of computation environments NeuroDebian Turnkey platform for neuroscince neuro.debian.net NeuroDebian: Turnkey platform for neuroscience

apt-get install python-nipype heudiconv datalad \ fsl afni singularity-container

neuro.debian.net What is ?

Windows Build-time tests Compatibility Mac OS X Maintainer tests QA/Testing other Interoperability testing Virtual machine Reproducable research Portability Teaching Publications Mentoring Mailing lists Proxy bugreports IRC (#neurodebian on OFTC) Online Portability patches Maintenance identi.ca/twitter (neurodebian) Packaging Library/compiler transitions Communication NeuroDebian Insider Blog Legal checks [email protected] Modularisation Conference booth Personal Talks Stalking i386 Debian amd64 Debian Science Software Blends i386 Debian Med Ubuntu amd64 BTS Data QA Usage statistics i386 Repository Debian amd64 Snapshoting armel DE Archive mips(el) GR kfreebsd-i386/amd64 US-CA Mirrors ... US-NH US-TN Derivatives Neurodocker Creator of containerized environments for neuroimaging github.com/kaczmarj/neurodocker NeuroDocker NICEMAN Neuroimaging Computation Environments Manager niceman.repronim.org NICEMAN:

Collect sufficient information about computing environment to document (include alongside the publication) compare or validate re-instantiate (exactly) Discover information about Debian-based distributions Conda virtualenv, pip Version control systems (git, svn) WiP: containers (Docker, Singularity) Unify interface to work with different resources local host remote hosts (ssh) Docker Singularity Cloud (Amazon) niceman.repronim.org Reproducibility is in reach!

Free and Open Source Software, Data sharing, Good code and data management practices are helping to make it happen Acknowledgments

David Kennedy NSF-CRCNS (1429999) Satrajit Ghosh NIH-NIBIB (P41 EB019936) Jean-Baptiste Poline Matt Travers FOSS developers of Robert Buccigrossi Python, NumPy, SciPy, Christian Haselgrove NiPy, nipype, heudiconv, nibabel, Mathias Goncalves Matplotlib, Matteo Visconti di Oleggio Shogun, scikit-learn, Neurodocker, Castello Inkscape, OBS, git, git-annex,... Michael Hanke Debian, NeuroDebian, Neurostars Kyle Meyer Communities Benjamin Poldrack

about the slides:

c (largely) 2018 Yaroslav O. Halchenko, CC BY-SA 3.0 — Creative Commons Attribution-ShareAlike 3.0 available from goo.gl/tXRyuv

References

Biswal, B. B., Mennes, M., Zuo, X.-N., Gohel, S., Kelly, C., Smith, S. M., Beckmann, C. F., Adelstein, J. S., Buckner, . L., Colcombe, S., Dogonowski, A.-M., Ernst, M., Fair, D., Hampson, M., Hoptman, M. J., Hyde, J. S., Kiviniemi, V. J., Kotter,¨ R., Li, S.-J., Lin, C.-P., Lowe, M. J., Mackay, C., Madden, D. J., Madsen, K. H., Margulies, D. S., Mayberg, H. S., McMahon, K., Monk, C. S., Mostofsky, S. H., Nagel, B. J., Pekar, J. J., Peltier, S. J., Petersen, S. E., Riedl, V., Rombouts, S. A. R. B., Rypma, B., Schlaggar, B. L., Schmidt, S., Seidler, R. D., Siegle, G. J., Sorg, C., Teng, G.-J., Veijola, J., Villringer, A., Walter, M., Wang, L., Weng, X.-C., Whitfield-Gabrieli, S., Williamson, P., Windischberger, C., Zang, Y.-F., Zhang, H.-Y., Castellanos, F. X., and Milham, M. P. (2010). Toward discovery science of human brain function. Proceedings of the National Academy of Sciences, 107(10):4734–4739. Ghosh, S., Poline, J., Keator, D., Halchenko, Y., Thomas, A., Kessler, D., and Kennedy, D. (2017). A very simple, re-executable neuroimaging publication [version 2; referees: 1 approved, 3 approved with reservations]. F1000Research, 6(124). Glatard, T., Lewis, L. B., Ferreira da Silva, R., Adalat, R., Beck, N., Lepage, C., Rioux, P., Rousseau, M.-E., Sherif, T., Deelman, E., Khalili-Mahani, N., and Evans, A. C. (2015). Reproducibility of neuroimaging analyses across operating systems. Frontiers in Neuroinformatics, 9:12. Gorgolewski, K., Burns, C. D., Madison, C., Clark, D., Halchenko, Y. O., Waskom, M. L., and Ghosh, S. S. (2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in Python. Front. Neuroinform., 5:13. PMC3159964. Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O., Handwerker, D. A., Hanke, M., Keator, D., Li, X., Michael, Z., Maumet, C., Nichols, B. N., Nichols, T. E., Pellman, J., Poline, J.-B., Rokem, A., Schaefer, G., Sochat, V., Triplett, W., Turner, J. A., Varoquaux, G., and Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data, 3:160044. Gronenschild, E. H. B. M., Habets, P., Jacobs, H. I. L., Mengelers, R., Rozendaal, N., van Os, J., and Marcelis, M. (2012). The effects of freesurfer version, workstation type, and macintosh operating system version on anatomical volume and cortical thickness measurements. PLoS ONE, 7. Halchenko, Y. O. and Hanke, M. (2012). Open is not enough. Let’s take the next step: An integrated, community-driven computing platform for neuroscience. Frontiers in Neuroinformatics, 6(00022). PMC3458431. Jenkinson, M., Beckmann, C. F., Behrens, T. E. J., Woolrich, M. W., and Smith, S. M. (2012). FSL. NeuroImage, 62(2):782–790.