Latest/Contributing.Html 12 13 14
Total Page:16
File Type:pdf, Size:1020Kb
v0.14+332.gd98c6648.dirty Handbook Introduction • Advanced topics • Use cases ADINA WAGNER &MICHAEL HANKE with Laura Waite, Kyle Meyer, Marisa Heckner, Benjamin Poldrack, Yaroslav Halchenko, Chris Markiewicz, Pattarawat Chormai, Lisa N. Mochalski, Lisa Wiersch, Jean-Baptiste Poline, Nevena Kraljevic, Alex Waite, Lya K. Paas, Niels Reuter, Peter Vavra, Tobias Kadelka, Peer Herholz, Alexandre Hutton, Sarah Oliveira, Dorian Pustina, Hamzah Hamid Baagil, Tristan Glatard, Giulia Ippoliti, Christian Mönch, Togaru Surya Teja, Dorien Huijser, Ariel Rokem, Remi Gau, Judith Bomba, Konrad Hinsen, Wu Jianxiao, Małgorzata Wierzba, Stefan Appelhoff, Michael Joseph, Tamara Cook, Stephan Heunis, Joerg Stadler, Sin Kim, Oscar Esteban CONTENTS I Introduction1 1 A brief overview of DataLad2 1.1 On Data..........................................2 1.2 The DataLad Philosophy.................................3 2 How to use the handbook5 2.1 For whom this book is written..............................5 2.2 How to read this book..................................5 2.3 Let’s get going!......................................8 3 Installation and configuration 10 3.1 Install DataLad...................................... 10 3.2 Initial configuration................................... 17 4 General prerequisites 19 4.1 The Command Line................................... 19 4.2 Command Syntax.................................... 20 4.3 Basic Commands..................................... 20 4.4 The Prompt........................................ 21 4.5 Paths........................................... 21 4.6 Text Editors........................................ 22 4.7 Shells........................................... 22 4.8 Tab Completion...................................... 23 5 What you really need to know 24 5.1 DataLad datasets..................................... 24 5.2 Simplified local version control workflows....................... 25 5.3 Consumption and collaboration............................. 25 5.4 Dataset linkage...................................... 26 5.5 Full provenance capture and reproducibility...................... 26 5.6 Third party service integration............................. 27 5.7 Metadata handling.................................... 27 5.8 All in all. ......................................... 28 II Basics 29 6 DataLad datasets 31 6.1 Create a dataset..................................... 31 6.2 Populate a dataset.................................... 34 i 6.3 Modify content...................................... 40 6.4 Install datasets...................................... 43 6.5 Dataset nesting...................................... 50 6.6 Summary......................................... 53 7 DataLad, Run! 56 7.1 Keeping track....................................... 56 7.2 DataLad, Re-Run!..................................... 61 7.3 Input and output..................................... 67 7.4 Clean desk........................................ 75 7.5 Summary......................................... 78 8 Under the hood: git-annex 80 8.1 Data safety........................................ 80 8.2 Data integrity....................................... 82 9 Collaboration 89 9.1 Looking without touching................................ 89 9.2 Where’s Waldo?..................................... 96 9.3 Retrace and reenact................................... 98 9.4 Stay up to date...................................... 100 9.5 Networking........................................ 102 9.6 Summary......................................... 107 10 Tuning datasets to your needs 109 10.1 DIY configurations................................... 109 10.2 More on DIY configurations.............................. 114 10.3 Configurations to go.................................. 123 10.4 Summary........................................ 126 11 Make the most out of datasets 132 11.1 A Data Analysis Project with DataLad......................... 132 11.2 YODA: Best practices for data analyses in a dataset................. 133 11.3 YODA-compliant data analysis projects........................ 140 11.4 Summary........................................ 153 12 One step further 161 12.1 More on Dataset nesting................................ 161 12.2 Computational reproducibility with software containers............... 163 12.3 Summary........................................ 169 13 Third party infrastructure 174 13.1 Beyond shared infrastructure............................. 174 13.2 Dataset hosting on GIN................................. 185 13.3 Amazon S3 as a special remote............................ 192 13.4 Overview: Publishing datasets............................. 201 13.5 Summary........................................ 205 14 Help yourself 207 14.1 What to do if things go wrong............................. 207 14.2 Miscellaneous file system operations......................... 207 14.3 Back and forth in time................................. 228 14.4 How to get help..................................... 241 ii 14.5 Gists........................................... 256 III Advanced 263 15 Advanced options 265 15.1 How to hide content from DataLad.......................... 265 15.2 DataLad extensions................................... 268 15.3 DataLad’s result hooks................................. 270 15.4 Configure custom data access............................. 273 15.5 Remote Indexed Archives for dataset storage and backup.............. 277 15.6 Prioritizing subdataset clone locations........................ 293 15.7 Subsample datasets using datalad copy-file...................... 296 16 Go big or go home 306 16.1 Going big with DataLad................................ 306 16.2 Calculate in greater numbers............................. 309 16.3 Fixing up too-large datasets.............................. 311 16.4 Summary........................................ 313 17 Computing on clusters 314 17.1 DataLad on High Throughput or High Performance Compute Clusters....... 314 17.2 DataLad-centric analysis with job scheduling and parallel computing....... 315 17.3 Walkthrough: Parallel ENKI preprocessing with fMRIprep.............. 324 18 Better late than never 335 18.1 Transitioning existing projects into DataLad..................... 335 19 Special purpose showrooms 340 19.1 Reproducible machine learning analyses: DataLad as DVC............. 340 IV Use cases 365 20 A typical collaborative data management workflow 367 20.1 The Challenge...................................... 367 20.2 The DataLad Approach................................. 367 20.3 Step-by-Step....................................... 368 21 Basic provenance tracking 372 21.1 The Challenge...................................... 372 21.2 The DataLad Approach................................. 372 21.3 Step-by-Step....................................... 372 22 Writing a reproducible paper 377 22.1 The Challenge...................................... 377 22.2 The DataLad Approach................................. 378 22.3 Step-by-Step....................................... 378 22.4 Automation with existing tools............................ 381 23 Student supervision in a research project 385 23.1 The Challenge...................................... 385 23.2 The DataLad Approach................................. 386 iii 23.3 Step-by-Step....................................... 386 24 A basic automatically and computationally reproducible neuroimaging analysis 390 24.1 The Challenge...................................... 390 24.2 The DataLad Approach................................. 390 24.3 Step-by-Step....................................... 391 25 An automatically and computationally reproducible neuroimaging analysis from scratch 398 25.1 The Challenge...................................... 398 25.2 The DataLad Approach................................. 399 25.3 Step-by-Step....................................... 399 26 Scaling up: Managing 80TB and 15 million files from the HCP release 411 26.1 The Challenge...................................... 412 26.2 The DataLad Approach................................. 412 26.3 Step-by-Step....................................... 413 27 Building a scalable data storage for scientific computing 421 27.1 The Challenge...................................... 421 27.2 The DataLad approach................................. 422 27.3 Step-by-step....................................... 423 28 Using Globus as a data store for the Canadian Open Neuroscience Portal 427 28.1 The Challenge...................................... 427 28.2 The Datalad Approach................................. 428 28.3 Step-by-Step....................................... 429 28.4 Resources........................................ 431 29 DataLad for reproducible machine-learning analyses 432 29.1 The Challenge...................................... 432 29.2 The DataLad Approach................................. 433 29.3 Step-by-Step....................................... 433 29.4 References........................................ 446 30 Contributing 447 V Appendix 448 A Glossary 449 B Frequently Asked Questions 457 B.1 What is Git?....................................... 457 B.2 Where is Git’s “staging area” in DataLad datasets?.................. 457 B.3 What is git-annex?.................................... 457 B.4 What does DataLad add to Git and git-annex?..................... 458 B.5 Does DataLad host my data?.............................. 458 B.6 How does GitHub relate to DataLad?.......................... 459 B.7 Does DataLad scale to large dataset