Introduction to Reproducible Science in R 2 Contents
Total Page:16
File Type:pdf, Size:1020Kb
Brian Lee Yung Rowe Introduction to Reproducible Science in R 2 Contents 1 The reproducible scientific method 1 1.1 The rise of operational models . 3 1.2 The importance of being functional . 8 1.2.1 Determinism of programs . 8 1.2.2 Lazy evaluation . 10 1.2.3 Generic versus specific types . 10 1.2.4 Computational graphs . 11 1.3 The UNIX philosophy . 13 1.4 Other approaches to reproducible science . 13 1.4.1 The tidyverse . 14 1.4.2 ReproZip . 14 1.4.3 REANA . 14 1.5 Coding conventions . 15 1.6 Package dependencies and datasets . 15 1.7 Summary . 16 1.8 Exercises . 17 I Tools for Model Development 19 2 Core tools and requirements 21 2.1 GNU/Linux . 21 2.1.1 The command shell . 22 2.1.2 File permissions and scripts . 25 2.1.3 Linking commands with the pipe . 26 2.1.4 File redirects . 28 2.1.5 Return codes . 29 2.1.6 Processes . 30 2.1.7 Variables . 32 2.1.8 Functions . 33 2.1.9 Conditional expressions . 35 2.1.10 Conditional statements . 36 2.1.11 Case statements . 37 2.1.12 Loops . 37 2.2 The make build tool . 41 2.2.1 Variables . 42 2.2.2 Conditional blocks . 44 i ii 2.2.3 Macros and functions . 44 2.3 Literate programming and LATEX................. 45 2.3.1R documentation . 46 2.3.2 Articles, papers, and reports . 48 2.3.3 Notebooks . 50 2.4 Containerization and Docker . 50 2.5R, Python, and interoperability . 52 2.6 The crant utility . 55 2.6.1 Creating projects and packages . 56 2.6.2 Building packages . 58 2.6.3 Developing multiple packages . 58 2.6.4 Installing packages . 59 2.7 Exercises . 60 3 Project conventions 63 3.1 Directory structure . 64 3.2 CreatingRfiles . 66 3.3 Working with dependencies . 69 3.4 Documenting your work . 70 4 Source code management 71 4.1 Version management . 73 4.2 Branches . 76 4.3 Merging branches . 78 4.4 Remote repositories . 83 4.5 Exercises . 85 5 Formalizing code in a package 87 5.1 Creating a package . 87 5.2 Building a package . 89 5.3 Using packages . 91 5.4 Testing packages . 92 5.5 Continuous integration . 93 5.6 Exercises . 95 6 Developing with containers 97 6.1 Working in a container . 100 6.1.1 Volume mapping . 102 6.1.2 Using graphics in a container . 103 6.2 Managing containers . 103 6.3 Working with images . 104 6.4 Running a notebook from a container . 104 6.5 Exercises . 107 II Model Development Workflows 109 iii 7 Workflows and repeatability 111 8 Exploratory analysis and hypothesis creation 115 9 Model design 127 10 Operationalization 135 10.1 Data processing as a pipeline . 137 10.2 Job initiation . 138 10.3 Scheduling jobs with cron.................. 141 10.4 Event-driven processing . 142 10.5 Error handling . 145 10.6 Development environments . 149 11 Reporting and visualization 153 11.1 Rmarkdown and knitr forflat reports . 154 11.2 Notebooks for interactive documents . 159 11.3 Shiny for interactive visualizations . 159 11.4 OpenCPU for dashboards . 162 11.5 Summary . 162 11.6 Exercises . 163 III A Discipline for Data Science 165 12 Algorithm styles 167 13 A brief history of object-oriented programming 173 14 Elements of a functional programming language 179 14.1 A taste of functional programming sorcery . 179 14.1.1 Everything I see is a function to me . 180 14.1.2 Iteration over values . 182 14.1.3 Composing functions . 185 14.1.4 Eliminating conditional expressions . 186 14.1.5 How functional is your code? . 187 14.2 Properties of the functional programming style . 188 14.3 Function representation . 190 14.3.1 Function signatures . 191 14.3.2 Function definitions . 191 14.3.3 Function application . 191 14.4 Vectorization . 191 14.5 First-Class Functions . 194 14.6 Closures . 196 iv 15 Programming conventions 203 15.1 PERL’s three virtues . 203 15.2 The Erlang manifesto . 204 15.3 Self-documenting code . 204 15.4 Abbreviations . 205 15.5 Suffixes . 206 15.6 Line length . 206 15.7 Indentation . 207 15.8 Spaces . 207 15.9 Consistency . 208 16 Troubleshooting and diagnostics 209 16.1 Testing . 209 16.1.1 Edge cases and well-formed data . 209 16.1.2 Function boundaries . 211 16.1.3 Property testing . 211 16.1.4 Model validation . 213 16.1.5 A note on test coverage . 213 16.2 Logging . 214 16.3 Debugging . 214 16.4 Exercises . 215 17 Mathematical data structures 217 17.1 Scalars . 218 17.2 Sets . 219 17.3 Tuples . 219 17.4 Relations . 220 17.5 Predicates . 220 17.6 Vectors . 220 17.7 Matrices . 221 17.8 Tensors . 221 18 Data structures in data science 223 18.1 Tabular data . 223 18.2 Hierarchical data . 224 18.3 Summary . 224 18.4 Exercises . 224 19 Common data transformations 227 19.1 Retrieving and transforming input . 228 19.2 Data normalization . 229 19.3 Feature engineering . 230 19.4 Partitioning and aggregation . 230 19.5 Matching model interfaces . ..