Introduction to Reproducible Science in R 2 Contents

Brian Lee Yung Rowe Introduction to Reproducible Science in R 2 Contents 1 The reproducible scientific method 1 1.1 The rise of operational models . 3 1.2 The importance of being functional . 8 1.2.1 Determinism of programs . 8 1.2.2 Lazy evaluation . 10 1.2.3 Generic versus specific types . 10 1.2.4 Computational graphs . 11 1.3 The UNIX philosophy . 13 1.4 Other approaches to reproducible science . 13 1.4.1 The tidyverse . 14 1.4.2 ReproZip . 14 1.4.3 REANA . 14 1.5 Coding conventions . 15 1.6 Package dependencies and datasets . 15 1.7 Summary . 16 1.8 Exercises . 17 I Tools for Model Development 19 2 Core tools and requirements 21 2.1 GNU/Linux . 21 2.1.1 The command shell . 22 2.1.2 File permissions and scripts . 25 2.1.3 Linking commands with the pipe . 26 2.1.4 File redirects . 28 2.1.5 Return codes . 29 2.1.6 Processes . 30 2.1.7 Variables . 32 2.1.8 Functions . 33 2.1.9 Conditional expressions . 35 2.1.10 Conditional statements . 36 2.1.11 Case statements . 37 2.1.12 Loops . 37 2.2 The make build tool . 41 2.2.1 Variables . 42 2.2.2 Conditional blocks . 44 i ii 2.2.3 Macros and functions . 44 2.3 Literate programming and LATEX................. 45 2.3.1R documentation . 46 2.3.2 Articles, papers, and reports . 48 2.3.3 Notebooks . 50 2.4 Containerization and Docker . 50 2.5R, Python, and interoperability . 52 2.6 The crant utility . 55 2.6.1 Creating projects and packages . 56 2.6.2 Building packages . 58 2.6.3 Developing multiple packages . 58 2.6.4 Installing packages . 59 2.7 Exercises . 60 3 Project conventions 63 3.1 Directory structure . 64 3.2 CreatingRfiles . 66 3.3 Working with dependencies . 69 3.4 Documenting your work . 70 4 Source code management 71 4.1 Version management . 73 4.2 Branches . 76 4.3 Merging branches . 78 4.4 Remote repositories . 83 4.5 Exercises . 85 5 Formalizing code in a package 87 5.1 Creating a package . 87 5.2 Building a package . 89 5.3 Using packages . 91 5.4 Testing packages . 92 5.5 Continuous integration . 93 5.6 Exercises . 95 6 Developing with containers 97 6.1 Working in a container . 100 6.1.1 Volume mapping . 102 6.1.2 Using graphics in a container . 103 6.2 Managing containers . 103 6.3 Working with images . 104 6.4 Running a notebook from a container . 104 6.5 Exercises . 107 II Model Development Workflows 109 iii 7 Workflows and repeatability 111 8 Exploratory analysis and hypothesis creation 115 9 Model design 127 10 Operationalization 135 10.1 Data processing as a pipeline . 137 10.2 Job initiation . 138 10.3 Scheduling jobs with cron.................. 141 10.4 Event-driven processing . 142 10.5 Error handling . 145 10.6 Development environments . 149 11 Reporting and visualization 153 11.1 Rmarkdown and knitr forflat reports . 154 11.2 Notebooks for interactive documents . 159 11.3 Shiny for interactive visualizations . 159 11.4 OpenCPU for dashboards . 162 11.5 Summary . 162 11.6 Exercises . 163 III A Discipline for Data Science 165 12 Algorithm styles 167 13 A brief history of object-oriented programming 173 14 Elements of a functional programming language 179 14.1 A taste of functional programming sorcery . 179 14.1.1 Everything I see is a function to me . 180 14.1.2 Iteration over values . 182 14.1.3 Composing functions . 185 14.1.4 Eliminating conditional expressions . 186 14.1.5 How functional is your code? . 187 14.2 Properties of the functional programming style . 188 14.3 Function representation . 190 14.3.1 Function signatures . 191 14.3.2 Function definitions . 191 14.3.3 Function application . 191 14.4 Vectorization . 191 14.5 First-Class Functions . 194 14.6 Closures . 196 iv 15 Programming conventions 203 15.1 PERL’s three virtues . 203 15.2 The Erlang manifesto . 204 15.3 Self-documenting code . 204 15.4 Abbreviations . 205 15.5 Suffixes . 206 15.6 Line length . 206 15.7 Indentation . 207 15.8 Spaces . 207 15.9 Consistency . 208 16 Troubleshooting and diagnostics 209 16.1 Testing . 209 16.1.1 Edge cases and well-formed data . 209 16.1.2 Function boundaries . 211 16.1.3 Property testing . 211 16.1.4 Model validation . 213 16.1.5 A note on test coverage . 213 16.2 Logging . 214 16.3 Debugging . 214 16.4 Exercises . 215 17 Mathematical data structures 217 17.1 Scalars . 218 17.2 Sets . 219 17.3 Tuples . 219 17.4 Relations . 220 17.5 Predicates . 220 17.6 Vectors . 220 17.7 Matrices . 221 17.8 Tensors . 221 18 Data structures in data science 223 18.1 Tabular data . 223 18.2 Hierarchical data . 224 18.3 Summary . 224 18.4 Exercises . 224 19 Common data transformations 227 19.1 Retrieving and transforming input . 228 19.2 Data normalization . 229 19.3 Feature engineering . 230 19.4 Partitioning and aggregation . 230 19.5 Matching model interfaces . ..

Introduction to Reproducible Science in R 2 Contents

Linux Fundamentals (GL120) U8583S This Is a Challenging Course That Focuses on the Fundamental Tools and Concepts of Linux and Unix

Research Computing and Cyberinfrastructure Team

Bash Guide for Beginners

Linux Cheat Sheet

Cooper (2007) Advanced Bash Scripting Guide

Linux Shells Book Chapter 5

Kamiak Cheat Sheet

What's a Script?

Bash Guide for Beginners

Learning the Bash Shell, 3Rd Edition

1.5 Declaring Pre-Tasks

Bash Reference Manual Reference Documentation for Bash Edition 2.5, for Bash Version 2.05