Sharing and Organizing Research Products As R Packages Matti Vuorre1 & Matthew J
Total Page:16
File Type:pdf, Size:1020Kb
Sharing and organizing research products as R packages Matti Vuorre1 & Matthew J. C. Crump2 1 Oxford Internet Institute, University of Oxford, United Kingdom 2 Department of Psychology, Brooklyn College of CUNY, New York USA A consensus on the importance of open data and reproducible code is emerging. How should data and code be shared to maximize the key desiderata of reproducibility, permanence, and accessibility? Research assets should be stored persistently in formats that are not software restrictive, and documented so that others can reproduce and extend the required computations. The sharing method should be easy to adopt by already busy researchers. We suggest the R package standard as a solution for creating, curating, and communicating research assets. The R package standard, with extensions discussed herein, provides a format for assets and metadata that satisfies the above desiderata, facilitates reproducibility, open access, and sharing of materials through online platforms like GitHub and Open Science Framework. We discuss a stack of R resources that help users create reproducible collections of research assets, from experiments to manuscripts, in the RStudio interface. We created an R package, vertical, to help researchers incorporate these tools into their workflows, and discuss its functionality at length in an online supplement. Together, these tools may increase the reproducibility and openness of psychological science. Keywords: reproducibility; research methods; R; open data; open science Word count: 5155 Introduction package standard, with additional R authoring tools, provides a robust framework for organizing and sharing reproducible Research projects produce experiments, data, analyses, research products. manuscripts, posters, slides, stimuli and materials, computa- Some advances in data-sharing standards have emerged: It tional models, and more. However, the potential added value is becoming more popular to share data on the Open Science of these products is not fully realized due to limited sharing Framework (OSF). However, those materials often contain and curating practices. Although more transparent commu- idiosyncratic file organization and minimal or missing doc- nication of these research products has recently been encour- umentation for raw data. In specific areas, organization and aged (Houtkoop et al., 2018; Klein et al., 2018; Lindsay, documentation standards have emerged, (e.g., the BIDS frame- 2017; Martone, Garcia-Castro, & VandenBos, 2018; Rouder, work in neuroscience, Gorgolewski et al., 2016), but they 2016; Rouder, Haaf, & Snyder, 2019; Vanpaemel, Vermorgen, usually only consider data and code instead of the project as Deriemaecker, & Storms, 2015; Wicherts, Borsboom, Kats, a whole. More comprehensive proposals are described in the & Molenaar, 2006), these efforts often focus narrowly on Transparency and Openness Promotion (Nosek et al., 2015), sharing data (and sometimes analysis code). Further, the prac- and Peer Reviewers’ Openness initiative guidelines (Morey et tical value of sharing is often limited by poor documentation, al., 2016), but these fall short of describing detailed standards incompatible file formats, and lack of organization, result- for organization and metadata. Additionally, these standards 1 ing in low rates of reproducibility (Hardwicke et al., 2018). (if they exist) may not be widely known, recognized, agreed Standardization of protocols for sharing would be beneficial; upon, and/or adopted at large. but, such standards have not emerged, possibly due to the variance of research in psychology. Instead of developing We sought a standard for organizing and sharing that would another standard, we suggest borrowing existing standards adhere to the FAIR (Findable, Accessible, Interoperable, and practices from software engineering. Specifically, the R Reusable) guidelines to maximize the reuse potential of data and support “discovery through good data management” (Wilkinson et al., 2016, see also general discussion). Addi- tionally, we recognized the added value of including other re- The authors declare equal authorship. MV is funded by the Ado- search outputs (“products”; e.g., manuscripts) beyond datasets lescent Well-Being in the Digital Age project funded by the Huo Family Foundation. 1By “reproducibility”, we mean computational or analytic repro- Correspondence concerning this article should be addressed to ducibility by which previous results can be reproduced using existing Matti Vuorre, 1 St Giles, Oxford OX1 3JS, United Kingdom. E-mail: data. Contrast this to “replicability”, in which entire procedures are [email protected] repeated to generate new data to test a scientific idea. 2 MATTI VUORRE1 & MATTHEW J. C. CRUMP2 in a reproducible collection of materials. We identified the created (see Figure 1). Functions in R packages are portable, R package standard with modern online-based workflows as such that others can install the package from their R console, a solution that doesn’t present overwhelming overhead for load it, and start using the functions immediately. Packages already busy researchers. Here, we discuss the R package can also depend on other packages (and be depended on), standard for creating reproducible research material contain- such that R automatically installs any requirements for your ers for data, analysis code, and metadata. Then, we introduce functions to work appropriately. Functions within R packages additional R packages for incorporating other research prod- are documented in a standardized manner, and the documen- ucts into the reproducible container. These additional tools tation for a function can be viewed in R (e.g. try ?mean) or are easily accessible through our vertical package, thus named online. because an entire research project can be completed from top Wrapping packages also provide a reuse benefit: Functions to bottom as a stack of R processes from within the RStudio can be difficult to find in old scripts, but easy to find and load software platform. if they are called from an existing package. Thus, formally in- R Packages cluding one’s functions in R packages facilitates reproducibil- ity and sharing. R is a programming language for statistical analyses (R Core Team (2020); https://www.r-project.org), and RStudio is its as- Data sociated integrated development environment (IDE; RStudio Team (2016); https://www.rstudio.com/). Both are free, open Broadly, there are 3 steps to including data in an R package: source and work on Windows, Mac, and Linux systems. These 1. placing raw data in the “data-raw” directory, 2. creating tools are already widely used among psychology researchers: an R script that processes the raw data and creates an R data In one crowd-sourced analysis, 16 of 29 analysis teams used object into the “data” directory, and 3. documenting the final R (Silberzahn et al., 2018), and many psychologists have data object. already developed R packages (see also Yee & Debbie, 2017, First, raw data in any format is stored in the “data-raw” direc- and http://r4stats.com/articles/popularity/). tory to ensure that the method of sharing remains software- A cornerstone of R are user-created packages, which contain agnostic. If possible, that raw data should be in a text based functions, (meta)data, and documentation in a standard for- format, rather than proprietary formats such as SPSS files. mat that allows seamless sharing across users and operating Second, instructions for converting and cleaning the raw data systems. Data and functions in R packages are immediately into analyzable format should also be included in this direc- available to others through the R console, and documentation tory (the raw data files should not themselves be modified). can be viewed in R or online. We first outline how data These pre-processing steps should be saved so that the com- and functions are included in R packages, and then turn to plete path from raw data to results is transparent and repro- including other assets, such as manuscripts and posters. A ducible, preferably in an R script in the “data-raw” directory. complete guide to R packages is outside our scope (see Wick- That script should end with creating an R data object in the ham (2015)). “data” directory. Placing the R data object there ensures that it is included in the resulting R package, which makes using When an R package is created as shown in Figure 1, the re- the data effortless for any R user, and eschews the need to quired files and directories are automatically created. At this download additional files. Loading the R package makes the point, the package can be installed (click “Install and Restart” included data immediately available in the R console. in RStudio’s “Build” tab) and shared online (see below). How- ever, to make it useful, content needs to be added, beginning Finally, the resulting data object in “data” should be doc- with the package’s description. That information (e.g., au- umented in a standardized format by placing a data.R thors, dependencies) is written in the “DESCRIPTION” file in file in the “R” directory, a process discussed in more de- a machine readable format. To edit that file, click on its name tail in our complete online supplementary tutorial (https: in RStudio’s File browser. The “blank.Rproj”, “man” and //crumplab.github.io/vertical/), and in Wickham (2015). That “NAMESPACE” files/directories shown in Figure 1 should standard format of documentation is especially useful because not be edited by users, and we thus turn to R functions. it allows datasets’ documentation to be viewed in R (e.g., type ?attitude in the R console) and online (see below). Functions R packages thus provide a useful standard for functions and Researchers often develop and use custom functions in their (meta)data. However, as we will see, the package structure analysis. Although it is possible to declare them directly in can be extended to include analyses, manuscripts, posters, the analysis scripts, we suggest declaring functions using the and presentations as well. The complete project is then easily R package standard, by placing the function’s source code shared online, and even showcased as a website, as we will into a file in the “R” directory.