Application of Bagit-Serialized Research Object Bundles for Packaging and Re-Execution of Computational Analyses
Total Page:16
File Type:pdf, Size:1020Kb
Application of BagIt-Serialized Research Object Bundles for Packaging and Re-execution of Computational Analyses Kyle Chard Bertram Ludascher¨ Thomas Thelen Computation Institute School of Information Sciences NCEAS University of Chicago University of Illinois at Urbana-Champaign University of California at Santa Barbara Chicago, IL Champaign, IL Santa Barbara, CA [email protected] [email protected] [email protected] Niall Gaffney Jarek Nabrzyski Matthew J. Turk Texas Advanced Computing Center Center for Research Computing School of Information Sciences University of Texas at Austin University of Notre Dame University of Illinois at Urbana-Champaign Austin, TX South Bend, IN Champaign, IL [email protected] [email protected] [email protected] Matthew B. Jones Victoria Stodden Craig Willisy NCEAS School of Information Sciences NCSA University of California at Santa Barbara University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Santa Barbara, CA Champaign, IL Champaign, IL [email protected] [email protected] [email protected] Kacper Kowalik Ian Taylor yCorresponding author NCSA Center for Research Computing University of Illinois at Urbana-Champaign University of Notre Dame Champaign, IL South Bend, IN [email protected] [email protected] Abstract—In this paper we describe our experience adopting and can be used for verification of computational reproducibil- the Research Object Bundle (RO-Bundle) format with BagIt ity, for example as part of the peer-review process. serialization (BagIt-RO) for the design and implementation of Since its inception, the Whole Tale platform has been “tales” in the Whole Tale platform. A tale is an executable research object intended for the dissemination of computational designed to bring together existing open science infrastructure. scientific findings that captures information needed to facilitate Researchers can ingest existing data from various scientific understanding, transparency, and re-execution for review and archival repositories; launch popular analytical tools (such as computational reproducibility at the time of publication. We Jupyter and RStudio); create and customize computational describe the Whole Tale platform and requirements that led to environments (using repo2docker1); conduct analyses; cre- our adoption of BagIt-RO, specifics of our implementation, and discuss migrating to the emerging Research Object Crate (RO- ate/upload code and data; and publish the resulting package Crate) standard. back to an archival repository. Tales are also downloadable and Index Terms—Reproducibility of results, Standards, Packag- re-executable locally, including the ability to retrieve remotely ing, Interoperability, Software, Digital preservation published data. With the May 2019 release of version 0.7 of the platform I. INTRODUCTION we adopted the Research Object Bundle BagIt serialization Whole Tale (http://wholetale.org) is a web-based, open- (BagIt-RO) format [?]. By combining the BagIt-RO serial- source platform for reproducible research supporting the cre- ization with our repo2docker-based execution framework and ation, sharing, execution, and verification of “tales” [?], [?]. the BDBag tools [?], we were able to define and implement Tales are executable research objects that capture the code, a standards-compliant, self-describing, portable, re-executable data, and environment along with narrative and workflow research object with the ability to retrieve remotely published information needed to re-create computational results from data. scientific studies. A goal of the Whole Tale platform (WT) is to produce an archival package that is exportable, publishable, 1https://repo2docker.readthedocs.io/ In this paper we describe the Whole Tale platform and re- • Whole Tale File System: A custom filesystem based on quirements that led to our adoption of the the BagIt-RO format. WebDav and FUSE used to mount user and registered The paper is organized as follows. In section II, we present data into running container environments a motivating example of the use of the Whole Tale platform • Image registry: A local Docker registry used to host followed by a brief description of the system architecture in images associated with tales section III. In section IV we outline the requirements that • Jobs and task Management: A task distribution and led to our adoption of the BagIt-RO format. In section V notification framework based on Girder and Celery we describe our implementation in more detail followed by • Data Management System (DMS): System for fetching, a discussion and conclusions. caching, and exposing externally published datasets Several aspects of the Whole Tale system are related to II. EXAMPLE SCENARIO:ANALYZING SEAL MIGRATION the BagIt-RO serialization format including filesystem orga- PATTERNS nization, user-defined environments, metadata as well as the We begin with a motivating example to illustrate the end-to- export and publication functions. We describe these in more end Whole Tale workflow for creating, exporting, and publish- detail below. ing a tale based on existing data archived using the Research 1) Tale workspace: Each tale has a workspace (folder) Workspace2, a DataONE member node. This example is based that contains user-created code, data, workflow, documen- tutorial material described in [?]. tation and narrative information. The workspace also con- A research team is preparing to publish a manuscript tains repo2docker-compatible configuration files defining the describing a computational model for estimating tale environment, described below. This appears as the animal movement paths from telemetry data. The workspace folder mounted into the running tale environ- source data for their analysis, tracking data for ment. juvenile seals in Alaska [?], has been published in 2) External data: Optionally, each tale can include refer- Research Workspace, a DataONE network member. ences to externally published data. The data is then registered Using the Whole Tale platform, the researchers with the Whole Tale system and managed by the DMS. register the external dataset. They then create a new Externally referenced data appears in the data folder, a tale by launching an RStudio environment based on sibling to the workspace. images maintained by the Rocker Project [?]. Using 3) Environment customization: Users can optionally cus- the interactive environment, they clone a Github tomize the tale environment using repo2docker-compatible repository, modify an R Markdown document, cus- configuration files. Whole Tale extends repo2docker via the 4 tomize the environment by specifying OS and R repo2docker_wholetale package, which adds build- packages via repo2docker configuration files, and packs to support Rocker, Spark, and OpenRefine images. execute their code to generate outputs. They down- 4) Metadata: Tales have basic descriptive metadata in- load the package in a compressed BagIt-RO format cluding creator, authors, title, description, keywords as well and run locally to verify their tale. Finally, they enter as information about the selected environment, licenses, and descriptive metadata and publish the final package associated persistent identifiers. The tale metadata is included back to DataONE to archive the package and obtain in the metadata directory both in the manifest.json and a persistent identifier to include in publication. environment.json files. The license is included in the BagIt payload directory, but not as part of the tale workspace. This scenario is further illustrated in Figure II. 5) Exporting tales: Tales can be exported in a BagIt- III. SYSTEM ARCHITECTURE RO serialized archive that contains the contents of the tale workspace (code, local data, narrative, workflow, repo2docker This section provides a brief overview of the Whole Tale configuration files) as well as references to external data, tale system architecture illustrated in Figure III. Whole Tale metadata, and a script to run the tale locally. BDBag [?] is used provides a scalable platform based on the Docker Swarm to materialize “holey” bags by downloading files specified in container orchestration system, exposing a set of core services the fetch.txt file, initially via HTTP and eventually via via REST APIs and Single Page Application (SPA). Key DOI, Globus, Agave schemes. The script to run locally is components include: stored at the root of the exported BagIt archive. • Whole Tale Dashboard: An Ember.js single page appli- Table 1 describes the contents of an exported tale in cation the BagIt-RO format. A complete example is available at • Whole Tale API: A REST API built using the Girder3 https://doi.org/10.5281/zenodo.2641314. framework to expose key features including authentica- tion, user/group management, tale lifecycle, data manage- IV. REQUIREMENTS ment, and integration with remote repositories The scenario described in section II highlights several key requirements of the Whole Tale platform that led to our 2https://www.researchworkspace.com 3https://girder.readthedocs.io 4https://github.com/whole-tale/repo2docker wholetale A container is launched Register telemetry Create a Tale, entering Enter descriptive based on selected dataset by digital object a name and selecting metadata including environment with an empty authors, title, description, identifier: the RStudio (Rocker) workspace and external and illustration image doi:10.24431/rw1k118 environment data mounted read-only schema:author schema:name schema:category pav:createdBy schema:license