Approaches for Containerized Scientific Workflows in Cloud

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1 Approaches for containerized scientific workflows in cloud environments with applications in life science Ola Spjuth1,*, Marco Capuccini1,2, Matteo Carone1, Anders Larsson3, Wesley Schaal1, Jon Ander Novella1, Oliver Stein1, Morgan Ekmefjord1, Paolo Di Tommaso4, Evan Floden4, Cedric Notredame4, Pablo Moreno5, Payam Emami Khoonsari6, Stephanie Herman1,6, Kim Kultima6, and Samuel Lampa1 1Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden 2Department of Information Technology, Uppsala University, Sweden 3National Bioinformatics Infrastructure Sweden, Uppsala University, Sweden 4Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Barcelona, Spain 5European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom 6Department of Medical Sciences, Clinical Chemistry, Uppsala University, Sweden *Corresponding author: [email protected] January 30, 2020 Abstract largest increase has been seen in recent years, but other domains are also increasing dramatically, in- Containers are gaining popularity in life science re- cluding proteomics, metabolomics, systems biology search as they provide a solution for encompass- and biological imaging [1, 2, 3]. Consequently, the ing dependencies of provisioned tools, simplify soft- need for computational and storage resources has ware installations for end users and offer a form continued to grow, but the focus has also changed of isolation between processes. Scientific workflows towards the downstream steps of biological exper- are ideal for chaining containers into data analy- iments and the need to carry out efficient and re- sis pipelines to aid in creating reproducible anal- producible data analysis. While high-performance yses. In this manuscript we review a number of computing (HPC) and high-throughput computing approaches to using containers as implemented in (HTC) clusters remain the main e-infrastructure the workflow tools Nextflow, Galaxy, Pachyderm, resource used in biological data analysis, cloud Argo, Kubeflow, Luigi and SciPipe, when deployed computing is emerging as an appealing alterna- in cloud environments. A particular focus is placed tive where, in the infrastructure-as-a-service (IaaS) on the workflow tool's interaction with the Kuber- case, scientists are able to spawn virtual instances netes container orchestration framework. or infrastructure on demand to facilitate their analysis, after which the resources are released [4, 5]. One area which has traditionally often caused 1 Introduction headaches for users is the installation of software tools and their inclusion into workflow tools or The life sciences have become data-intensive, driven other computing frameworks [6]. This is an area largely by the massive increase in throughput and where cloud resources provide an advantage over resolution of molecular data generating technolo- HPC, in that scientists are not dependent on sys- gies. Massively parallel sequencing (also known as tem administrators to install software, but can next-generation sequencing or NGS) is where the handle the installation themselves. However, soft- 1 © 2020 by the author(s). Distributed under a Creative Commons CC BY license. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1 ware for biological analyses can be quite challeng- objective for these systems is to allow the users to ing to install due to sometimes complex dependen- treat a cluster of compute nodes as a single deploy- cies. Virtual machines (VMs) offer the benefit of ment target and handle the packaging of containers instantiating ready-made environments with data on compute nodes behind the scenes. and software, including all dependencies for specific tasks or analyses and constitute a big step towards 1.2 Scientific workflows making computational reproducibility easier and more realistic to achieve in daily work. VMs form While orchestration tools such as Kubernetes en- the backbone of traditional cloud-based infrastruc- able scientists to run many containers in a dis- tures. Virtual machine images (VMIs), however, tributed environment (such as a virtual infrastruc- can easily become large and take considerable time ture on a cloud provider), the problem remains how to instantiate, transfer or rebuild when software to orchestrate the scheduling and dependencies be- in the image needs to be updated. Such images tween containers and being able to chain (define can easily be shared between users and can be de- data dependencies between) them into an analysis posited and indexed by one of the available VMI pipeline. This is where scientific workflow man- catalogs [7]. agement systems (WMSs) can be very useful; in fact, it can be difficult to carry out more com- 1.1 Containers plex analyses without such a system. Tradition- ally in data-intensive bioinformatics, an important Software container technology has emerged as a task for WMSs has been to support analyses con- complement to virtual machines. While they of- sisting of several components by running a set of fer a bit less isolation between processes, they are command-line tools in a sequential fashion, com- on the other hand more lightweight and easy to monly on a computer cluster. The main benefits share; the operating system kernel makes them of using a WMS include: i) making automation of much faster to launch and terminate [8]. Con- multi-step computations easier to create, more ro- tainers encompass all dependencies of the provi- bust and easy to change ii) providing more trans- sioned tools and greatly simplify software instal- parency as to what the pipeline does through its lations for end users. The most widely used con- more high-level workflow description and better re- tainerization solution is Docker (www.docker.com), porting and visualization facilities, and iii) provid- but Singularity [9], uDocker [10], and Shifter [11] ing more reliable policies to handle transient or per- are recent alternatives that prevent users from run- sistent error conditions, including strategies to re- ning containers with root privileges, addressing the cover from interrupted computations while re-using most common security issues when deploying con- any partially finished data. tainers in multi-tenant computing clusters such as Using containerized components as the nodes in a on high-performance computing (HPC) clusters. scientific workflow has the advantage of adding iso- Docker containers are commonly shared via Docker lation between processes and the complete encapsu- Hub [12]. There are also initiatives for standardiz- lation of tool dependencies within a container, and ing containers in the life sciences such as BioCon- reduces the burden to administrate the host sys- tainers [13]. Containers have seen an increased up- tem where the workflow is scheduled to run. This take in the life sciences, both for delivering software means the system can be tested on a local com- tools and for facilitating data analysis in various puter and executed on remote servers and clusters ways [14, 15, 16, 17, 18]. without modification, using the exact same con- When running more than just a few containers, tainers. Naturally, this opens up for execution on an orchestration system is needed to coordinate and virtual infrastructures, even those provisioned on- manage their execution and handle issues related to demand with features such as auto-scaling. Apart e.g. load balancing, health checks and scaling. Ku- from greatly simplifying access to the needed dis- bernetes [19] has over the last couple of years rose tributed compute resources, a positive side-effect is to become the de facto standard container orches- that the entire analysis becomes portable and re- tration system, but Docker Swarm [20] and Apache producible. Mesos [21] are other well-known systems. A key Data ingestion and access to reference data con- 2 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1 Tool Workflow Tool Kubernetes pod engine Kubernetes pod Kubernetes Kubernetes cluster cluster Workflow engine a) Workflow engine running on the local computer b) Workflow engine running in a pod in the Kubernetes cluster Figure 2: Comparison between two approaches for how to run the workflow engine. It can run either Figure 1: Overview of a containerized workflow a) on the user's computer, or b) in a pod in the ku- with a workflow engine managing the dependency bernetes (abbreviated \k8s" in the image) cluster. graph and scheduling of the containers encompass- ing the tools to run. The tools pass data between each other by reading and writing to a shared stor- 2.1 Nextflow age system. Nextflow [22] is a workflow framework that aims to ease the implementation of scientific workflows in a portable and reproducible manner across stitute a key step when using cloud-based resources. heterogeneous computing platforms and to en- When spawning a virtual infrastructure, data used able the smooth migration of these applications in the analysis either needs to be available on stor- to the cloud. The framework is based on the age connected to the compute nodes, or ingested dataflow paradigm, a functional/reactive program- into a locally provisioned file system. At the end of ming model in which tasks are isolated from each the analysis, the resulting data needs to

Load more