Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

Approaches for containerized scientific workflows in cloud environments with applications in life science

Ola Spjuth1,*, Marco Capuccini1,2, Matteo Carone1, Anders Larsson3, Wesley Schaal1, Jon Ander Novella1, Oliver Stein1, Morgan Ekmefjord1, Paolo Di Tommaso4, Evan Floden4, Cedric Notredame4, Pablo Moreno5, Payam Emami Khoonsari6, Stephanie Herman1,6, Kim Kultima6, and Samuel Lampa1

1Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Sweden 2Department of Information Technology, Uppsala University, Sweden 3National Bioinformatics Infrastructure Sweden, Uppsala University, Sweden 4Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Barcelona, Spain 5European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom 6Department of Medical Sciences, Clinical Chemistry, Uppsala University, Sweden *Corresponding author: [email protected]

January 30, 2020

Abstract largest increase has been seen in recent years, but other domains are also increasing dramatically, in- Containers are gaining popularity in life science re- cluding proteomics, metabolomics, systems biology search as they provide a solution for encompass- and biological imaging [1, 2, 3]. Consequently, the ing dependencies of provisioned tools, simplify soft- need for computational and storage resources has ware installations for end users and offer a form continued to grow, but the focus has also changed of isolation between processes. Scientific workflows towards the downstream steps of biological exper- are ideal for chaining containers into data analy- iments and the need to carry out efficient and re- sis pipelines to aid in creating reproducible anal- producible data analysis. While high-performance yses. In this manuscript we review a number of computing (HPC) and high-throughput computing approaches to using containers as implemented in (HTC) clusters remain the main e-infrastructure the workflow tools Nextflow, Galaxy, Pachyderm, resource used in biological data analysis, cloud Argo, Kubeflow, Luigi and SciPipe, when deployed computing is emerging as an appealing alterna- in cloud environments. A particular focus is placed tive where, in the infrastructure-as-a-service (IaaS) on the workflow tool’s interaction with the Kuber- case, scientists are able to spawn virtual instances netes container orchestration framework. or infrastructure on demand to facilitate their anal- ysis, after which the resources are released [4, 5]. One area which has traditionally often caused 1 Introduction headaches for users is the installation of software tools and their inclusion into workflow tools or The life sciences have become data-intensive, driven other computing frameworks [6]. This is an area largely by the massive increase in throughput and where cloud resources provide an advantage over resolution of molecular data generating technolo- HPC, in that scientists are not dependent on sys- gies. Massively parallel sequencing (also known as tem administrators to install software, but can next-generation sequencing or NGS) is where the handle the installation themselves. However, soft-

1

© 2020 by the author(s). Distributed under a Creative Commons CC BY license. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

ware for biological analyses can be quite challeng- objective for these systems is to allow the users to ing to install due to sometimes complex dependen- treat a cluster of compute nodes as a single deploy- cies. Virtual machines (VMs) offer the benefit of ment target and handle the packaging of containers instantiating ready-made environments with data on compute nodes behind the scenes. and software, including all dependencies for specific tasks or analyses and constitute a big step towards 1.2 Scientific workflows making computational reproducibility easier and more realistic to achieve in daily work. VMs form While orchestration tools such as en- the backbone of traditional cloud-based infrastruc- able scientists to run many containers in a dis- tures. Virtual machine images (VMIs), however, tributed environment (such as a virtual infrastruc- can easily become large and take considerable time ture on a cloud provider), the problem remains how to instantiate, transfer or rebuild when software to orchestrate the scheduling and dependencies be- in the image needs to be updated. Such images tween containers and being able to chain (define can easily be shared between users and can be de- data dependencies between) them into an analysis posited and indexed by one of the available VMI pipeline. This is where scientific workflow man- catalogs [7]. agement systems (WMSs) can be very useful; in fact, it can be difficult to carry out more com- 1.1 Containers plex analyses without such a system. Tradition- ally in data-intensive bioinformatics, an important Software container technology has emerged as a task for WMSs has been to support analyses con- complement to virtual machines. While they of- sisting of several components by running a set of fer a bit less isolation between processes, they are command-line tools in a sequential fashion, com- on the other hand more lightweight and easy to monly on a computer cluster. The main benefits share; the operating system kernel makes them of using a WMS include: i) making automation of much faster to launch and terminate [8]. Con- multi-step computations easier to create, more ro- tainers encompass all dependencies of the provi- bust and easy to change ii) providing more trans- sioned tools and greatly simplify software instal- parency as to what the pipeline does through its lations for end users. The most widely used con- more high-level workflow description and better re- tainerization solution is (www.docker.com), porting and visualization facilities, and iii) provid- but Singularity [9], uDocker [10], and Shifter [11] ing more reliable policies to handle transient or per- are recent alternatives that prevent users from run- sistent error conditions, including strategies to re- ning containers with root privileges, addressing the cover from interrupted computations while re-using most common security issues when deploying con- any partially finished data. tainers in multi-tenant computing clusters such as Using containerized components as the nodes in a on high-performance computing (HPC) clusters. scientific workflow has the advantage of adding iso- Docker containers are commonly shared via Docker lation between processes and the complete encapsu- Hub [12]. There are also initiatives for standardiz- lation of tool dependencies within a container, and ing containers in the life sciences such as BioCon- reduces the burden to administrate the host sys- tainers [13]. Containers have seen an increased up- tem where the workflow is scheduled to run. This take in the life sciences, both for delivering software means the system can be tested on a local com- tools and for facilitating data analysis in various puter and executed on remote servers and clusters ways [14, 15, 16, 17, 18]. without modification, using the exact same con- When running more than just a few containers, tainers. Naturally, this opens up for execution on an orchestration system is needed to coordinate and virtual infrastructures, even those provisioned on- manage their execution and handle issues related to demand with features such as auto-scaling. Apart e.g. load balancing, health checks and scaling. Ku- from greatly simplifying access to the needed dis- bernetes [19] has over the last couple of years rose tributed compute resources, a positive side-effect is to become the de facto standard container orches- that the entire analysis becomes portable and re- tration system, but Docker Swarm [20] and Apache producible. Mesos [21] are other well-known systems. A key Data ingestion and access to reference data con-

2 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

Tool Workflow Tool Kubernetes pod engine Kubernetes pod

Kubernetes Kubernetes cluster cluster

Workflow engine

a) Workflow engine running on the local computer b) Workflow engine running in a pod in the Kubernetes cluster

Figure 2: Comparison between two approaches for how to run the workflow engine. It can run either Figure 1: Overview of a containerized workflow a) on the user’s computer, or b) in a pod in the ku- with a workflow engine managing the dependency bernetes (abbreviated “k8s” in the image) cluster. graph and scheduling of the containers encompass- ing the tools to run. The tools pass data between each other by reading and writing to a shared stor- 2.1 Nextflow age system. Nextflow [22] is a workflow framework that aims to ease the implementation of scientific workflows in a portable and reproducible manner across stitute a key step when using cloud-based resources. heterogeneous computing platforms and to en- When spawning a virtual infrastructure, data used able the smooth migration of these applications in the analysis either needs to be available on stor- to the cloud. The framework is based on the age connected to the compute nodes, or ingested dataflow paradigm, a functional/reactive program- into a locally provisioned file system. At the end of ming model in which tasks are isolated from each the analysis, the resulting data needs to either be other and are expected to be executed in a stateless put into persistent storage on the cloud or down- manner. This paradigm simplifies the deployment loaded to local storage. However, the overhead with of complex workflows in distributed computing en- this can also be turned around. If data resides on vironments and matches the immutable execution storage connected to a cloud infrastructure, it is model of containers. possible to simply move the workflow and contain- Nextflow provides built-in support for the most ers there and “bring compute to the data”. used container runtimes i.e. Docker, Singularity, Shifter and uDocker (though the last two are un- documented because of still being in an experimen- tal state). The use of containers is defined in a 2 Workflow systems declarative manner i.e. by specifying the container image that a workflow needs to use for the task exe- cutions. Once annotated in this way, the Nextflow Due to their simplicity, containers are supported runtime takes care of transparently running each out of the box by most workflow tools. However, task in its own container instance while mount- the approach for interacting with containers dif- ing the required input and output files as needed. fers, especially when using virtual infrastructures This approach allows a user that needs to target on cloud resources. This section describes the ap- the requirements of a specific execution platform proaches taken by a set of workflow management to quickly switch from one containerization tech- systems to interacting with containers in cloud en- nology to another with just a few changes in the vironments, with varying degree of uptake in the workflow configuration file (e.g. the same appli- life science community. cation can be deployed in an HPC cluster using

3 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

Singularity or in the cloud with Docker). Great flexibility is given for container image con- figuration. Nextflow supports both the image-per- workflow pattern (the same image for all executed tasks) and the image-per-task pattern (each task uses a different container image). The container image used by a task can even be specified at run- time by a given input configuration. Nextflow allows the deployment of containerized workloads in a cloud-native manner using differ- ent strategies, as briefly discussed in the following paragraphs. Cloud-managed computing service: in this ap- proach workflow tasks are submitted to a cloud provider batch execution service. The containeriza- tion and auto-scaling are delegated to the cloud ser- vice. The workflow inputs/outputs and intermedi- ate result files need to be stored using the provider object storage service. Nextflow adapts the task Figure 3: Overview of a containerized workflow definition, resource requests and container image with a workflow engine managing the dependency to the cloud provider format, submitting the cor- graph and scheduling of the containers encompass- responding API requests and properly staging the ing the tools to run. The tools pass data between task data. Currently, Nextflow has built-in sup- each other by reading and writing to a shared stor- port for the AWS Batch service and the age system. Genomics Pipelines service. Cloud unmanaged computing service: when us- ing this deployment strategy Nextflow takes care of tinue to evolve since 2005, enabling reproducible the provisioning of a set of cloud VM instances in research through active channels for sharing tools, which it automatically sets up its own clustering en- workflows and datasets [23]. Galaxy is currently gine for the distributed execution of the workflow. used by tens of thousands of scientists across the The VM instances only require the availability of globe [24]. Besides offering a modern user interface a Docker runtime and the Java virtual machine. (UI) for end users, Galaxy is accessible through its Workflow tasks exchange data via a shared file sys- REST API or through the Bioblend Python bind- tem or by using a cloud object storage service. The ing (that talks to this REST API). workflow execution is managed by Nextflow which For execution versatility, Galaxy has adapters to also handles the cluster auto-scaling i.e. adds or offload jobs in a wide variety of systems, from lo- removes VM instances on-demand to adapt the ac- cal Docker containers to different batch systems. tual needs of a workload at any point in time and In Galaxy, there is a complete separation of con- optimizes the overall computing cost when applica- cerns between tools, workflows and execution en- ble. At the time of writing, this feature supports vironments, meaning that a workflow definition is the use of the AWS EC2 cloud service and Google not bound to a particular execution environment. Compute Engine service. Within the last five years Galaxy gained the ability to launch on Amazon AWS through the CloudMan 2.2 Galaxy Launcher Galaxy sub-project [25]. This setup fol- lows a relative VM centric approach, which has nat- Galaxy (Galaxy, RRID:SCR 006281) is a data- ural limitations for scaling in the number of tools analysis workflow platform that has probably the included as they all need to be provisioned inside most active development community in the field of the VM used for the provisioned cluster. workflow environments in Bioinformatics. This has More recently, through efforts within the Phe- enabled it to persist as an active project and to con- noMeNal H2020 Project [26], Galaxy gained the

4 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

ability to deploy inside Kubernetes in an automated and possibly hundreds of deployments in develop- fashion through the use of Helm Charts [27]. The ment settings. This integration has been tested Galaxy-stable Helm chart makes use of commu- locally on Minikube, on OpenStack deployments, nity maintained Galaxy container images (Docker- on (GCP) and Amazon Galaxy-stable), which are used on other container Web Services (AWS). It is expected that in the orchestrators, and include support for SSH file next few years the Galaxy-Kubernetes integration transfer protocol (SFTP) access, database back- is embraced in the areas of Proteomics and Tran- ends and even deployment/compatibility of HPC scriptomics. The CloudMan Galaxy launcher is scheduling systems like Condor or SLURM for hy- being considered to be migrated to a more con- brid cloud setups. The Kubernetes job runner, also tainerized deployment scheme based on the Galaxy- contributed to the Galaxy community by the Phe- Kubernetes setup. noMeNal H2020 Project, allows Galaxy to offload jobs to Kubernetes, either when Galaxy runs in- 2.3 Pachyderm side the container orchestrator or from the outside (provided that the shared file system is accessible to Pachyderm [29] is a large-scale data processing tool both the container orchestrator and Galaxy). The built natively on top of Kubernetes. This plat- Kubernetes runner for Galaxy will take care of re- form offers: i) a data management system based source requirement settings, recovery of jobs in Ku- on Git semantics and ii) a workflow system for cre- bernetes if there is a restart, auto-upgrading of re- ating distributed and reproducible pipelines based source utilization, Persistent Volume Claim mount- on application containers. In order to create work- ing both for Galaxy and in each job’s pod, as well as flows with Pachyderm, users simply need to supply adequate supplementary groups handling for net- a JSON pipeline specification including a Docker work filesystem (NFS) and other shared file sys- image, an entrypoint command to execute in the tems. This integration requires the use of a shared user containers and one or more data input(s). Af- file system, which needs to be mountable in Kuber- terwards, Pachyderm will ensure that the corre- netes in a read-write-many configuration, as both sponding pods are created in Kubernetes, and will the main Galaxy pod and the job pods need to share the input data across them and collect the mount it at the same time (through the PV/PVC corresponding outputs. Thanks to using Kuber- abstraction). The Kubernetes runner for Galaxy netes for container orchestration, Pachyderm is able adheres to the principles of failsafeness and redun- to, among other things: i) optimize cluster resource dancy commonly considered in cloud computing, utilization, ii) run seamlessly on multiple cloud and providing ways to recover from jobs being lost by on-premise environments, iii) self-heal pipeline jobs cluster nodes going down in the middle of an exe- and related resources. cution. This workflow tool provides an easy mechanism The Kubernetes runner for Galaxy can benefit to distribute computations over a collection of con- from explicit tool-to-containers mapping through tainers. In other words, it offers similar capabilities Galaxy dynamic destinations, or take advantage of to frameworks such as Apache Spark, but replaces dynamic bioconda-to-containers mapping thanks to MapReduce syntax with legacy code. More specif- built-in functionality in Galaxy for this purpose. ically, Pachyderm is able to distribute workloads This means that if a Galaxy tool declares a Bio- over collections of containers by partitioning the conda package as the dependency to be resolved, it data into minimal data units of computation called will bring an automatically built container for that “datums”. The contents of these datums are de- Bioconda package [28] from BioContainers [13]. fined by glob patterns such as ”/” or “/*”, which The Galaxy-Kubernetes integration, which in- instruct Pachyderm how to process the input data: cludes all the aspects related to Kubernetes men- all together as a single datum, or in isolation (as tioned in the previous paragraphs, has been bat- separate datums). Users can define the number tle tested by the PhenoMeNal H2020 Project, hav- of pipeline workers to process these datums, which ing its public instance received in the order of tens will be processed one at a time in isolation. When of thousands of jobs during the past 2 years, hav- the datums have been successfully processed the ing tens of deployments in production environments Pachyderm daemon gathers the results correspond-

5 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

ing to each datum, combines them and versions the structions to reproduce the experiment, are pub- complete output of the pipeline. The scheduling licly available on GitHub [35]. of pipeline workers is determined by the resource utilization status of the workers nodes and the re- source requests that are made for each pipeline, al- 2.5 SciPipe lowing for efficiently distributing workloads. Re- SciPipe [36] (SciPipe, RRID:SCR 017086) is a cently, work to integrate and demonstrate Pachy- workflow library based on Flow-based program- derm in bioinformatics was published [30]. ming principles, which enables building workflows from component libraries of predefined workflow 2.4 Luigi components. SciPipe is implemented as a program- ming library in Go, which enables using the full Luigi [31] is an open-source Python module for power of the Go programming language to define defining and running batch-like workflows. The workflows, as well as to compile workflows to ex- system is strongly oriented towards Big Data, ecutable files, for maximum ease of deployment in and it comes as a convenience tool to pipe jobs cloud scenarios. from well-established analytics frameworks (e.g. SciPipe has been used to orchestrate machine Hadoop, Spark, Pig and Hive). Hence, instead of learning pipelines to build predictive models for off- trying to substitute any of the available analytics target binding in pharmacology [37]. SciPipe cur- systems, Luigi provides the means to plug together rently provides early support for container based jobs from different frameworks in a single batch- workloads via Kubernetes, implemented through oriented workflow. In doing so, Luigi seamlessly the official Kubernetes Go client library [38]. The handles many boilerplate operations such as: de- Go client library enables transparent access to a pendency resolution, visualization, failure tolerance Kubernetes cluster regardless of whether the SciP- and atomic file system operations. ipe workflow runs outside the cluster or inside, in The Kubernetes integration was developed by a pod. When running inside the cluster, it auto- the PhenoMeNal consortium, and it is now part matically detects cluster settings. When not in- of the official Luigi project and thus being sup- side a cluster, the connection to a cluster can be ported and maintained by the community. The configured by setting the appropriate environment integration enables to include pipeline processing variables. The Kubernetes integration in SciPipe steps as Kubernetes Jobs, which represent batch- is demonstrated through the implementation of a like application containers that run until command use case workflow consisting of a set of data prepa- completion. Hence, Luigi mainly uses Kubernetes ration steps for mass-spectrometry data, using the as a cluster-resource manager where multiple par- OpenMS software suite [39]. It contains a Go file allel jobs can be run concurrently. To this extent, (With the .go extension) with the workflow, and a it is important to point out that Luigi is not meant Kubernetes job-spec file in YAML format (with the to be used as a substitute for parallel processing .yml extension) for starting SciPipe inside a Kuber- frameworks (such as Spark and Hadoop), as it has netes cluster is together with the workflow [40]. limited support for concurrent tasks. In fact, in our experience Luigi can handle up to 300 parallel Singularity containers [9] can already today be tasks before it starts to break down, or 64 par- used from SciPipe on a simple level, by configuring allel tasks if running on a single node. A project the appropriate shell commands when creating new to adapt Luigi for scientific applications in general processes through the shell-command process con- and bioinformatics in particular resulted in a tool structor (the scipipe.NewProc() function in SciP- named SciLuigi [32]. ipe). Development of a more integrated support for Singularity containers is planned. We benchmarked Luigi on Kubernetes by repro- ducing a large-scale metabolomics study [33]. The analysis was carried out on a cloud-based Kuber- 2.6 Argo workflow netes cluster, provisioned via KubeNow and hosted by EMBL-EBI, showing good scalability up to 40 Argo workflow [41] originates from a corporate ap- concurrent jobs [34]. Workflow definition, and in- plication with the need of a competent continuous

6 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

integration / continuous deployment (CI/CD) so- integrated with the Kubeflow ecosystem with a dif- lution but matured into a generalized workflow en- ferent UI than the original Argo UI. In essence, a gine. It was open sourced by Intuit and is now user declares a workflow in the Kubeflow Pipelines adopted across several industries and companies software development kit (SDK), defining the dif- among them life science and biopharma. A big ferent steps in terms of container operations, i.e. differentiator of Argo is that no domain specific what container to use for the step, Kubernetes re- language knowledge is required to build workflows. source allocations, commands to execute in the con- The workflows are instead constructed by a clearly tainer and so on. The user also declares the order defined declarative approach manifested as Kuber- of these container operations and any desired in- netes custom resource definitions (CRDs), which put parameters to the workflow, and from this def- is a templated approach to enable composability inition a pipeline can be compiled. This pipeline and extendability to the original Kubernetes re- is a portable configuration which in theory can be source orchestration. Argo is often referred to as a uploaded and used in any Kubeflow environment. cloud native workflow orchestration tool and is de- With the pipeline deployed in Kubeflow, a user can signed for container native execution without the start a run where available pipeline parameters can overhead of other non native container solutions. be provided and the progress of the run tracked Argo provides a robust way to declare and run in the Kubeflow Pipelines UI. Here, logs of the workflows as arguments and artifact objects can container’s standard output, output artifacts and be declared as input and output and containers various Kubernetes information can be tracked for run as state transitions merely passing objects and each step, along with a visualization of the work- arguments through the pipeline. In each work- flow graph. flow step the state is stored in redundant etcd [42] As an evaluation of the Kubeflow Pipelines sys- memory for resilience and reproducibility. Argo tem, a simple pipeline was developed, based on pre- is complemented by an artifacts concept that en- vious work [44] where a convolutional neural net- ables data and parameter versioning through com- work is trained to “... predict cell mechanisms of plementing Argo with an s3 compatible storage action in response to chemical perturbations. . . ”, backend. Furthermore Argo events enables signals based on a dataset of cell microscopy images from and events for triggering workflows on webhooks, the Broad Bioimage Benchmark Collection. The resource changes, GitHub or Docker image changes pipeline covers the following steps of the machine etc. enabling additional automation scenarios. learning process: data pre-processing, model train- ing, model evaluation and model serving prepara- 2.7 Kubeflow tion. The result is a workflow that handles the en- tire process end-to-end, building a servable machine Kubeflow [43] is another open-source platform for learning model and publishing this as a Docker con- Kubernetes, specifically aimed at building and run- tainer that can be deployed for serving predictions. ning workflows and is developed The pipeline source code, while partly tailored to a by Google. As a framework, Kubeflow is intended specific environment, is available in [45]. to provide a set of tools for various tasks in the process of machine learning workflows, from data preparation and training to model serving and 3 Discussion maintenance. Many of the tools are existing tools with a user base in the industry, such as Jupyter The tools covered in this study have taken slightly Notebook and , while others are customized different approaches to work with containers, with wrappers or endpoints to frameworks like Tensor- different implications. Most tools are designed to Flow, PyTorch. work with general command-line tools while also In particular, Kubeflow provides a workflow tool supporting containers. except Pachyderm, Argo called Kubeflow Pipelines, which allows users to and Kubeflow which are Kubernetes-native and create workflows of varying complexity, chaining only support containers as means of processing. together series of container operations. The tool is Workflow tools can almost always work with con- based on the workflow engine Argo, which has been tainers instead of command-line tools, using the

7 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

‘docker run’ command, and taking advantage of an processes per Kubernetes job. Pachyderm makes external scheduling system such as a batch sched- direct use of Kubernetes scheduling and uses etcd uler (e.g. SLURM). When deployed in cloud envi- for storing metadata for data versioning and sta- ronments, Kubernetes is the most widely used or- tus of all jobs. It is also the only system that uses chestration framework by the tools, though Galaxy the Kubernetes feature of “parallel jobs”. Argo and also supports Docker Swarm as well as provides a Kubeflow Pipelines similarly utilize etcd for storing convenient mapping from the Conda package man- metadata of workflows and executions, and employ ager to containers. For Kubernetes, tools differ in a workflow controller component to schedule con- the way they interact with its API; Pachyderm, tainers based on the state of deployed workflows. Argo, Kubeflow Pipelines and SciPipe interact via Luigi keeps a thread open and continuously polls the Go API whereas Nextflow, Galaxy and Luigi the job status. SciPipe keeps a lightweight thread communicate via the REST API. While these APIs (go-routine) alive for each created job, as the Go are not identical, the API designs are very similar in API call for creating jobs do block the calling go- terms of data structures and in effect allow for the routine until the job finishes or is stopped for other same level of control. In more detail, the structure reasons. of the JSON documents sent to the REST API is Handling many concurrent jobs can become closely matched by the hierarchic structure of the a problem in Luigi/SciLuigi, where a separate ‘spec’ structs in the Go API. The workflow tools Python process is started for each job which im- also differ somewhat in the level of flexibility they poses a practical limit on the number of concur- allow for utilizing the API features, where Luigi is rent jobs; in the authors’ experience 64 is a rule the most versatile as it allows for passing a raw of thumb for an upper limit if running on a single JSON string with full flexibility but also requires node, or else 300. Going above this tends to result the most skills from workflow authors. in HTTP timeouts since workers are talking to the Error management for containerized workflows central scheduler via HTTP calls over the network. is an important concept. Kubernetes is designed So what are the main differences between the to restart pods that fail, which is not particularly systems when running containers in cloud environ- useful if there is an error reported from the con- ments? Galaxy has a graphical user interface and tainerized tool. The cloud paradigm includes fault- might be more attractive for users with less exper- tolerance, and the tools differ in their strategies. tise in programming/scripting. Workflows in Argo Galaxy, Luigi and SciPipe use Kubernetes Jobs, can be submitted and controlled both from multi which allows restarting the pods N times before platform cli’s as well as monitored through a graph- reporting as failure. Argo and Kubeflow Pipelines ical user interface but can also be controlled by run workflow steps as pods that are governed by other Kubernetes cluster resources since workflows Argo’s custom Kubernetes resource workflow; as are standard custom resource description (CRD) such errors for individual containers are handled as resources. Kubeflow also provides a GUI as the for normal Kubernetes pods, including error mes- main method for running pipelines, in addition to sages detailing reasons for failure etc. Nextflow of- being a very portable platform in Kubernetes envi- fers the possibility to automatically tune the job, ronments. Kubeflow Pipelines does however have e.g. to increase the available memory for a job be- a somewhat steep learning curve as the Python- fore reporting it as failed. based domain-specific language (DSL) requires a The way in which workflow systems schedule con- good grasp of Kubernetes concepts. Nextflow is ar- tainers on Kubernetes is one factor where they dif- guably the most versatile system for working with fer from each other. Nextflow implements an actor- containers, being able to operate either on cloud like model to handle job requests. Jobs are submit- or HPC batch schedulers. Pachyderm, Argo and ted in a manner similar to an internal job queue. Kubeflow are built specifically for running on Ku- Then a separate thread submits a request to the bernetes and Pachyderm has a data-versioning file Kubernetes cluster for each of them and periodi- system built-in. Argo and Kubeflow Pipelines sup- cally checks their status until the job completion is port artifacts in workflows and versioning of input detected. Galaxy has two handlers iterating over and output artifacts through interfacing with an all running jobs. It keeps no separate threads or optional S3 storage backend. Luigi/SciLuigi sup-

8 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

ports many data sources out of the box, includ- • SDK: Software development kit ing HDFS, S3, etc. and integrates seamlessly with Apache Hadoop and Spark. SciPipe implements an agile workflow programming API that smoothly • SFTP: SSH file transfer protocol integrates with the fast growing Go programming language ecosystem. • UI: User interface We would like to end with a word of caution regarding the use of containers. While containers have many advantages from a technical perspective • VM: Virtual machine and alleviates many of the problems with depen- dency management, it also comes with implications • VMI: Virtual machine image as it can make tools more opaque and discourage accessing and inspecting the inner workings of con- tainers. This could potentially lead to misuse or • WMS: Workflow management system lack of understanding of what a specific tool does. Thus, when wiring together containers in a scien- tific workflow, proper care needs to be taken that the observed output matches what would be ex- 4.2 Funding pected from using the tool directly [46]. Also for this reason we strongly suggest to follow commu- This research was supported by The European nity best practices for packaging and containeriza- Commission’s Horizon 2020 programme funded un- tion of bioinformatics software [47]. der grant agreement number 654241 (PhenoMe- Nal) and grant agreement number 731075 (Open- RiskNet), the Swedish Foundation for Strategic 4 Declarations Research, the Swedish Research Council FOR- MAS, the Swedish e-Science Research Centre 4.1 List of abbreviations (SeRC), Ake˚ Wiberg Foundation, and the Nordic e-Infrastructure Collaboration (NeIC) via the • API: Application programming interface Glenna2 project. The funders had no role in study • AWS: design, data collection and analysis, decision to publish, or preparation of the manuscript. • CI/CD: Continuous integration / continuous deployment

• CRD: Custom resource description 4.3 Author’s Contributions

• DSL: Domain-specific language OS and SL coordinated the project. All authors contributed with experiences on containers and • GCP: Google cloud platform workflow systems as well as manuscript prepara- • GUI: Graphical user interface tion.

• HPC: High-performance computing

• HTC: High-throughput computing 4.4 Competing Interests

• NFS: Network file system PDT and EF are founders of Seqera Labs, a com- pany based in Barcelona, Spain, offering commer- • NGS: Next-generation sequencing cial support for the (open source) Nextflow soft- • PV: Persistent volume ware. SL is involved in RIL Partner AB, a Sweden based company offering commercial support for the • PVC: Persistent volume claim (open source) SciPipe software.

9 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

References [13] Felipe da Veiga Leprevost, Bj¨ornA Gr¨uning, Saulo Alves Aflitos, Hannes L R¨ost, Ju- [1] Vivien Marx. Biology: The big challenges lian Uszkoreit, Harald Barsnes, Marc Vaudel, of big data. Nature, 498(7453):255–260, Jun Pablo Moreno, Laurent Gatto, Jonas Weber, 2013. Mingze Bai, Rafael C Jimenez, Timo Sach- senberg, Julianus Pfeuffer, Roberto Vera Al- [2] Bertil Schmidt and Andreas Hildebrandt. varez, Johannes Griss, Alexey I Nesvizhskii, Next-generation sequencing: big data meets and Yasset Perez-Riverol. BioContainers: an high performance computing. Drug discovery open-source and community-driven framework today, 22(4):712–717, Apr 2017. for software standardization. Bioinformatics, [3] Mike May. Big data, big picture: 33(16):2580–2582, March 2017. Metabolomics meets systems biology. Science, 356(6338):646–648, 2017. [14] Reem Almugbel, Ling-Hong Hung, Jiaming Hu, Abeer Almutairy, Nicole Ortogero, [4] Vivien Marx. Genomics in the clouds. Nature Yashaswi Tamta, and Ka Yee Yeung. Repro- Methods, 10(10):941–945, September 2013. ducible bioconductor workflows using browser- based interactive notebooks and containers. [5] Nadia Drake. How to catch a cloud. Nature, Journal of the American Medical Informatics 522(7554):115–116, June 2015. Association, 25(1):4–12, October 2017.

[6] Bj¨orn A Gr¨uning, Samuel Lampa, Marc [15] Heru Suhartanto, Agung P Pasaribu, Muham- Vaudel, and Daniel Blankenberg. Software en- mad F Siddiq, Muhammad I Fadhila, Muham- gineering for scientific big data analysis. Gi- mad H Hilman, and Arry Yanuar. A prelimi- gaScience, 8(5), May 2019. nary study on shifting from virtual machine to [7] Martin Dahl¨o,Fr´ed´ericHaziza, Aleksi Kallio, docker container for insilico drug discovery in Eija Korpelainen, Erik Bongcam-Rudloff, and the cloud. International Journal of Technol- Ola Spjuth. BioImg.org: A Catalog of Virtual ogy, 8(4):611, July 2017. Machine Images for the Life Sciences. Bioin- [16] Ling-Hong Hung, Daniel Kristiyanto, formatics and Biology Insights, 9:BBI.S28636, Sung Bong Lee, and Ka Yee Yeung. GUIdock: 2015. Using docker containers with a common [8] Andrew Silver. Software simplified. Nature, graphics user interface to address the re- 546(7656):173–174, June 2017. producibility of research. PLOS ONE, 11(4):e0152686, April 2016. [9] Gregory M. Kurtzer, Vanessa Sochat, and Michael W. Bauer. Singularity: Scientific con- [17] Baekdoo Kim, Thahmina Ali, Carlos Lijeron, tainers for mobility of compute. PLOS ONE, Enis Afgan, and Konstantinos Krampis. Bio- 12(5):e0177459, May 2017. docklets: virtualization containers for single- step execution of NGS pipelines. GigaScience, [10] Jorge Gomes, Emanuele Bagnaschi, Isabel 6(8), June 2017. Campos, Mario David, Lu´ısAlves, Jo˜aoMar- tins, Jo˜ao Pina, Alvaro L´opez-Garc´ıa, and [18] Wade L Schulz, Thomas Durant, Alexa J Sid- Pablo Orviz. Enabling rootless linux contain- don, and Richard Torres. Use of application ers in multi-user environments: The udocker containers and workflows for genomic data tool. Computer Physics Communications, analysis. Journal of Pathology Informatics, 232:84–97, November 2018. 7(1):53, 2016.

[11] Richard Shane Canon and Doug Jacobsen. [19] Kubernetes Project. Accessed 20 Jan 2020. Shifter: containers for hpc. Proceedings of the Cray User Group, 2016. [20] Docker Swarm. Accessed 20 Jan 2020. [12] Docker Hub. Accessed 20 Jan 2020. [21] Apache Mesos. Accessed 20 Jan 2020.

10 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

[22] Paolo Di Tommaso, Maria Chatzou, Evan W Th´evenot, Mattia Tomasoni, Merlijn van Ri- Floden, Pablo Prieto Barja, Emilio Palumbo, jswijk, Michael van Vliet, Mark R Viant, Ralf and Cedric Notredame. Nextflow enables re- J M Weber, Gianluigi Zanetti, and Christoph producible computational workflows. Nature Steinbeck. PhenoMeNal: processing and anal- Biotechnology, 35(4):316–319, April 2017. ysis of metabolomics data in the cloud. Giga- Science, 8(2), 12 2018. giy149. [23] Daniel Blankenberg, Gregory Von Kuster, Emil Bouvier, Dannon Baker, Enis Afgan, [27] Pablo Moreno, Luca Pireddu, Pierrick Roger, Nicholas Stoler, James Taylor, and An- Nuwan Goonasekera, Enis Afgan, Marius ton Nekrutenko and. Dissemination of scien- van den Beek, Sijin He, Anders Larsson, tific software with galaxy ToolShed. Genome Daniel Schober, Christoph Ruttkies, David Biology, 15(2):403, 2014. Johnson, Philippe Rocca-Serra, Ralf JM We- ber, Bj¨ornGruening, Reza M Salek, Nam- [24] Enis Afgan, Dannon Baker, B´er´eniceBatut, rata Kale, Yasset Perez-Riverol, Irene Pap- Marius van den Beek, Dave Bouvier, Mar- atheodorou, Ola Spjuth, and Steffen Neu- ˇ tin Cech, John Chilton, Dave Clements, Nate mann. Galaxy-kubernetes integration: scaling Coraor, Bj¨orn A Gr¨uning, Aysam Guer- bioinformatics workflows in the cloud. bioRxiv, ler, Jennifer Hillman-Jackson, Saskia Hilte- December 2018. mann, Vahid Jalili, Helena Rasche, Nicola Soranzo, Jeremy Goecks, James Taylor, An- [28] Bj¨ornGr¨uning, Ryan Dale, Andreas Sj¨odin, ton Nekrutenko, and Daniel Blankenberg. Brad A. Chapman, Jillian Rowe, Christo- The galaxy platform for accessible, repro- pher H. Tomkins-Tinch, Renan Valieris, Jo- ducible and collaborative biomedical analy- hannes K¨oster,and The BioConda Team. Bio- ses: 2018 update. Nucleic Acids Research, conda: sustainable and comprehensive soft- 46(W1):W537–W544, May 2018. ware distribution for the life sciences. Nature Methods, 15(7):475–476, July 2018. [25] C. Sloggett, N. Goonasekera, and E. Af- gan. BioBlend: automating pipeline analyses [29] Pachyderm. Accessed 20 Jan 2020. within galaxy and CloudMan. Bioinformatics, 29(13):1685–1686, April 2013. [30] Jon Ander Novella, Payam Emami Khoonsari, Stephanie Herman, Daniel Whitenack, Marco [26] Kristian Peters, James Bradbury, Sven Capuccini, Joachim Burman, Kim Kultima, Bergmann, Marco Capuccini, Marta Cascante, and Ola Spjuth. Container-based bioin- Pedro de Atauri, Timothy M D Ebbels, Car- formatics with pachyderm. Bioinformatics, les Foguet, Robert Glen, Alejandra Gonzalez- 35(5):839–846, August 2018. Beltran, Ulrich L G¨unther, Evangelos Han- [31] Luigi. Accessed 20 Jan 2020. dakas, Thomas Hankemeier, Kenneth Haug, Stephanie Herman, Petr Holub, Massimiliano [32] Samuel Lampa, Jonathan Alvarsson, and Ola Izzo, Daniel Jacob, David Johnson, Fabien Spjuth. Towards agile large-scale predictive Jourdan, Namrata Kale, Ibrahim Karaman, modelling in drug discovery with flow-based Bita Khalili, Payam Emami Khonsari, Kim programming design principles. Journal of Kultima, Samuel Lampa, Anders Larsson, Cheminformatics, 8(1), November 2016. Christian Ludwig, Pablo Moreno, Steffen Neu- mann, Jon Ander Novella, Claire O’Donovan, [33] Christina Ranninger, Lukas Schmidt, Marc Jake T M Pearce, Alina Peluso, Marco En- Rurik, Alice Limonciel, Paul Jennings, Oliver rico Piras, Luca Pireddu, Michelle A C Reed, Kohlbacher, and Christian Huber. MT- Philippe Rocca-Serra, Pierrick Roger, An- BLS233: Improving global feature detectabil- tonio Rosato, Rico Rueedi, Christoph Rut- ities through scan range splitting for untar- tkies, Noureddin Sadawi, Reza M Salek, geted metabolomics by high-performance liq- Susanna-Assunta Sansone, Vitaly Selivanov, uid chromatography-Orbitrap mass spectrom- Ola Spjuth, Daniel Schober, Etienne A etry. Accessed 20 Jan 2020.

11 Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 31 January 2020 doi:10.20944/preprints202001.0378.v1

[34] Marco Capuccini, Anders Larsson, Mat- [44] Alexander Kensert, Philip J. Harrison, and teo Carone, Jon Ander Novella, Noureddin Ola Spjuth. Transfer learning with deep Sadawi, Jianliang Gao, Salman Toor, and Ola convolutional neural networks for classify- Spjuth. On-demand virtual research environ- ing cellular morphological changes. SLAS ments using microservices. PeerJ Computer DISCOVERY: Advancing Life Sciences R&D, Science, 5:e232, November 2019. 24(4):466–475, January 2019. [35] MTBLS233 with PhenoMeNal Jupyter. Ac- [45] Alexander Kensert. CNN example pipeline. cessed 20 Jan 2020. Accessed 20 Jan 2020. [36] Samuel Lampa, Martin Dahl¨o,Jonathan Al- [46] Konrad Hinsen. Verifiability in computer- varsson, and Ola Spjuth. SciPipe: A work- aided research: the role of digital scientific flow library for agile development of complex notations at the human-computer interface. and dynamic bioinformatics pipelines. Giga- PeerJ Computer Science, 4:e158, July 2018. Science, 8(5), April 2019. [47] B Gruening, O Sallou, P Moreno, [37] Samuel Lampa, Jonathan Alvarsson, Staffan F da Veiga Leprevost, H M´enager, Arvidsson Mc Shane, Arvid Berg, Ernst D Søndergaard, H R¨ost, T Sachsenberg, Ahlberg, and Ola Spjuth. Predicting off-target B O’Connor, F Madeira, V Dominguez binding profiles with confidence using confor- Del Angel, MR Crusoe, S Varma, D Blanken- mal prediction. Frontiers in Pharmacology, 9, berg, RC Jimenez, null null, and Y Perez- November 2018. Riverol. Recommendations for the packaging and containerizing of bioinformatics software [38] The Kubernetes Community (2016). Go client [version 2; peer review: 2 approved, 1 ap- for kubernetes. Accessed 16 Jan 2018. proved with reservations]. F1000Research, [39] Hannes L R¨ost,Timo Sachsenberg, Stephan 7(742), 2019. Aiche, Chris Bielow, Hendrik Weisser, Fabian Aicheler, Sandro Andreotti, Hans-Christian Ehrlich, Petra Gutenbrunner, Erhan Kenar, Xiao Liang, Sven Nahnsen, Lars Nilse, Ju- lianus Pfeuffer, George Rosenberger, Marc Rurik, Uwe Schmitt, Johannes Veit, Math- ias Walzer, David Wojnar, Witold E Wol- ski, Oliver Schilling, Jyoti S Choudhary, Lars Malmstr¨om,Ruedi Aebersold, Knut Reinert, and Oliver Kohlbacher. OpenMS: a flexi- ble open-source software platform for mass spectrometry data analysis. Nature Methods, 13(9):741–748, August 2016. [40] Samuel Lampa. OpenMS SciPipe workflow example. OpenMS SciPipe workflow example. Accessed 16 Jan 2018. [41] The Argo Authors (2019). Argo project. Ac- cessed 14 Oct 2019. [42] The Etcd Community (2019). Distributed re- liable key-value store for the most critical data of a distributed system. Accessed 14 Oct 2019.

[43] Kubeflow. Accessed 20 Jan 2020.

12