Methods Enabling Portability of Scientific Workflows

METHODS ENABLING PORTABILITY OF SCIENTIFIC WORKFLOWS A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Nicholas Hazekamp Douglas Thain, Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana December 2019 c Copyright by Nicholas Hazekamp 2019 All Rights Reserved METHODS ENABLING PORTABILITY OF SCIENTIFIC WORKFLOWS Abstract by Nicholas Hazekamp Scientific workflows are common and powerful tools used to elevate small scale analysis to large scale distributed computation. They provide ease of use for domain scientists by supporting the use of applications as they are, partitioning the data for concurrency instead of the application. However, many of these workflows are written in a way that couples the scientific intention with the specificity of the execution environment. This coupling limits the flexibility and portability of the workflow, requiring the workflow to be re-engineered for each new dataset or site. I propose that workflows can be written for pure scientific intent, with the id- iosyncrasies of execution resolved at runtime using workflow abstractions. These abstractions would allow workflows to be quickly transformed for different configurations, specifically handling new datasets, diverse sites, and different configurations. I examine three methods for developing workflow abstraction on static workflows, apply these methods to a dynamic workflow, and propose an approach that separates the user from the distributed environment. In developing these methods for static workflows I first explored Dynamic Work- flow Expansion, which allows workflows to be quickly adapted for new and diverse datasets. Then I describe an algorithm for statically determining a workflow's storage needs, which is used at runtime to prevent storage deadlocks. Finally, I develop an algebra for transforming workflows, which isolates site and configuration specific Nicholas Hazekamp designs to be applied to workflows as needed. These methods were combined and applied to a dynamic workflow, adapting a site bounds MPI application to a dynamic cloud workflow. I combine these methods and formulated the Continuously Divisible Jobs abstraction to separate the domain scientist's application from the distributed logic of a dynamic workflow. This abstraction defines an API which applications can imple- ment to allow for dynamic distributed computation, showcasing the flexibility and portability provided through workflow abstractions. CONTENTS Figures . vi Tables . xiii Acknowledgments . xiv Chapter 1: Introduction . .1 1.1 Scientific Workflows . .1 1.2 Methods for Improving Workflow Flexibility . .5 1.3 Overview of this Dissertation . .8 1.4 Example Applications . 10 Chapter 2: Related Work . 13 2.1 Static Workflow Systems . 13 2.1.1 Workflow Specification Languages . 18 2.2 Dynamic Workflow Systems . 19 2.3 Batch Computing Systems . 22 2.4 Resource Provisioning and Management . 24 2.4.1 Storage Management . 26 2.5 Environment Re-Creation . 26 2.6 Makeflow and Work Queue . 28 Chapter 3: Dynamic Workflow Expansion . 30 3.1 Introduction . 30 3.2 Galaxy . 31 3.3 Dynamic Job Expansion . 32 3.4 Example Application . 35 3.5 Design Considerations . 37 3.5.1 Dependency Management . 37 3.5.2 Remote Execution . 39 3.5.3 Garbage Collection . 40 3.6 Opportunities . 41 3.6.1 Expression of Job Status . 41 3.6.2 Checkpoints and Partial Failure . 42 3.7 Evaluation . 43 ii 3.8 Conclusion . 48 Chapter 4: Static Analysis and Dynamic Management of Workflow Storage . 50 4.1 Introduction . 50 4.2 Definition of Files and Storage in Workflows . 51 4.3 Storage Management Components . 53 4.4 The Storage Footprint . 54 4.5 Example Workflows . 57 4.6 Static Analysis Algorithm . 58 4.6.1 Limitations . 64 4.7 Dynamic Storage Management . 65 4.7.1 Dynamic Storage Algorithm . 65 4.7.2 Worked Example . 66 4.7.3 Impact of Local Storage . 68 4.8 Overall Evaluation . 70 4.8.1 Configuration . 70 4.9 Conclusions . 72 Chapter 5: An Algebra for Robust Workflow Transformations . 74 5.1 Introduction . 74 5.2 Challenges in Transforming Workflows . 76 5.3 An Algebra of Workflow Transformations . 79 5.3.1 Notation . 79 5.3.2 Semantics . 81 5.3.3 Transformations as Functions . 82 5.3.4 Applying the Sandbox Model . 86 5.4 Transformations in Practice . 86 5.4.1 Composability versus Commutability . 86 5.4.2 Command Description . 88 5.4.3 File List Management . 90 5.4.4 Resource Provisioning . 90 5.4.5 Environment Elaboration . 91 5.5 Applications of Transformations . 92 5.5.1 Sandbox Transform . 92 5.5.2 Container Transform . 92 5.5.3 Resource Monitoring Transform . 93 5.5.4 Environment Transform . 93 5.5.5 Failure Handling Transform . 94 5.6 Case Studies . 95 5.6.1 Resource Usage in a Container . 95 5.6.2 Failure Analysis . 96 5.6.3 Complex Software Configuration . 97 5.7 Conclusion . 99 iii Chapter 6: Applying Static Techniques to Dynamic Workflows . 102 6.1 Introduction . 102 6.2 Jetstream . 103 6.3 Portable Reproducible Environments . 103 6.3.1 Machine Images . 104 6.3.2 Container Images . 104 6.3.3 Deployment Services . 105 6.3.4 VC3 . 105 6.3.5 Deploying MAKER . 106 6.4 Scalability . 107 6.4.1 MAKER's MPI Behavior . 108 6.4.2 WQ-MAKER . 109 6.4.3 Scaling Up vs Scaling Out . 110 6.5 Exposing Execution Feedback . 111 6.5.1 Clean Environment Builds . 111 6.5.2 Deploying Workers . 111 6.5.3 Evaluate Performance . 112 6.5.4 Diagnosing Errors . 113 6.6 Evaluation . 113 6.7 Conclusion . 117 Chapter 7: Continuously Divisible Jobs . 119 7.1 Introduction . 119 7.2 Contributions . 120 7.3 Challenges . 121 7.4 Continuously Divisible Jobs . 123 7.4.1 Operations . 125 7.4.2 Abstract Jobs . 127 7.4.3 Job Coordinators . 128 7.4.4 Design Considerations . 130 7.4.4.1 File Partitioning . 131 7.4.4.2 Job Namespaces . 131 7.4.4.3 Execution Sandbox . 132 7.4.4.4 Job Ordering . 132 7.5 Virtual File Abstraction . 133 7.5.1 Operations on Virtual Files . 133 7.5.1.1 Indexing . 134 7.5.1.2 Partitioning . 134 7.5.1.3 Index Look-up . 135 7.5.1.4 Instantiate . 135 7.5.1.5 Serialization . 135 7.6 Implementation . 136 7.6.1 Example Continuously Divisible Job Application . 137 iv 7.6.2 Virtual Files ..

Methods Enabling Portability of Scientific Workflows

Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18

Efficient Support for Data-Intensive Scientific Workflows on Geo

Apache Taverna

A Review of Scalable Bioinformatics Pipelines

Programming Models to Support Data Science Workflows

ADMI Cloud Computing Presentation

Workspace - a Scientific Workflow System for Enabling Research Impact

Open Research Online Oro.Open.Ac.Uk

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Arxiv:2007.04939V1 [Cs.DC] 9 Jul 2020

Intelligent Integrated Digital Platform Insilicokdd in Support of Precision Medicine for Scientific Research

Download Thesis