Methods Enabling Portability of Scientific Workflows
Total Page:16
File Type:pdf, Size:1020Kb
METHODS ENABLING PORTABILITY OF SCIENTIFIC WORKFLOWS A Dissertation Submitted to the Graduate School of the University of Notre Dame in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy by Nicholas Hazekamp Douglas Thain, Director Graduate Program in Computer Science and Engineering Notre Dame, Indiana December 2019 c Copyright by Nicholas Hazekamp 2019 All Rights Reserved METHODS ENABLING PORTABILITY OF SCIENTIFIC WORKFLOWS Abstract by Nicholas Hazekamp Scientific workflows are common and powerful tools used to elevate small scale analysis to large scale distributed computation. They provide ease of use for domain scientists by supporting the use of applications as they are, partitioning the data for concurrency instead of the application. However, many of these workflows are written in a way that couples the scientific intention with the specificity of the execution environment. This coupling limits the flexibility and portability of the workflow, requiring the workflow to be re-engineered for each new dataset or site. I propose that workflows can be written for pure scientific intent, with the id- iosyncrasies of execution resolved at runtime using workflow abstractions. These abstractions would allow workflows to be quickly transformed for different configura- tions, specifically handling new datasets, diverse sites, and different configurations. I examine three methods for developing workflow abstraction on static workflows, apply these methods to a dynamic workflow, and propose an approach that separates the user from the distributed environment. In developing these methods for static workflows I first explored Dynamic Work- flow Expansion, which allows workflows to be quickly adapted for new and diverse datasets. Then I describe an algorithm for statically determining a workflow's stor- age needs, which is used at runtime to prevent storage deadlocks. Finally, I develop an algebra for transforming workflows, which isolates site and configuration specific Nicholas Hazekamp designs to be applied to workflows as needed. These methods were combined and applied to a dynamic workflow, adapting a site bounds MPI application to a dynamic cloud workflow. I combine these methods and formulated the Continuously Divisible Jobs ab- straction to separate the domain scientist's application from the distributed logic of a dynamic workflow. This abstraction defines an API which applications can imple- ment to allow for dynamic distributed computation, showcasing the flexibility and portability provided through workflow abstractions. CONTENTS Figures . vi Tables . xiii Acknowledgments . xiv Chapter 1: Introduction . .1 1.1 Scientific Workflows . .1 1.2 Methods for Improving Workflow Flexibility . .5 1.3 Overview of this Dissertation . .8 1.4 Example Applications . 10 Chapter 2: Related Work . 13 2.1 Static Workflow Systems . 13 2.1.1 Workflow Specification Languages . 18 2.2 Dynamic Workflow Systems . 19 2.3 Batch Computing Systems . 22 2.4 Resource Provisioning and Management . 24 2.4.1 Storage Management . 26 2.5 Environment Re-Creation . 26 2.6 Makeflow and Work Queue . 28 Chapter 3: Dynamic Workflow Expansion . 30 3.1 Introduction . 30 3.2 Galaxy . 31 3.3 Dynamic Job Expansion . 32 3.4 Example Application . 35 3.5 Design Considerations . 37 3.5.1 Dependency Management . 37 3.5.2 Remote Execution . 39 3.5.3 Garbage Collection . 40 3.6 Opportunities . 41 3.6.1 Expression of Job Status . 41 3.6.2 Checkpoints and Partial Failure . 42 3.7 Evaluation . 43 ii 3.8 Conclusion . 48 Chapter 4: Static Analysis and Dynamic Management of Workflow Storage . 50 4.1 Introduction . 50 4.2 Definition of Files and Storage in Workflows . 51 4.3 Storage Management Components . 53 4.4 The Storage Footprint . 54 4.5 Example Workflows . 57 4.6 Static Analysis Algorithm . 58 4.6.1 Limitations . 64 4.7 Dynamic Storage Management . 65 4.7.1 Dynamic Storage Algorithm . 65 4.7.2 Worked Example . 66 4.7.3 Impact of Local Storage . 68 4.8 Overall Evaluation . 70 4.8.1 Configuration . 70 4.9 Conclusions . 72 Chapter 5: An Algebra for Robust Workflow Transformations . 74 5.1 Introduction . 74 5.2 Challenges in Transforming Workflows . 76 5.3 An Algebra of Workflow Transformations . 79 5.3.1 Notation . 79 5.3.2 Semantics . 81 5.3.3 Transformations as Functions . 82 5.3.4 Applying the Sandbox Model . 86 5.4 Transformations in Practice . 86 5.4.1 Composability versus Commutability . 86 5.4.2 Command Description . 88 5.4.3 File List Management . 90 5.4.4 Resource Provisioning . 90 5.4.5 Environment Elaboration . 91 5.5 Applications of Transformations . 92 5.5.1 Sandbox Transform . 92 5.5.2 Container Transform . 92 5.5.3 Resource Monitoring Transform . 93 5.5.4 Environment Transform . 93 5.5.5 Failure Handling Transform . 94 5.6 Case Studies . 95 5.6.1 Resource Usage in a Container . 95 5.6.2 Failure Analysis . 96 5.6.3 Complex Software Configuration . 97 5.7 Conclusion . 99 iii Chapter 6: Applying Static Techniques to Dynamic Workflows . 102 6.1 Introduction . 102 6.2 Jetstream . 103 6.3 Portable Reproducible Environments . 103 6.3.1 Machine Images . 104 6.3.2 Container Images . 104 6.3.3 Deployment Services . 105 6.3.4 VC3 . 105 6.3.5 Deploying MAKER . 106 6.4 Scalability . 107 6.4.1 MAKER's MPI Behavior . 108 6.4.2 WQ-MAKER . 109 6.4.3 Scaling Up vs Scaling Out . 110 6.5 Exposing Execution Feedback . 111 6.5.1 Clean Environment Builds . 111 6.5.2 Deploying Workers . 111 6.5.3 Evaluate Performance . 112 6.5.4 Diagnosing Errors . 113 6.6 Evaluation . 113 6.7 Conclusion . 117 Chapter 7: Continuously Divisible Jobs . 119 7.1 Introduction . 119 7.2 Contributions . 120 7.3 Challenges . 121 7.4 Continuously Divisible Jobs . 123 7.4.1 Operations . 125 7.4.2 Abstract Jobs . 127 7.4.3 Job Coordinators . 128 7.4.4 Design Considerations . 130 7.4.4.1 File Partitioning . 131 7.4.4.2 Job Namespaces . 131 7.4.4.3 Execution Sandbox . 132 7.4.4.4 Job Ordering . 132 7.5 Virtual File Abstraction . 133 7.5.1 Operations on Virtual Files . 133 7.5.1.1 Indexing . 134 7.5.1.2 Partitioning . 134 7.5.1.3 Index Look-up . 135 7.5.1.4 Instantiate . 135 7.5.1.5 Serialization . 135 7.6 Implementation . 136 7.6.1 Example Continuously Divisible Job Application . 137 iv 7.6.2 Virtual Files ..