An Intermediate Data-Driven Methodology for Scientific Workflow Management System to Support Reusability

An Intermediate Data-driven Methodology for Scientific Workflow Management System to Support Reusability A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Debasish Chakroborti arXiv:2010.14057v1 [cs.IR] 23 Oct 2020 ©Debasish Chakroborti, August/2019. All rights reserved. Permission to Use In presenting this thesis in partial fulfilment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Computer Science 176 Thorvaldson Building 110 Science Place University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5C9 Or Dean College of Graduate and Postdoctoral Studies University of Saskatchewan 116 Thorvaldson Building, 110 Science Place Saskatoon, Saskatchewan S7N 5C9 Canada i Abstract Automatic processing of different logical sub-tasks by a set of rules is a workflow. A workflow management system (WfMS) is a system that helps us accomplish a complex scientific task through making a sequential arrangement of sub-tasks available as tools. Workflows are formed with modules from various domains in a WfMS, and many collaborators of the domains are involved in the workflow design process. Workflow Management Systems (WfMSs) have been gained popularity in recent years for managing various tools in a system and ensuring dependencies while building a sequence of executions for scientific analyses. As a result of heterogeneous tools involvement and collaboration requirement, Collaborative Scientific Workflow Management Systems (CSWfMS) have gained significant interest in the scientific analysis community. In such systems, big data explosion issues exist with massive velocity and variety characteristics for the heterogeneous large amount of data from different domains. Therefore a large amount of heterogeneous data need to be managed in a Scientific Workflow Management System (SWfMS) with a proper decision mechanism. Although a number of studies addressed the cost management of data, none of the existing studies are related to real- time decision mechanism or reusability mechanism. Besides, frequent execution of workflows in a SWfMS generates a massive amount of data and characteristics of such data are always incremental. Input data or module outcomes of a workflow in a SWfMS are usually large in size. Processing of such data-intensive workflows is usually time-consuming where modules are computationally expensive for their respective inputs. Besides, lack of data reusability, limitation of error recovery, inefficient workflow processing, inefficient storing of derived data, lacking in metadata association and lacking in validation of the effectiveness of a technique of existing systems need to be addressed in a SWfMS for efficient workflow building by maintaining the big data explosion. To address the issues, in this thesis first we propose an intermediate data management scheme for a SWfMS. In our second attempt, we explored the possibilities and introduced an automatic recommendation technique for a SWfMS from real-world workflow data (i.e Galaxy [1] workflows) where our investigations show that the proposed technique can facilitate 51% of workflow building in a SWfMS by reusing intermediate data of previous workflows and can reduce 74% execution time of workflow buildings in a SWfMS. Later we propose an adaptive version of our technique by considering the states of tools in a SWfMS, which shows around 40% reusability for workflows. Consequently, in our fourth study, We have done several experiments for analyzing the performance and exploring the effectiveness of the technique in a SWfMS for various environments. The technique is introduced to emphasize on storing cost reduction, increase data reusability, and faster workflow execution, to the best of our knowledge, which is the first of its kind. Detail architecture and evaluation of the technique are presented in this thesis. We believe our findings and developed system will contribute significantly to the research domain of SWfMSs. ii Acknowledgements First and foremost, I have to thank my research supervisors, Dr. Chanchal K. Roy and Dr. Kevin A. Schneider for continuous support with their immense knowledge and patience during my MSc study. I express my heartfelt gratitude to Dr. Banani Roy, who made this work possible. Her guidance helped me in formulating and shaping up the thesis to its current form. I could not have imagined having a better advisor and mentor for my MSc study. I would also wish to express my gratitude to Dr. Manishankar Mondal for extended discussions, valuable suggestions, passionate participation, and input, which have contributed greatly to the improvement of the thesis. Besides my advisors, I would like to thank the rest of my thesis committee: Dr. Mark Keil, Dr. Ralph Deters, and Dr. Aryan Saadat Mehr, for their willingness to take part in the advisement and evaluation of my thesis work. I would like to express my special appreciation and thanks to the anonymous reviewers for their valuable comments and suggestions on improving the papers produced from this thesis. I thank my fellow labmates of the Software Research Lab for their stimulating discussions, important feedback on my work, and for all the fun we have had in the last two years. In particular, I would like to thank Shamima Yeasmin, CM Khaled Saifullah, Kawser Wazed Nafi, Amit Kumar Mondal, Masud Rahman, Muhammad Asaduzzaman, Avijit Bhattacharjee, Md Nadim, Saikat Mondal, Judith Islam, Tonny Kar, Hamid Khodabandeloo, Zonayed Ahmed, and Golam Mostaeen. I am grateful to the Department of Computer Science of the University of Saskatchewan for their generous financial support through scholarships, awards, and bursaries that helped me to concentrate more deeply on my thesis work. I am also grateful to all the staffs of the Department for their constant support. In particular, I would like to thank Gwen Lancaster, Findlay Sophie, and Heather Webb. Getting through my thesis required more than academic support, and I have many, many people to thank for listening to and, at times, having to tolerate me over the past two years. I cannot begin to express my gratitude and appreciation for their friendship. Paromita Sengupta, Mashrafi Haider Iqra and Sakib Mostafa, Mike SB have been unwavering in their personal and professional support during the time I spent at the University. For many memorable evenings out and in, I must thank everyone above as well as Sanchita Sen, Purnendu B. Sen and Asif Anjum Khan, Yeasir Jubaer. I would also like to thank Pritam Sen, who opened both his home and help to me when I first arrived in the city. I would especially like to thank all my seniors who have been extremely helpful and encouraging throughout the period of my thesis work. In particular, I would like to thank Md. Sami Uddin and Md. Aminul Islam. And finally, last but by no means least, I would like to thank my family: my father Rajendranath Chakroborti, my mother Protima Chakroborti, my wife Sristy Sumana Nath, my daughter Ushasi Srija iii Chakroborti, my elder brother Shamiron Chakroborti elder sister Mousumi Chakroborti Moumita Chakroborti for their unconditional support, encouragement, inspiration, and love at every stage of my life. Without their endless sacrifice, I would not have come this far. iv I dedicate this thesis to my mother, Protima Chakroborti, my wife, Sristy Sumana Nath, and my daughter, Ushasi Srija Chakroborti, whose influence helps me to accomplish every step of my life. v Contents Permission to Use i Abstract ii Acknowledgements iii Contents vi List of Tables ix List of Figures x List of Abbreviations xii 1 Introduction 1 1.1 Motivation . 1 1.2 Problem Statement . 2 1.3 Our Contribution . 4 1.4 Publications . 5 1.5 Thesis Outline . 6 2 Background 8 2.1 Workflow and Workflow Management System . 8 2.2 Scientific Workflow . 9 2.3 Scientific Workflow Management System . 9 2.4 Tool and Technologies . 9 2.5 Web Services vs. Local Scripts vs. Distributed Scripts . 12 3 A Data Management Scheme for Micro-Level Modular Computation-intensive Pro- grams in Big Data Platforms 15 3.1 Motivation . 15 3.2 Related Work . 17 3.2.1 Study on file systems of Big Data processing . 17 3.2.2 Big Data in terms of Image Processing . 18 3.2.3 Study on Intermediate-data . 19 3.3 Proposed Method . 19 3.4 Experimental Setup . 22 3.5 Results and Evaluation . 23 3.5.1 Experiment for reusability . 25 3.5.2 Experiment for error recovery . 27 3.5.3 Experiment for fast processing . 27 3.6 Discussion and Lesson Learned . 29 3.7 Conclusion . 31 4 Optimized Storing of Workflow Outputs through Mining AssociationRules 32 4.1 Motivation . 32 4.2 Background . 35 4.3 Recommending intermediate stage results to store for reusing . 35 4.3.1 Determining association rules from pipelines . 36 vi 4.3.2 Determining the supports and confidences of the association rules obtained from the pipelines .

An Intermediate Data-Driven Methodology for Scientific Workflow Management System to Support Reusability

Communicate the Future PROCEEDINGS HYATT REGENCY ORLANDO 20–23 MAY | ORLANDO, FL and Summit.Stc.Org #STC18

Efficient Support for Data-Intensive Scientific Workflows on Geo

Apache Taverna

A Review of Scalable Bioinformatics Pipelines

Programming Models to Support Data Science Workflows

ADMI Cloud Computing Presentation

Workspace - a Scientific Workflow System for Enabling Research Impact

Methods Enabling Portability of Scientific Workflows

Open Research Online Oro.Open.Ac.Uk

HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack

Arxiv:2007.04939V1 [Cs.DC] 9 Jul 2020

Intelligent Integrated Digital Platform Insilicokdd in Support of Precision Medicine for Scientific Research